- Roman Kazinnik

# Active Learning: End-to-End solution for Deduplication

Github

__https://github.com/romankazinnik/romankazinnik_blog/tree/master/active_learning__

Active Learning: main components

Blocking

Iterative labeling:

Select data points for manual labeling

A small number of Iterations

Use Classifier and Sampling Strategy

Iterative Update

Training data with newly labeled data

Classifier re-trained with new Training data

Stopping criteria

3. Run classifier for all blocks

Deduping with Active Learning

**Input Data, Output Data, Training Data**,** and data points.**

I will explain this entire concept behind deduping with active learning with the help of a simple example. Let’s assume, I am given 1M records of restaurants. Each record has multiple columns containing information about a restaurant, such as a restaurant location, name, and name of the restaurant listing marketplace. All the columns are assumed to be noisy, such as the same restaurants may have similar but different names and slightly different locations in different marketplaces. Deduping helps us out by splitting the input data into two datasets of unique restaurants and non-unique restaurants. Active Learning approaches operate with a classifier trained for training data. Training data consists of pairs of records from the Input Data. The total number of all possible pairs is too high and computationally infeasible (½ 1M * 1M = .5e+12). A naive approach would require manually labeling the entire Training Data if some 1e+12 data points. Active learning allows to creation of reasonably good classifiers with below a hundred manual labelings. But how to choose these one hundred data points?

Active Learning: results with some one hundred manual labeling

**
** **High confidence**

### Medium Confidence

### Low confidence 'noise'

**Blocks**

Our task is to find duplicates in one million records and some 1e+12 pairs (data points). Without a classifier, one would need to manual label every possible pair which could be autonomous number T=1e+12 and computationally not feasible even for fastest classifiers. With a classifier, one would need to run classifier for 1e+12 data points, which is not feasible.

The blocking approach solves the problem of a high number of pairs T by splitting all the records into thousands of smaller sets of records. We must assume records in different blocks can’t be duplicated. For example, restaurants with locations too far from each other must be different restaurants.

As a result, one can ignore pairs coming from different blocks and process only pairs where both restaurants belong to the same block. I use a quadtree to create thousands of blocks with at most 300 records per block and with about 100 records at average. The block boundaries must overlap (one can think about this question!) so that the total number of records in all the blocks will increase. However, the total number of pairs T as a result of splitting into B blocks with maximum R records will be limited by B*½ R^2 which is much smaller than ½ 10^12

**Classifier**

The classifier is trained with labeled pairs of records (datapoints of training data) and computes a probability that a pair of two restaurant records denote single (‘duplicate’) restaurant.

At each iterative step of manual labeling, a new training set is created and a new classifier is trained.

Specifically, at each step, training data is added with a new manually labeled data points, usually a small number of less than ten.

After that, a new classifier is trained with the newly labeled data, whereas the manually labeled data points have higher weights in training.

Next, I run classifier for another subset of unlabeled data points, and the highest confidence points are added to the training data automatically, without manual labeling. These automatically selected new data points will have lower weight.

After a few iterations of adding manual labeled data points and points with the high confidence , the accumulated training data becomes large enough. This way, a classifier can utilize deep learning architectures.

**Sampling and Manual Labeling**

The iterative process of manual labeling adds more data to the training data.

Instead of selecting the random pairs for manual labeling, Active Learning selects pairs with the highest impact with the goal to minimize the number of manual labeling iterations required to create an accurate classifier.

Sampling strategy helps to select a small number of pairs for manual labeling from large unlabeled data

At each iteration K+1, we use Classifer_K to select a small number pairs of records (as small as five) that would benefit the most in training a new classifier.

There are two steps:

(1) selecting the five pairs for manual labeling is still a research challenge. There are three main strategies. One strategy suggests select data with the highest uncertainty, such as closest to the probability of fifty percent for each class. In this work, I used another strategy. I observed that the data points at the fifty percent boundary are simply ‘noise’ and it is impossible to guess the labels for them. I sample data points far enough from the ‘noise’ boundary and from the ‘too certain’ sides.

(2) The newly manually labeled records are added to the training with higher weights. The new classifier denoted Classifeir_K+1/2. After that, I extend the training data with unlabeled data points that have high confidence with Classifier k+½, and create Classifer_K+1.

Figure-1 shows a histogram of probabilities at the 10th iteration (outputs by Classifier-10 for Training Data-10), which confirms observation from another Active Learning work (Figure-2): both show that the shape has three following phases: Noise, Medium and High Confidence.

### Algorithm

Create Blocks

Initialize: Classifier-0 and Training-Data-0

Iterate k = 1, … :

Select M unlabeled data points. Apply Classifier-k to M datapoints.

Select D datapoints from M, both positive and negative ranges, using Sampling Strategy from the Medium Certainty range P-medium.

Quit if Classifier-K accuracy looks sufficiently high for the selected D data points. Otherwise, label manually and add the M newly labeled datapoints with higher weights to create Training Data-k+½

Train Classifier-K+½

Run Classifeir-k+½ for M more unlabeled datapoints and add the datapoints with the high confidence greater P-high to create Training Data-k+1

Train Classifier-k+1

Parameters: D (as small as five data points), M (a thousand of data points), P-Medium confidence range, P-high probability threshold, Classifier architecture (Logistic Regression, RandomForest or DNN)

### Computational Steps

20 mins: Build docker image

10 mins: Step-1

Create 32K Blocks

15 mins: Step-2

Create Initial Labels and initial Classifier

10 mins: Iterate until Classifier accuracy is sufficient

Sampling Strategy: Pull from Unlabeled and Label manually

Retrain Classifier with newly added Labels

100 mins: Step-3

Run Classifier Inferences for All Block of Pairs.

32K blocks processed 10 blocks per min x32 cores

3 mins: Step-4

Calibration: select threshold to decide duplicates

Use ten binary splits to decide on the best threshold value

Using the selected threshold, create two files with unique and duplicated restaurant names