Active Learning: End-to-End solution for Deduplication
Active Learning: main components
Select data points for manual labeling
A small number of Iterations
Use Classifier and Sampling Strategy
Training data with newly labeled data
Classifier re-trained with new Training data
3. Run classifier for all blocks
Deduping with Active Learning
Input Data, Output Data, Training Data, and data points.
I will explain this entire concept behind deduping with active learning with the help of a simple example. Let’s assume, I am given 1M records of restaurants. Each record has multiple columns containing information about a restaurant, such as a restaurant location, name, and name of the restaurant listing marketplace. All the columns are assumed to be noisy, such as the same restaurants may have similar but different names and slightly different locations in different marketplaces. Deduping helps us out by splitting the input data into two datasets of unique restaurants and non-unique restaurants. Active Learning approaches operate with a classifier trained for training data. Training data consists of pairs of records from the Input Data. The total number of all possible pairs is too high and computationally infeasible (½ 1M * 1M = .5e+12). A naive approach would require manually labeling the entire Training Data if some 1e+12 data points. Active learning allows to creation of reasonably good classifiers with below a hundred manual labelings. But how to choose these one hundred data points?
Active Learning: results with some one hundred manual labeling
Low confidence 'noise'
Our task is to find duplicates in one million records and some 1e+12 pairs (data points). Without a classifier, one would need to manual label every possible pair which could be autonomous number T=1e+12 and computationally not feasible even for fastest classifiers. With a classifier, one would need to run classifier for 1e+12 data points, which is not feasible.
The blocking approach solves the problem of a high number of pairs T by splitting all the records into thousands of smaller sets of records. We must assume records in different blocks can’t be duplicated. For example, restaurants with locations too far from each other must be different restaurants.
As a result, one can ignore pairs coming from different blocks and process only pairs where both restaurants belong to the same block. I use a quadtree to create thousands of blocks with at most 300 records per block and with about 100 records at average. The block boundaries must overlap (one can think about this question!) so that the total number of records in all the blocks will increase. However, the total number of pairs T as a result of splitting into B blocks with maximum R records will be limited by B*½ R^2 which is much smaller than ½ 10^12
The classifier is trained with labeled pairs of records (datapoints of training data) and computes a probability that a pair of two restaurant records denote single (‘duplicate’) restaurant.
At each iterative step of manual labeling, a new training set is created and a new classifier is trained.
Specifically, at each step, training data is added with a new manually labeled data points, usually a small number of less than ten.
After that, a new classifier is trained with the newly labeled data, whereas the manually labeled data points have higher weights in training.
Next, I run classifier for another subset of unlabeled data points, and the highest confidence points are added to the training data automatically, without manual labeling. These automatically selected new data points will have lower weight.
After a few iterations of adding manual labeled data points and points with the high confidence , the accumulated training data becomes large enough. This way, a classifier can utilize deep learning architectures.
Sampling and Manual Labeling
The iterative process of manual labeling adds more data to the training data.
Instead of selecting the random pairs for manual labeling, Active Learning selects pairs with the highest impact with the goal to minimize the number of manual labeling iterations required to create an accurate classifier.
Sampling strategy helps to select a small number of pairs for manual labeling from large unlabeled data
At each iteration K+1, we use Classifer_K to select a small number pairs of records (as small as five) that would benefit the most in training a new classifier.
There are two steps:
(1) selecting the five pairs for manual labeling is still a research challenge. There are three main strategies. One strategy suggests select data with the highest uncertainty, such as closest to the probability of fifty percent for each class. In this work, I used another strategy. I observed that the data points at the fifty percent boundary are simply ‘noise’ and it is impossible to guess the labels for them. I sample data points far enough from the ‘noise’ boundary and from the ‘too certain’ sides.
(2) The newly manually labeled records are added to the training with higher weights. The new classifier denoted Classifeir_K+1/2. After that, I extend the training data with unlabeled data points that have high confidence with Classifier k+½, and create Classifer_K+1.
Figure-1 shows a histogram of probabilities at the 10th iteration (outputs by Classifier-10 for Training Data-10), which confirms observation from another Active Learning work (Figure-2): both show that the shape has three following phases: Noise, Medium and High Confidence.
Initialize: Classifier-0 and Training-Data-0
Iterate k = 1, … :
Select M unlabeled data points. Apply Classifier-k to M datapoints.
Select D datapoints from M, both positive and negative ranges, using Sampling Strategy from the Medium Certainty range P-medium.
Quit if Classifier-K accuracy looks sufficiently high for the selected D data points. Otherwise, label manually and add the M newly labeled datapoints with higher weights to create Training Data-k+½
Run Classifeir-k+½ for M more unlabeled datapoints and add the datapoints with the high confidence greater P-high to create Training Data-k+1
Parameters: D (as small as five data points), M (a thousand of data points), P-Medium confidence range, P-high probability threshold, Classifier architecture (Logistic Regression, RandomForest or DNN)
20 mins: Build docker image
10 mins: Step-1
Create 32K Blocks
15 mins: Step-2
Create Initial Labels and initial Classifier
10 mins: Iterate until Classifier accuracy is sufficient
Sampling Strategy: Pull from Unlabeled and Label manually
Retrain Classifier with newly added Labels
100 mins: Step-3
Run Classifier Inferences for All Block of Pairs.
32K blocks processed 10 blocks per min x32 cores
3 mins: Step-4
Calibration: select threshold to decide duplicates
Use ten binary splits to decide on the best threshold value
Using the selected threshold, create two files with unique and duplicated restaurant names