Blocking
It is not necessary to compare all pairs of records, blocking can save us a lot of time
Introduction
Blocking is a technique to reduce the number of fine-grained comparisons that are necessary to identify true linkage hits.
Vectorize the source and target datasets
For each row in the source file, find the top-k nearest neighbors in the target file
Blocking needs to be efficient computationally in order to work in low-resource settings. We use FAISS (for dense vectors) or PySparnn (for sparse vectors). The default now is to use sparse vectors as returned by Scikit-Learn TF-IDF vectorizer.
Usage
You can run blocking on pre-processed data. The only real parameter is the number of neighbors (k). We determine k automatically (between 5 and 100) based on the size of your datasets.
$ e2elink block
This run will produce an index file based on the target data.
Last updated
Was this helpful?