Medical record linkage is finally made easy. Get started!
Prepare your data
We provide some exemplary datasets. These datasets are synthetic but greatly inspired by previous works we've done at CIDRZ. You can download them as follows:
$e2elinkexample
Three files will appear in your working directory. These correspond to a source file, a target file, and a truth file.
This will create a mapping file with the suggested correspondence between original column names and standard column names. For more details, please see:
Once the schema of the input files has been identified, you need to preprocess the data.
This will create a preprocessed files with standard column names and cleaned data. Cleaning data is a critical step of record linkage. Learn more about how we do it here:
Blocking is a key step to ensure computational performance. By default, we block based on full names. For each row in the source file, we look for the nearest neighbors (best candidates) in the target file.
A blocking index, specific to the target file, will be generated and stored as output. Please read more about blocking here:
Comparisons are the fun part of record linkage. Each reference field is compared using one multiple similarity metrics to achieve the best possible fuzzy matches.
We have done a big effort to have a comprehensive and efficient set of comparisons for each linkage variable. Learn more here:
We are always happy to include new types of comparisons. Please reach out to us if you have suggestions!
6. Scoring
We provide a single linkage score based on the multiple comparisons. This score is based on pre-trained and calibrated models based on synthetic data. So it can be interpreted as a probability.
Our scoring methodology is a unique component of this record linkage package. Please learn more here: