Quick start
Medical record linkage is finally made easy. Get started!
Prepare your data
We provide some exemplary datasets. These datasets are synthetic but greatly inspired by previous works we've done at CIDRZ. You can download them as follows:
$ e2elink example
Three files will appear in your working directory. These correspond to a source file, a target file, and a truth file.
Run full pipeline
You can run a full linkage pipeline as follows:
$ e2elink run --src_file source.csv --trg_file target.csv
You are done! An output folder has l be created.
Results are stored in the finish
subfolder.
Run step by step
If you prefer to have full control over the linkage pipeline, you can run the code step by step.
1. Set up output directory
First, you want to create an output directory where results will be stored.
$ e2elink setup --src_file source.csv --trg_file target.csv
2. Schema matching
Check your input files and identify column types.
$ e2elink schema
This will create a mapping file with the suggested correspondence between original column names and standard column names. For more details, please see:
Schema matching3. Data pre-processing
Once the schema of the input files has been identified, you need to preprocess the data.
$ e2elink preprocess
This will create a preprocessed files with standard column names and cleaned data. Cleaning data is a critical step of record linkage. Learn more about how we do it here:
Preprocessing4. Blocking
Blocking is a key step to ensure computational performance. By default, we block based on full names. For each row in the source file, we look for the nearest neighbors (best candidates) in the target file.
$ e2elink block
A blocking index, specific to the target file, will be generated and stored as output. Please read more about blocking here:
Blocking5. Comparison
Comparisons are the fun part of record linkage. Each reference field is compared using one multiple similarity metrics to achieve the best possible fuzzy matches.
$ e2elink compare
We have done a big effort to have a comprehensive and efficient set of comparisons for each linkage variable. Learn more here:
Comparisons6. Scoring
We provide a single linkage score based on the multiple comparisons. This score is based on pre-trained and calibrated models based on synthetic data. So it can be interpreted as a probability.
$ e2elink score
Our scoring methodology is a unique component of this record linkage package. Please learn more here:
Scoring7. Evaluation
We estimate the performance of the prediction.
$ e2elink evaluate
8. Finish
Wrap up and write results.
$ e2elink finish
Last updated
Was this helpful?