Quick start

Medical record linkage is finally made easy. Get started!

Prepare your data

We provide some exemplary datasets. These datasets are synthetic but greatly inspired by previous works we've done at CIDRZ. You can download them as follows:

$ e2elink example

Run full pipeline

You can run a full linkage pipeline as follows:

$ e2elink run --src_file source.csv --trg_file target.csv

Results are stored in the finish subfolder.

Run step by step

If you prefer to have full control over the linkage pipeline, you can run the code step by step.

1. Set up output directory

First, you want to create an output directory where results will be stored.

$ e2elink setup --src_file source.csv --trg_file target.csv

2. Schema matching

Check your input files and identify column types.

$ e2elink schema

This will create a mapping file with the suggested correspondence between original column names and standard column names. For more details, please see:

Schema matching

3. Data pre-processing

Once the schema of the input files has been identified, you need to preprocess the data.

$ e2elink preprocess

This will create a preprocessed files with standard column names and cleaned data. Cleaning data is a critical step of record linkage. Learn more about how we do it here:

Preprocessing

4. Blocking

Blocking is a key step to ensure computational performance. By default, we block based on full names. For each row in the source file, we look for the nearest neighbors (best candidates) in the target file.

$ e2elink block

A blocking index, specific to the target file, will be generated and stored as output. Please read more about blocking here:

Blocking

5. Comparison

Comparisons are the fun part of record linkage. Each reference field is compared using one multiple similarity metrics to achieve the best possible fuzzy matches.

$ e2elink compare

We have done a big effort to have a comprehensive and efficient set of comparisons for each linkage variable. Learn more here:

Comparisons

We are always happy to include new types of comparisons. Please reach out to us if you have suggestions!

6. Scoring

We provide a single linkage score based on the multiple comparisons. This score is based on pre-trained and calibrated models based on synthetic data. So it can be interpreted as a probability.

$ e2elink score

Our scoring methodology is a unique component of this record linkage package. Please learn more here:

Scoring

7. Evaluation

We estimate the performance of the prediction.

$ e2elink evaluate

8. Finish

Wrap up and write results.

$ e2elink finish

Last updated

Was this helpful?