1 of 10

ESTHER workshop

Welcome to the record linkage course!

This guidebook contains information about the ESTHER record linkage workshop

Towards an end-to-end record linkage pipeline using machine learning

The aim of this course is to explore the potential of Machine Learning (ML) techniques applied to Record Linkage (RL). We will start the course by providing a gentle introduction to ML. Then, in a series of more specialized sessions, we will revisit the classical steps of record linkage (data processing/cleaning, blocking, comparison...) and will try to understand how ML can be used to increase linkage quality, performance and level of automation.

The course will be structured in 2 introductory sessions and 4 hands-on sessions focused on synthetic datasets provided to participants.

Basic record linkage course

For more information, visit the ESTHER Project on Moodle.

Materials

ML1 - Basic course: S5

Demystifying machine learning for medical data analysis

Content

Past, present and future of ML applied to Medical Data Analysis. The field of ML is very broad; and an emphasis will be put on clarifying concepts and simplifying them for a general audience. Concepts include:

A brief overview of the machine learning cycle
- Data preparation
- Splitting data into train, test and validation sets
- Training a ML model
- A taxonomy of ML models, including the most prominent families of ML models
- Cross-validation and external validation
- Bringing models to production (model deployment)
Links between ML and conventional tools in statistics (the past)
A selection of medical data analysis studies/articles applying ML (the present)
End-to-end pipelines and the promises of ML; how the ML cycle will be embedded in medical data management centers (the future)

Based on this general introduction, participants will be encouraged to suggest at least one current task in their own field of research where ML could be applied.

Learning outcomes

This session is aimed at a general audience and provides a high-level overview of ML. At the end of the session, attendants should have a ‘demystified’ view of ML and, hopefully, they will be encouraged to explore the potential of ML in their own projects. Thus, the main outcomes are:

A qualitative understanding of the field of ML
A personalized list of items/tasks related to participant’s projects where ML could be applied.

Materials

Pre-recorded video (1h): https://youtu.be/rnQAUeOFl3E
See Moodle for more!

ML2 - Basic course: S6

How can machine learning improve my record linkage procedures?

Content

A brief overview of the machine learning cycle
- Data preparation
- Splitting data into train, test and validation sets
- Training a ML model
- A taxonomy of ML models, including the most prominent families of ML models
- Cross-validation and external validation
- Bringing models to production (model deployment)
Links between ML and conventional tools in statistics (the past)
A selection of medical data analysis studies/articles applying ML (the present)
End-to-end pipelines and the promises of ML; how the ML cycle will be embedded in medical data management centers (the future)

Based on this general introduction, participants will be encouraged to suggest at least one current task in their own field of research where ML could be applied.

Learning outcomes

A qualitative understanding of the field of ML
A personalized list of items/tasks related to participant’s projects where ML could be applied

Materials

Pre-recorded video (45min): https://youtu.be/-M2ISBFbhZU
See Moodle for more!

ML3 - Advanced course: S1

Record linkage with unsupervised ML

Content

In this session, we will introduce the taxonomy of unsupervised ML methods, with a focus on clustering algorithms. We will present a real-case scenario, based on a RL pipeline developed by VO at the NCR. Python code (but not data) will be fully or partially shared with participants through a GitHub repository for educational purposes.

Learning outcomes

At the end of this session, and by means of a real-world example relevant to the ESTHER project, participants will have a clear idea of the potential of unsupervised ML for RL, as well as the key requirements (both on terms of infrastructure and coding skills) that are needed to successfully develop a RL study.

Moodle

See Moodle for more!

ML4 - Advanced course: S2

Record linkage with supervised ML

Content

In this session, we will introduce the taxonomy of supervised ML methods, with a focus on binary classification methods. Like in ML3, we will present a real-case scenario, this time based on an analysis of CIDRZ data. Different to ML3, focus will be put on three key steps (rather than the full pipeline), namely schema matching, full name sorting and comparison vector classification. Python code (but not data) will be fully shared with participants through the course GitHub repository.

Learning outcomes

It is impossible to cover in one session the plethora of supervised ML methods available today. Rather, participants will see in detail, and in a real case, how three supervised ML methods are applied to RL, including means to validate them.

Materials

Pre-recorded video (30min): https://youtu.be/fFkkI7SU8_U
See Moodle for more!

ML5 - Advanced course: S3

Towards an end-to-end record linkage pipeline

Content

We will use a synthetic, small-scale dataset to develop an end-to-end ML-based pipeline for RL. The project will be developed collaboratively between all participants, including facilitators. In the first round of discussions, participants will be asked a series of questions regarding the project planning, including the choice of algorithms for each of the key steps. An end-to-end notebook for record linkage will be developed collaboratively. At the end of the hands-on session, participants will present to their supervisors (e.g. ESTHER partners).

Learning outcomes

Participants will experiment with ML tools in Python, including basic data analysis and plotting packages. In addition, they will develop a customized RL pipeline.

Materials

Pre-recorded video (30min): https://youtu.be/-iiNJePTTjA
See Moodle for more!

ML6 - Advanced course: S4

Making record linkage useful to others

Content

In this session, we will build a simple app in Google Colaboratory to make record linkage useful to others.

Learning outcomes

Participants will learn about the importance of distributing code to their collaborators and, hopefully, gain the satisfying experience of building their first web-based application!

Materials

Pre-recorded video (30min): https://youtu.be/P7CmHbLU4sk
See Moodle for more!

Synthetic data generation

Precomputed datasets

This page contains several pre-computed datasets to be used throughout the workshop

Please contact Miquel Duran-Frigola if you are not satisfied with the current datasets!

Below, you can find several exemplary precomputed synthetic datasets. Each zipped folder contains a source file, a target file, a ground truth file in CSV format. Parameters used for the calculation are given in JSON format.

S01: Toy example

This small dataset has 10 samples in the source file and 100 samples in the target file. The expected linkage rate is high (90%).

5KB

S01.zip

archive

S02: Realistic CIDRZ dataset

This dataset is representative of a typical facility-to-SmartCare linkage as performed at CIDRZ. The size of the source file is 5,000 samples and the size of the target file is 50,000. Repeated visits and duplicates are added. The expected linkage rate is 70%.

1MB

S02.zip

archive

S03: Large clean dataset

This dataset is aimed at testing the computational performance of the linkage pipeline. The source file has 50,000 samples and the target file has 500,000 samples. The expected linkage rate is 80%. No significant noise was added, so the linkage is expected to be simple. Some identifiers are included in the source file.

12MB

S03.zip

archive

S04: Mid-size noisy dataset

This dataset includes a significant amount of noise, meaning misspellings, name swappings, etc. are frequent. In addition, there is less consistency in the formats. The number of source samples is 10,000 and the number of target samples is 30,000. The expected linkage rate is 70%.

873KB

S04.zip

archive