arrow-left

Only this pageAll pages
gitbookPowered by GitBook
1 of 10

ESTHER workshop

Loading...

Materials

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Synthetic data generation

Loading...

Welcome to the record linkage course!

This guidebook contains information about the ESTHER record linkage workshop

hashtag
Towards an end-to-end record linkage pipeline using machine learning

The aim of this course is to explore the potential of Machine Learning (ML) techniques applied to Record Linkage (RL). We will start the course by providing a gentle introduction to ML. Then, in a series of more specialized sessions, we will revisit the classical steps of record linkage (data processing/cleaning, blocking, comparison...) and will try to understand how ML can be used to increase linkage quality, performance and level of automation.

The course will be structured in 2 introductory sessions and 4 hands-on sessions focused on synthetic datasets provided to participants.

hashtag
Basic record linkage course

For more information, visit the .

ML1 - Basic course: S5chevron-right
ML2 - Basic course: S6chevron-right
ML3 - Advanced course: S1chevron-right
ML5 - Advanced course: S3chevron-right
ML6 - Advanced course: S4chevron-right
ESTHER Project on Moodlearrow-up-right
file-pdf
19KB
schedule_overview.pdf
PDF
arrow-up-right-from-squareOpen
Schedule

ML1 - Basic course: S5

hashtag
Demystifying machine learning for medical data analysis

hashtag
Content

Past, present and future of ML applied to Medical Data Analysis. The field of ML is very broad; and an emphasis will be put on clarifying concepts and simplifying them for a general audience. Concepts include:

  • A brief overview of the machine learning cycle

    • Data preparation

    • Splitting data into train, test and validation sets

Based on this general introduction, participants will be encouraged to suggest at least one current task in their own field of research where ML could be applied.

hashtag
Learning outcomes

This session is aimed at a general audience and provides a high-level overview of ML. At the end of the session, attendants should have a ‘demystified’ view of ML and, hopefully, they will be encouraged to explore the potential of ML in their own projects. Thus, the main outcomes are:

  1. A qualitative understanding of the field of ML

  2. A personalized list of items/tasks related to participant’s projects where ML could be applied.

hashtag
Materials

  • Pre-recorded video (1h):

  • See for more!

ML3 - Advanced course: S1

hashtag
Record linkage with unsupervised ML

hashtag
Content

In this session, we will introduce the taxonomy of unsupervised ML methods, with a focus on clustering algorithms. We will present a real-case scenario, based on a RL pipeline developed by VO at the NCR. Python code (but not data) will be fully or partially shared with participants through a GitHub repository for educational purposes.

hashtag
Learning outcomes

At the end of this session, and by means of a real-world example relevant to the ESTHER project, participants will have a clear idea of the potential of unsupervised ML for RL, as well as the key requirements (both on terms of infrastructure and coding skills) that are needed to successfully develop a RL study.

hashtag
Moodle

  • See Moodlearrow-up-right for more!

Training a ML model

  • A taxonomy of ML models, including the most prominent families of ML models

  • Cross-validation and external validation

  • Bringing models to production (model deployment)

  • Links between ML and conventional tools in statistics (the past)

  • A selection of medical data analysis studies/articles applying ML (the present)

  • End-to-end pipelines and the promises of ML; how the ML cycle will be embedded in medical data management centers (the future)

  • https://youtu.be/rnQAUeOFl3Earrow-up-right
    Moodlearrow-up-right

    ML5 - Advanced course: S3

    hashtag
    Towards an end-to-end record linkage pipeline

    hashtag
    Content

    We will use a synthetic, small-scale dataset to develop an end-to-end ML-based pipeline for RL. The project will be developed collaboratively between all participants, including facilitators. In the first round of discussions, participants will be asked a series of questions regarding the project planning, including the choice of algorithms for each of the key steps. An end-to-end notebook for record linkage will be developed collaboratively. At the end of the hands-on session, participants will present to their supervisors (e.g. ESTHER partners).

    hashtag
    Learning outcomes

    Participants will experiment with ML tools in Python, including basic data analysis and plotting packages. In addition, they will develop a customized RL pipeline.

    hashtag
    Materials

    • Pre-recorded video (30min):

    • See for more!

    https://youtu.be/-iiNJePTTjAarrow-up-right
    Moodlearrow-up-right

    ML4 - Advanced course: S2

    hashtag
    Record linkage with supervised ML

    hashtag
    Content

    In this session, we will introduce the taxonomy of supervised ML methods, with a focus on binary classification methods. Like in ML3, we will present a real-case scenario, this time based on an analysis of CIDRZ data. Different to ML3, focus will be put on three key steps (rather than the full pipeline), namely schema matching, full name sorting and comparison vector classification. Python code (but not data) will be fully shared with participants through the course GitHub repository.

    hashtag
    Learning outcomes

    It is impossible to cover in one session the plethora of supervised ML methods available today. Rather, participants will see in detail, and in a real case, how three supervised ML methods are applied to RL, including means to validate them.

    hashtag
    Materials

    • Pre-recorded video (30min):

    • See for more!

    https://youtu.be/fFkkI7SU8_Uarrow-up-right
    Moodlearrow-up-right

    ML2 - Basic course: S6

    hashtag
    How can machine learning improve my record linkage procedures?

    hashtag
    Content

    Past, present and future of ML applied to Medical Data Analysis. The field of ML is very broad; and an emphasis will be put on clarifying concepts and simplifying them for a general audience. Concepts include:

    • A brief overview of the machine learning cycle

      • Data preparation

      • Splitting data into train, test and validation sets

    Based on this general introduction, participants will be encouraged to suggest at least one current task in their own field of research where ML could be applied.

    hashtag
    Learning outcomes

    This session is aimed at a general audience and provides a high-level overview of ML. At the end of the session, attendants should have a ‘demystified’ view of ML and, hopefully, they will be encouraged to explore the potential of ML in their own projects. Thus, the main outcomes are:

    1. A qualitative understanding of the field of ML

    2. A personalized list of items/tasks related to participant’s projects where ML could be applied

    hashtag
    Materials

    • Pre-recorded video (45min):

    • See for more!

    Training a ML model

  • A taxonomy of ML models, including the most prominent families of ML models

  • Cross-validation and external validation

  • Bringing models to production (model deployment)

  • Links between ML and conventional tools in statistics (the past)

  • A selection of medical data analysis studies/articles applying ML (the present)

  • End-to-end pipelines and the promises of ML; how the ML cycle will be embedded in medical data management centers (the future)

  • https://youtu.be/-M2ISBFbhZUarrow-up-right
    Moodlearrow-up-right

    ML6 - Advanced course: S4

    hashtag
    Making record linkage useful to others

    hashtag
    Content

    In this session, we will build a simple app in Google Colaboratory to make record linkage useful to others.

    hashtag
    Learning outcomes

    Participants will learn about the importance of distributing code to their collaborators and, hopefully, gain the satisfying experience of building their first web-based application!

    hashtag
    Materials

    • Pre-recorded video (30min):

    • See for more!

    https://youtu.be/P7CmHbLU4skarrow-up-right
    Moodlearrow-up-right

    Precomputed datasets

    This page contains several pre-computed datasets to be used throughout the workshop

    circle-info

    Please contact Miquel Duran-Frigolaenvelope if you are not satisfied with the current datasets!

    Below, you can find several exemplary precomputed synthetic datasets. Each zipped folder contains a source file, a target file, a ground truth file in CSV format. Parameters used for the calculation are given in JSON format.

    hashtag
    S01: Toy example

    This small dataset has 10 samples in the source file and 100 samples in the target file. The expected linkage rate is high (90%).

    hashtag
    S02: Realistic CIDRZ dataset

    This dataset is representative of a typical facility-to-SmartCare linkage as performed at CIDRZ. The size of the source file is 5,000 samples and the size of the target file is 50,000. Repeated visits and duplicates are added. The expected linkage rate is 70%.

    hashtag
    S03: Large clean dataset

    This dataset is aimed at testing the computational performance of the linkage pipeline. The source file has 50,000 samples and the target file has 500,000 samples. The expected linkage rate is 80%. No significant noise was added, so the linkage is expected to be simple. Some identifiers are included in the source file.

    hashtag
    S04: Mid-size noisy dataset

    This dataset includes a significant amount of noise, meaning misspellings, name swappings, etc. are frequent. In addition, there is less consistency in the formats. The number of source samples is 10,000 and the number of target samples is 30,000. The expected linkage rate is 70%.

    file-archive
    5KB
    S01.zip
    archive
    arrow-up-right-from-squareOpen
    file-archive
    1MB
    S02.zip
    archive
    arrow-up-right-from-squareOpen
    file-archive
    12MB
    S03.zip
    archive
    arrow-up-right-from-squareOpen
    file-archive
    873KB
    S04.zip
    archive
    arrow-up-right-from-squareOpen