Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
In this session, we will introduce the taxonomy of supervised ML methods, with a focus on binary classification methods. Like in ML3, we will present a real-case scenario, this time based on an analysis of CIDRZ data. Different to ML3, focus will be put on three key steps (rather than the full pipeline), namely schema matching, full name sorting and comparison vector classification. Python code (but not data) will be fully shared with participants through the course GitHub repository.
It is impossible to cover in one session the plethora of supervised ML methods available today. Rather, participants will see in detail, and in a real case, how three supervised ML methods are applied to RL, including means to validate them.
Pre-recorded video (30min): https://youtu.be/fFkkI7SU8_U
See Moodle for more!
Past, present and future of ML applied to Medical Data Analysis. The field of ML is very broad; and an emphasis will be put on clarifying concepts and simplifying them for a general audience. Concepts include:
A brief overview of the machine learning cycle
Data preparation
Splitting data into train, test and validation sets
Training a ML model
A taxonomy of ML models, including the most prominent families of ML models
Cross-validation and external validation
Bringing models to production (model deployment)
Links between ML and conventional tools in statistics (the past)
A selection of medical data analysis studies/articles applying ML (the present)
End-to-end pipelines and the promises of ML; how the ML cycle will be embedded in medical data management centers (the future)
Based on this general introduction, participants will be encouraged to suggest at least one current task in their own field of research where ML could be applied.
This session is aimed at a general audience and provides a high-level overview of ML. At the end of the session, attendants should have a ‘demystified’ view of ML and, hopefully, they will be encouraged to explore the potential of ML in their own projects. Thus, the main outcomes are:
A qualitative understanding of the field of ML
A personalized list of items/tasks related to participant’s projects where ML could be applied.
Pre-recorded video (1h): https://youtu.be/rnQAUeOFl3E
See Moodle for more!
This guidebook contains information about the ESTHER record linkage workshop
The aim of this course is to explore the potential of Machine Learning (ML) techniques applied to Record Linkage (RL). We will start the course by providing a gentle introduction to ML. Then, in a series of more specialized sessions, we will revisit the classical steps of record linkage (data processing/cleaning, blocking, comparison...) and will try to understand how ML can be used to increase linkage quality, performance and level of automation.
The course will be structured in 2 introductory sessions and 4 hands-on sessions focused on synthetic datasets provided to participants.
For more information, visit the ESTHER Project on Moodle.
In this session, we will introduce the taxonomy of unsupervised ML methods, with a focus on clustering algorithms. We will present a real-case scenario, based on a RL pipeline developed by VO at the NCR. Python code (but not data) will be fully or partially shared with participants through a GitHub repository for educational purposes.
At the end of this session, and by means of a real-world example relevant to the ESTHER project, participants will have a clear idea of the potential of unsupervised ML for RL, as well as the key requirements (both on terms of infrastructure and coding skills) that are needed to successfully develop a RL study.
See Moodle for more!
In this session, we will build a simple app in Google Colaboratory to make record linkage useful to others.
Participants will learn about the importance of distributing code to their collaborators and, hopefully, gain the satisfying experience of building their first web-based application!
Pre-recorded video (30min): https://youtu.be/P7CmHbLU4sk
See Moodle for more!
Past, present and future of ML applied to Medical Data Analysis. The field of ML is very broad; and an emphasis will be put on clarifying concepts and simplifying them for a general audience. Concepts include:
A brief overview of the machine learning cycle
Data preparation
Splitting data into train, test and validation sets
Training a ML model
A taxonomy of ML models, including the most prominent families of ML models
Cross-validation and external validation
Bringing models to production (model deployment)
Links between ML and conventional tools in statistics (the past)
A selection of medical data analysis studies/articles applying ML (the present)
End-to-end pipelines and the promises of ML; how the ML cycle will be embedded in medical data management centers (the future)
Based on this general introduction, participants will be encouraged to suggest at least one current task in their own field of research where ML could be applied.
This session is aimed at a general audience and provides a high-level overview of ML. At the end of the session, attendants should have a ‘demystified’ view of ML and, hopefully, they will be encouraged to explore the potential of ML in their own projects. Thus, the main outcomes are:
A qualitative understanding of the field of ML
A personalized list of items/tasks related to participant’s projects where ML could be applied
Pre-recorded video (45min): https://youtu.be/-M2ISBFbhZU
See Moodle for more!
This page contains several pre-computed datasets to be used throughout the workshop
Please contact Miquel Duran-Frigola if you are not satisfied with the current datasets!
Below, you can find several exemplary precomputed synthetic datasets. Each zipped folder contains a source file, a target file, a ground truth file in CSV format. Parameters used for the calculation are given in JSON format.
This small dataset has 10 samples in the source file and 100 samples in the target file. The expected linkage rate is high (90%).
This dataset is representative of a typical facility-to-SmartCare linkage as performed at CIDRZ. The size of the source file is 5,000 samples and the size of the target file is 50,000. Repeated visits and duplicates are added. The expected linkage rate is 70%.
This dataset is aimed at testing the computational performance of the linkage pipeline. The source file has 50,000 samples and the target file has 500,000 samples. The expected linkage rate is 80%. No significant noise was added, so the linkage is expected to be simple. Some identifiers are included in the source file.
This dataset includes a significant amount of noise, meaning misspellings, name swappings, etc. are frequent. In addition, there is less consistency in the formats. The number of source samples is 10,000 and the number of target samples is 30,000. The expected linkage rate is 70%.
We will use a synthetic, small-scale dataset to develop an end-to-end ML-based pipeline for RL. The project will be developed collaboratively between all participants, including facilitators. In the first round of discussions, participants will be asked a series of questions regarding the project planning, including the choice of algorithms for each of the key steps. An end-to-end notebook for record linkage will be developed collaboratively. At the end of the hands-on session, participants will present to their supervisors (e.g. ESTHER partners).
Participants will experiment with ML tools in Python, including basic data analysis and plotting packages. In addition, they will develop a customized RL pipeline.
Pre-recorded video (30min): https://youtu.be/-iiNJePTTjA
See Moodle for more!