Schema matching
The first step of record linkage is recognizing the inputs.
Introduction
Schema matching refers to the mapping of columns in an input dataset to a set of standard columns that we will eventually use as linkage variables. For example, the full name of a person is sometimes referred to as Client Name, whereas sometimes it is specified in two fields, Name and Surname.
The task of the schema matching can be summarized as follows:
Get column names of the input file
Read the content of the columns
Map column names to standard fields based on column contents
Mapping is done based on pre-trained text-based classifiers built with a massive dataset of synthetic data. Pretrained models are downloaded automatically (you can find them here too). We used the fantastic FastText tool to build the language-based models.
Standard fields
The majority of datasets that we encounter at CIDRZ contain only a limited number of linkage variables. We identify the following types of columns:
Reference fields
Field
Description
Standard format
full_name
First name and last name
john smith
birth_date
Date of birth
1985-01-31
visit_date
Date of data event, typically a visit to the clinic
2020-12-31
sex
Gender
m
identifier
Unique identifier for the patient
23956b40-2cc9-4e50
Other standard fields
Field
Reference field
Description
Standard format
first_name
full_name
First name
john
last_name
full_name
Last name
smith
age
birth_date
Age
35
birth_year
birth_date
Year of birth
1985
Usage
$ e2elink step match
Edit matching
The schema matching tool will do its best to match columns to standard columns. The mapping will be stored in a file that looks like this:
# standard fields
standard:
- first_name: "Name"
- last_name: "Surname"
- age: "Age"
- visit_date: "Date"
You can manually edit this file if you are not satisfied with it. The preprocessing step is going to read it.
Predefined matching
If you know the matching beforehand, you can specify it to make speed up the process. For example, you pre-specify the visit data column and let the tool guess the rest:
# standard fields
standard:
- visit_date: "Date of visit"
Then, pass this file when running the matching step:
$ e2elink step math --file my_match.yml
Last updated
Was this helpful?