Schema matching

The first step of record linkage is recognizing the inputs.

Introduction

Schema matching refers to the mapping of columns in an input dataset to a set of standard columns that we will eventually use as linkage variables. For example, the full name of a person is sometimes referred to as Client Name, whereas sometimes it is specified in two fields, Name and Surname.

The task of the schema matching can be summarized as follows:

  1. Get column names of the input file

  2. Read the content of the columns

  3. Map column names to standard fields based on column contents

Mapping is done based on pre-trained text-based classifiers built with a massive dataset of synthetic data. Pretrained models are downloaded automatically (you can find them here too). We used the fantastic FastText tool to build the language-based models.

Standard fields

The majority of datasets that we encounter at CIDRZ contain only a limited number of linkage variables. We identify the following types of columns:

Reference fields

Field

Description

Standard format

full_name

First name and last name

john smith

birth_date

Date of birth

1985-01-31

visit_date

Date of data event, typically a visit to the clinic

2020-12-31

sex

Gender

m

identifier

Unique identifier for the patient

23956b40-2cc9-4e50

Other standard fields

Field

Reference field

Description

Standard format

first_name

full_name

First name

john

last_name

full_name

Last name

smith

age

birth_date

Age

35

birth_year

birth_date

Year of birth

1985

We recognize that the current number of standard fields is limited. We are willing to add more upon request. Please reach out to us if you have ideas or suggestions.

Usage

$ e2elink step match

Edit matching

The schema matching tool will do its best to match columns to standard columns. The mapping will be stored in a file that looks like this:

match.yml
# standard fields
standard:
 - first_name: "Name"
 - last_name: "Surname"
 - age: "Age"
 - visit_date: "Date"

You can manually edit this file if you are not satisfied with it. The preprocessing step is going to read it.

Predefined matching

If you know the matching beforehand, you can specify it to make speed up the process. For example, you pre-specify the visit data column and let the tool guess the rest:

my_match.yml
# standard fields
standard:
 - visit_date: "Date of visit"

Then, pass this file when running the matching step:

$ e2elink step math --file my_match.yml

Last updated

Was this helpful?