Synthetic data
Simple generation of realistic synthetic datasets
Introduction
Synthetic data generation is key to record linkage because, most often, we have to deal with sensitive data, involving personally identifiable information, which hampers data sharing and collaboration.
This tool produces synthetic that is very similar to the datasets that we have encountered in our experience with medical
Source file: The main file of interest.
Target file: Typically, a large file containing. Source file entries are searched within this file.
Linkage file: Contains pairs of indices corresponding to source and target rows. This file can be considered to be the ground truth (gold standard).
Anonymize Personally Identifiable Information (PII)
Lack of gold standards
Small and large files
Test linkage pipelines under different conditions
Educational purposes
Usage
Run in the command line
The following command will generate a random synthetic dataset in the current working directory.
$ e2elink syntheticThis will generate a realistic dataset that we believe is reasonable. You can pass a configuration file if you prefer:
The parameters file can look like this:
Alternatively, you can see our template parameters file and edit it manually. The following command will produce this file named synthetic_params.yml:
Run in the browser
An easier way to produce customized synthetic data is to use the synthetic data generator app. You launch it locally with the following command:
An online demo of the synthetic data generator is available here.
Last updated
Was this helpful?