Skills: open chemistry datasets
This workshop will introduce some basic Python concepts and tools to facilitate the remaining sessions. We will utilize these concepts to visualise the distribution of compounds from two ChEMBL datasets by following the steps in the Google Colab notebook.
The code for the workshop has been pre-written in Python. Python is a flexible and user-friendly programming language for machine learning and data analysis. The python code has been written in notebook (.ipynb) format.
A number of online chemical databases offer freely available bioactivity screening data and are a valuable resource for ML model training in a drug discovery context. For this example, we’ll use data from the ChEMBL database, a manually curated database of structure-activity datasets.
- The dataset comprises 13,533 molecules.
- The IC50 bioactivity is measured as the concentration exhibiting 50% growth inhibition (pLDH inhibition assay).
- The dataset comprises 1,524 molecules.
- The EC50 bioactivity is measured as the concentration of the drug that gives half-maximal response.
- 1.Search in ChEMBL for 'plasmodium falciparum' and select 'Assays'.
- 2.Sort by most-to-least number of compounds.
- 3.Look for the 'St Jude Malaria Screening' dataset on the first page (ID: CHEMBL730079).
- 4.Download the molecules for the assay and unzip the file.
- 5.Rename the file to 'st_jude_3d7.csv'.
- 6.Drag and drop the file into the 'h3d_ersilia_ai_workshop/data/session1/' folder on google drive.
We need to describe our molecules numerically for computers to understand them. In this section, we make use of the popular Morgan Fingerprint algorithm to describe our molecules with 2048 numerical descriptors. However, this high-dimensional description is difficult to visualise. We can use ‘dimensionality reduction’ algorithms to reduce the data to two dimensions that can be readily plotted.
Conversion of a molecule to a Morgan Fingerprint
The PCA algorithm finds new axes through the dataset that will cause data points to be as dispersed as possible when mapped onto the new axes. This can be applied to high-dimensional data to linearly transform the data points to a lower-dimensional space for plotting, in our case, transforming our 2048 descriptors to just 2.
Every dot represents a molecule, where structurally related compounds are closer together. If a dataset is dissimilar to the data used to train a ML model, the model is less likely to be able to provide reliable predictions. In this case, additional training data would need to be curated.
A PCA is a linear re-scaling of data that focuses on preserving the global structure of the dataset. However, an alternative dimensionality reduction algorithm is UMAP. This is a non-linear data transformation that instead emphasizes retaining local data structure to form clusters of similar data. It is useful to have both plots available for a dataset to aid interpretation of chemical space similarity.