Skills: open chemistry datasets
This workshop will introduce some basic Python concepts and tools to facilitate the remaining sessions. We will utilize these concepts to visualise the distribution of compounds from two ChEMBL datasets by following the steps in the Google Colab notebook.
View the step-by-step presentation for this session.
Running Python code
The code for the workshop has been pre-written in Python. Python is a flexible and user-friendly programming language for machine learning and data analysis. The python code has been written in notebook (.ipynb) format.
Python notebooks consist of a combination of cells, either of plain-text or Python code. The notebooks can be opened through the cloud-based Google Colaboratory service, or Google ‘Colab’ for short.
Datasets
A number of online chemical databases offer freely available bioactivity screening data and are a valuable resource for ML model training in a drug discovery context. For this example, we’ll use data from the ChEMBL database, a manually curated database of structure-activity datasets.
GSK Plasmodium falciparum 3D7
The first dataset contains dose-response data for the activity of anti-malarial compounds against a whole-cell drug-sensitive strain of Plasmodium falciparum (3D7). You can find the dataset at the ChEMBL website and in its associated publication.
The dataset comprises 13,533 molecules.
The IC50 bioactivity is measured as the concentration exhibiting 50% growth inhibition (pLDH inhibition assay).
St Jude Plasmodium falciparum 3D7
The second dataset also contains anti-malarial dose-response data deposited by another virtual screening campaign performed by the St Jude Children’s Research Hospital. The dataset can be found at the ChEMBL website and in its original publication.
The dataset comprises 1,524 molecules.
The EC50 bioactivity is measured as the concentration of the drug that gives half-maximal response.
Guidance for finding the St. Jude 3D7 Dataset
Search in ChEMBL for 'plasmodium falciparum' and select 'Assays'.
Sort by most-to-least number of compounds.
Look for the 'St Jude Malaria Screening' dataset on the first page (ID: CHEMBL730079).
Download the molecules for the assay and unzip the file.
Rename the file to 'st_jude_3d7.csv'.
Drag and drop the file into the 'h3d_ersilia_ai_workshop/data/session1/' folder on google drive.
Visualizing Chemical Space
We need to describe our molecules numerically for computers to understand them. In this section, we make use of the popular Morgan Fingerprint algorithm to describe our molecules with 2048 numerical descriptors. However, this high-dimensional description is difficult to visualise. We can use ‘dimensionality reduction’ algorithms to reduce the data to two dimensions that can be readily plotted.
PCA
A common dimensionality reduction method is Principal Component Analysis (PCA). A non-linear re-scaling that preserves local clustering of data points.
The PCA algorithm finds new axes through the dataset that will cause data points to be as dispersed as possible when mapped onto the new axes. This can be applied to high-dimensional data to linearly transform the data points to a lower-dimensional space for plotting, in our case, transforming our 2048 descriptors to just 2.
Results interpretation
Every dot represents a molecule, where structurally related compounds are closer together. If a dataset is dissimilar to the data used to train a ML model, the model is less likely to be able to provide reliable predictions. In this case, additional training data would need to be curated.
UMAP
A PCA is a linear re-scaling of data that focuses on preserving the global structure of the dataset. However, an alternative dimensionality reduction algorithm is UMAP. This is a non-linear data transformation that instead emphasizes retaining local data structure to form clusters of similar data. It is useful to have both plots available for a dataset to aid interpretation of chemical space similarity.
Extension exercises
If you would like to see some additional examples of Python commands or some further usage of the Pandas library, have a look at the session 1 extension notebook.
Last updated