Breakout: working with chemistry data

For this session, we will have group discussions around three focused topics to prepare for our dive into ML model building. We will form breakout groups of 5-6 participants. During the discussion, flipcharts will be available to take notes for feedback to the larger group afterwards.


Additional Datasets

BindingDB Human PI3K

This dataset contains dose-response data for inhibition of a human kinase, PI3K. Kinases are attractive targets in the anti-malarial field and are important for cellular processes. We’ll investigate which of our whole-cell anti-malarial screening libraries are better suited to predict this PI3K dataset from ChEMBL.

  • The dataset comprises 345 molecules.

  • The IC50 bioactivity is measured as the concentration exhibiting 50% inhibition.

ChEMBL Acinetobacter Baumannii

A. baumannii is a member of the ESKAPE pathogens - bacteria with a worrying increase in the prevalance of antibiotic resistance. This dataset contains dose-response data for the inhibition of A. baumannii from multiple ChEMBL curated sources and can be found here.

  • The dataset comprises 10,194 molecules.

  • The IC50 bioactivity is measured as the concentration exhibiting 50% inhibition.

Computational tools discussion

Take 10 minutes to get to know each other:

  • Give each person a chance to introduce themselves and their field of expertise.

  • What do you hope to learn about during the workshop?

  • Decide on a group name.

For 20 minutes, discuss the challenges and opportunities for ML in drug discovery:

  • What current computational tools and/or skills, if any, do you use in your research?

  • What challenges/limitations do you face that prevent you from making further use of data science tools?

  • How could some of these gaps be addressed?

Chemical Space Discussion

Allow for 20 minutes of discussion around concepts concerning ‘chemical space’.

  • What do you understand by the term 'chemical space'?

  • How does this differ to the concept of 'drug-like molecules'?

  • Why should the data we use to train models be similar to the types of compounds we aim to obtain predictions for?

  • How might our training data requirements vary between a virtual screening campaign for novel chemical hits and a model to score analogues of a particular chemical series that is further along the drug discovery pipeline?

For the next 20 minutes, answer the questions for the following figures:

We will load a figure of an example data set with various chemical series (set of related chemical analogues) in colour.

  • Which of these series are outside the chemical space of our training data and why?

  • Does this make them more likely or less likely to be accurately predicted by a model trained on this data (the data in black)?

Now we’ll investigate which of our datasets used in the skills workshop (GSK and St Jude antimalarial screens) would be better suited to screening a set of active human PI3K inhibitors against malaria. After producing the PCAs and UMAPs of the PI3K dataset with each potential training set, answer the following questions:

  • Which dataset would you rather use to train a model to predict PI3K inhibition and why?

  • Are there any concerns with your chosen dataset?

Data cleaning discussion

Initial datasets are not necessarily ready for ML model building. We often first need to clean datasets to conform to a common format and remove anomalies. Datasets should complete, consistent, accurate and reliable.

In your groups, brainstorm the types of inconsistencies you might need to look out for and address in drug discovery data in order to clean and standardise the dataset for machine learning. Use the ChEMBL A. baumannii dataset as an example of non-standardised data to generate ideas.

Here is an example of a clean dataset, where columns have a single standardized value per column with no duplicates or missing data:

Last updated