AI2050 AI for Drug Discovery
4-day workshop on the use of AI for Drug Discovery
Last updated
4-day workshop on the use of AI for Drug Discovery
Last updated
Each day is organised in four main activities:
Keynote Lectures: a presentation by H3D and Ersilia mentors and invited speakers, introducing key concepts that we will put into practice during the day
Skills Workshop: a hands-on training where the course facilitators will walk the participants through a specific ML exercise
Breakout: participants will be divided in small groups to complete an exercise related to the skills workshop
Group Discussion: each group will give a short (10 min) presentation on their breakout activity.
Activity | Session 1 | Session 2 | Session 3 | Session 4 |
---|---|---|---|---|
After the course, the apps will remain active on StreamlitCloud and also available for local deployment from GitHub. Please keep in mind that these are apps built for the course, and might not be suitable to apply directly to a research project. Contact Ersilia for more!
In case you need further information to follow the course, be sure to check the following:
Ersilia Model Hub: list of all available models.
Ersilia Model Hub: self-service on GitHub (requires a GitHub account).
Ersilia Model Hub: local installation documentation (only expert users).
Publications referred to during the workshop.
Datasets required for the workshop.
Everything we do in AI is driven by the data we have available. Often, this data, whether generated in-house or obtained from literature, is inconsistent and includes many sources of noise. It is important to be aware of the causes for noisy data and how to minimize this before feeding the data into data science tools. It is also helpful to understand the chemical space that the model has trained on and how this relates to the chemical space we are interested in. This understanding will contribute to our confidence in the model predictions before using them for prospective screening.
This activity is split into three parts:
A short ice-breaker to get to know your assigned breakout group members.
A data cleaning activity from a real-world public dataset.
A chemical space analysis activity to compare the similarities and differences between chemical libraries.
Take a few minutes to get to know the members of your breakout group. Take turns to cover the following points about yourself:
Your name, research institution and field of research.
How you currently use computational tools for your research, if any, and specifically AI-based tools.
How you think data science tools can contribute to your research going forward.
Lastly, select a scribe for your breakout group as well as someone else who will provide some feedback on your group’s discussion for the next two tasks during the feedback session.
Task 2.1: Data consistency discussion
Do some group discussion around the following properties. Why should data have each of them before we train models from it? Think about what problem(s) might result from a dataset that does not have each of these properties.
Completeness
Consistency
Accuracy
Relevancy
Task 2.2: Data cleaning hands-on example.
During the skills development, we spoke about the need for clean data and some examples of common problems in chemical datasets. Ideally, we need a set of compound structures and corresponding assay outcomes from the same experimental conditions. Now we will identify some examples of data inconsistencies in a dataset from literature.
The Community for Open-Antimicrobial Drug Discovery (CO-ADD) has curated a database of compounds that have been tested for activity against a set of infectious bacteria known as the ESKAPE pathogens. Let’s say we want to create a clean dataset to predict the activity of compounds against the A.baumannii bacteria. Download the dose-response dataset from this link and answer the following questions:
What is the output of the assay measuring?
Find examples of data inconsistencies that cause the dataset to not be complete, consistent, accurate, or relevant for modelling A.baumannii activity. How did you find this issue in the dataset?
How could you address each of the points you found in (b) to make the dataset more clean?
Were there any other checks you performed where the dataset did not have an issue?
Task 3.1: Chemical space discussion
In group, discuss and answer the following questions:
What do you understand by the term 'chemical space’?
How does this differ to the concept of 'drug-like molecules’?
Why should the compounds we use to train models and compounds we want predictions for be similar?
How might the requirements for our training data vary between the following scenarios?
Virtual screening for novel chemical hits?
Ranking closely related analogues within one chemical series?
Task 3.2: Chemical space visualization
Go to ChEMBL and download the St Jude 3D7 screening set for malaria (ID: CHEMBL730079). Follow the step-by-step instructions in the breakout session slides. Then upload this dataset to the chemical space visualization app (link) and select the ‘MMV Malaria Box’ and ‘Open Source Malaria’ checkboxes from the list of example libraries. Imagine we use the St Jude 3D7 dataset to train a model to screen for new chemical hits in the MMV Malaria Box and Open Source Malaria datasets. Answer the following questions:
Do you think that a model that has been trained on the St Jude 3D7 dataset would be more predictive for the MMV Malaria Box or the Open Source Malaria libraries? Why?
What steps could you take to make our model more applicable to the library that would be more difficult to make predictions for?
How could we improve our model once we’ve experimentally tested the first set of compounds that were selected from the virtual screening?
A virtual screening cascade allows us to mimic in the computer some of the experimental steps we must do to identify new drug leads. By filtering out molecules with predicted low activities, or undesired side effects, we can lower the cost and time to find new drug candidates. Ideally, we can build virtual screening cascades based off our own data, but for many assays we do not have readily available experimental data. In these situations, we can leverage models developed by third parties and apply them to our problem.
Virtual screening cascades are not meant to substitute experimental testing, but act as a decision-making support tool
In this activity, we will replicate the work described in Liu et al, 2023, where they build an ML model to identify novel A.baumannii inhibitors and use it to filter the Drug Repurposing Hub.
During the skills development session, we have done a deep dive into ML model building using the A.baumannii model as an example. Now, you have to download the list of compounds available in the Drug Repurposing Hub and continue the "virtual screening" similar to the original author's work. To that end, we suggest running predictions against A.baumannii activity and a few accessory models available through Ersilia to select the best candidates. In short, the steps to follow are:
Download the Drug Repurposing Hub data for your group from this link.
Look at the Ersilia Model Hub models available online (select Online in the left menu of the website).
Select which models would be relevant to your exercise and why, and take notes on model interpretation, expected results, priority level etc.
Run predictions for the Drug Repurposing Hub molecules using the selected models.
Select the best molecule candidates based on your defined filters of activity, ADME properties and other considerations.
To simplify the exercise, we have prepared 5 subsets of data from the Drug Repurposing Hub. Each group should use its assigned dataset only
In order to limit the exercise, please limit your screening to the following models:
A.baumannii Activity: eos3804
General Antibiotic Activity: eos4e40
Cardiotoxicity: eos43at
Synthetic Accessibility: eos9ei3
ADME properties: eos7d58
Natural Product Likeness: eos9yui
It is best to select only a few models but really understand how to use them rather than running predictions for all the models but not knowing how to interpret the outcomes.
For each model, think about the following questions:
What type of model is it (classification or regression)?
What is the training dataset? (refer to the original publication listed above)
What is the interpretation of the model outcome?
What cut-off, if any, we should use for that particular model?
In addition, think about the following concepts:
Does the outcome of the model make sense? If it does not make sense, perhaps we have the wrong interpretation of the model output.
Is the cut-off I have selected too stringent (i.e, I am losing too many molecules and I should be more permissive?)
Is this model very relevant for the current dataset (i.e, is malaria activity equally important as natural product likeness?)
Molecule selection
Use the predicted values to select the 10 molecules that you would take for experimental testing if you had to choose. To that end, you can think of:
What are the most important activities you want to optimize
What are strict no-go points
What are activities that are easiest to optimize at lead stage
Finally, prepare a short presentation for the rest of the participants. This should cover:
Which models did you choose and why
What selection strategy did you decide
Which were your selected molecules
Extension: if you finish the proposed activity, have a look at what else is available in the Ersilia Model Hub and what would you like to see deployed online!
Sampling the chemical space is a fundamental task in computational drug discovery. Recently, generative AI methods have significantly expanded our sampling capabilities. As a result, traditionally laborious steps, such as hit-to-lead optimization, can now be supported by computational tools that generate new chemical matter as starting points for further analysis. In this session, we will experiment with some of these tools to sample the chemical space around a seed molecule.
You will be presented with four hits obtained from an experimental screening against Burkholderia cenocepacia (growth inhibition assay). These will serve as your starting points. As a team, agree on one hit (seed) molecule for further analysis. There is no right or wrong answer.
Next, you will have the opportunity to experiment with different “samplers” from the Ersilia Model Hub. Read the model descriptions and discuss the following:
Some samplers perform a similarity search against a large chemical library. Can you identify these samplers? Do you expect them to give deterministic results?
Other samplers generate genuinely new molecules. Of these, two are fragment-based and two are deep-learning (AI) based. Can you classify them?
What are the expected advantages and disadvantages of each type of model?
Some samplers may not yield results. What might be the reason for this?
The best way to understand the behavior of sampling methods is to run them and explore the results, both visually and quantitatively. Feel free to sample compounds multiple times, trying different methods. There are filtering options in the app, as well as a simple display of the molecular structures.
Ultimately, you should create a wishlist of 100-500 molecules, ideally obtained using multiple models. To create this wishlist, consider the following questions:
What is a reasonable range of Tanimoto similarities to the seed molecule?
Of the auxiliary properties (e.g., molecular weight, logP, and QED), which is the most relevant at this stage?
Pro tip: open an Excel spreadsheet and copy-paste your molecules of interest them. Once you are satifsfied with them, copy-paste them in the relevant box in the app.
The Ersilia Model Hub contains a variety of activity and property prediction models that can be used to further assess your wishlist. In this case, we have selected three activity prediction models (eos4e40, eos5xng, and eos9f6t) and a multi-output ADMET model (eos7d58). Discuss the following with your team:
Which activity prediction model is best suited for this task?
Which 5 ADMET properties would you prioritize? There is no right or wrong answer.
What is more important at this stage: activity or ADMET properties?
ADMET values will be given in percentiles. How do you interpret these values?
Once you have made your decision, you can calculate your desired enpoints for all the molecules in your wishlist.
Explore results on two levels:
General level: Assess whether wishlist molecules have high predicted activities (as desirable) and acceptable ADMET properties.
Individual level: Select 3-5 molecules and discuss whether they are good leads.
Send your selected 3-5 compounds to miquel@ersilia.io. The title of your email should be "AI2050 Group N". Just copy-paste the SMILES strings of the molecules in the e-mail text.
Prepare a short presentation for the rest of the participants. The presentation should cover the following:
What seed molecule did you choose and why?
Which sampler models did you find most useful?
What was your rationale for assembling your wishlist?
Which lead molecules did you select and why?
Keynote
Skills development
Breakout