AI2050 AI for Drug Discovery

4-day workshop on the use of AI for Drug Discovery

Course overview

Each day is organised in four main activities:

  • Keynote Lectures: a presentation by H3D and Ersilia mentors and invited speakers, introducing key concepts that we will put into practice during the day

  • Skills Workshop: a hands-on training where the course facilitators will walk the participants through a specific ML exercise

  • Breakout: participants will be divided in small groups to complete an exercise related to the skills workshop

  • Group Discussion: each group will give a short (10 min) presentation on their breakout activity.

Course Contents

In case you need further information to follow the course, be sure to check the following:

  • Ersilia Model Hub: list of all available models.

  • Ersilia Model Hub: self-service on GitHub (requires a GitHub account, allows to run online predictions for all models).

  • Ersilia Model Hub: fast online deployment for selected models.

  • Ersilia Model Hub: local installation documentation (only expert users).

  • Publications referred to during the workshop.

  • Datasets required for the workshop.

Breakout Sessions

Breakout session day 1

Pre-activity 1: Ice-breaker

Take a few minutes to get to know the members of your breakout group. Take turns to cover the following points about yourself:

  1. Your name, research institution and field of research.

  2. How you currently use computational tools for your research, if any, and specifically AI-based tools.

  3. How you think data science tools can contribute to your research going forward.

Lastly, select a scribe for your breakout group as well as someone else who will provide some feedback on your group’s discussion for the next two tasks during the feedback session.

Task 1: Data Cleaning

The Community for Open-Antimicrobial Drug Discovery (CO-ADD) has curated a database of compounds that have been tested for activity against a set of infectious bacteria known as the ESKAPE pathogens. This is a great source of data for training models that can predict a compound’s activity against bacteria. However, we will need to clean the raw data first.

This activity is split into three parts:

  1. Follow the task 1 guidance (day 1 breakout slides) to download a dataset from CO-ADD.

  2. Answer questions 1 to 5.

  3. Deliverable: Compile the list of data issues (question 4) that your group identified within the raw dataset which needs to be addressed to clean the dataset.

Questions:

  1. What is the assay(s) in this dataset measuring?

  2. Why should datasets have each of the following properties before we use them to train models

    1. Completeness

    2. Consistency

    3. Accuracy

    4. Relevancy

  3. What problem(s) might result from a dataset that does not have each of these properties when used to train a model?

  4. Deliverable: Find examples of data inconsistencies that are causing the dataset to not be complete, consistent, accurate, or relevant for modelling A. baumanni activity. How could you address each of the points you found in (b) to clean the dataset?

  5. Were there any other problems that you thought to look for but were not a problem in this dataset?

Deliverable:

Place your answers to question 4 in a word document. You will email these answers after adding the task 2 deliverable to this document.

Task 2: Chemical Space Exploration

Let’s imagine we have trained a model to predict antiplasmodium activity and we have two potential compound libraries that we now could virtually screen for new chemical hits. We want to start by selecting just one of these two libraries based on how relevant the training data of our model might is to each library.

Steps to complete this activity:

  1. Follow the task 2 guidance (day 1 breakout slides) to download a dataset from ChEMBL.

  2. Answer questions 1 to 6.

  3. Deliverable (question 5): a) A screenshot of your UMAP chemical space, b) the library you have chosen to screen and a short reason why you choose this library.

Questions:

  1. What do you understand by the term 'chemical space’?

  2. How does this differ from the concept of 'drug-like molecules’?

  3. Why should the compounds we use to train models be similar to the compounds we want to obtain predictions for?

  4. What type of training data is needed for each of the following scenarios?:

    1. Virtual screening of broad chemical space for novel chemical hits?

    2. Ranking closely related analogues within a chemical series to prioritize compounds for synthesis?

  5. Deliverable: Follow the guidance for task 2 to download and plot the chemical space of several antiplasmodial screening libraries. Do you think that a model that has been trained on the St Jude 3D7 dataset would be better at predicting activity in the MMV Malaria Box or the Open Source Malaria libraries? Why?

  6. How could we improve our model once we’ve experimentally tested the first set of compounds that were selected from the virtual screening?

Deliverable:

Add to your deliverable document from Task 1 a) A screenshot of your UMAP chemical space, b) the library you have chosen to screen and a short reason why you choose this library. Then email this document to the facilitator by the end of the breakout session.

Breakout session day 2

A virtual screening cascade allows us to mimic in the computer some of the experimental steps we must do to identify new drug leads. By filtering out molecules with predicted low activities, or undesired side effects, we can lower the cost and time to find new drug candidates. Ideally, we can build virtual screening cascades based off our own data, but for many assays we do not have readily available experimental data. In these situations, we can leverage models developed by third parties and apply them to our problem.

A.baumannii activity prediction

In this activity, we will replicate the work described in Liu et al, 2023, where they build an ML model to identify novel A.baumannii inhibitors and use it to filter the Drug Repurposing Hub.

During the skills development session, we have done a deep dive into ML model building using the A.baumannii model as an example. Now, you have to download the list of compounds available in the Drug Repurposing Hub and continue the "virtual screening" similar to the original author's work. To that end, we suggest running predictions against A.baumannii activity and a few accessory models available through Ersilia to select the best candidates. In short, the steps to follow are:

  1. Download the Drug Repurposing Hub data for your group from this link.

  2. Look at the pre-selected Ersilia Model Hub models available online in our platform.

  3. Select which models would be relevant to your exercise and why, and take notes on model interpretation, expected results, priority level etc.

  4. Run predictions for the Drug Repurposing Hub molecules using the selected models.

  5. Select the best molecule candidates based on your defined filters of activity, ADME properties and other considerations.

To simplify the exercise, we have prepared 5 subsets of data from the Drug Repurposing Hub. Each group should use its assigned dataset only

Task 1: Model selection

In order to limit the exercise, please limit your screening to the following models:

  • A.baumannii Activity: eos3804

  • General Antibiotic Activity: eos4e40

  • Cardiotoxicity: eos43at

  • Synthetic Accessibility: eos9ei3

  • ADME properties: eos7d58

  • Natural Product Likeness: eos9yui

For each model, think about the following questions:

  • What type of model is it (classification or regression)?

  • What is the training dataset? (refer to the original publication listed above)

  • What is the interpretation of the model outcome?

  • What cut-off, if any, we should use for that particular model?

In addition, think about the following concepts:

  • Does the outcome of the model make sense? If it does not make sense, perhaps we have the wrong interpretation of the model output.

  • Is the cut-off I have selected too stringent (i.e, I am losing too many molecules and I should be more permissive?)

  • Is this model very relevant for the current dataset (i.e, is malaria activity equally important as natural product likeness?)

Deliverable: fill in the following excel table and send it to gemma[at]ersilia.io

Model information can be found in the metadata of the model available through Ersilia's GitHub repository, but it's best to read the original publication. Publications are available in this folder with highlighted sections to facilitate model understanding.

Task 2: Molecule prioritization

Next, let's use the models we have discussed to run some predictios!

  1. Download the compound library corresponding to your groups from here.

  2. Go to the Ersilia GUI and run evaluations.

  3. Download all CSV files into a working directory. In case a model evaluation fails, feel free to download precalculations from here.

  4. Use this simple app to merge your CSV files or, alternatively, use Excel to concatenate the CSV files.

  5. Open the downloaded Excel file and use the Legends tab to learn more about each column.

  6. Select up to 5 compounds, based on bioactivity, ADME and toxicity profiles.

  7. Make a short slide deck. Use this template:

    • Model relevance and why: did you use all models? Which ones or which columns from each model do you consider more relevant?

    • Overview of the screening results. Were the results as expected?

    • Selected compounds and rationale. Which compounds did you choose, and why? Did you identify an optimal compound? What would you do next?

  8. Present! Send your slides to miquel@ersilia.io. Name your file: ai_workshop_green.pptx, ai_workshop_yellow.pptx, etc.

Deliverable: list of top 5 candidates

Feel free to use these tools:

Breakout session day 3

Sampling the chemical space is a fundamental task in computational drug discovery. Recently, generative AI methods have significantly expanded our sampling capabilities. As a result, traditionally laborious steps, such as hit-to-lead optimization, can now be supported by computational tools that generate new chemical matter as starting points for further analysis. In this session, we will experiment with some of these tools to sample the chemical space around a seed molecule.

Step 1: Sample the chemical space around a seed molecule

You will be presented with four hits obtained from an experimental screening against Burkholderia cenocepacia (growth inhibition assay). These will serve as your starting points. As a team, agree on one hit (seed) molecule for further analysis. There is no right or wrong answer.

Next, you will have the opportunity to experiment with different “samplers” from the Ersilia Model Hub. Read the model descriptions and discuss the following:

  • Some samplers perform a similarity search against a large chemical library. Can you identify these samplers? Do you expect them to give deterministic results?

  • Other samplers generate genuinely new molecules. Of these, two are fragment-based and two are deep-learning (AI) based. Can you classify them?

  • What are the expected advantages and disadvantages of each type of model?

  • Some samplers may not yield results. What might be the reason for this?

Step 2: Create a wishlist of molecules

The best way to understand the behavior of sampling methods is to run them and explore the results, both visually and quantitatively. Feel free to sample compounds multiple times, trying different methods. There are filtering options in the app, as well as a simple display of the molecular structures.

Ultimately, you should create a wishlist of 100-500 molecules, ideally obtained using multiple models. To create this wishlist, consider the following questions:

  • What is a reasonable range of Tanimoto similarities to the seed molecule?

  • Of the auxiliary properties (e.g., molecular weight, logP, and QED), which is the most relevant at this stage?

Pro tip: open an Excel spreadsheet and copy-paste your molecules of interest them. Once you are satifsfied with them, copy-paste them in the relevant box in the app.

Step 3: Predict properties of your molecule wishlist

The Ersilia Model Hub contains a variety of activity and property prediction models that can be used to further assess your wishlist. In this case, we have selected three activity prediction models (eos4e40, eos5xng, and eos9f6t) and a multi-output ADMET model (eos7d58). Discuss the following with your team:

  • Which activity prediction model is best suited for this task?

  • Which 5 ADMET properties would you prioritize? There is no right or wrong answer.

  • What is more important at this stage: activity or ADMET properties?

  • ADMET values will be given in percentiles. How do you interpret these values?

Once you have made your decision, you can calculate your desired enpoints for all the molecules in your wishlist.

Explore results on two levels:

  1. General level: Assess whether wishlist molecules have high predicted activities (as desirable) and acceptable ADMET properties.

  2. Individual level: Select up to 10 molecules and discuss whether they are good leads.

Final step: be ready to discuss!

We will discuss our selection criteria together. Be ready to answer the following questions:

  • What seed molecule did you choose and why?

  • Which sampler models did you find most useful?

  • What was your rationale for assembling your wishlist?

  • Which lead molecules did you select and why?

Last updated

Was this helpful?