Breakout: the Ersilia Model Hub

For this activity, we will put ourselves in the shoes of a scientist that has access to a library of compounds for only their structure (SMILES) is available. She/he/they want to identify a few hits for an anti-infective drug, but only have the capacity to test 10 molecules in the first round.

The goals of this session are:

To learn how to use the Ersilia Model Hub as a python package.
To learn how to analyse the predictions we obtain from an open source ML model.
To understand how can we perform virtual screening of molecular libraries.

Material

The material for this activity includes:

Introductory presentation
Exercise Notebook
Medicines for Malaria Venture (MMV) dataset

Introduction

A virtual screening cascade allows us to mimic in the computer some of the experimental steps we must do to identify new drug leads. By filtering out molecules with predicted low activities, or undesired side effects, we can lower the cost and time to find new drug candidates.

Ideally, we can build virtual screening cascades based off our won data, but for many assays we do not have readily available experimental data. In these situations, we can leverage models developed by third parties and apply them to our problem.

Virtual screening cascades are not meant to substitute experimental testing, but act as a decision-making support tool

The Ersilia Model Hub

The Ersilia Model Hub is a repository of pre-trained, ready to use open source ML models for drug discovery. It is constantly growing with models collected by the literature and models developed by Ersilia's team:

Browse available models.
Check in-depth documentation on model usage and model contributing.
Suggest new models to be added.

The Ersilia Model Hub is licensed under a GPLv3 License. Each model is licensed according to its original authors. Please check the restrictions before using them, particularly for commercial ventures.

MMV Malaria Box

In this activity, we will use the MMV Malaria Box as an examplar molecular library that we want to sort based on different bioactivity profiles.

The MMV Malaria Box is an already optimised small library, so we expect certain requirements (i.e, synthetic availability) to be fulfilled by most molecules

Malaria Activity Prediction

We will use an antimalarial activity predictor model (eos2gth or maip-malaria-surrogate( as an example on how to fetch and use models from the Ersilia Model Hub. This will serve as starting point for the rest of the breakout activity.

More information is shared on the slides, but, in short, the Ersilia Model Hub can be used as a Python Package in a Jupyter notebook>

from ersilia import ErsiliaModel

!ersilia fetch eos2gth

model = ErsiliaModel("eos2gth")
model.serve()
output = model.predict(input=smiles, output="pandas")
model.close()

The output generated by the model is a table with the following columns:

key

input

score

ALGPHOUNWIZIOQ-UHFFFAOYSA-N

COc1ccccc1CNC(=O)CCn1c(=O)[nH]c2ccsc2c1=O

6.886159

QFVDKARCPMTZCS-UHFFFAOYSA-N

CN(C)c1ccc(C(O)(c2ccc(N(C)C)cc2)c2ccc(N(C)C)cc.

15.483176

HKNNPGWJKJDXCN-UHFFFAOYSA-N

Cc1ccc(-c2cc3c(SCC(=O)Nc4cc(C(F)(F)F)ccc4Cl)nc

27.288107

Each molecule is represented by its InChiKey and SMILES, and the model output is the third column, labeled as "score". A useful thing to do is plotting the distribution of the model predictions:

There are a few questions we need to ask ourselves in order to interpret this output:

Is the model a classification or a regression?
What were the exact measurements used to train the model (IC50, % Inhibition...)?
What can we consider a good cut-off for the activity?
Is the chemical space of the training data relevant to our problem?

In this case, we need to go back to the original publication of the model and its associated documentation to learn more about what is the "score" representing. Thanks to an excellent modelling and documentation work, we can easily understand that the model is:

A regression
Trained on several large datasets of molecules and their associated activity against malaria
The higher the score, the higher activity against P.falciparum
The threshold depends on the user, but they provide several examples to help users understand their results.

Guidance

We will continue the "virtual screening" of the MMV Malaria Box in small groups, and discuss our findings with the rest of the participants at the end.

Please take into account the following expected timings:

Breakout introduction: 30 min
Running predictions and evaluating each model output: 1h
Selecting best molecules: 15 min
Preparing a short presentation for the rest of the groups (make sure to leave enough time to prepare this presentation, analyse less models if need be).

Tools

We will use the following tools:

Ersilia Model Hub: we have pre-selected a few models to facilitate the discussion
Google Colab: we will use the implementation of the Ersilia Model Hub in Google Colab as we have exemplified with the antimalarial model
Google Drive: model predictions will be stored in your drive under the DataScience_Workshop/data folder.

The model outputs can be analysed directly in Google Colab (if you are familiar with Python) or simply as an excel file. Please if not all participants in the group are familiar with Python resort to using Excel.

Models

In order to limit the exercise, please limit your screening to the following models:

Malaria Activity: eos2gth / maip-malaria-surrogate
Tuberculosis Activity: eos46ev / chemtb
Antibiotic Activity: eos4e40 / chemprop-antibiotic
Cardiotoxicity (hERG): eos43at / molgrad-herg
Retrosynthetic Accessibility: eos2r5a / retrosynthetic-accessibility
Aqueous Solubility: eos6oli / soltrannet-aqueous-solubility
Natural Product Likeness: eos9yui / natural-product-likeness

It is best to select only a few models but really understand how to use them rather than running predictions for all the models but not knowing how to interpret the outcomes.

To speed up model predictions, you can first leave the notebook running fetching all the models, and then serve and run predictions for the ones that are relevant to your group.

Steps

Step 1: Model prediction & interpretation

For each model, think about the following questions:

What type of model is it (classification or regression)?
What is the training dataset? (refer to the original publication listed above)
What is the interpretation of the model outcome?
What cut-off, if any, we should use for that particular model?

In addition, think about the following concepts:

Does the outcome of the model make sense? (i.e, malaria activity is predicted high for most molecules since it is a library optimized for malaria activity). If it does not make sense, perhaps we have the wrong interpretation of the model output
Is the cut-off I have selected too stringent (i.e, I am losing too many molecules and I should be more permissive?)
Is this model very relevant for the current dataset (i.e, is malaria activity equally important as natural product likeness?)

Step 2: molecule selection

Use the predicted values to select the 10 molecules that you would take for experimental testing if you had to choose. To that end, you can think of:

What are the most important activities you want to optimize
What are strict no-go points
What are activities that are easiest to optimize at lead stage

Step 3: prepare the presentation

Prepare a short presentation for the rest of the participants. This should cover:

Which models did you choose and why
What selection strategy did you decide
Which were your selected molecules

PreviousSkills: building an ML model for chemistry NextSkills: using OS models

Last updated 2 years ago