Breakout: the Ersilia Model Hub
Last updated
Last updated
For this activity, we will put ourselves in the shoes of a scientist that has access to a library of compounds for only their structure (SMILES) is available. She/he/they want to identify a few hits for an anti-infective drug, but only have the capacity to test 10 molecules in the first round.
The goals of this session are:
To learn how to use the Ersilia Model Hub as a python package.
To learn how to analyse the predictions we obtain from an open source ML model.
To understand how can we perform virtual screening of molecular libraries.
The material for this activity includes:
Introductory
Exercise
Medicines for Malaria Venture (MMV)
A virtual screening cascade allows us to mimic in the computer some of the experimental steps we must do to identify new drug leads. By filtering out molecules with predicted low activities, or undesired side effects, we can lower the cost and time to find new drug candidates.
Ideally, we can build virtual screening cascades based off our won data, but for many assays we do not have readily available experimental data. In these situations, we can leverage models developed by third parties and apply them to our problem.
Virtual screening cascades are not meant to substitute experimental testing, but act as a decision-making support tool
The Ersilia Model Hub is a repository of pre-trained, ready to use open source ML models for drug discovery. It is constantly growing with models collected by the literature and models developed by Ersilia's team:
The Ersilia Model Hub is licensed under a GPLv3 License. Each model is licensed according to its original authors. Please check the restrictions before using them, particularly for commercial ventures.
In this activity, we will use the MMV Malaria Box as an examplar molecular library that we want to sort based on different bioactivity profiles.
We will use an antimalarial activity predictor model (eos2gth
or maip-malaria-surrogate(
as an example on how to fetch and use models from the Ersilia Model Hub. This will serve as starting point for the rest of the breakout activity.
More information is shared on the slides, but, in short, the Ersilia Model Hub can be used as a Python Package in a Jupyter notebook>
The output generated by the model is a table with the following columns:
ALGPHOUNWIZIOQ-UHFFFAOYSA-N
COc1ccccc1CNC(=O)CCn1c(=O)[nH]c2ccsc2c1=O
6.886159
QFVDKARCPMTZCS-UHFFFAOYSA-N
CN(C)c1ccc(C(O)(c2ccc(N(C)C)cc2)c2ccc(N(C)C)cc.
15.483176
HKNNPGWJKJDXCN-UHFFFAOYSA-N
Cc1ccc(-c2cc3c(SCC(=O)Nc4cc(C(F)(F)F)ccc4Cl)nc
27.288107
Each molecule is represented by its InChiKey and SMILES, and the model output is the third column, labeled as "score". A useful thing to do is plotting the distribution of the model predictions:
There are a few questions we need to ask ourselves in order to interpret this output:
Is the model a classification or a regression?
What were the exact measurements used to train the model (IC50, % Inhibition...)?
What can we consider a good cut-off for the activity?
Is the chemical space of the training data relevant to our problem?
A regression
Trained on several large datasets of molecules and their associated activity against malaria
The higher the score, the higher activity against P.falciparum
The threshold depends on the user, but they provide several examples to help users understand their results.
We will continue the "virtual screening" of the MMV Malaria Box in small groups, and discuss our findings with the rest of the participants at the end.
Please take into account the following expected timings:
Breakout introduction: 30 min
Running predictions and evaluating each model output: 1h
Selecting best molecules: 15 min
Preparing a short presentation for the rest of the groups (make sure to leave enough time to prepare this presentation, analyse less models if need be).
We will use the following tools:
Ersilia Model Hub: we have pre-selected a few models to facilitate the discussion
Google Colab: we will use the implementation of the Ersilia Model Hub in Google Colab as we have exemplified with the antimalarial model
Google Drive: model predictions will be stored in your drive under the DataScience_Workshop/data folder.
The model outputs can be analysed directly in Google Colab (if you are familiar with Python) or simply as an excel file. Please if not all participants in the group are familiar with Python resort to using Excel.
In order to limit the exercise, please limit your screening to the following models:
Malaria Activity: eos2gth / maip-malaria-surrogate
Tuberculosis Activity: eos46ev / chemtb
Antibiotic Activity: eos4e40 / chemprop-antibiotic
Cardiotoxicity (hERG): eos43at / molgrad-herg
Retrosynthetic Accessibility: eos2r5a / retrosynthetic-accessibility
Aqueous Solubility: eos6oli / soltrannet-aqueous-solubility
Natural Product Likeness: eos9yui / natural-product-likeness
For each model, think about the following questions:
What type of model is it (classification or regression)?
What is the training dataset? (refer to the original publication listed above)
What is the interpretation of the model outcome?
What cut-off, if any, we should use for that particular model?
In addition, think about the following concepts:
Does the outcome of the model make sense? (i.e, malaria activity is predicted high for most molecules since it is a library optimized for malaria activity). If it does not make sense, perhaps we have the wrong interpretation of the model output
Is the cut-off I have selected too stringent (i.e, I am losing too many molecules and I should be more permissive?)
Is this model very relevant for the current dataset (i.e, is malaria activity equally important as natural product likeness?)
Use the predicted values to select the 10 molecules that you would take for experimental testing if you had to choose. To that end, you can think of:
What are the most important activities you want to optimize
What are strict no-go points
What are activities that are easiest to optimize at lead stage
Prepare a short presentation for the rest of the participants. This should cover:
Which models did you choose and why
What selection strategy did you decide
Which were your selected molecules
Browse models.
Check in-depth on model usage and model contributing.
Suggest to be added.
In this case, we need to go back to the of the model and its associated to learn more about what is the "score" representing. Thanks to an excellent modelling and documentation work, we can easily understand that the model is: