Breakout: generative models

For this activity, we will continue with the exercise started in Session 2. We have a list of selected hits with potential high antimalarial activity, but we would like to diversify them and obtain more hit candidates, maybe with even better predicted properties...

Generative models are subset of Unsupervised Machine Learning models that automatically discover and learn patterns in an input dataset and use those to propose new examples that could have been part of the original dataset.

There are several types of generative models applied to chemistry. In the Keynote session, we have discussed how Ersilia applied a Reinforcement Learning algorithm (REINVENT 2.0) to generate new hits with predicted high activity against P.falciparum and S.aureus, respectively.


The materials for this session are:

  • Presentation slides

  • Guide Notebook

  • Selected hits from the MMV Malaria Box

Each group will work with its list of selected hits, which should have been saved in the Google Drive folder h3d_ersilia_ai_workshop/data/session2

In this exercise, we will use a similarity search model as a surrogate for generating new molecules. Those models use a virtually generated library to find the 100 molecules closest to your input. They are not a generative ML model per se but rely on a first step of molecular generation, and the process is much faster and simpler than a real generative model.

We will use and compare the results we obtain with two different similarity search models:

  • GDBChEMBL: searches the 100 nearest neighbors to the input molecule in a database of 166 billion compounds (Bühlmann et al, 2020). It is identified in the Ersilia Model Hub by the code eos4b8j

  • GDBMedChem: searches the 100 nearest neighbors to the input molecule in a database of 10 million compounds curated from GBDChEMBL, with reduced complexity and higher synthetic accessibility (Awale et al, 2019). It is identified in the Ersilia Model Hub by the code eos7jlv

The similarity search uses an online server to query the databases, this might have implications for doing similarity searches with IP-sensitive molecules


Find the 100 compounds from each database that are most similar to the top hit of your selection of molecules from the MMV Malaria Box, and try to answer the following questions:

  • Are the hits obtained from each database different?

  • Are hits from GDBMedChem synthetically more accessible than hits from GDBChEMBL?

  • Do we have any molecule with predicted higher antimalarial potential than the original hit?

  • Which molecules would you select for further screening?

  • Is there any ADMET consideration you are taking into account for the selection, after what we reviewed on session 3?

To answer some of these questions, use the tools we learned in session 2 and 3. Models predicting retrosynthetic accessibility, malaria activity, ADMET properties...

Also remember you can use RDKit to draw molecules from SMILES.

Last updated