Sampling the chemical space

We present ChemSampler, a simple tool to sample molecules in a given region of the chemical space

This page is 👷 work in progress 👷!

ChemSampler acts as a framework that fetches generative AI models from the Ersilia Model Hub and runs iterative rounds of generation to provide the user with a pool of new or sampled existing molecules for downstream filtering using bioactivity predictors and/or docking approaches.

Basic structure

ChemSampler accepts one single molecule (seed_smiles) as input. The seed smiles is passed to each one of the Ersilia Model Hub Generative models (samplers) specified in the params.json file (see below). All results of this first round are collated in a single output .csv file, and a random sample of 100 molecules from the generated list is provided to the user for ease of evaluation.

ChemSampler runs as many rounds as specified in the params.json file or until the maximum number of desired molecules has been achieved. In each round, ChemSampler will evaulate whether the seed_smiles is still providing enough good candidates (new unique molecules) or whether a new input_smiles needs to be selected from the pool of generated molecules. The ranking of newly generated molecules is done by similarity search using molecular descriptors specified by the user. If the user is interested in small modifications of the seed_smiles, 2D descriptors should be prioritized. On the contrary for scaffold hopping exercises, 3D and pharmacophore descriptors should be prioritized. The molecule most similar to the seed_smiles according to the selected descriptors will be used as input_smiles for the next round. A schema is shown below:

The parameters file

The parameters file is a .json file that allows the user to specify:

{
    "seed_smiles": <smiles of origin used for the generation of new candidates>,
    "keep_smiles": <smiles of a substructure to maintain>,
    "avoid_smiles": <smiles of a substructure to discard in the generative process>,
    "samplers": [ <list of generative models (ersilia identifier code)> ],
    "descriptors": [ <list of molecular descriptors (ersilia identifier code)> ],
    "num_samples": <maximum number of candidates to generate>,
    "max_rounds": <maximum number of rounds>,
    "time_budget_sec": <maximum time spent per round>,
    "saturation_number": <minimum number of new candidates obtained from the input smiles to continue using the same input in the next round>,
    "output_folder": <folder where results will be stored>
}

* If keep_smiles or avoid_smiles are specified, all molecules that do not fulfill either criteria will be removed, which potentially could lead to low number of generated candidates

Input

A seed_smiles, a list of samplers and a list of descriptors. All of those must be specified in the parameters file.

Output

  • A list of generated/sampled molecules and its similarity to the seed_smiles. Similarity is calculated using the molecular descriptors and either euclidean distances or tanimoto similarity.

  • A info.json file where information about each round (how many molecules were generated, how many of those were new, which was the input smiles...) is stored

  • A sample of 100 molecules randomly selected to be drawn for the user to evaluate the performance of each round

Available Samplers

EOSIDSlugDescription

eos1d7r

small-world-zinc

Small World is an index of chemical space containing more than 230B molecular substructures. Here we use the Small World API to post a query to the SmallWorld server. We sample 100 molecules within a distance of 10 specifically for the Wuxi map, not the entire SmallWorld domain. Please check other small-world models available in our hub.

eos3kcw

small-world-wuxi

Small World is an index of chemical space containing more than 230B molecular substructures. Here we use the Small World API to post a query to the SmallWorld server. We sample 100 molecules within a distance of 10 specifically for the ZINC map, not the entire SmallWorld domain. Please check other small-world models available in our hub.

eos9ueu

small-world-enamie-real

Small World is an index of chemical space containing more than 230B molecular substructures. Here we use the Small World API to post a query to the SmallWorld server. We sample 100 molecules within a distance of 10 specifically for the ZINC map, not the entire SmallWorld domain. Please check other small-world models available in our hub.

eos1noy

chembl-sampler

A simple sampler of the ChEMBL database using their API. It looks for similar molecules to the input molecule and returns a list of 100 molecules by default. This model has been developed by Ersilia. It posts queries to an online server.

eos2hzy

pubchem-sampler

A simple sampler of the PubChem database using their API. It looks for similar molecules to the input molecule and returns a list of 100 molecules by default. This model has been developed by Ersilia and posts queries to an online server.

eos8fma

stoned-sampler

The STONED sampler uses small modifications to molecules represented as SELFIES to perform a search of the chemical space and generate new molecules. The use of string modifications in the SELFIES molecular representation bypasses the need for large amounts of data while maintaining a performance comparable to deep generative models.

eos4qda

fasmifra

FasmiFra is a molecular generator based on (deep)SMILES fragments. The authors use Deep SMILES to ensure the generated molecules are syntactically valid, and by working on string operations they are able to obtain high performance (>340,000 molecule/s). Here, we use 100k compounds from ChEMBL to sample fragments. Only assembled molecules containing one of the fragments of the input molecule are retained.

eos9taz

moler-enamine-fragmens

MoLeR is a graph-based generative model that combines fragment-based and atom-by-atom generation of new molecules with scaffold-constrained optimization. It does not depend on generation history and therefore MoLeR is able to complete arbitrary scaffolds. The model has been trained on the GuacaMol dataset. Here we sample a fragment library from Enamine.

eos633t

moler-enamine-blocks

MoLeR is a graph-based generative model that combines fragment-based and atom-by-atom generation of new molecules with scaffold-constrained optimization. It does not depend on generation history and therefore MoLeR is able to complete arbitrary scaffolds. The model has been trained on the GuacaMol dataset. Here we sample the 300k building blocks library from Enamine.

Available molecular descriptors

EOSIDSlugDescription

eos4wt0

morgan-fps

The Morgan Fingerprints are one of the most widely used molecular representations. They are circular representations (from an atom,search the atoms around with a radius n) and can have thousands of features. This implementation uses the RDKit package and is done with radius 3 and 2048 dimensions,providing a binary vector as output. For Morgan counts, see eos5axz.

eos5axz

morgan-counts

The Morgan Fingerprints, or extended connectivity fingerprints (ECFP4) are one of the most widely used molecular representations. They are circular representations (from an atom, search the atoms around with a radius n) and can have thousands of features. This implementation uses the RDKit package and is done with radius 3 and 2048 dimensions.

eos8a4x

rdkit-descriptors

A set of 200 physicochemical descriptors available from the RDKIT, including molecular weight, solubility and druggability parameters. We have used the DescriptaStorus selection of RDKit descriptors for simplicity.

eos7jio

rdkit-fingerprint

Path-based fingerprints calculated with the RDKit package Chem.RDKFingerprint. It is inspired in the Daylight fingerprint. As explained in the RDKit Book, the fingerprinting algorithm identifies all subgraphs in the molecule within a particular range of sizes, hashes each subgraph to generate a raw bit ID, mods that raw bit ID to fit in the assigned fingerprint size, and then sets the corresponding bit.

eos78ao

mordred

A set of ca 1,800 chemical descriptors, including both RDKit and original modules. It is comparable to the well known PaDEL-Descriptors (see eos7asg), but has shorter calculation times and can process larger molecules.

eos4u6p

cc-signaturizer

A set of 25 Chemical Checker bioactivity signatures (including 2D & 3D fingerprints, scaffold, binding, crystals, side effects, cell bioassays, etc) to capture properties of compounds beyond their structures. Each signature has a length of 128 dimensions. In total, there are 3200 dimensions. The signaturizer is periodically updated. We use the 2020-02 version of the signaturizer.

eos7w6n

grover-embedding

GROVER is a self-supervised Graph Neural Network for molecular representation pretrained with 10 million unlabelled molecules from ChEMBL and ZINC15. The model provided has been pre-trained on 10 million molecules (GROVERlarge). GROVER has then been fine-tuned to predict several activities from the MoleculeNet benchmark, consistently outperforming other state-of-the-art methods for serveral benchmark datasets.

eos3ae6

whales-descriptor

Weighted Holistic Atom Localization and Entity Shape (WHALES) is a descriptors based on 3D structure to facilitate natural product featurization. It is aimed at scaffold hopping exercises from natural products to synthetic compounds

eos4x30

pmapper-3d

The pharmacophore mapper (pmapper) identifies common 3D pharmacophores of active compounds against a specific target and uniquely encodes them with hashes suitable for fast identification of identical pharmacophores. The obtained signatures are amenable for downstream ML tasks.

eos2gw4

eosce

Bioactivity-aware chemical embeddings for small molecules. Using transfer learning, we have created a fast network that produces embeddings of 1024 features condensing physicochemical as well as bioactivity information The training of the network has been done using the FS-Mol and ChEMBL datasets, and Grover, Mordred and ECFP descriptors

eos7asg

padel

PaDEL is a commonly used molecular descriptor. It calculates 1875 molecular descriptors (1444 1D and 2D descriptors, 431 3D descriptors) and 12 types of fingerprints for small molecule representation.

Last updated