Inputs
Here we describe the types of valid inputs while running Ersilia models
Compound inputs
Small molecule structures are typically expressed in the well known SMILES format. Below are the SMILES strings of a few drug molecules:
Artemisin
CC1C2C(CC3(C=CC(=O)C(=C3C2OC1=O)C)C)O
Isoniazid
C1=CN=CC=C1C(=O)NN
Tenofovir
CC(CN1C=NC2=C(N=CN=C21)N)OCP(=O)(O)O
Aspirin
CC(=O)OC1=CC=CC=C1C(=O)O
Ibuprofen
CC(C)CC1=CC=C(C=C1)C(C)C(=O)O
Remdesivir
CC1(OC2C(OC(C2O1)(C#N)C3=CC=C4N3N=CN=C4N)CO)C
Cephalotaxin
COC1=CC23CCCN2CCC4=CC5=C(C=C4C3C1O)OCO5
In the most common case, each input sample corresponds to a single molecule. However, some models expect a list of molecules as input, and some even expect more complex inputs such as pairs of lists of molecules. Multiple molecules (lists) can be serialized to strings in the SMILES notation with the dot (.
) character.
Valid input file formats are comma-separated (.csv
), tab-separated (.tsv
) and JSON (.json
). Ersilia automatically recognizes these formats. Inputs can also be passed as Python instances through the Ersilia Python API.
It is possible to run Ersilia models for one input as well as multiple inputs. Ersilia automatically detects the if one or multiple inputs are passed.
Single molecules
One input
This is the simplest case where one single molecule is passed as input.
smiles
CC1C2C(CC3(C=CC(=O)C(=C3C2OC1=O)C)C)O
Multiple inputs
This is a common case too, where multiple single molecules are passed as input. The model will run predictions/calculations for each molecule independently.
smiles
CC1C2C(CC3(C=CC(=O)C(=C3C2OC1=O)C)C)O
C1=CN=CC=C1C(=O)NN
CC(CN1C=NC2=C(N=CN=C21)N)OCP(=O)(O)O
CC(=O)OC1=CC=CC=C1C(=O)O
CC(C)CC1=CC=C(C=C1)C(C)C(=O)O
CC1(OC2C(OC(C2O1)(C#N)C3=CC=C4N3N=CN=C4N)CO)C
COC1=CC23CCCN2CCC4=CC5=C(C=C4C3C1O)OCO5
List of molecules
One input
Here the molecule expects a list of molecules as input, therefore, the one prediction/calculation will be done based on the list as a whole. Some generative models, for example, require multiple molecules as a starting point for one generation round.
smiles
CC1C2C(CC3(C=CC(=O)C(=C3C2OC1=O)C)C)O
C1=CN=CC=C1C(=O)NN
CC(CN1C=NC2=C(N=CN=C21)N)OCP(=O)(O)O
CC(=O)OC1=CC=CC=C1C(=O)O
CC(C)CC1=CC=C(C=C1)C(C)C(=O)O
CC1(OC2C(OC(C2O1)(C#N)C3=CC=C4N3N=CN=C4N)CO)C
COC1=CC23CCCN2CCC4=CC5=C(C=C4C3C1O)OCO5
Note that, in this case, the compound_list.csv
file is the same as the compound_singles.csv
file, corresponding to the multiple single molecules. However, the model will treat these files differently. In the current case, the full list corresponds to one input, whereas in the previous case each single molecule was an independent input. In the Ersilia Model Hub, the Input Shape field is labelled as Single or List, correspondingly.
Multiple inputs
Here, multiple lists are passed as input, and each list is treated independently by the model.
smiles
CC1C2C(CC3(C=CC(=O)C(=C3C2OC1=O)C)C)O.C1=CN=CC=C1C(=O)NN.CC(CN1C=NC2=C(N=CN=C21)N)OCP(=O)(O)O
CC(=O)OC1=CC=CC=C1C(=O)O.CC(C)CC1=CC=C(C=C1)C(C)C(=O)O.CC1(OC2C(OC(C2O1)(C#N)C3=CC=C4N3N=CN=C4N)CO)C.COC1=CC23CCCN2CCC4=CC5=C(C=C4C3C1O)OCO5
Pair of lists of molecules
One input
This corresponds to a less common case where, one input is expressed as a pair of lists. An example would be a model comparing two sets of molecules and returning an overall similarity values (one float number) between the two sets.
smiles_1,smiles_2
CC1C2C(CC3(C=CC(=O)C(=C3C2OC1=O)C)C)O,CC(=O)OC1=CC=CC=C1C(=O)O
C1=CN=CC=C1C(=O)NN,CC(C)CC1=CC=C(C=C1)C(C)C(=O)O
CC(CN1C=NC2=C(N=CN=C21)N)OCP(=O)(O)O,CC1(OC2C(OC(C2O1)(C#N)C3=CC=C4N3N=CN=C4N)CO)C
,COC1=CC23CCCN2CCC4=CC5=C(C=C4C3C1O)OCO5
Please note that the compound_pair_of_lists.csv
file contains two columns, one for each set. The first set has three molecules, and the second set has four molecules. Therefore, the first column contains one empty row.
Multiple inputs
Multiple pairs of lists can be passed to obtain multiple predictions/calculations, one for each pair. Like in the case of multiple lists, molecules can be separated with a dot character in a tabular file.
smiles_1,smiles_2
CC1C2C(CC3(C=CC(=O)C(=C3C2OC1=O)C)C)O.C1=CN=CC=C1C(=O)NN,CC(CN1C=NC2=C(N=CN=C21)N)OCP(=O)(O)O
CC(=O)OC1=CC=CC=C1C(=O)O.CC(C)CC1=CC=C(C=C1)C(C)C(=O)O,CC1(OC2C(OC(C2O1)(C#N)C3=CC=C4N3N=CN=C4N)CO)C.COC1=CC23CCCN2CCC4=CC5=C(C=C4C3C1O)OCO5
Real-world sets of molecules
You can generate inputs of arbitrary size for your model of interest with the following command. In this case, we generate 1,000 inputs for the chemprop-antibiotic
model and store them as a .csv
file.
ersilia example chemprop-antibiotic -n 1000 -f input.csv
This command simply samples drug molecules from the attached table.
Small molecule databases
Many small molecule databases exist in the public domain. If you want to look for a molecule of interest, consider the following resources:
Last updated
Was this helpful?