Inputs
Here we describe the types of valid inputs while running Ersilia models
Small molecule structures are typically expressed in the well known SMILES format. Below are the SMILES strings of a few drug molecules:
Drug | SMILES |
---|---|
Artemisin | CC1C2C(CC3(C=CC(=O)C(=C3C2OC1=O)C)C)O |
Isoniazid | C1=CN=CC=C1C(=O)NN |
Tenofovir | CC(CN1C=NC2=C(N=CN=C21)N)OCP(=O)(O)O |
Aspirin | CC(=O)OC1=CC=CC=C1C(=O)O |
Ibuprofen | CC(C)CC1=CC=C(C=C1)C(C)C(=O)O |
Remdesivir | CC1(OC2C(OC(C2O1)(C#N)C3=CC=C4N3N=CN=C4N)CO)C |
Cephalotaxin | COC1=CC23CCCN2CCC4=CC5=C(C=C4C3C1O)OCO5 |
In the most common case, each input sample corresponds to a single molecule. However, some models expect a list of molecules as input, and some even expect more complex inputs such as pairs of lists of molecules. Multiple molecules (lists) can be serialized to strings in the SMILES notation with the dot (
.
) character.Valid input file formats are comma-separated (
.csv
), tab-separated (.tsv
) and JSON (.json
). Ersilia automatically recognizes these formats. Inputs can also be passed as Python instances through the Ersilia Python API.It is possible to run Ersilia models for one input as well as multiple inputs. Ersilia automatically detects the if one or multiple inputs are passed.
This is the simplest case where one single molecule is passed as input.
CSV
JSON
Python
compound_single.csv
smiles
CC1C2C(CC3(C=CC(=O)C(=C3C2OC1=O)C)C)O
compound_single.json
"CC1C2C(CC3(C=CC(=O)C(=C3C2OC1=O)C)C)O"
smiles = "CC1C2C(CC3(C=CC(=O)C(=C3C2OC1=O)C)C)O"
This is a common case too, where multiple single molecules are passed as input. The model will run predictions/calculations for each molecule independently.
CSV
JSON
Python
compound_singles.csv
smiles
CC1C2C(CC3(C=CC(=O)C(=C3C2OC1=O)C)C)O
C1=CN=CC=C1C(=O)NN
CC(CN1C=NC2=C(N=CN=C21)N)OCP(=O)(O)O
CC(=O)OC1=CC=CC=C1C(=O)O
CC(C)CC1=CC=C(C=C1)C(C)C(=O)O
CC1(OC2C(OC(C2O1)(C#N)C3=CC=C4N3N=CN=C4N)CO)C
COC1=CC23CCCN2CCC4=CC5=C(C=C4C3C1O)OCO5
compound_singles.json
[
"CC1C2C(CC3(C=CC(=O)C(=C3C2OC1=O)C)C)O",
"C1=CN=CC=C1C(=O)NN",
"CC(CN1C=NC2=C(N=CN=C21)N)OCP(=O)(O)O",
"CC(=O)OC1=CC=CC=C1C(=O)O",
"CC(C)CC1=CC=C(C=C1)C(C)C(=O)O",
"CC1(OC2C(OC(C2O1)(C#N)C3=CC=C4N3N=CN=C4N)CO)C",
"COC1=CC23CCCN2CCC4=CC5=C(C=C4C3C1O)OCO5"
]
smiles = [
"CC1C2C(CC3(C=CC(=O)C(=C3C2OC1=O)C)C)O",
"C1=CN=CC=C1C(=O)NN",
"CC(CN1C=NC2=C(N=CN=C21)N)OCP(=O)(O)O",
"CC(=O)OC1=CC=CC=C1C(=O)O",
"CC(C)CC1=CC=C(C=C1)C(C)C(=O)O",
"CC1(OC2C(OC(C2O1)(C#N)C3=CC=C4N3N=CN=C4N)CO)C",
"COC1=CC23CCCN2CCC4=CC5=C(C=C4C3C1O)OCO5",
]
The majority of Ersilia chemistry models take single molecules as input.
Here the molecule expects a list of molecules as input, therefore, the one prediction/calculation will be done based on the list as a whole. Some generative models, for example, require multiple molecules as a starting point for one generation round.
CSV
JSON
Python
compound_list.csv
smiles
CC1C2C(CC3(C=CC(=O)C(=C3C2OC1=O)C)C)O
C1=CN=CC=C1C(=O)NN
CC(CN1C=NC2=C(N=CN=C21)N)OCP(=O)(O)O
CC(=O)OC1=CC=CC=C1C(=O)O
CC(C)CC1=CC=C(C=C1)C(C)C(=O)O
CC1(OC2C(OC(C2O1)(C#N)C3=CC=C4N3N=CN=C4N)CO)C
COC1=CC23CCCN2CCC4=CC5=C(C=C4C3C1O)OCO5
compound_list.json
[
"CC1C2C(CC3(C=CC(=O)C(=C3C2OC1=O)C)C)O",
"C1=CN=CC=C1C(=O)NN",
"CC(CN1C=NC2=C(N=CN=C21)N)OCP(=O)(O)O",
"CC(=O)OC1=CC=CC=C1C(=O)O",
"CC(C)CC1=CC=C(C=C1)C(C)C(=O)O",
"CC1(OC2C(OC(C2O1)(C#N)C3=CC=C4N3N=CN=C4N)CO)C",
"COC1=CC23CCCN2CCC4=CC5=C(C=C4C3C1O)OCO5"
]
smiles_list = [
"CC1C2C(CC3(C=CC(=O)C(=C3C2OC1=O)C)C)O",
"C1=CN=CC=C1C(=O)NN",
"CC(CN1C=NC2=C(N=CN=C21)N)OCP(=O)(O)O",
"CC(=O)OC1=CC=CC=C1C(=O)O",
"CC(C)CC1=CC=C(C=C1)C(C)C(=O)O",
"CC1(OC2C(OC(C2O1)(C#N)C3=CC=C4N3N=CN=C4N)CO)C",
"COC1=CC23CCCN2CCC4=CC5=C(C=C4C3C1O)OCO5",
]
Note that, in this case, the
compound_list.csv
file is the same as the compound_singles.csv
file, corresponding to the multiple single molecules. However, the model will treat these files differently. In the current case, the full list corresponds to one input, whereas in the previous case each single molecule was an independent input. In the Ersilia Model Hub, the Input Shape field is labelled as Single or List, correspondingly.Here, multiple lists are passed as input, and each list is treated independently by the model.
CSV
JSON
Python
compound_lists.csv
smiles
CC1C2C(CC3(C=CC(=O)C(=C3C2OC1=O)C)C)O.C1=CN=CC=C1C(=O)NN.CC(CN1C=NC2=C(N=CN=C21)N)OCP(=O)(O)O
CC(=O)OC1=CC=CC=C1C(=O)O.CC(C)CC1=CC=C(C=C1)C(C)C(=O)O.CC1(OC2C(OC(C2O1)(C#N)C3=CC=C4N3N=CN=C4N)CO)C.COC1=CC23CCCN2CCC4=CC5=C(C=C4C3C1O)OCO5
compound_lists.json
[
[
"CC1C2C(CC3(C=CC(=O)C(=C3C2OC1=O)C)C)O",
"C1=CN=CC=C1C(=O)NN",
"CC(CN1C=NC2=C(N=CN=C21)N)OCP(=O)(O)O"
],
[
"CC(=O)OC1=CC=CC=C1C(=O)O",
"CC(C)CC1=CC=C(C=C1)C(C)C(=O)O",
"CC1(OC2C(OC(C2O1)(C#N)C3=CC=C4N3N=CN=C4N)CO)C",
"COC1=CC23CCCN2CCC4=CC5=C(C=C4C3C1O)OCO5"
]
]
smiles_lists = [
[
"CC1C2C(CC3(C=CC(=O)C(=C3C2OC1=O)C)C)O",
"C1=CN=CC=C1C(=O)NN",
"CC(CN1C=NC2=C(N=CN=C21)N)OCP(=O)(O)O",
],
[
"CC(=O)OC1=CC=CC=C1C(=O)O",
"CC(C)CC1=CC=C(C=C1)C(C)C(=O)O",
"CC1(OC2C(OC(C2O1)(C#N)C3=CC=C4N3N=CN=C4N)CO)C",
"COC1=CC23CCCN2CCC4=CC5=C(C=C4C3C1O)OCO5",
],
]
To specify a list in a single column in a
.csv
file, molecules can be separated with a dot (.
). The type of delimiter is specific to the input type. The SMILES notation naturally accepts the dot as a separator for multiple molecules.This corresponds to a less common case where, one input is expressed as a pair of lists. An example would be a model comparing two sets of molecules and returning an overall similarity values (one float number) between the two sets.
CSV
JSON
Python
compound_pair_of_lists.csv
smiles_1,smiles_2
CC1C2C(CC3(C=CC(=O)C(=C3C2OC1=O)C)C)O,CC(=O)OC1=CC=CC=C1C(=O)O
C1=CN=CC=C1C(=O)NN,CC(C)CC1=CC=C(C=C1)C(C)C(=O)O
CC(CN1C=NC2=C(N=CN=C21)N)OCP(=O)(O)O,CC1(OC2C(OC(C2O1)(C#N)C3=CC=C4N3N=CN=C4N)CO)C
,COC1=CC23CCCN2CCC4=CC5=C(C=C4C3C1O)OCO5
compound_pair_of_lists.json
[
[
"CC1C2C(CC3(C=CC(=O)C(=C3C2OC1=O)C)C)O",
"C1=CN=CC=C1C(=O)NN",
"CC(CN1C=NC2=C(N=CN=C21)N)OCP(=O)(O)O"
],
[
"CC(=O)OC1=CC=CC=C1C(=O)O",
"CC(C)CC1=CC=C(C=C1)C(C)C(=O)O",
"CC1(OC2C(OC(C2O1)(C#N)C3=CC=C4N3N=CN=C4N)CO)C",
"COC1=CC23CCCN2CCC4=CC5=C(C=C4C3C1O)OCO5"
]
]
smiles_pair_of_lists = (
[
"CC1C2C(CC3(C=CC(=O)C(=C3C2OC1=O)C)C)O",
"C1=CN=CC=C1C(=O)NN",
"CC(CN1C=NC2=C(N=CN=C21)N)OCP(=O)(O)O",
],
[
"CC(=O)OC1=CC=CC=C1C(=O)O",
"CC(C)CC1=CC=C(C=C1)C(C)C(=O)O",
"CC1(OC2C(OC(C2O1)(C#N)C3=CC=C4N3N=CN=C4N)CO)C",
"COC1=CC23CCCN2CCC4=CC5=C(C=C4C3C1O)OCO5",
],
)
Please note that the
compound_pair_of_lists.csv
file contains two columns, one for each set. The first set has three molecules, and the second set has four molecules. Therefore, the first column contains one empty row.Multiple pairs of lists can be passed to obtain multiple predictions/calculations, one for each pair. Like in the case of multiple lists, molecules can be separated with a dot character in a tabular file.
CSV
JSON
Python
compound_pair_of_lists.csv
smiles_1,smiles_2
CC1C2C(CC3(C=CC(=O)C(=C3C2OC1=O)C)C)O.C1=CN=CC=C1C(=O)NN,CC(CN1C=NC2=C(N=CN=C21)N)OCP(=O)(O)O
CC(=O)OC1=CC=CC=C1C(=O)O.CC(C)CC1=CC=C(C=C1)C(C)C(=O)O,CC1(OC2C(OC(C2O1)(C#N)C3=CC=C4N3N=CN=C4N)CO)C.COC1=CC23CCCN2CCC4=CC5=C(C=C4C3C1O)OCO5
compound_pairs_of_lists.json
[
[
[
"CC1C2C(CC3(C=CC(=O)C(=C3C2OC1=O)C)C)O",
"C1=CN=CC=C1C(=O)NN"
],
[
"CC(CN1C=NC2=C(N=CN=C21)N)OCP(=O)(O)O"
]
],
[
[
"CC(=O)OC1=CC=CC=C1C(=O)O",
"CC(C)CC1=CC=C(C=C1)C(C)C(=O)O"
],
[
"CC1(OC2C(OC(C2O1)(C#N)C3=CC=C4N3N=CN=C4N)CO)C",
"COC1=CC23CCCN2CCC4=CC5=C(C=C4C3C1O)OCO5"
]
]
]
smiles_pairs_of_lists = [
[
["CC1C2C(CC3(C=CC(=O)C(=C3C2OC1=O)C)C)O", "C1=CN=CC=C1C(=O)NN"],
["CC(CN1C=NC2=C(N=CN=C21)N)OCP(=O)(O)O"],
],
[
["CC(=O)OC1=CC=CC=C1C(=O)O", "CC(C)CC1=CC=C(C=C1)C(C)C(=O)O"],
[
"CC1(OC2C(OC(C2O1)(C#N)C3=CC=C4N3N=CN=C4N)CO)C",
"COC1=CC23CCCN2CCC4=CC5=C(C=C4C3C1O)OCO5",
],
],
]
You can generate inputs of arbitrary size for your model of interest with the following command. In this case, we generate 1,000 inputs for the
chemprop-antibiotic
model and store them as a .csv
file.ersilia example chemprop-antibiotic -n 1000 -f input.csv
This command simply samples drug molecules from the attached table.
drug_molecules.tsv
451KB
Text
Tab-separated file containing drug molecules from the Drug Repurposing Hub. InChIKeys, SMILES and names are provided.
Ersilia will automatically detect the SMILES column and the format in an input file, so it is acceptable to pass the
drug_molecules.tsv
file as is, or a chunk of it.Many small molecule databases exist in the public domain. If you want to look for a molecule of interest, consider the following resources:
Last modified 1yr ago