Inputs
Here we describe the types of valid inputs while running Ersilia models
Last updated
Was this helpful?
Here we describe the types of valid inputs while running Ersilia models
Last updated
Was this helpful?
Small molecule structures are typically expressed in the well known format. Below are the SMILES strings of a few drug molecules:
Artemisin
CC1C2C(CC3(C=CC(=O)C(=C3C2OC1=O)C)C)O
Isoniazid
C1=CN=CC=C1C(=O)NN
Tenofovir
CC(CN1C=NC2=C(N=CN=C21)N)OCP(=O)(O)O
Aspirin
CC(=O)OC1=CC=CC=C1C(=O)O
Ibuprofen
CC(C)CC1=CC=C(C=C1)C(C)C(=O)O
Remdesivir
CC1(OC2C(OC(C2O1)(C#N)C3=CC=C4N3N=CN=C4N)CO)C
Cephalotaxin
COC1=CC23CCCN2CCC4=CC5=C(C=C4C3C1O)OCO5
In the most common case, each input sample corresponds to a single molecule. However, some models expect a list of molecules as input, and some even expect more complex inputs such as pairs of lists of molecules. Multiple molecules (lists) can be serialized to strings in the SMILES notation with the dot (.
) character.
Valid input file formats are comma-separated (.csv
), tab-separated (.tsv
) and JSON (.json
). Ersilia automatically recognizes these formats. Inputs can also be passed as Python instances through the Ersilia Python API.
It is possible to run Ersilia models for one input as well as multiple inputs. Ersilia automatically detects the if one or multiple inputs are passed.
This is the simplest case where one single molecule is passed as input.
This is a common case too, where multiple single molecules are passed as input. The model will run predictions/calculations for each molecule independently.
Here the molecule expects a list of molecules as input, therefore, the one prediction/calculation will be done based on the list as a whole. Some generative models, for example, require multiple molecules as a starting point for one generation round.
Here, multiple lists are passed as input, and each list is treated independently by the model.
This corresponds to a less common case where, one input is expressed as a pair of lists. An example would be a model comparing two sets of molecules and returning an overall similarity values (one float number) between the two sets.
Please note that the compound_pair_of_lists.csv
file contains two columns, one for each set. The first set has three molecules, and the second set has four molecules. Therefore, the first column contains one empty row.
Multiple pairs of lists can be passed to obtain multiple predictions/calculations, one for each pair. Like in the case of multiple lists, molecules can be separated with a dot character in a tabular file.
You can generate inputs of arbitrary size for your model of interest with the following command. In this case, we generate 1,000 inputs for the chemprop-antibiotic
model and store them as a .csv
file.
This command simply samples drug molecules from the attached table.
Many small molecule databases exist in the public domain. If you want to look for a molecule of interest, consider the following resources:
Note that, in this case, the compound_list.csv
file is the same as the compound_singles.csv
file, corresponding to the multiple single molecules. However, the model will treat these files differently. In the current case, the full list corresponds to one input, whereas in the previous case each single molecule was an independent input. In the , the Input Shape field is labelled as Single or List, correspondingly.
as a go-to search tool to obtain generalistic chemical information.
as a search tool for bioactivity data of medicinal chemistry compounds.
to obtain comprehensive information about drug molecules.
to search for commercially-available libraries of compounds.