Model template
This pages provides a deep dive into the structure of the model template for new model incorporation.
Model Incorporation has evolved at Ersilia, and while this workflow is largely based on its predecessor, there are some key differences in the legacy version and the workflow listed here. The instructions below lay out the steps from the current workflow for incorporating a model in the Ersilia Model Hub, with the last section pointing out the differences between the two versions.
Anatomy of the Ersilia Model Template
Each model in the Ersilia Model Hub is contained within an individual GitHub repository. Each model repository is created using the Ersilia Model Template upon approval of the Model Request issue. When the new repository is created, please fork it and work on modifying the template from your own user. Open a pull request when the model is ready.
When you have finished the model incorporation, please delete the fork from your own GitHub user. This will prevent abuses of the Git-LFS quota and outdated versions of the models.
Below, we describe the main files you will find in the newly created model repository. Note that some of them are automatically updated and you do not have to modify them, like the README.MD.
The eos
identifier
eos
identifierEach model in the Ersilia Model Hub has an Ersilia Open Source (EOS) identifier. This identifier determines the name of the GitHub repository containing the model:
The eos
identifier follows this regular expression: eos[1-9][a-z0-9]{3}
. That is:
The
eos
prefix, plus...one digit (
1-9
) (the0
is reserved for test models), plus...three alphanumeric (
a-z
and0-9
) characters.
eos
identifiers are automatically assigned at repository creation. Please do not modify them.
The metadata.yml
file
metadata.yml
fileThe metadata.yml
file is where all the model information can be found. This is the only place where you should modify or update the model description, interpretation etc. The Airtable backend, the browsable Model Hub and the README file will automatically be updated from the metadata.yml
upon merge of the Pull Request.
The YAML fields are constrained by certain parameters. If they do not adhere to the minimal quality standards, the Pull Request will be rejected and an explanatory message will be available on the GitHub Action. Below we try to provide a comprehensive overview of the metadata accepted:
Identifier: the eos
identifier described above. It will be automatically filled in. Do not modify.
Slug: a one-word or multi-word (linked by a hypen) human-readable identifier, stored as a string, to be used as an alternative to the EOS ID. It will be filled in from the Model Request issue. it can be modified afterwards.
Title: a self-descriptive model title (less than 70 characters)
Description: minimum information about model type, results and the training dataset. We require that all models have a description of minimum 200 characters.
Some contributors may find it difficult to come up with a good description for the model. You can find some inspiration in Semantic Scholar. This portal provides an AI-based TL;DR short description of many indexed papers.
Task: the ML task performed by the model. This field is typically a list, with one ore more than one entries. The only accepted tasks are: Regression
, Classification
, Generative
, Representation
, Similarity
, Clustering
and Dimensionality reduction
.
Mode: mode of training of the models: Pretrained
(the checkpoints were downloaded directly from a third party), Retrained
(the model was trained again using the same or a new dataset), In-house
(if the model has been developed from scratch by Ersilia's contributors) or Online
(if the model sends queries to an external server). This field is a string.
Input: data format required by the model. Most chemistry related models, for example, will require compounds as input. Currently, the only accepted inputs by Ersilia are Compound
, Protein
or Text
. This field is a list containing one or more entries. At present Ersilia only works with models with Compound inputs.
Input Shape: format of the input data. It can be Single
(one compound), Pair
(for example, two compounds), a List
, a Pair of Lists
or a List of Lists
. Please note this refers to the minimum shape for the model to work. If a model predicts, for example, the antimalarial potential of a small molecule, the input shape is Single
, regardless of the fact that you can pass several compounds in a list.
Output: description of the model result. It is important to choose the right description. Is the model providing a probability? Is it a score? Is it a new compound? The only accepted output formats are: Boolean
, Compound
, Descriptor
, Distance
, Experimental value
, Image
, Other value
, Probability
, Protein
, Score
, Text
. This field is a list with one or more acceptable values.
Output Type: the only accepted output types are String
, Float
or Integer
. More than one type can be added as a list if necessary. This field is typically a list with one or more acceptable values.
Output Shape: similar to the input shape, in what format is the endpoint returned? The only accepted output shapes are: Single
, List
, Flexible List
, Matrix
or Serializable Object
. This field is a string with only a single accepted value.
Interpretation: provide a brief description of how to interpret the model results. For example, in the case of a binary classification model for antimalarial activity based on experimental IC50, indicate the experimental settings (time of incubation, strain of parasite...) and the selected cut-off for the classification.
Tag: labels to facilitate model search. For example, a model that predicts activity against malaria could have P.falciparum as tag. This field is a list with one or more accepted values since models can have more than one tag. Select between one and five relevant from the following categories:
Disease:
AIDS
,Alzheimer
,Cancer
,Cardiotoxicity
,Cytotoxicity
,COVID19
,Dengue
,Malaria
,Neglected tropical disease
,Schistosomiasis
,Tuberculosis
.Organism:
A.baumannii
,E.coli
,E.faecium
,HBV
,HIV
,Human
,K.pneumoniae
,Mouse
,M.tuberculosis
,P.aeruginosa
,P.falciparum
,Rat
,Sars-CoV-2
,S.aureus
,ESKAPE
.Target:
BACE
,CYP450
,GPCR
,hERG
.Experiment:
Fraction bound
,IC50
,Half-life
,LogD
,LogP
,LogS
,MIC90
,Molecular weight
,Papp
,pKa
.Application:
ADME
,Antimicrobial activity
,Antiviral activity
,Bioactivity profile
,Lipophilicity
,Metabolism
,Microsomal stability
,Natural product
,Price
,Quantum properties
,Side effects
,Solubility
,Synthetic accessibility
,Target identification
,Therapeutic indication
,Toxicity
.Dataset:
ChEMBL
,DrugBank
,MoleculeNet
,Tox21
,ToxCast
,ZINC
,TDCommons
.Chemoinformatics:
Chemical graph model
,Chemical language model
,Chemical notation
,Chemical synthesis
,Compound generation
,Descriptor
,Drug-likeness
,Embedding
,Fingerprint
,Similarity
.
Publication: link to the original publication. Please refer to the journal page whenever possible, instead of Pubmed, Researchgate or other secondary webs. This field is a string with only one accepted value.
Source Code: link to the original code repository of the model. If this is an in-house model, please add here the link of the ML package used to train the model. This field is a string with only one accepted value.
License: the License of the original code. We have included the following OS licences: MIT
, GPL-3.0
, LGPL-3.0
, AGPL-3.0
, Apache-2.0
, BSD-2.0
, BSD-3.0
, Mozilla
, CC
. You can also select Proprietary or
Non-commercial
if the authors have included their own license notice (for example restricting commercial usage). If the code was released without a license, please add None
in this field. Make sure to abide by requirements of the original license when re-licensing or sub-licensing third-party author code (such as adding the license file together with the original code). This field is a string with only one accepted value.
If the predetermined fields are not sufficient for your use case, you can open a pull request to include new ones to our repository. Please do so only if strictly necessary (for example, if a disease is not already in the Tag field).
Ersilia maintainers will review and approve / reject PRs for additions to the existing lists of approved items.
Note that these fields are filled in as Python strings, therefore misspellings or lower / uppercases will affect their recognition as valid values.
The README
file
README
fileThe README.md
file is where we give basic information about the model. It reads from the metadata.json file and it will be automatically updated thanks to a GitHub Action once the Pull Request is approved.
Please do not modify it manually.
The LICENSE
file
LICENSE
fileBy default, all code written in contribution to Ersilia should be licensed under a GPLv3 License. The main LICENSE
file of the repository, therefore, will be a GPLv3 as specified by GitHub.
However, the license notices for code developed by third parties must be kept in the respective folders where the third-party code is found.
The install.yml
file
install.yml
fileErsilia uses an install.yml
file to specify installation instructions. The YAML syntax is used because it is easy to read and maintain. This file specifies which Python version to use to build a conda environment, or a Docker image for the model.
This dependency configuration file has two top level keys, namely, python
, and commands.
They dependencies are to be specified in the following manner:
python
key expects a string value denoting a python version (eg"3.10"
)commands
key expects a list of values, each of which is a list on its own, denoting the dependencies required by the model. Currently, dependencies frompip
andconda
are supported.pip
dependencies are expected to be three element lists in the format["pip", "library", "version"]
conda
dependencies are expected to be four element lists in the format["conda", "library", "version", "channel"]
, where channel is the conda channel to install the required library.When the model is run from source, Ersilia always defaults to creating a conda environment for the model to provide isolation. However, when the model is Dockerized, whether conda is used in that process depends entirely on there being conda dependencies in this file.
The install.yml
available in the Ersilia Model Template is the following:
In this case, when running the model from source, a Conda environment will be used to isolate the model. Additionally, a conda environment will also be used inside the Docker image of the mode. This example demonstrates an installation instructions for an environment using Python 3.10.
In this example, the rdkit-pypi==2022.3.1b1
will be installed using pip
, while pandas=1.3.5
will be installed using conda
through the default package channel on conda.
The install.yml
file can contain as many commands as necessary. Please limit the packages to the bare minimum required, sometimes models have additional packages for extra functionalities that are not required to run the model. It is good practice to trim to the minimum the package dependencies to avoid conflicts. Always pin the version of the package to make sure it is always reproducible.
The install.yml file
contains the installation instructions of the model. Therefore, the content of this file can be very variable, since each model will have its own dependencies.
The model
folder
model
folderThe model
folder is the most important one. It contains two sub-folders:
framework
: In this folder, we keep all the necessary code to run the model (assuming dependencies are already installed).checkpoints
: In this folder, we store the model data (pretrained model parameters, scaling data, etc).
The model
folder should not contain anything other than the framework
and checkpoints
subfolder. When the Ersilia CLI eventually fetches the model, it does a reorganization of the code and the only subfolders it keeps are these two. Any other file or folder at the model/
directory level will be overlooked.
Often, the separation between framework
and checkpoints
is not easy to determine. Sometimes, models obtained from third parties have model data embedded within the code or as part of the repository. In these cases, it is perfectly fine to keep model data in the framework
subfolder, and leave the checkpoints
subfolder empty.
The framework
subfolder contains at least one Bash file, named run.sh
. This file will run as follows:
Unless strictly necessary, the run.sh
file should accept three and only three arguments, namely FRAMEWORK_DIR
, DATA_FILE
and OUTPUT_FILE
. In the current template, we provide the following example:
In this case, a Python file located in the [FRAMEWORK_DIR]/code
folder is executed, taking as input (-i
) the DATA_FILE
and giving as output (-o
) the OUTPUT_FILE
.
To understand this further, we now need to inspect the step main.py
file in the step above, in more detail. The current template proposes the following script:
In this case, the model simply calculates the molecular weight and adds a number to it.
The important steps of the script are:
Load model parameters.
Read input file.
Run predictions using the input file and the model parameters.
Write the output.
Most of the work of the model contributor will be to work on this or similar scripts. In the template, we provide a dummy model (i.e. add a fixed value to the molecular weight). This dummy model can can be already defined within the script (my_model
). However, in real world cases, the model will most likely be loaded from a third party Python library, or from a (cloned) repository placed in the same directory.
To summarize, in the template, we provide a structure that follows this logic:
The
run.sh
script executes the Pythonmain.py
script.The
main.py
script:Defines the model code.
Loads parameters from
checkpoints
.Reads an input file containing SMILES (with header).
Runs a model that calculates molecular weight and adds an integer defined by the parameters.
Writes an output file containing one column corresponding to the output value (with a header).
In the template, the example provided is very simple. Depending on the model being incorporated, the logic may be different. For example, many third party models already contain a command-line option, with a specific syntax. In these cases, you may want to write scripts to adapt the input and the output, and then execute the model as-is.
Each script will be one main.py
file, we can create as many as necessary and rename them appropriately (see below for examples)
The .gitattributes
file
.gitattributes
fileWe use Git LFS to store large files (over 100 MB). Typically, these files are model parameters. Files to be stored in Git LFS should be specified in the .gitattributes
file. The current file will store in Git LFS all files in csv
, h5
, joblib
, pkl
, pt
and tsv
format.
The Legacy Model Template
At the time of writing this tutorial, the Ersilia Model Hub has approximately 150 models developed with this template. As mentioned above, this legacy template served as an inspiration for the current workflow, however there are several key differences that should be called out. These differences largely pertain to how dependencies are specified in these models, the tools used to create a model server, and source files facilitating that process.
The Dockerfile
file
Dockerfile
fileErsilia uses a Dockerfile
file to specify installation instructions. The reason for this is that Docker provides the maximum level of isolation possible (i.e. a container), which may be needed to run models in some systems. However, in most practical scenarios, a Docker container will not be necessary and a Conda environment, or even a Virtualenv environment, will suffice. The Ersilia CLI will decide which isolation level to provide depending on the content of the Dockerfile:
The Dockerfile
available in the Ersilia Model Template is the following:
The first line of the Dockerfile
indicates that this Conda environment will have BentoML 0.11.0 installed on Python 3.10. In this example, the rdkit
library will be installed using conda
, and joblib
will be installed using pip
.
The Dockerfile
can contain as many RUN
commands as necessary, between the MAINTAINER
and the WORKDIR
lines. Please limit the packages to the bare minimmum required, sometimes models have additional packages for extra functionalities that are not required to run the model. It is good practice to trim to the minimmum the package dependencies to avoid conflicts. Whenever possible, pin the version of the package.
The Dockerfile
contains the installation instructions of the model. Therefore, the content of this file can be very variable, since each model will have its own dependencies.
The service
file
service
fileThe service file is located in src/service.py
. It contains the necessary code to facilitate model bundling with BentoML.
There are three main classes in the service
file, namely Model
, Artifact
and Service
.
The Model
class
Model
classThis class is simply a wrapper for the AI/ML model. Typically, when incorporating external (type 1) models, the run.sh
script will already capture the logic within the Model
class, in which case the Model
class is simply redundant. However, when incorporating internally developed (types 2 and 3) models into the hub, we can make use of the artifacts for standard modeling frameworks (e.g. sklearn, PyTorch, and Keras) provided by BentoML, and the Model
class becomes necessary for BentoML compatibility. Hence, the Model
class enables generalization between these types of model additions.
Typically, the central method of the Model
class is the run
method.
In this case, the model takes as input a list of molecules represented as SMILES strings. This is the standard input type for models focused on chemistry data as input.
Models incorporated with this workflow do not allow for multiple endpoints, or "methods". All models developed in this manner only have a single run
API.
In its simplest form, the Model
class just points Ersilia to the model
directory and then creates a Bash file to execute the necessary commands to run the model. It is actually a very simple class, although it may look overwhelming at first. We break it down below:
First, a temporary directory is created:
Then, a data file is created in the temporary directory. In this example, it is simply a one-column csv
file having a header (smiles
) and a list of molecules in SMILES format (one per row):
Now we already have the input file of the run.sh
script, located in the model/framework/
directory, as specified above. The following creates a dummy Bash script in the temporary directory and runs the command from there. The output is saved in the temporary directory too. Remember that the run.sh
script expects three arguments, FRAMEWORK_DIR
, DATA_FILE
and OUTPUT_FILE
.
The last step is to read from the output in the temporary directory and return it in a JSON-serializable format. The output in the example is a csv
table, with one or multiple columns, containing numeric data. The table has a header, which is read and saved as metadata.
You will see that, in the template, pointers to potential edits are highlighted with the tag # EDIT
. Necessary edits relate to the the format of the input data, or the serialization to JSON format from the output data.
Advanced contributors may want to modify the Model
class to load a model in-place (for example, a Scikit-Learn model) instead of executing a Bash command in the model/framework/
directory.
The Artifact
class
Artifact
classThis class mirrors BentoML artifacts. It simply contains load
, save
, get and pack
functionalities:
You don't have to modify this class.
The Service
class
Service
classThis class is used to create the service. The service exposes the run
API:
By default, Ersilia works with JSON inputs, which are deserialized as a SMILES list inside the API, in this case. The deafult API is run
. The general rule is, do not modify it.
Last updated