Model template

This pages provides a deep dive into the structure of the model template for new model incorporation.

Model Incorporation has evolved at Ersilia, and while this workflow is largely based on its predecessor, there are some key differences in the legacy version and the workflow listed here. The instructions below lay out the steps from the current workflow for incorporating a model in the Ersilia Model Hub, with the last section pointing out the differences between the two versions.

Anatomy of the Ersilia Model Template

Each model in the Ersilia Model Hub is contained within an individual GitHub repository. Each model repository is created using the Ersilia Model Template upon approval of the Model Request issue. When the new repository is created, please fork it and work on modifying the template from your own user. Open a pull request when the model is ready.

When you have finished the model incorporation, please delete the fork from your own GitHub user. This will prevent abuses of the Git-LFS quota and outdated versions of the models.

Below, we describe the main files you will find in the newly created model repository. Note that some of them are automatically updated and you do not have to modify them, like the README.MD.

The eos identifier

Each model in the Ersilia Model Hub has an Ersilia Open Source (EOS) identifier. This identifier determines the name of the GitHub repository containing the model:

https://github.com/ersilia-os/[EOS_IDENTIFIER]

The eos identifier follows this regular expression: eos[1-9][a-z0-9]{3}. That is:

  • The eos prefix, plus...

  • one digit (1-9) (the 0 is reserved for test models), plus...

  • three alphanumeric (a-z and 0-9) characters.

eos identifiers are automatically assigned at repository creation. Please do not modify them.

The metadata.yml file

The metadata.yml file is where all the model information can be found. This is the only place where you should modify or update the model description, interpretation etc. The Airtable backend, the browsable Model Hub and the README file will automatically be updated from the metadata.yml upon merge of the Pull Request.

The YAML fields are constrained by certain parameters. If they do not adhere to the minimal quality standards, the Pull Request will be rejected and an explanatory message will be available on the GitHub Action. Below we try to provide a comprehensive overview of the metadata accepted:

Identifier: the eos identifier described above. It will be automatically filled in. Do not modify.

Slug: a one-word or multi-word (linked by a hypen) human-readable identifier, stored as a string, to be used as an alternative to the EOS ID. It will be filled in from the Model Request issue. it can be modified afterwards.

Title: a self-descriptive model title (less than 70 characters)

Description: minimum information about model type, results and the training dataset. We require that all models have a description of minimum 200 characters.

Some contributors may find it difficult to come up with a good description for the model. You can find some inspiration in Semantic Scholar. This portal provides an AI-based TL;DR short description of many indexed papers.

Task: the ML task performed by the model. This field is typically a list, with one ore more than one entries. The only accepted tasks are: Regression, Classification, Generative, Representation, Similarity, Clustering and Dimensionality reduction.

Mode: mode of training of the models: Pretrained (the checkpoints were downloaded directly from a third party), Retrained (the model was trained again using the same or a new dataset), In-house (if the model has been developed from scratch by Ersilia's contributors) or Online (if the model sends queries to an external server). This field is a string.

Input: data format required by the model. Most chemistry related models, for example, will require compounds as input. Currently, the only accepted inputs by Ersilia are Compound, Protein or Text. This field is a list containing one or more entries. At present Ersilia only works with models with Compound inputs.

Input Shape: format of the input data. It can be Single (one compound), Pair (for example, two compounds), a List, a Pair of Lists or a List of Lists. Please note this refers to the minimum shape for the model to work. If a model predicts, for example, the antimalarial potential of a small molecule, the input shape is Single, regardless of the fact that you can pass several compounds in a list.

Output: description of the model result. It is important to choose the right description. Is the model providing a probability? Is it a score? Is it a new compound? The only accepted output formats are: Boolean, Compound, Descriptor, Distance, Experimental value, Image, Other value, Probability, Protein, Score, Text. This field is a list with one or more acceptable values.

Output Type: the only accepted output types are String, Float or Integer. More than one type can be added as a list if necessary. This field is typically a list with one or more acceptable values.

Output Shape: similar to the input shape, in what format is the endpoint returned? The only accepted output shapes are: Single, List, Flexible List, Matrix or Serializable Object. This field is a string with only a single accepted value.

Interpretation: provide a brief description of how to interpret the model results. For example, in the case of a binary classification model for antimalarial activity based on experimental IC50, indicate the experimental settings (time of incubation, strain of parasite...) and the selected cut-off for the classification.

Tag: labels to facilitate model search. For example, a model that predicts activity against malaria could have P.falciparum as tag. This field is a list with one or more accepted values since models can have more than one tag. Select between one and five relevant from the following categories:

  • Disease: AIDS, Alzheimer, Cancer, Cardiotoxicity, Cytotoxicity, COVID19, Dengue, Malaria, Neglected tropical disease, Schistosomiasis, Tuberculosis.

  • Organism: A.baumannii, E.coli, E.faecium, HBV, HIV, Human, K.pneumoniae, Mouse, M.tuberculosis, P.aeruginosa, P.falciparum, Rat, Sars-CoV-2, S.aureus, ESKAPE.

  • Target: BACE, CYP450, GPCR, hERG.

  • Experiment: Fraction bound, IC50, Half-life, LogD, LogP, LogS, MIC90, Molecular weight, Papp, pKa.

  • Application: ADME, Antimicrobial activity, Antiviral activity, Bioactivity profile, Lipophilicity, Metabolism, Microsomal stability, Natural product, Price, Quantum properties, Side effects, Solubility, Synthetic accessibility, Target identification, Therapeutic indication, Toxicity.

  • Dataset: ChEMBL, DrugBank, MoleculeNet, Tox21, ToxCast, ZINC, TDCommons.

  • Chemoinformatics: Chemical graph model, Chemical language model, Chemical notation, Chemical synthesis, Compound generation, Descriptor, Drug-likeness, Embedding, Fingerprint, Similarity.

Publication: link to the original publication. Please refer to the journal page whenever possible, instead of Pubmed, Researchgate or other secondary webs. This field is a string with only one accepted value.

Source Code: link to the original code repository of the model. If this is an in-house model, please add here the link of the ML package used to train the model. This field is a string with only one accepted value.

License: the License of the original code. We have included the following OS licences: MIT, GPL-3.0, LGPL-3.0, AGPL-3.0, Apache-2.0, BSD-2.0, BSD-3.0, Mozilla, CC. You can also select Proprietary or Non-commercial if the authors have included their own license notice (for example restricting commercial usage). If the code was released without a license, please add None in this field. Make sure to abide by requirements of the original license when re-licensing or sub-licensing third-party author code (such as adding the license file together with the original code). This field is a string with only one accepted value.

If the predetermined fields are not sufficient for your use case, you can open a pull request to include new ones to our repository. Please do so only if strictly necessary (for example, if a disease is not already in the Tag field).

Ersilia maintainers will review and approve / reject PRs for additions to the existing lists of approved items.

Note that these fields are filled in as Python strings, therefore misspellings or lower / uppercases will affect their recognition as valid values.

The README file

The README.md file is where we give basic information about the model. It reads from the metadata.json file and it will be automatically updated thanks to a GitHub Action once the Pull Request is approved.

Please do not modify it manually.

The LICENSE file

By default, all code written in contribution to Ersilia should be licensed under a GPLv3 License. The main LICENSE file of the repository, therefore, will be a GPLv3 as specified by GitHub.

However, the license notices for code developed by third parties must be kept in the respective folders where the third-party code is found.

The install.yml file

Ersilia uses an install.yml file to specify installation instructions. The YAML syntax is used because it is easy to read and maintain. This file specifies which Python version to use to build a conda environment, or a Docker image for the model.

This dependency configuration file has two top level keys, namely, python, and commands. They dependencies are to be specified in the following manner:

  • python key expects a string value denoting a python version (eg "3.10")

  • commands key expects a list of values, each of which is a list on its own, denoting the dependencies required by the model. Currently, dependencies frompip and conda are supported.

  • pip dependencies are expected to be three element lists in the format ["pip", "library", "version"]

  • conda dependencies are expected to be four element lists in the format ["conda", "library", "version", "channel"], where channel is the conda channel to install the required library.

  • When the model is run from source, Ersilia always defaults to creating a conda environment for the model to provide isolation. However, when the model is Dockerized, whether conda is used in that process depends entirely on there being conda dependencies in this file.

The install.yml available in the Ersilia Model Template is the following:

install.yml
python: "3.10"
commands:
    - ["pip", "rdkit-pypi", "2022.3.1b1"]
    - ["conda", "pandas", "1.3.5", "default"]

In this case, when running the model from source, a Conda environment will be used to isolate the model. Additionally, a conda environment will also be used inside the Docker image of the mode. This example demonstrates an installation instructions for an environment using Python 3.10.

In this example, the rdkit-pypi==2022.3.1b1 will be installed using pip, while pandas=1.3.5 will be installed using conda through the default package channel on conda.

The install.yml file can contain as many commands as necessary. Please limit the packages to the bare minimum required, sometimes models have additional packages for extra functionalities that are not required to run the model. It is good practice to trim to the minimum the package dependencies to avoid conflicts. Always pin the version of the package to make sure it is always reproducible.

The install.yml file contains the installation instructions of the model. Therefore, the content of this file can be very variable, since each model will have its own dependencies.

The model folder

The model folder is the most important one. It contains two sub-folders:

  • framework: In this folder, we keep all the necessary code to run the model (assuming dependencies are already installed).

  • checkpoints: In this folder, we store the model data (pretrained model parameters, scaling data, etc).

The model folder should not contain anything other than the framework and checkpoints subfolder. When the Ersilia CLI eventually fetches the model, it does a reorganization of the code and the only subfolders it keeps are these two. Any other file or folder at the model/ directory level will be overlooked.

Often, the separation between framework and checkpoints is not easy to determine. Sometimes, models obtained from third parties have model data embedded within the code or as part of the repository. In these cases, it is perfectly fine to keep model data in the framework subfolder, and leave the checkpoints subfolder empty.

The framework subfolder contains at least one Bash file, named run.sh. This file will run as follows:

bash run.sh [FRAMEWORK_DIR] [DATA_FILE] [OUTPUT_FILE]

Unless strictly necessary, the run.sh file should accept three and only three arguments, namely FRAMEWORK_DIR, DATA_FILE and OUTPUT_FILE. In the current template, we provide the following example:

run.sh
python $1/code/main.py -i $2 -o $3

In this case, a Python file located in the [FRAMEWORK_DIR]/code folder is executed, taking as input (-i) the DATA_FILE and giving as output (-o) the OUTPUT_FILE.

To understand this further, we now need to inspect the step main.py file in the step above, in more detail. The current template proposes the following script:

code/main.py
# imports
import os
import csv
import joblib
import sys
from rdkit import Chem
from rdkit.Chem.Descriptors import MolWt

# parse arguments
input_file = sys.argv[1]
output_file = sys.argv[2]

# current file directory
root = os.path.dirname(os.path.abspath(__file__))

# checkpoints directory
checkpoints_dir = os.path.abspath(os.path.join(root, "..", "..", "checkpoints"))

# read checkpoints (here, simply an integer number: 42)
ckpt = joblib.load(os.path.join(checkpoints_dir, "checkpoints.joblib"))

# model to be run (here, calculate the Molecular Weight and add ckpt (42) to it)
def my_model(smiles_list, ckpt):
    return [MolWt(Chem.MolFromSmiles(smi))+ckpt for smi in smiles_list]
    
# read SMILES from .csv file, assuming one column with header
with open(input_file, "r") as f:
    reader = csv.reader(f)
    next(reader) # skip header
    smiles_list = [r[0] for r in reader]
    
# run model
outputs = my_model(smiles_list, ckpt)

# write output in a .csv file
with open(output_file, "w") as f:
    writer = csv.writer(f)
    writer.writerow(["value"]) # header
    for o in outputs:
        writer.writerow([o])

In this case, the model simply calculates the molecular weight and adds a number to it.

The important steps of the script are:

  1. Load model parameters.

  2. Read input file.

  3. Run predictions using the input file and the model parameters.

  4. Write the output.

Most of the work of the model contributor will be to work on this or similar scripts. In the template, we provide a dummy model (i.e. add a fixed value to the molecular weight). This dummy model can can be already defined within the script (my_model). However, in real world cases, the model will most likely be loaded from a third party Python library, or from a (cloned) repository placed in the same directory.

To summarize, in the template, we provide a structure that follows this logic:

  1. The run.sh script executes the Python main.py script.

  2. The main.py script:

    • Defines the model code.

    • Loads parameters from checkpoints.

    • Reads an input file containing SMILES (with header).

    • Runs a model that calculates molecular weight and adds an integer defined by the parameters.

    • Writes an output file containing one column corresponding to the output value (with a header).

In the template, the example provided is very simple. Depending on the model being incorporated, the logic may be different. For example, many third party models already contain a command-line option, with a specific syntax. In these cases, you may want to write scripts to adapt the input and the output, and then execute the model as-is.

Each script will be one main.py file, we can create as many as necessary and rename them appropriately (see below for examples)

The .gitattributes file

We use Git LFS to store large files (over 100 MB). Typically, these files are model parameters. Files to be stored in Git LFS should be specified in the .gitattributes file. The current file will store in Git LFS all files in csv, h5, joblib, pkl, pt and tsv format.

*.csv filter=lfs diff=lfs merge=lfs -text
*.h5 filter=lfs diff=lfs merge=lfs -text
*.joblib filter=lfs diff=lfs merge=lfs -text
*.pkl filter=lfs diff=lfs merge=lfs -text

The Legacy Model Template

At the time of writing this tutorial, the Ersilia Model Hub has approximately 150 models developed with this template. As mentioned above, this legacy template served as an inspiration for the current workflow, however there are several key differences that should be called out. These differences largely pertain to how dependencies are specified in these models, the tools used to create a model server, and source files facilitating that process.

The Dockerfile file

Ersilia uses a Dockerfile file to specify installation instructions. The reason for this is that Docker provides the maximum level of isolation possible (i.e. a container), which may be needed to run models in some systems. However, in most practical scenarios, a Docker container will not be necessary and a Conda environment, or even a Virtualenv environment, will suffice. The Ersilia CLI will decide which isolation level to provide depending on the content of the Dockerfile:

The Dockerfile available in the Ersilia Model Template is the following:

Dockerfile 

FROM bentoml/model-server:0.11.0-py310 
MAINTAINER ersilia

RUN pip install rdkit==23. 
RUN pip install joblib==1.1.0

WORKDIR /repo COPY . /repo

The first line of the Dockerfile indicates that this Conda environment will have BentoML 0.11.0 installed on Python 3.10. In this example, the rdkit library will be installed using conda, and joblib will be installed using pip.

The Dockerfile can contain as many RUN commands as necessary, between the MAINTAINER and the WORKDIR lines. Please limit the packages to the bare minimmum required, sometimes models have additional packages for extra functionalities that are not required to run the model. It is good practice to trim to the minimmum the package dependencies to avoid conflicts. Whenever possible, pin the version of the package.

The Dockerfile contains the installation instructions of the model. Therefore, the content of this file can be very variable, since each model will have its own dependencies.

The service file

The service file is located in src/service.py. It contains the necessary code to facilitate model bundling with BentoML.

There are three main classes in the service file, namely Model, Artifact and Service.

The Model class

This class is simply a wrapper for the AI/ML model. Typically, when incorporating external (type 1) models, the run.sh script will already capture the logic within the Model class, in which case the Model class is simply redundant. However, when incorporating internally developed (types 2 and 3) models into the hub, we can make use of the artifacts for standard modeling frameworks (e.g. sklearn, PyTorch, and Keras) provided by BentoML, and the Model class becomes necessary for BentoML compatibility. Hence, the Model class enables generalization between these types of model additions.

Typically, the central method of the Model class is the run method.

class Model (object):
    ...
    def run(self, smiles_list):
        ...

In this case, the model takes as input a list of molecules represented as SMILES strings. This is the standard input type for models focused on chemistry data as input.

Models incorporated with this workflow do not allow for multiple endpoints, or "methods". All models developed in this manner only have a single run API.

In its simplest form, the Model class just points Ersilia to the model directory and then creates a Bash file to execute the necessary commands to run the model. It is actually a very simple class, although it may look overwhelming at first. We break it down below:

First, a temporary directory is created:

 class Model(object):
    ...
    def run(self, smiles_list):
        tmp_folder = tempfile.mkdtemp(prefix="eos-")
        data_file = os.path.join(tmp_folder, self.DATA_FILE)
        output_file = os.path.join(tmp_folder, self.OUTPUT_FILE)
        log_file = os.path.join(tmp_folder, self.LOG_FILE)
        ...

Then, a data file is created in the temporary directory. In this example, it is simply a one-column csv file having a header (smiles) and a list of molecules in SMILES format (one per row):

class Model(object):
    ...
    def run(self, smiles_list):
        ...
        with open(data_file, "w") as f:
            f.write("smiles"+os.linesep)
            for smiles in smiles_list:
                f.write(smiles+os.linesep)
        ...

Now we already have the input file of the run.shscript, located in the model/framework/ directory, as specified above. The following creates a dummy Bash script in the temporary directory and runs the command from there. The output is saved in the temporary directory too. Remember that the run.sh script expects three arguments, FRAMEWORK_DIR, DATA_FILE and OUTPUT_FILE.

class Model(object):
    ...
    def run(self, smiles_list):
        ...
        run_file = os.path.join(tmp_folder, self.RUN_FILE)
        with open(run_file, "w") as f:
            lines = [
                "bash {0}/run.sh {0} {1} {2}".format(
                    self.framework_dir,
                    data_file,
                    output_file
                )
            ]
            f.write(os.linesep.join(lines))
        cmd = "bash {0}".format(run_file)
        with open(log_file, "w") as fp:
            subprocess.Popen(
                cmd, stdout=fp, stderr=fp, shell=True, env=os.environ
            ).wait()
        ...

The last step is to read from the output in the temporary directory and return it in a JSON-serializable format. The output in the example is a csv table, with one or multiple columns, containing numeric data. The table has a header, which is read and saved as metadata.

class Model(object):
    ...
    def run(self, smiles_list):
        ...
        with open(output_file, "r") as f:
            reader = csv.reader(f)
            h = next(reader)
            R = []
            for r in reader:
                R += [{"outcome": [Float(x) for x in r]}]
        meta = {
            "outcome": h
        }
        result = {
            "result": R,
            "meta": meta
        }
        shutil.rmtree(tmp_folder)
        return result

You will see that, in the template, pointers to potential edits are highlighted with the tag # EDIT . Necessary edits relate to the the format of the input data, or the serialization to JSON format from the output data.

Advanced contributors may want to modify the Model class to load a model in-place (for example, a Scikit-Learn model) instead of executing a Bash command in the model/framework/ directory.

The Artifact class

This class mirrors BentoML artifacts. It simply contains load, save, get and pack functionalities:

class Artifact(BentoServiceArtifact):
    ...
    def pack(self, model):
        ...
    def load(self, path):
        ...
    def get(self):
        ...
    def save(self, dst):
        ...

You don't have to modify this class.

The Service class

This class is used to create the service. The service exposes the run API:

@artifacts([Artifact("model")])
class Service(BentoService):
    @api(input=JsonInput(), batch=True)
    def run(self, input: List[JsonSerializable]):
        input = input[0]
        smiles_list = [inp["input"] for inp in input]
        output = self.artifacts.model.run(smiles_list)
        return [output]

By default, Ersilia works with JSON inputs, which are deserialized as a SMILES list inside the API, in this case. The deafult API is run. The general rule is, do not modify it.

Last updated