BioModels Annotation

This page describes how to annotate Ersilia models in the BioModels Tool contributing towards FAIRness.

Background

Sharing of machine learning models most importantly in the field of drug discovery is important in creating a FAIReR (Findable, Accessible, Interoperable, Reusable, and Reproducible) collection of machine learning models which in turn makes it easier to reproduce and reuse these models. This reduces the need to rebuild models from scratch, increases their usefulness in various applications, and speeds up progress in drug discovery.

In the process of making ML models FAIR and shareable, there are standards and protocols to follow which includes; sharing model training code, dataset information, reproduced figures, model evaluation metrics, trained models, Docker files, model metadata, and FAIR dissemination. Here, we will focus on the Model Metadata, and its annotation.

Model Metadata

To share ML models effectively, it’s important to provide relevant information about the models. The information being the Metadata. Metadata is organised information that describes, explains, or helps find, use, or manage a resource. In the context of Ersilia Models, metadata is data about the model and it is classified into three categories namely; Biological Metadata, Computational Metadata, and the Description Metadata. The metadata enables the findability and accessibility of the models based on its specific characteristics by other researchers and modellers.

  • Biological Metadata

    The biological relevance of a model is an important aspect of a model. This ranges from its bioactivity, biological processes explained by the model, biological system the model was trained on, tissue or cell type involved, assay type, biological entity, Ersilia model theme (ranges from infectious disease to ADME property) and Compartment in which biological process is happening.

  • Computational Metadata

    The metadata identifies the model based on its specific characteristics, such as the type of ML algorithm used, the modelling approach, its evaluation metrics, and the functional properties of the model such as input data type, and model output.

  • Description Metadata

    A model is described by its publication, a code base such as GitHub, data repository such as Zenodo, and lastly its deployment which could be in the form of a web server.

Why is Annotation Important?

An annotation is an association of a metadata with an ontology term. Annotating a model consists of mapping the identified model metadata with terms from controlled vocabularies and entries in data resources.

Annotation of a Model Metadata are crucial to:

  • Precisely identify model categories

  • Improve understanding of the model's structure

  • Make it easier to compare different models

  • Simplify model integration

  • Enable efficient searches

  • Add meaningful context to the model

  • Enhance understanding of the biology behind the model

  • Allow conversion and reuse of the model

  • Facilitate integration of the model with biological knowledge

Controlled Vocabularies and Ontology

Controlled vocabularies contain set terms that describe concepts in a specific domain. These terms have definitions that help us understand and agree on their meanings, and also helps with indexing and easy retrieval. Ontologies use controlled vocabularies to describe concepts and how they are related in a structured & computable format.

The following ontologies are preferred;

How do we Annotate?

This is the process of curating an annotation file with all essential information. Annotation is done by linking the right ontology, a cross-reference to a metadata with the addition of values and qualifiers in order to explicitly define the relationship between the model metadata, and the linked resources.

  1. Metadata is the data of a data. The latter being the model. This model is referred to as the entity in annotation.

    • Entity (eg. Model)

  2. Extract all information available for each metadata categories.

  3. Each annotation is linked to external data resources and values e.g., EDAM, STATO. An external data resource could be a database of ontology or the ontology itself.

    • This improves the model quality

    • Essential for the model search criteria

  4. Value enhances the accessibility and integrates a metadata with other data resources using a compact identifier.

    • Metadata (eg. Machine learning)

    • Ontology (Bioinformatics Concept EDAM: edam:topic_3474)

    • Value (https://identifiers.org/bptl/edam:topic_3474)

  5. Append qualifiers to each annotation. Qualifiers explain the relationship between a metadata and the model itself.

    • Qualifier (eg. bqbiol and bqmodel)

    • A relationship is either biological - bqbiol, or computational - bqmodel.

    The following qualifiers are used to describe relationships in the annotation;

    • bqbiol:hasTaxon - describes a relationship between a model and organism

    • bqbiol:occursIn - a compartment where a process occurs

    • bqbiol:hasProperty - general biological property

    • bqmodel:hasProperty - all model properties

    • bqmodel:isDescribedBy - the model resources

    • bqbiol:hasInput - model input data

    • bqbiol:hasDataset - the model training data

    • bqbiol:hasOutput - biological output of the model

  6. Dome Annotation gives more context to the computational metadata

    • D - Data ( this could be data source or the type of input data

    • O - Optimization (each model has their algorithm)

    • M - Model (the model source code and it’s executable form)

    • E - Evaluation metrics (all model performance are evaluated)

Tools for Annotation

OLS is a search and visualisation service that hosts 260+ biological and biomedical ontologies in one place.

Zooma is a ontology mapping tool that can be used to automatically map free text

Steps to Annotating a Model - An Example

1. Identify a Model and its associated Information

Associated information includes the model publication, repository and its source code.

Read the Publication

All model metadata are enclosed in its publication and it’s important to read the publication to understand the biological or chemical processes the model performs, its bioactivity, the algorithm of the model, its training data and the validation performed among other information. Go through the repository to validate the information in the publication.

Special Scenario

There are models that are built by fine-tuning other large models with different datasets, and performing several tasks. It’s important to understand the base models in this case, and all its properties.

Case-study

For the purpose of example, we're working with an antimalarial model with the tag eos4zfy from the Ersilia Model. In this case, this is a collaborative project between the EMBL-EBI and other big pharma and institutes. Here, an individual QSAR model to identify novel molecules that may have antimalarial properties built on private dataset was merged together to develop MAIP. A free web platform available for mass prediction of potential malaria inhibiting compounds.

2. Assign the Metadata Entity

The metadata entity is the source of the metadata. It’s more of the Metadata Data which is its Model. Adding the metadata entity makes the table looks like this;

Entity

Model

Model

Model

3. Extract all information available for each metadata category.

To identify all model metadata associated with this model. We’d go through the publication, the web platform to understand its pipeline, its source code and Ersilia implementation process. This template can be adapted to individual use and a sample visual can be seen below.

Biological Metadata

These metadata are extracted from the abstract section of the publication and includes the disease, causative agents, data classification and the biological output of the model.

Computational Metadata

These includes the algorithm of the model, it’s evaluation method, and data type

Descriptive Metadata

Each model is described by a publication, source code and its implementation.

Entity
Model Metadata Categories
Metadata

Model

Biological Metadata

  1. Homo sapiens

  2. Plasmodium falciparum

  3. Malaria

  4. Antimalarial properties

  5. Active

  6. Inactive

  7. Antimalarial compounds prediction

Model

Computational Metadata

  1. Classification models

  2. Naïve Bayesian model

  3. AUC–ROC

  4. 5-fold cross validation

  5. Smiles descriptors

  6. malaria dataset

Model

Descriptive Metadata

  1. MAIP web platform - Source code

  2. Ersilia Incorporation URL

  3. PubMed URL

For the purpose of example, these are sample Metadata from this model and its classification.

P.S: Column 2 (Model Metadata Categories) is just for descriptive purpose. It's not part of the annotation

Special Scenario

Some models were validated after building experimentally. This validation is done either in-vivo or in-vitro. It occurs as a predicted compound from the model being further validated experimentally to confirm its bioactivity. This validation is important for models that undergo such and should be annotated for the model.

4. Map the Metadata to the right Ontology

This is the main process of annotation, and it’s associating a metadata to the right ontology. Ontology can be identified through the Ontology Lookup Service. The Ontology Lookup Service (OLS) is a repository for biomedical ontologies that aims to provide a single point of access to the latest ontology versions.

To ensure standardization and interoperability, it's crucial to identify relevant ontologies through the Ontology Lookup Service (OLS). These ontologies will help in annotating the model components accurately. Search for your terms, for example, Machine Learning, in the search bar, and select the right term in the preferred ontology. If not found in the preferred ontology, look through other available options with the right meaning.

Sometimes, the exact term isn't found in the OLS, and in this case, the closest term can be used to replace the metadata.

In choosing the right ontology for a metadata, there are important things to consider.

  1. The ontology with the best metadata meaning

  2. Inclusive of a preferred ontology for better indexing

After mapping the metadata to the right ontology, we have a table like this;

Entity
Preferred Ontology
Metadata

Model

  1. NCBI Taxonomy

  2. NCBI Taxonomy

  3. Experimental Factor Ontology EFO

  4. NCI Thesaurus OBO Edition NCIT

  5. NCI Thesaurus OBO Edition NCIT

  6. NCI Thesaurus OBO Edition NCIT

  7. Chemical Entities of Biological Interest CHEBI

  1. Homo sapiens

  2. Plasmodium falciparum

  3. Malaria

  4. Antimalarial properties

  5. Active

  6. Inactive

  7. Antimalarial compounds prediction

Model

  1. STATO: the statistical methods ontology

  2. STATO: the statistical methods ontology

  3. STATO: the statistical methods ontology

  4. Ontology for Biomedical Investigations OBI

  5. Chemical information ontology (cheminf)

  6. NCI Thesaurus OBO Edition NCIT

  1. Classification models

  2. Naïve Bayesian model

  3. AUC–ROC

  4. 5-fold cross validation

  5. Smiles descriptors

  6. malaria dataset

Model

Special cases

  1. MAIP web platform - Source code

  2. Ersilia Incorporation URL

  3. PubMed URL

Note that the descriptive Medatada, like the PubMed URL of the model, or the specific Ersilia GitHub repository where the model is hosted, do not have an ontology (special cases).

Values enhance accessibility and integrate metadata with other data resources in the form of a URL (Uniform Resource Locator).

Each ontology has its accession identifier and a value is formed using the ontology identifier with a compact identifier. The compact identifier is a resolution service that provides consistent access in form of https://identifiers.org/

  1. Value = https://identifiers.org/ + NCIT:C176231

Each ontology is linked to their respective value using the formula above, and the table looks like this;

Entity
Preferred Ontology
Values
Metadata

Model

  1. NCBI Taxonomy

  2. NCBI Taxonomy

  3. Experimental Factor Ontology EFO

  4. NCI Thesaurus OBO Edition NCIT

  5. NCI Thesaurus OBO Edition NCIT

  6. NCI Thesaurus OBO Edition NCIT

  7. Chemical Entities of Biological Interest CHEBI

  1. Homo sapiens

  2. Plasmodium falciparum

  3. Malaria

  4. Antimalarial properties

  5. Active

  6. Inactive

  7. Antimalarial compounds prediction

Model

  1. STATO: the statistical methods ontology

  2. STATO: the statistical methods ontology

  3. STATO: the statistical methods ontology

  4. Ontology for Biomedical Investigations OBI

  5. Chemical information ontology (cheminf)

  6. NCI Thesaurus OBO Edition NCIT

  1. Classification models

  2. Naïve Bayesian model

  3. AUC–ROC

  4. 5-fold cross validation

  5. Smiles descriptors

  6. malaria dataset

Model

  1. Online Web server

  2. Ersilia Model Hub

  3. PubMed Identification Number PMID

  1. MAIP web platform - Source code

  2. Ersilia Incorporation URL

  3. PubMed URL

6. Associate the right qualifier to each annotation

Each metadata as previously explained is either a biology component of the model or a computational component or a descriptive component. Here, we’d annotate the metadata based on the category the fall.

For example;

Metadata Category
Metadata
Qualifier

Biological Metadata

Malaria

bqbiol:hasProperty

Computational Metadata

naïve Bayesian model

bqmodel:hasProperty

Descriptive Metadata

Ersilia Incorporation URL

bqmodel:isDescribedBy

After adding qualifiers to each metadata, the table looks like this;

Entity
Qualifiers
Preferred Ontology
Values
Metadata

Model

  1. bqbiol:hasTaxon

  1. bqbiol:hasTaxon

  1. bqbiol:hasProperty

  1. bqbiol:hasProperty

  1. bqbiol:hasProperty

  1. bqbiol:hasProperty

  1. bqbiol:hasOutput

  1. NCBI Taxonomy

  2. NCBI Taxonomy

  3. Experimental Factor Ontology EFO

  4. NCI Thesaurus OBO Edition NCIT

  5. NCI Thesaurus OBO Edition NCIT

  6. NCI Thesaurus OBO Edition NCIT

  7. Chemical Entities of Biological Interest CHEBI

  1. Homo sapiens

  2. Plasmodium falciparum

  3. Malaria

  4. Antimalarial properties

  5. Active

  6. Inactive

  7. Antimalarial compounds prediction

Model

  1. bqmodel:hasProperty

  1. bqmodel:hasProperty

  1. bqmodel:hasProperty

  1. bqmodel:hasProperty

  1. bqbiol:hasInput

  1. bqbiol:hasDataset

  1. STATO: the statistical methods ontology

  2. STATO: the statistical methods ontology

  3. STATO: the statistical methods ontology

  4. Ontology for Biomedical Investigations OBI

  5. Chemical information ontology (cheminf)

  6. NCI Thesaurus OBO Edition NCIT

  1. Classification models

  2. Naïve Bayesian model

  3. AUC–ROC

  4. 5-fold cross validation

  5. Smiles descriptors

  6. malaria dataset

Model

  1. bqmodel:isDescribedBy

  1. bqmodel:isDescribedBy

  1. bqmodel:isDescribedBy

  1. Online Web server

  2. Ersilia Model Hub

  3. PubMed Identification Number PMID

  1. MAIP web platform - Source code

  2. Ersilia Incorporation URL

  3. PubMed URL

7. Contextualize the Computational Metadata by adding DOME

The DOME annotation provides more contexts to the computational metadata by identifying which section of the modelling the metadata belong to.

Adding DOME to the table shows this;

Metadata
DOME

classification models

naïve Bayesian model

AUC–ROC

5-fold cross validation

Optimization-Algorithm

Optimization-Algorithm

Evaluation-Performance Measure

Evaluation-Method

MAIP web platform - Source code

Ersilia Incorporation URL

Smiles descriptors

malaria dataset

predictions of potential Antimalarial compounds

Model-Executable form

Model-Executable form

Data-Input

Data-Source

Model-Output; Classification

Resources & References

Last updated