BioModels annotation

This page describes how to annotate Ersilia models in the BioModels Tool contributing towards FAIRness.

Background

Sharing of machine learning models most importantly in the field of drug discovery is important in creating a FAIReR (Findable, Accessible, Interoperable, Reusable, and Reproducible) collection of machine learning models which in turn makes it easier to reproduce and reuse these models. This reduces the need to rebuild models from scratch, increases their usefulness in various applications, and speeds up progress in drug discovery.

In the process of making ML models FAIR and shareable, there are standards and protocols to follow which includes; sharing model training code, dataset information, reproduced figures, model evaluation metrics, trained models, Docker files, model metadata, and FAIR dissemination. Here, we will focus on the Model Metadata, and its annotation.

Model Metadata

To share ML models effectively, it’s important to provide relevant information about the models. The information being the Metadata. Metadata is organised information that describes, explains, or helps find, use, or manage a resource. In the context of Ersilia Models, metadata is data about the model and it is classified into three categories namely; Biological Metadata, Computational Metadata, and the Description Metadata. The metadata enables the findability and accessibility of the models based on its specific characteristics by other researchers and modellers.

Biological Metadata
The biological relevance of a model is an important aspect of a model. This ranges from its bioactivity, biological processes explained by the model, biological system the model was trained on, tissue or cell type involved, assay type, biological entity, Ersilia model theme (ranges from infectious disease to ADME property) and Compartment in which biological process is happening.
Computational Metadata
The metadata identifies the model based on its specific characteristics, such as the type of ML algorithm used, the modelling approach, its evaluation metrics, and the functional properties of the model such as input data type, and model output.
Description Metadata
A model is described by its publication, a code base such as GitHub, data repository such as Zenodo, and lastly its deployment which could be in the form of a web server.

Why is Annotation Important?

An annotation is an association of a metadata with an ontology term. Annotating a model consists of mapping the identified model metadata with terms from controlled vocabularies and entries in data resources.

Annotation of a Model Metadata are crucial to:

Precisely identify model categories
Improve understanding of the model's structure
Make it easier to compare different models
Simplify model integration
Enable efficient searches
Add meaningful context to the model
Enhance understanding of the biology behind the model
Allow conversion and reuse of the model
Facilitate integration of the model with biological knowledge

Controlled Vocabularies and Ontology

Controlled vocabularies contain set terms that describe concepts in a specific domain. These terms have definitions that help us understand and agree on their meanings, and also helps with indexing and easy retrieval. Ontologies use controlled vocabularies to describe concepts and how they are related in a structured & computable format.

The following ontologies are preferred;

How do we Annotate?

This is the process of curating an annotation file with all essential information. Annotation is done by linking the right ontology, a cross-reference to a metadata with the addition of values and qualifiers in order to explicitly define the relationship between the model metadata, and the linked resources.

Metadata is the data of a data. The latter being the model. This model is referred to as the entity in annotation.
- Entity (eg. Model)
Extract all information available for each metadata categories.
Each annotation is linked to external data resources and values e.g., EDAM, STATO. An external data resource could be a database of ontology or the ontology itself.
- This improves the model quality
- Essential for the model search criteria
Value enhances the accessibility and integrates a metadata with other data resources using a compact identifier.
- Metadata (eg. Machine learning)
- Ontology (Bioinformatics Concept EDAM: edam:topic_3474)
- Value (https://identifiers.org/bptl/edam:topic_3474)
Append qualifiers to each annotation. Qualifiers explain the relationship between a metadata and the model itself.
- Qualifier (eg. bqbiol and bqmodel)
- A relationship is either biological - bqbiol, or computational - bqmodel.
The following qualifiers are used to describe relationships in the annotation;
- bqbiol:hasTaxon - describes a relationship between a model and organism
- bqbiol:occursIn - a compartment where a process occurs
- bqbiol:hasProperty - general biological property
- bqmodel:hasProperty - all model properties
- bqmodel:isDescribedBy - the model resources
- bqbiol:hasInput - model input data
- bqbiol:hasDataset - the model training data
- bqbiol:hasOutput - biological output of the model
Dome Annotation gives more context to the computational metadata
- D - Data ( this could be data source or the type of input data
- O - Optimization (each model has their algorithm)
- M - Model (the model source code and it’s executable form)
- E - Evaluation metrics (all model performance are evaluated)

Tools for Annotation

Ontology Lookup Service

OLS is a search and visualisation service that hosts 260+ biological and biomedical ontologies in one place.

Zooma

Zooma is a ontology mapping tool that can be used to automatically map free text

Steps to Annotating a Model - An Example

1. Identify a Model and its associated Information

Associated information includes the model publication, repository and its source code.

Read the Publication

All model metadata are enclosed in its publication and it’s important to read the publication to understand the biological or chemical processes the model performs, its bioactivity, the algorithm of the model, its training data and the validation performed among other information. Go through the repository to validate the information in the publication.

Special Scenario

There are models that are built by fine-tuning other large models with different datasets, and performing several tasks. It’s important to understand the base models in this case, and all its properties.

Case-study

For the purpose of example, we're working with an antimalarial model with the tag eos4zfy from the Ersilia Model. In this case, this is a collaborative project between the EMBL-EBI and other big pharma and institutes. Here, an individual QSAR model to identify novel molecules that may have antimalarial properties built on private dataset was merged together to develop MAIP. A free web platform available for mass prediction of potential malaria inhibiting compounds.

2. Assign the Metadata Entity

The metadata entity is the source of the metadata. It’s more of the Metadata Data which is its Model. Adding the metadata entity makes the table looks like this;

Entity

Model

3. Extract all information available for each metadata category.

To identify all model metadata associated with this model. We’d go through the publication, the web platform to understand its pipeline, its source code and Ersilia implementation process. This template can be adapted to individual use and a sample visual can be seen below.

Biological Metadata

These metadata are extracted from the abstract section of the publication and includes the disease, causative agents, data classification and the biological output of the model.

Computational Metadata

These includes the algorithm of the model, it’s evaluation method, and data type

Descriptive Metadata

Each model is described by a publication, source code and its implementation.

Entity

Model Metadata Categories

Metadata

Model

Biological Metadata

Homo sapiens
Plasmodium falciparum
Malaria
Antimalarial properties
Active
Inactive
Antimalarial compounds prediction

Model

Computational Metadata

Classification models
Naïve Bayesian model
AUC–ROC
5-fold cross validation
Smiles descriptors
malaria dataset

Model

Descriptive Metadata

MAIP web platform - Source code
Ersilia Incorporation URL
PubMed URL

For the purpose of example, these are sample Metadata from this model and its classification.

P.S: Column 2 (Model Metadata Categories) is just for descriptive purpose. It's not part of the annotation

Special Scenario

Some models were validated after building experimentally. This validation is done either in-vivo or in-vitro. It occurs as a predicted compound from the model being further validated experimentally to confirm its bioactivity. This validation is important for models that undergo such and should be annotated for the model.

4. Map the Metadata to the right Ontology

This is the main process of annotation, and it’s associating a metadata to the right ontology. Ontology can be identified through the Ontology Lookup Service. The Ontology Lookup Service (OLS) is a repository for biomedical ontologies that aims to provide a single point of access to the latest ontology versions.

To ensure standardization and interoperability, it's crucial to identify relevant ontologies through the Ontology Lookup Service (OLS). These ontologies will help in annotating the model components accurately. Search for your terms, for example, Machine Learning, in the search bar, and select the right term in the preferred ontology. If not found in the preferred ontology, look through other available options with the right meaning.

Sometimes, the exact term isn't found in the OLS, and in this case, the closest term can be used to replace the metadata.

In choosing the right ontology for a metadata, there are important things to consider.

The ontology with the best metadata meaning
Inclusive of a preferred ontology for better indexing

After mapping the metadata to the right ontology, we have a table like this;

Entity

Preferred Ontology

Metadata

Model

NCBI Taxonomy
NCBI Taxonomy
Experimental Factor Ontology EFO
NCI Thesaurus OBO Edition NCIT
NCI Thesaurus OBO Edition NCIT
NCI Thesaurus OBO Edition NCIT
Chemical Entities of Biological Interest CHEBI

Homo sapiens
Plasmodium falciparum
Malaria
Antimalarial properties
Active
Inactive
Antimalarial compounds prediction

Model

STATO: the statistical methods ontology
STATO: the statistical methods ontology
STATO: the statistical methods ontology
Ontology for Biomedical Investigations OBI
Chemical information ontology (cheminf)
NCI Thesaurus OBO Edition NCIT

Classification models
Naïve Bayesian model
AUC–ROC
5-fold cross validation
Smiles descriptors
malaria dataset

Model

Special cases

MAIP web platform - Source code
Ersilia Incorporation URL
PubMed URL

Note that the descriptive Medatada, like the PubMed URL of the model, or the specific Ersilia GitHub repository where the model is hosted, do not have an ontology (special cases).

5. Link the Ontology to their values

Values enhance accessibility and integrate metadata with other data resources in the form of a URL (Uniform Resource Locator).

Each ontology has its accession identifier and a value is formed using the ontology identifier with a compact identifier. The compact identifier is a resolution service that provides consistent access in form of https://identifiers.org/

Value = https://identifiers.org/ + NCIT:C176231
Value = https://identifiers.org/NCIT:C176231

Each ontology is linked to their respective value using the formula above, and the table looks like this;

Entity

Preferred Ontology

Values

Metadata

Model

NCBI Taxonomy
NCBI Taxonomy
Experimental Factor Ontology EFO
NCI Thesaurus OBO Edition NCIT
NCI Thesaurus OBO Edition NCIT
NCI Thesaurus OBO Edition NCIT
Chemical Entities of Biological Interest CHEBI

Homo sapiens
Plasmodium falciparum
Malaria
Antimalarial properties
Active
Inactive
Antimalarial compounds prediction

Model

STATO: the statistical methods ontology
STATO: the statistical methods ontology
STATO: the statistical methods ontology
Ontology for Biomedical Investigations OBI
Chemical information ontology (cheminf)
NCI Thesaurus OBO Edition NCIT

Classification models
Naïve Bayesian model
AUC–ROC
5-fold cross validation
Smiles descriptors
malaria dataset

Model

Online Web server
Ersilia Model Hub
PubMed Identification Number PMID

MAIP web platform - Source code
Ersilia Incorporation URL
PubMed URL

6. Associate the right qualifier to each annotation

Each metadata as previously explained is either a biology component of the model or a computational component or a descriptive component. Here, we’d annotate the metadata based on the category the fall.

For example;

Metadata Category

Metadata

Qualifier

Biological Metadata

Malaria

bqbiol:hasProperty

Computational Metadata

naïve Bayesian model

bqmodel:hasProperty

Descriptive Metadata

Ersilia Incorporation URL

bqmodel:isDescribedBy

After adding qualifiers to each metadata, the table looks like this;

Entity

Qualifiers

Preferred Ontology

Values

Metadata

Model

bqbiol:hasTaxon

bqbiol:hasTaxon

bqbiol:hasProperty

bqbiol:hasProperty

bqbiol:hasProperty

bqbiol:hasProperty

bqbiol:hasOutput

NCBI Taxonomy
NCBI Taxonomy
Experimental Factor Ontology EFO
NCI Thesaurus OBO Edition NCIT
NCI Thesaurus OBO Edition NCIT
NCI Thesaurus OBO Edition NCIT
Chemical Entities of Biological Interest CHEBI

Homo sapiens
Plasmodium falciparum
Malaria
Antimalarial properties
Active
Inactive
Antimalarial compounds prediction

Model

bqmodel:hasProperty

bqmodel:hasProperty

bqmodel:hasProperty

bqmodel:hasProperty

bqbiol:hasInput

bqbiol:hasDataset

STATO: the statistical methods ontology
STATO: the statistical methods ontology
STATO: the statistical methods ontology
Ontology for Biomedical Investigations OBI
Chemical information ontology (cheminf)
NCI Thesaurus OBO Edition NCIT

Classification models
Naïve Bayesian model
AUC–ROC
5-fold cross validation
Smiles descriptors
malaria dataset

Model

bqmodel:isDescribedBy

bqmodel:isDescribedBy

bqmodel:isDescribedBy

Online Web server
Ersilia Model Hub
PubMed Identification Number PMID

MAIP web platform - Source code
Ersilia Incorporation URL
PubMed URL

7. Contextualize the Computational Metadata by adding DOME

The DOME annotation provides more contexts to the computational metadata by identifying which section of the modelling the metadata belong to.

Adding DOME to the table shows this;

Metadata

DOME

classification models

naïve Bayesian model

AUC–ROC

5-fold cross validation

Optimization-Algorithm

Evaluation-Performance Measure

Evaluation-Method

MAIP web platform - Source code

Ersilia Incorporation URL

Smiles descriptors

malaria dataset

predictions of potential Antimalarial compounds

Model-Executable form

Data-Input

Data-Source

Model-Output; Classification

Resources & References

PreviousTroubleshooting models NextFor developers

Last updated 7 months ago

Was this helpful?