BioModels Annotation
This page describes how to annotate Ersilia models in the BioModels Tool to make them more FAIR
Background
Sharing of machine learning models most importantly in the field of drug discovery is important in creating a FAIReR (Findable, Accessible, Interoperable, Reusable, and Reproducible) collection of machine learning models which in turn makes it easier to reproduce and reuse these models. This reduces the need to rebuild models from scratch, increases their usefulness in various applications, and speeds up progress in drug discovery.
In the process of making ML models FAIR and shareable, there are standards and protocols to follow which includes; sharing model training code, dataset information, reproduced figures, model evaluation metrics, trained models, Docker files, model metadata, and FAIR dissemination. Here, we will focus on the Model Metadata, and its annotation.
Model Metadata
To share ML models effectively, itβs important to provide relevant information about the models. The information being the Metadata. Metadata is organised information that describes, explains, or helps find, use, or manage a resource. In the context of Ersilia Models, metadata is data about the model and it is classified into three categories namely; Biological Metadata, Computational Metadata, and the Description Metadata. The metadata enables the findability and accessibility of the models based on its specific characteristics by other researchers and modellers.
- Biological Metadata
The biological relevance of a model is an important aspect of a model. This ranges from its bioactivity, biological processes explained by the model, biological system the model was trained on, tissue or cell type involved, assay type, biological entity, Ersilia model theme (ranges from infectious disease to ADME property) and Compartment in which biological process is happening.
Why is Annotation Important?
An annotation is an association of a metadata with an ontology term. Annotating a model consists of mapping the identified model metadata with terms from controlled vocabularies and entries in data resources.
Annotation of a Model Metadata are crucial to:
Precisely identify model categories
Improve understanding of the model's structure
Make it easier to compare different models
Simplify model integration
Enable efficient searches
Add meaningful context to the model
Enhance understanding of the biology behind the model
Allow conversion and reuse of the model
Facilitate integration of the model with biological knowledge
Controlled Vocabularies and Ontology
Controlled vocabularies contain set terms that describe concepts in a specific domain. These terms have definitions that help us understand and agree on their meanings, and also helps with indexing and easy retrieval. Ontologies use controlled vocabularies to describe concepts and how they are related in a structured & computable format.
The following ontologies are preferred;
How do we Annotate?
This is the process of curating an annotation file with all essential information. Annotation is done by linking the right ontology, a cross-reference to a metadata with the addition of values and qualifiers in order to explicitly define the relationship between the model metadata, and the linked resources.
Metadata is the data of a data. The latter being the model. This model is referred to as the entity in annotation.
Entity (eg. Model)
Extract all information available for each metadata categories.
Each annotation is linked to external data resources and values e.g., EDAM, STATO. An external data resource could be a database of ontology or the ontology itself.
This improves the model quality
Essential for the model search criteria
Value enhances the accessibility and integrates a metadata with other data resources using a compact identifier.
Metadata (eg. Machine learning)
Ontology (Bioinformatics Concept EDAM: edam:topic_3474)
Value (https://identifiers.org/bptl/edam:topic_3474)
Append qualifiers to each annotation. Qualifiers explain the relationship between a metadata and the model itself.
Qualifier (eg. bqbiol and bqmodel)
A relationship is either biological - bqbiol, or computational - bqmodel.
The following qualifiers are used to describe relationships in the annotation;
bqbiol:hasTaxon - describes a relationship between a model and organism
bqbiol:occursIn - a compartment where a process occurs
bqbiol:hasProperty - general biological property
bqmodel:hasProperty - all model properties
bqmodel:isDescribedBy - the model resources
bqbiol:hasInput - model input data
bqbiol:hasDataset - the model training data
bqbiol:hasOutput - biological output of the model
Dome Annotation gives more context to the computational metadata
D - Data ( this could be data source or the type of input data
O - Optimization (each model has their algorithm)
M - Model (the model source code and itβs executable form)
E - Evaluation metrics (all model performance are evaluated)
Tools for Annotation
OLS is a search and visualisation service that hosts 260+ biological and biomedical ontologies in one place.
Zooma is a ontology mapping tool that can be used to automatically map free text
Steps to Annotating a Model - An Example
1. Identify a Model and its associated Information
Associated information includes the model publication, repository and its source code.
Read the Publication
All model metadata are enclosed in its publication and itβs important to read the publication to understand the biological or chemical processes the model performs, its bioactivity, the algorithm of the model, its training data and the validation performed among other information. Go through the repository to validate the information in the publication.
Special Scenario
There are models that are built by fine-tuning other large models with different datasets, and performing several tasks. Itβs important to understand the base models in this case, and all its properties.
Case-study
For the purpose of example, we're working with an antimalarial model with the tag eos4zfy from the Ersilia Model. In this case, this is a collaborative project between the EMBL-EBI and other big pharma and institutes. Here, an individual QSAR model to identify novel molecules that may have antimalarial properties built on private dataset was merged together to develop MAIP. A free web platform available for mass prediction of potential malaria inhibiting compounds.
2. Assign the Metadata Entity
The metadata entity is the source of the metadata. Itβs more of the Metadata Data which is its Model. Adding the metadata entity makes the table looks like this;
Entity |
---|
Model |
Model |
Model |
3. Extract all information available for each metadata category.
To identify all model metadata associated with this model. Weβd go through the publication, the web platform to understand its pipeline, its source code and Ersilia implementation process. This template can be adapted to individual use and a sample visual can be seen below.
Biological Metadata
These metadata are extracted from the abstract section of the publication and includes the disease, causative agents, data classification and the biological output of the model.
Computational Metadata
These includes the algorithm of the model, itβs evaluation method, and data type
Descriptive Metadata
Each model is described by a publication, source code and its implementation.
Entity | Model Metadata Categories | Metadata |
---|---|---|
Model | Biological Metadata |
|
Model | Computational Metadata |
|
Model | Descriptive Metadata |
|
For the purpose of example, these are sample Metadata from this model and its classification.
P.S: Column 2 (Model Metadata Categories) is just for descriptive purpose. It's not part of the annotation
Special Scenario
Some models were validated after building experimentally. This validation is done either in-vivo or in-vitro. It occurs as a predicted compound from the model being further validated experimentally to confirm its bioactivity. This validation is important for models that undergo such and should be annotated for the model.
4. Map the Metadata to the right Ontology
This is the main process of annotation, and itβs associating a metadata to the right ontology. Ontology can be identified through the Ontology Lookup Service. The Ontology Lookup Service (OLS) is a repository for biomedical ontologies that aims to provide a single point of access to the latest ontology versions.
To ensure standardization and interoperability, it's crucial to identify relevant ontologies through the Ontology Lookup Service (OLS). These ontologies will help in annotating the model components accurately.
Here, you search for the respective metadata in the OLS website and identify the Ontology that best suits it. The terms are searched for in the preferred ontology. There are terms that are best described by ontologies not in the preferred one, and such ontologies can be used.
A brief description of the Image above.
Step 1 - Input the metadata in the search space and search
Step 2 - Look for the term in the preferred ontology
Step 3 - If not found in the preferred ontology, look through other available options with the right meaning.
** P.S: Sometimes, the exact term isn't found in the OLS, and in this case, the closest term can be used to replace the metadata.
In choosing the right ontology for a metadata, there are important things to consider.
The ontology with the best metadata meaning
Inclusive of a preferred ontology for better indexing
Using the image above as an example, we have the metadata Machine Learning belonging to different ontology with three above as the most preferred.
Looking at the important things to consider,
There are two preferred ontologies here with EDAM having the best meaning. EDAM then is preferred for the Metadata βMachine Learningβ
After mapping the metadata to the right ontology, we have a table like this;
Entity | Preferred Ontology | Metadata |
---|---|---|
Model |
|
|
Model |
|
|
Model | Special cases |
|
There are Metadata that doesnβt have ontology, and these are descriptive Metadatas. Their ontology are what describes them, In the special case, weβd have;
MAIP web platform - Source code
Ersilia Incorporation URL
PubMed URL
Online Web server
Ersilia Model Hub
PubMed Identification Number PMID
5. Link the Ontology to their values
Values enhance accessibility and integrate metadata with other data resources in the form of a URL (Uniform Resource Locator).
Each ontology has its accession identifier and a value is formed using the ontology identifier with a compact identifier.
For example; Value = Compact identifier + accession identifier
The compact identifier is a resolution service that provides consistent access in form of https://identifiers.org/
Value = https://identifiers.org/ + NCIT:C176231
Each ontology is linked to their respective value using the formula above, and the table looks like this;
Entity | Preferred Ontology | Values | Metadata |
---|---|---|---|
Model |
|
| |
Model |
|
| |
Model |
|
|
6. Associate the right qualifier to each annotation
Each metadata as previously explained is either a biology component of the model or a computational component or a descriptive component. Here, weβd annotate the metadata based on the category the fall.
For example;
Metadata Category | Metadata | Qualifier |
---|---|---|
Biological Metadata | Malaria | bqbiol:hasProperty |
Computational Metadata | naΓ―ve Bayesian model | bqmodel:hasProperty |
Descriptive Metadata | Ersilia Incorporation URL | bqmodel:isDescribedBy |
After adding qualifiers to each metadata, the table looks like this;
Entity | Qualifiers | Preferred Ontology | Values | Metadata |
---|---|---|---|---|
Model |
|
|
| |
Model |
|
|
| |
Model |
|
|
|
7. Contextualize the Computational Metadata by adding DOME
The DOME annotation provides more contexts to the computational metadata by identifying which section of the modelling the metadata belong to.
Adding DOME to the table shows this;
Metadata | DOME |
---|---|
classification models naΓ―ve Bayesian model AUCβROC 5-fold cross validation | Optimization-Algorithm Optimization-Algorithm Evaluation-Performance Measure Evaluation-Method |
MAIP web platform - Source code Ersilia Incorporation URL Smiles descriptors malaria dataset predictions of potential Antimalarial compounds | Model-Executable form Model-Executable form Data-Input Data-Source Model-Output; Classification |
The process above is how to completely annotate a model to meet the FAIR Standard.
Resources & References
Last updated