👩‍💻
Event Fund
  • Bringing data science and AI/ML tools to infectious disease research
  • Session 1
    • Skills: open chemistry datasets
    • Breakout: working with chemistry data
  • Session 2
    • Skills: building an ML model for chemistry
    • Breakout: the Ersilia Model Hub
  • Session 4
    • Skills: using OS models
    • Breakout: generative models
  • Extra content
    • Git and Github
  • Documents
    • Tools
    • Glossary
    • Code of Conduct
    • Image and media policy
Powered by GitBook
On this page
  • Day 1
  • Day 2
  1. Documents

Glossary

Dynamic glossary of terms introduced during the workshop

Day 1

Bioavailability: fraction of the active form of a drug that reaches systemic circulation unaltered

Chemical space: virtual space spanned by all possible molecules and chemical compounds adhering to a given set of construction principles and boundary conditions. Oftentimes, used also to refer to a group of molecules belonging to a particular subspace characterized by their physicochemical and structural features.

Dimensionality reduction: transformation of data from a high-dimensional space (for example, a vector of 10000 points) to a low-dimensional space (2 numbers) so that the low-dimensional representation still retains some meaningful properties of the original data.

Day 2

Classification: type of supervised ML modelling where the model learns to categorize each input into a specific class.

Contingency table: a matrix table describing the distribution of variables across two or more categories. For a binary classification ML model, a contingency table counts the distribution of molecules across 2 variables (real values and predicted values).

Featurization: in chemioinformatics, the process of converting a molecule (usually represented by its SMILES string) into a vector of numbers that can be passed to the algorithm. The better we are able to represent a molecule as a numerical vector (i.e, not lose information) the more informative our model will be.

Input: data you pass onto an ML model (by convention represented by X)

Output: data you get from an ML model (by convention represented by Y)

Precision: measure of a classification ML model performance. It measures the proportion of predicted positives that are actually positive.

Probability: likelihihood that a proposition is true. Applied to the output of a classification ML model, how likely it is that a molecule belongs to a specific class.

Recall: measure of a classification ML model performance. It measures how many positives we were actually able to identify.

Reinforcement Learning: a subcategory of machine learning where the algorithm (agent) learns through a process of trial and error. In chemioinformatics, a generative model learning to predict new molecules is an RL method.

Regression: type of supervised machine learning where the algorithm learns to predict a continuous outcome.

ROC curve: graph that shows the performance of a classification ML model at all threshold levels. Each point in the ROC curve is the True Positive Rate (TPR) and False Positive Rate (FPR) for a specific threshold.

Unsupervised machine learning: subcategory of machine learning where the algorithm is trained on unlabelled data with the goal to identify patterns in it. For example, a UMAP representation of a chemistry dataset belongs to the unsupervised machine learning class.

Supervised machine learning: subcategory of machine learning where the algorithm is trained on a dataset of input-output pairs (labelled data) and it learns to map an input to a specific output based on the training set.

PreviousToolsNextCode of Conduct

Last updated 2 years ago