Glossary

Dynamic glossary of terms introduced during the workshop

Day 1

Bioavailability: fraction of the active form of a drug that reaches systemic circulation unaltered

Chemical space: virtual space spanned by all possible molecules and chemical compounds adhering to a given set of construction principles and boundary conditions. Oftentimes, used also to refer to a group of molecules belonging to a particular subspace characterized by their physicochemical and structural features.

Dimensionality reduction: transformation of data from a high-dimensional space (for example, a vector of 10000 points) to a low-dimensional space (2 numbers) so that the low-dimensional representation still retains some meaningful properties of the original data.

Day 2

Classification: type of supervised ML modelling where the model learns to categorize each input into a specific class.

Contingency table: a matrix table describing the distribution of variables across two or more categories. For a binary classification ML model, a contingency table counts the distribution of molecules across 2 variables (real values and predicted values).

Featurization: in chemioinformatics, the process of converting a molecule (usually represented by its SMILES string) into a vector of numbers that can be passed to the algorithm. The better we are able to represent a molecule as a numerical vector (i.e, not lose information) the more informative our model will be.

Input: data you pass onto an ML model (by convention represented by X)

Output: data you get from an ML model (by convention represented by Y)

Precision: measure of a classification ML model performance. It measures the proportion of predicted positives that are actually positive.

Probability: likelihihood that a proposition is true. Applied to the output of a classification ML model, how likely it is that a molecule belongs to a specific class.

Recall: measure of a classification ML model performance. It measures how many positives we were actually able to identify.

Reinforcement Learning: a subcategory of machine learning where the algorithm (agent) learns through a process of trial and error. In chemioinformatics, a generative model learning to predict new molecules is an RL method.

Regression: type of supervised machine learning where the algorithm learns to predict a continuous outcome.

ROC curve: graph that shows the performance of a classification ML model at all threshold levels. Each point in the ROC curve is the True Positive Rate (TPR) and False Positive Rate (FPR) for a specific threshold.

Unsupervised machine learning: subcategory of machine learning where the algorithm is trained on unlabelled data with the goal to identify patterns in it. For example, a UMAP representation of a chemistry dataset belongs to the unsupervised machine learning class.

Supervised machine learning: subcategory of machine learning where the algorithm is trained on a dataset of input-output pairs (labelled data) and it learns to map an input to a specific output based on the training set.

Last updated