# Glossary

## Day 1

**Bioavailability:** fraction of the active form of a drug that reaches systemic circulation unaltered

**Chemical space:** virtual space spanned by all possible molecules and chemical compounds adhering to a given set of construction principles and boundary conditions. Oftentimes, used also to refer to a group of molecules belonging to a particular subspace characterized by their physicochemical and structural features.

**Dimensionality reduction:** transformation of data from a high-dimensional space (for example, a vector of 10000 points) to a low-dimensional space (2 numbers) so that the low-dimensional representation still retains some meaningful properties of the original data.

## Day 2

**Classification:** type of supervised ML modelling where the model learns to categorize each input into a specific class.

**Contingency table:** a matrix table describing the distribution of variables across two or more categories. For a binary classification ML model, a contingency table counts the distribution of molecules across 2 variables (real values and predicted values).

**Featurization:** in chemioinformatics, the process of converting a molecule (usually represented by its SMILES string) into a vector of numbers that can be passed to the algorithm. The better we are able to represent a molecule as a numerical vector (i.e, not lose information) the more informative our model will be.

**Input:** data you pass onto an ML model (by convention represented by X)

**Output:** data you get from an ML model (by convention represented by Y)

**Precision:** measure of a classification ML model performance. It measures the proportion of predicted positives that are actually positive.

**Probability:** likelihihood that a proposition is true. Applied to the output of a classification ML model, how likely it is that a molecule belongs to a specific class.

**Recall:** measure of a classification ML model performance. It measures how many positives we were actually able to identify.

**Reinforcement Learning:** a subcategory of machine learning where the algorithm (agent) learns through a process of trial and error. In chemioinformatics, a generative model learning to predict new molecules is an RL method.

**Regression:** type of supervised machine learning where the algorithm learns to predict a continuous outcome.

**ROC curve**: graph that shows the performance of a classification ML model at all threshold levels. Each point in the ROC curve is the True Positive Rate (TPR) and False Positive Rate (FPR) for a specific threshold.

**Unsupervised machine learning:** subcategory of machine learning where the algorithm is trained on unlabelled data with the goal to identify patterns in it. For example, a UMAP representation of a chemistry dataset belongs to the unsupervised machine learning class.

**Supervised machine learning:** subcategory of machine learning where the algorithm is trained on a dataset of input-output pairs (labelled data) and it learns to map an input to a specific output based on the training set.
