# Glossary

Dynamic glossary of terms introduced during the workshop

**Bioavailability:**fraction of the active form of a drug that reaches systemic circulation unaltered

**Chemical space:**virtual space spanned by all possible molecules and chemical compounds adhering to a given set of construction principles and boundary conditions. Oftentimes, used also to refer to a group of molecules belonging to a particular subspace characterized by their physicochemical and structural features.

**Dimensionality reduction:**transformation of data from a high-dimensional space (for example, a vector of 10000 points) to a low-dimensional space (2 numbers) so that the low-dimensional representation still retains some meaningful properties of the original data.

**Classification:**type of supervised ML modelling where the model learns to categorize each input into a specific class.

**Contingency table:**a matrix table describing the distribution of variables across two or more categories. For a binary classification ML model, a contingency table counts the distribution of molecules across 2 variables (real values and predicted values).

**Featurization:**in chemioinformatics, the process of converting a molecule (usually represented by its SMILES string) into a vector of numbers that can be passed to the algorithm. The better we are able to represent a molecule as a numerical vector (i.e, not lose information) the more informative our model will be.

**Input:**data you pass onto an ML model (by convention represented by X)

**Output:**data you get from an ML model (by convention represented by Y)

**Precision:**measure of a classification ML model performance. It measures the proportion of predicted positives that are actually positive.

**Probability:**likelihihood that a proposition is true. Applied to the output of a classification ML model, how likely it is that a molecule belongs to a specific class.

**Recall:**measure of a classification ML model performance. It measures how many positives we were actually able to identify.

**Reinforcement Learning:**a subcategory of machine learning where the algorithm (agent) learns through a process of trial and error. In chemioinformatics, a generative model learning to predict new molecules is an RL method.

**Regression:**type of supervised machine learning where the algorithm learns to predict a continuous outcome.

**ROC curve**: graph that shows the performance of a classification ML model at all threshold levels. Each point in the ROC curve is the True Positive Rate (TPR) and False Positive Rate (FPR) for a specific threshold.

**Unsupervised machine learning:**subcategory of machine learning where the algorithm is trained on unlabelled data with the goal to identify patterns in it. For example, a UMAP representation of a chemistry dataset belongs to the unsupervised machine learning class.

**Supervised machine learning:**subcategory of machine learning where the algorithm is trained on a dataset of input-output pairs (labelled data) and it learns to map an input to a specific output based on the training set.

Last modified 1yr ago