# Skills: building an ML model for chemistry

This hands-on workshop will review the basic steps to train a Machine Learning model for chemistry datasets. We will go through each step together following the Google Colab notebook for the session.

*This pipeline is an example prepared with curated data and should not be reproduced with the student's own datasets. The goal of this workshop is purely academic and does not represent a real research case study.*

For this example, we will use the already curated datasets available in the Therapeutics Data Commons (TDC) initiative. They are very convenient to practise because they have been benchmarked and pre-prepared for ML modelling, already providing the train-test splits necessary.

The first dataset we will use identifies cardiotoxicity due to the blocking of the potassium channel hERG. Cardiotoxicity is one of the major adverse drug reactions that affect the later stages of drug discovery, and predicting whether a molecule will induce it can avoid major failures at late stages of the drug discovery pipeline. You can read more about the dataset in the TDC website, and in its original publication.

- The dataset is composed of 648 molecules
- The cardiotoxic activity has been binarized (0: inactive, 1: active)

The second dataset corresponds to a lethal dose evaluation (the dose that kills 50% of the test subjects). You can read more about the dataset in the TDC website and in its original publication.

- The dataset is composed of 7,385 molecules
- The bioactivity is measured as Lethal Dose 50 (LD50)

As we have reviewed in the Keynote session, molecules need to be converted to an input amenable for Machine Learning. This is typically a numeric vector or an image.

We will use the Chemical Checker, which transforms each SMILES into a

*signature*that encodes both structural and bioactivity information for the molecule. You can read more about the Chemical Checker in its associated publication.In the workshop we will explore only supervised machine learning algorithms (where the training data is already labelled)

We will use different functions from the Python SciKit Learn package to train the ML models

The goal of this course is not to review the mathematical concepts behind each type of model, therefore we will simply load the predefined parameters from SciKit-Learn and use them.

A classifier identifies the input with a specific category. In our example, we have a binary classification of hERG cardiotoxicity:

- Inactive: not cardiotoxic (0)
- Active: cardiotoxic (1)

The training data (for example, IC50) has been binarized with a specific threshold of activity (for example, 10 uM). Knowing which was the original measure and its threshold can be very helpful to interpret the results.

A binary classifier gives two outputs:

- Probability of 0: the probability that a molecule is inactive in the given assay
- Probability of 1: the probability that a molecule is active in the given assay

To convert the probabilities into a binary classification (a molecule must be given either a 0 or a 1), we must select a probability threshold. By default, the cut-off is 0.5, but it can be adapted to the particular assay needs.

You can play a bit in the notebook and observe how the results change if we modify the activity cut-off

**Confusion matrix:**distributes the predictions in True Negatives, True Positives, False Negatives and False Positives. By default it shows the classification using a 0.5 probability threshold. Allows us to get a visualization of how good our classifier is, and provides the values to calculate:

- Precision: how many positives are actually positive
- Recall: how many positives are we able to identify

**ROC Curve:**displays the performance of the model at all classification thresholds. The axis indicate:

- X-axis: true positive rate (sensitivity or recall)
- Y-axis: false positive rate (100-sensibility)

A regression model predicts the exact value of a continuous output, for example the IC50 of a compound. It is important to know which was the input of the model and its units to understand the output

In the regression models, the output is easier to interpret. Each molecule is associated to a predicted value, LD50 in our example.

Sometimes the input has been transformed (for example, by logarithm) and therefore the output will also be in log.

We have several values to measure the performance of a regression model over the test data:

- Mean Absolute Error (MAE): average of the absolute difference between actual and predicted values. It is easy to understand as the units are the same as the output values
- Mean Squared Error (MSE): average of the squared difference, compared to the MAE it serves to better visualize large errors.
- R-Square (R2): proportion of the variance explained by the regression model. It is scale free.