# Breakout: working with chemistry data

For this session, we will have group discussions around three focused topics to prepare for our dive into ML model building. We will form breakout groups of 5-6 participants. During the discussion, flipcharts will be available to take notes for feedback to the larger group afterwards.

## Materials

* Session 1 Breakout [presentation](https://github.com/ersilia-os/event-fund-ai-drug-discovery/blob/main/presentations/session1_breakout.pptx)&#x20;
* Exercise [Notebook](https://github.com/ersilia-os/event-fund-ai-drug-discovery/blob/main/notebooks/session1_breakout.ipynb)
* [Datasets](https://github.com/ersilia-os/event-fund-ai-drug-discovery/tree/main/data/session1) from the Session 1 Skills Notebook

### Additional Datasets

#### **BindingDB Human PI3K**

This dataset contains dose-response data for inhibition of a human kinase, PI3K. Kinases are attractive targets in the anti-malarial field and are important for cellular processes. We’ll investigate which of our whole-cell anti-malarial screening libraries are better suited to predict this PI3K dataset from [ChEMBL](https://www.ebi.ac.uk/chembl/assay_report_card/CHEMBL3706369/).

* The dataset comprises 345 molecules.
* The IC50 bioactivity is measured as the concentration exhibiting 50% inhibition.

#### **ChEMBL&#x20;*****Acinetobacter Baumannii***

*A. baumannii* is a member of the ESKAPE pathogens - bacteria with a worrying increase in the prevalance of antibiotic resistance. This dataset contains dose-response data for the inhibition of *A. baumannii* from multiple [ChEMBL](https://www.ebi.ac.uk/chembl/g/#browse/activities/filter/target_chembl_id%3ACHEMBL614425%20AND%20standard_type%3A\(%22MIC%22\)) curated sources and can be found [here](https://github.com/ersilia-os/event-fund-ai-drug-discovery/blob/main/data/day1/Chembl_A_baumannii.csv?raw=true).

* The dataset comprises 10,194 molecules.
* The IC50 bioactivity is measured as the concentration exhibiting 50% inhibition.

## Computational tools discussion

Take 10 minutes to get to know each other:

* Give each person a chance to introduce themselves and their field of expertise.
* What do you hope to learn about during the workshop?
* Decide on a group name.&#x20;

For 20 minutes, discuss the challenges and opportunities for ML in drug discovery:

* What current computational tools and/or skills, if any, do you use in your research?
* What challenges/limitations do you face that prevent you from making further use of data science tools?
* How could some of these gaps be addressed?

## Chemical Space Discussion

Allow for 20 minutes of discussion around concepts concerning ‘chemical space’.

* What do you understand by the term 'chemical space'?
* &#x20;How does this differ to the concept of 'drug-like molecules'?
* Why should the data we use to train models be similar to the types of compounds we aim to obtain predictions for?
* How might our training data requirements vary between a virtual screening campaign for novel chemical hits and a model to score analogues of a particular chemical series that is further along the drug discovery pipeline?&#x20;

For the next 20 minutes, answer the questions for the following figures:

We will load a figure of an example data set with various chemical series (set of related chemical analogues) in colour.

* Which of these series are outside the chemical space of our training data and why?
* Does this make them more likely or less likely to be accurately predicted by a model trained on this data (the data in black)?

Now we’ll investigate which of our datasets used in the skills workshop (GSK and St Jude antimalarial screens) would be better suited to screening a set of active human PI3K inhibitors against malaria. After producing the PCAs and UMAPs of the PI3K dataset with each potential training set, answer the following questions:

* Which dataset would you rather use to train a model to predict PI3K inhibition and why?
* Are there any concerns with your chosen dataset?

## Data cleaning discussion

Initial datasets are not necessarily ready for ML model building. We often first need to clean datasets to conform to a common format and remove anomalies. Datasets should complete, consistent, accurate and reliable.

In your groups, brainstorm the types of inconsistencies you might need to look out for and address in drug discovery data in order to clean and standardise the dataset for machine learning. Use the ChEMBL *A. baumannii* dataset as an example of non-standardised data to generate ideas.

Here is an example of a clean dataset, where columns have a single standardized value per column with no duplicates or missing data:

<figure><img src="/files/PzsNPfea9O4jdcx37DAq" alt=""><figcaption><p>Example of an standardised dataset</p></figcaption></figure>


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://ersilia.gitbook.io/event-fund/session-1/breakout-working-with-chemistry-data.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
