Outreachy Winter 2023
This page describes the contribution guidelines for the interns interested in participating in the Outreachy round from December 2022 to March 202
The Ersilia Model Hub is an open source platform of ready-to-use AI/ML models for biomedical research. With it, scientists can browse a collection of models, select the ones relevant to their research and run predictions! For example, to predict whether a molecule will be active against Malaria.
The internship project will be focused on increasing the collection of available models in the Hub.
Specially important to Ersilia is the contribution period. We list below a number of tasks that must be completed in order. We will not judge interns on how many tasks they can complete, but on the quality of each contribution and the interest to learn, participate and help others in the community. The mentors are there and willing to help, but they also have a dedicated timeslot for reviewing contributions, please be patient if it takes a few hours to get back to you.
The contribution period runs from October 6th to November 4th. During this time, interested applicants are welcome to contribute to Ersilia's project following the guidelines in this document.
We will be having tons of interactions during the contribution period, so the best is to get to know your peers, your mentors and a bit more about Ersilia.
We use Slack as our main communication platform, both for the contribution period and afterwards to work with the selected interns. If you have never used this tool, don't worry, is quite intuitive!
- 2.Introduce yourself in the #general channel. The general channel is for random questions, interactions with other fellow contributors, writing tips and suggestions...
- 3.Use the dedicated channels for questions about specific topics. For example, the #colab channel will be used to discuss about Google Colaboratory issues.
- 4.Contribute to your peers questions, this is about helping each other and we really value interns who work with the community.
- 📖 Getting familiar with the repository structure
- 🐛 Checking the issues to see what has the community been working on
- 👀 Watching the repository to receive notifications if you are mentioned
- ⭐ Starring the repository if you like the work Ersilia is doing!
We will host an open community call on Wednesday 12th of October at 17:00 SAST. During the call we will introduce a bit more about Ersilia's work, go over the contribution period tasks and answer any questions you might have!
The link to the call will be shared via Slack.
We apologize in advance if the call is in a difficult time-zone for you, attendance is not compulsory and will not be taken into account when selecting the interns.
We will be using the Ersilia Model Hub throughout the internship. The software is in beta testing and therefore you might encounter errors while running it, don't worry this is why we are here!
Please follow the installation instructions. If you have a UNIX machine (Linux or MacOS) you can install Ersilia directly. If you are using a windows machine you will need a Virtual Machine or a Windows Subsystem Linux (WSL).
We will first make sure ersilia works by running the following commands:
ersilia --help #this should output the command options for ersilia
Once we are sure ersilia is recognised in the CLI, we will test a very simple model
ersilia -v fetch eos3b5e
ersilia serve eos3b5e
ersilia -v api calculate -i "CCCC"
This is calculating the molecular weight of the molecules, the output should be printed in your CLI and look like:
Your second task for the contribution period is to help us debug any issues when running models with the Ersilia Command Line Interface. You can find more information about it on the Model Usage guide.
- 1.Use one of the empty <username> column fields
- 2.Add the system you are using on the first row.
- 3.Change the cells of your column corresponding to the models you have tested in:
- 1.Green, if they run without a problem
- 2.Red, if there was an issue
- 4.If you run the model successfully:
- 1.Write fetch and predict times on the excel cell.
- 2.Go on to the next model, until you have tested the models assigned to you.
- 5.If you encountered a problem:
- 1.Go to Ersilia's GitHub repository and check if there is a Bug already for this model. If there is, add your error on the same thread.
- 2.If not, open a Bug issue (use the provided template) and add the model identifier (eosxxxx) as issue title.
- 3.Add the log of the error, which can be obtained by adding the following code snippet at the end of the commandersilia -v fetch modelname > my.log 2>&1
- 4.Let's work on debugging this issue together before moving on!
We have provided an already prepared list of test molecules in .csv format. Download it and use it for model testing. Store the output of the model in a .csv file that must contain in the name the model identifier and the date when the prediction was done. Upload the output .csv file in the shared folder.
Once you have successfully tested 5 models, go on and try to repeat the exercise in Google Colaboratory. You can also test one model in the CLI and in Google Colab in parallel. You can use the Colaboratory Template Notebook from Ersilia's repository.
Google Colaboratory uses Google's servers (Linux machines) to run the code, which is very convenient to bypass installation issues. To run Google Colab, you only need a Google Account. If you do not have and do not wish to open a Google Account, we can skip this step.All
Once you have:
- Successfully installed the Ersilia Model Hub in your computer
- Successfully run predictions for at least, 3 models using the command line AND debugged any issues you might have.
- Successfully run the same five models in Google Colaboratory using the Ersilia PyPi package
We are ready to continue onto the next stage of the contribution period 🎉
For this period, we will use Ersilia's automated ML modelling packages:
We will leverage the datasets from the excellent initiative Therapeutics Data Commons (TDC). You can read more about it in its associated publication. TDC has prepared biomedical-related datasets for ML modelling, and provides benchmarks of performance. We will use those to test our automated ML libraries and add the resulting models in the Ersilia Model Hub.
For this part, please do not launch directly onto modelling, wait for the mentor's revision of your previous work:
- 1.Once you have completed all the above steps, mention the mentors on the Slack channel #stage1-contributions and explain all the steps done on the first phase. Add your GitHub handle in the issue!
- 2.The mentors will open a GitHub issue with the specific modelling exercise and assign it to you.
- 4.Answer these questions (and more that you might have) about your data and models:
- 1.What are we trying to predict? (do some reading!)
- 2.Is this a classification or a regression problem?
- 3.How many datapoints do we have in total and in each set (train / validation / test)?
- 4.How many actives and inactives do we have?
- 5.Can you show in a plot the distribution of actives/inactives, or the values for the regression?
- 6.What is the performance of your model? What metrics are you using and why?
- 7.How do you interpret the ROC Curve you got?
- 8.How could we improve the model?
- 5.Finally, comment on at least TWO of your peers models, discussing their results and suggesting ideas or improvements if needed.