Outreachy Summer 2025

Please find here the guidelines for the Outreachy contribution period running from 17th March to the 15th April 2025

Internship Project

The Ersilia Model Hub is an open source platform of ready-to-use AI/ML models for biomedical research. With it, scientists can browse a collection of models, select the ones relevant to their research and run predictions! For example, to predict whether a molecule will be active against Malaria. The internship project will be focused on increasing the collection of available models in the Hub.

Specially important to Ersilia is the contribution period. We list below a number of tasks that must be completed in order. We will not judge interns on how many tasks they can complete, but on the quality of each contribution and the interest to learn, participate and help others in the community. The mentors are there and willing to help, but they also have a dedicated time slot for reviewing contributions, please be patient if it takes a few hours to get back to you. Almost all the information you will need to successfully complete the contribution period is enclosed in this document, please read it all before asking questions.

Transparency note: Please note that this internship requires strong knowledge of Python programming language. If you are not yet an expert programmer, we recommend you look for other projects to contribute where your expertise might be better suited.

Please note that we have a zero tolerance policy for plagiarism. All text written by interns will be revised to ensure no plagiarism and no use of AI writing tools has been involved. While AI tools can be helpful in certain circumstances, here we need to learn about you in your own words. Using Chat-GPT or others for writing letters of interest, summaries of publications or similar will immediately disqualify you from further participating in the program.

Signing up for Outreachy

Interested applicants must be accepted in the Outreachy internship program. Please go to Outreachy's website and be sure to fill in the application according to timelines.

Ersilia can only accept interns that have been approved by Outreachy and that comply with the necessary requirements. Please check your availability for the internship period fulfills Outreachy's requirements.

Contribution Period

The contribution period runs from March 17th to April 15th. During this time, interested applicants are welcome to contribute to Ersilia's project following the guidelines in this document.

The contribution period is organised in 4 weeks. Each week has a set of specific goals defined, with the objective that mentors can evaluate the intern's experience, interest in the community and team-building work. Once the week's objectives have been met, please focus on:

Improving your contribution (there is always more publications to read, better bug reports to be written etc)
Helping out other contributors (we really value group work)

We will be using GitHub issues to track the work of each contributor.

📆 WEEK 1: Get to know the community

The first week is focused on getting to know the Ersilia community, our mission and how we work to achieve it. We will be having tons of interactions during the contribution period, so the best is to get to know your peers, your mentors and a bit more about Ersilia.

Task 1: Join in the communication channels

Slack: we use Slack as our main communication platform, both for the contribution period and afterwards to work with the selected interns. If you have never used this tool, don't worry, is quite intuitive!

Sign up using your preferred email and name. It is helpful to add your GitHub user name in brackets if it does not match your name, for example: John Doe (jdgithub)
Introduce yourself in the #intros channel.
Use the dedicated channels for questions about specific topics. If you feel more channels are needed, feel free to ask for them to the mentors
Contribute to your peers' questions, this is about helping each other and we really value interns who work with the community.

We try to work as openly as possible, we encourage all contributors to post in the open channels rather than private conversations.

Please use a Slack name that is easy to identify with your GitHub handle to make it easy for mentors to review contributions and tag people.

GitHub: we will be working based on the Ersilia Model Hub main repository, which is hosted on GitHub. You can start by:

📖 Getting familiar with the repository structure
🐛 Checking the issues to see what has the community been working on
👀 Watching the repository to receive notifications if you are mentioned
⭐ Starring the repository if you like the work Ersilia is doing!

Ersilia is an Open Source community with active contributing members. Please respect the work of others.

We will be using GitHub issues a lot, so if you have never worked with GitHub before, make sure you understand how issues work!

Community call: we will hold a community call on March 20th 17:00pm CET to go over the contribution period tasks and answer any questions you might have! The link will be shared via Slack. Attendance is not compulsory, we have tried to find a time that is acceptable for most time zones, we apologize in advance if it means an early start or late end of your day.

Code of Conduct: Ersilia is adhered to the Contributor Covenant Code of Conduct. Any breaches of the code of conduct, specially harassment or lack of respect for fellow contributors, will mean disqualification as an applicant.

Task 2: Install the Ersilia Model Hub

We will be using the Ersilia Model Hub throughout the internship. Please follow the installation instructions. If you have a UNIX machine (Linux or MacOS) you can install Ersilia directly. If you are using a windows machine you will need a Virtual Machine or a Windows Subsystem Linux (WSL).

For Windows users, we recommend using a WSL with Visual Studio Code to access it.

A common mistake is to forget the installation of Git-LFS, which is required for many models. Please do so! We also prioritize working with Dockerised models, so make sure to install Docker and Docker Desktop.

Testing that Ersilia works: we will first make sure ersilia works by running the following commands:

ersilia --help #this should output the command options for ersilia

Once we are sure ersilia is recognised in the CLI, we will test a very simple model

ersilia -v fetch eos3b5e
ersilia serve eos3b5e
ersilia -v run -i "CCCC"

This is calculating the molecular weight of the molecules, the output should be printed in your CLI and look like:

{
    "input": {
        "key": "IJDNQMDRQITEOD-UHFFFAOYSA-N",
        "input": "CCCC",
        "text": "CCCC"
    },
    "output": {
        "mw": 58.123999999999995
    }
}

These tests do not work, what now?! Write down the challenges you are facing in your GitHub issue, and ask for support to your peers through the Slack channel.

Ersilia models are supported to be fetched from a whole host of places, namely, S3 buckets, GitHub, DockerHub, and even from a local repository of the model!

To ensure model dependencies are self contained, Ersilia models are "dockerized", and Ersilia fetches them through DockerHub by default, if you have Docker installed. To complete this task, make sure you have Docker installed or install it from here.

Pull a model image. Here we use another simple model from the hub

docker pull ersiliaos/eos4wt0:latest

Activate the environment where you have installed ersilia, and test this model that you have just fetched from DockerHub:

ersilia serve eos4wt0 # Notice that you don't have to fetch it through ersilia here.
ersilia -v run -i "CCCC"

This generates the Morgan Fingerprints for a molecule and the output should be printed in your shell like this:

{
    "input": {
        "key": "IJDNQMDRQITEOD-UHFFFAOYSA-N",
        "input": "CCCC",
        "text": "CCCC"
    },
    "output": {
        "outcome": [
            0.0,
            0.0,
            0.0,
            0.0,
            ...,
        ]
    }
}

Task 3: Motivation statement

Write, in a thread in your issue, your motivation for joining Outreachy and, in particular, why are you interested in working at Ersilia. A good motivation letter will explain your current skills that are relevant to Ersilia, your reasons to work in Ersilia's project, how this would advance your career and what are your plans during and after the internship.

Task 4: Obtain approval of the introductory tasks to continue contributing.

If what you have seen and learnt in this first week is appealing and you want to continue working with us, please open an issue on the outreachy-contributions repository and:

Detail which operating system are you using
Describe your tests of Ersilia and Docker - Demonstrate that you are able to run models and get predictions. Examples of thorough testing include understanding the different output formats available for Ersilia Models, the types of model available, etc..
Send the motivation statement to work at Ersilia

We will not be assigning issues or reviewing code contributions of those applicants who have not received the OK to continue working on their application after completing the first week's assignments

📆 WEEKS 2 and 3: Apply Ersilia Models to a modelling task

Once you have successfully completed all the entry-level tasks and received the OK from your mentors to continue contributing, go ahead to the outreachy-contributions repository to start working! Please make sure to follow this documentation step by step to succeed in this modelling exercise. We have tried to indicate what are we looking for in each step:

1. Download a dataset of interest

There are plenty of datasets for drug discovery exercises. Here, we suggest using a dataset from the Therapeutics Data Commons, which are already pre-prepared for ML modelling. When choosing your dataset, consider:

Classifiers are more easily modelled than regressors. We strongly suggest selecting a classifier problem
Understanding the background data. Could I follow the original data collection protocol? Do I understand what the endpoint is, and could I explain it in my own words?
My computational capacity. Large datasets will occupy more space once data is featurised.

Once you are sure of which dataset you will model, download it using the python package. Keep all code in notebooks or scripts and save the data in the /data folder.

At this point, you should start preparing the documentation for your project. How do I install and run it? Which notebooks/scripts do I need to run, and in which order? What will I find in the folders? Use the README file for documentation.

Evaluated tasks:

Basic comprehension of drug discovery tasks
Installing and running a third party Python package
Documentation
Working with GitHub: forks, issues and more

2. Featurise the data

The first step in ML modelling is to featurise the data (i.e convert the molecules to comprehensive vectorial representations). There are several ways of doing so. In this project, we ask you to browse the Ersilia Model Hub and select a featuriser from the models available in the Hub:

Look at the "Representation" labelled models in Ersilia
Select one featuriser and explain why

Again, this section should also be reproducible by following the instructions on the README file and the code available.

Evaluated tasks:

Basic comprehension of molecular featurisation
Installing and running the Ersilia Model Hub
Documentation

3. Build an ML model

Using the selected datasets and the featuriser, build a simple ML model. We suggest using one of the following packages:

XGBoost
FLAML
Sci-kit Learn

Again, use scripts or python notebooks and make sure all steps are reproducible. Train-test splits should be applied and model validations need to be completed and discussed. For the latest, please use matplotlib to create easy to interpret graphs. Document everything on the README file.

Evaluated tasks:

Basic comprehension of ML frameworks
Evaluation of ML models
Plotting with python libraries
Documentation

4. Prepare your code for review

Once you are happy with your modelling exercise, it is time to evaluate it critically:

Is my documentation thorough? (think as if someone was seeing this code for the first time, could they be able to reproduce it?
Is the evaluation of my model good enough? Could I do something to improve it?

If all the above checks, update your issue to ask for review and feedback!

Evaluated tasks:

Constancy and thoroughness

5. Stretch tasks

There are always ways to improve our work. In this particular project, you might want to:

Explore other featurisers
Try different ML architectures
Try the model on public data (for example from ChEMBL)

📆 WEEK 4: Submit your Final Application

Focus the last week ONLY in writing your final application to Outreachy. Mentors will not revise any contribution for the last week, only final applications, to ensure we can provide feedback on them.

The Outreachy internship project will differ from the Contribution period. The contribution period allows mentors to evaluate the interns skills and select those more likely to succeed during the internship period. Please look at the original project description on the Outreachy webpage to understand what you will work on during the internship.

Week 4 is the time to go to the main Ersilia Model Hub repository and look at the work we are doing! This will give you better hint of the type of tasks you will complete. In the final application, we want to understand what are your learning goals and how do you think you can achieve them. You can include things like how and when will you report on tasks, what aspects of the Ersilia Model Hub backend development motivate you more, where do you think your skills are better suited etc

Final applications must be submitted through the Outreachy website on time. We will not be able to provide help or support for last minute internet connection problems, late submissions and other issues. Please make sure you fill it in with time.

PreviousInternships NextOutreachy Winter 2024

Last updated 4 months ago

Was this helpful?