Drug discovery with Deep Learning under 10 lines of code

Drug Discovery with Deep Learning Under 10 Lines of Codes.


This post was originally published by Kexin Huang at Towards Data Science

DeepPurpose Framework

Now after we have a conceptual overview of DTI and DeepPurpose, I begin to introduce the DeepPurpose programming framework. This framework consists of several steps where each step consists of one line of code:

  • Data loading
  • Encoder specification
  • Data encoding and split
  • Model configuration generation
  • Model initialization
  • Model training
  • Repurposing/Screening
  • Model saving and loading

Let’s go over each step! For a better learning experience, it is recommended to follow these steps in your own Jupyter notebook file or going over the blog post notebook. To install DeepPurpose, you can use the DeepPurpose Binder cloud notebook (simply clicking the link!) or use the local DeepPurpose environment. The instructions for the installation can be found here, where you can also find video installation tutorials.

Data loading

DeepPurpose takes in a NumPy array of drug’s SMILES strings (X_drugs), a NumPy array of target protein’s amino acid sequence (X_targets), and a NumPy array of labels (y), which can either be binary 0/1 indicating interaction outcome or a real number indicating binding affinity value. The input drug and target arrays should be paired, i.e. y[0] is the score for X_drugs[0] and X_targets[0]. DeepPurpose will automatically switch between the binary classification task or regression task depending on the dataset.

Besides transforming into NumPy arrays through some data wrangling on your own, DeepPurpose also provides benchmark dataset loaders (DAVIS/KIBA/BindingDB) to ease preprocessing. For example, in this post, we will use the DAVIS dataset:

You can also load your own dataset from txt file using thedataset.read_file_training_dataset_drug_target_pairs function, where each line is a drug SMILES string, a target amino acid sequence, and the binding score.

Encoder specification

After we obtain the required data format, we need to first specify the encoder to use for drug and protein. Here we try MPNN for the drug and CNN for the target. Note that you can switch encoder just by changing the encoding name. The full list of encoders is listed here.

(If you are using CPU to run this blog post, you will find that MPNN and CNN are a bit large, you can try smaller encoders such as Morgan for drugs and Conjoint_triad for proteins.)

Data encoding and split

Now, we need to prepare the data encoding scheme for different encoders, using the utils.data_process function. In this function, we can specify train/validation/test split fractions, random seed to ensure the same data split for reproducibility, and also supports data splitting methods such as cold_drug and cold_protein, which splits on drug/proteins for model robustness evaluation to test unseen drug/proteins. The function outputs train, validation, and test set Pandas data frames.

Model configuration generation

Now, we initialize a model with its configuration. You can modify almost any hyper-parameters (e.g., learning rate, epoch, batch size), model parameters (e.g. hidden dimensions, filter size) in this function. The supported configurations are listed here in this link.

For this blog post, we specify the epoch size to be 3 so that you can run on both CPUs & GPUs quickly and can proceed to the next steps. For reference parameters, check out the notebooks in the DEMO folder.

Model initialization

Next, we initialize a model using the above configuration:

Model training

Now, it is ready to train by simply typing the model.train function!

The training loss curve will be automatically printed out. Also, if the task is a binary classification task, the test set ROC-AUC and PR-AUC curves will also be printed out.

Automatically generated loss curve. Image by authors.


After we trained the model, we can simply repurpose and screen using the models.repurpose and models.virtual_screening functions:

For example, suppose we want to do repurposing from a set of antiviral drugs for a COVID-19 target 3CL protease. We have provided the corresponding data in the dataset wrappers.

A ranked list of drug candidates are automatically generated and printed out:

Automatically generated a ranked list of repurposing results. Image by authors.

Next, we showcase how to do virtual screening on a sample of data from the BindingDB dataset and then use the virtual_screening function to generate a list of drug-target pairs that have high binding affinities. If no drug/target names are provided, the index of the drug/target list is used instead. A similar looking ranked list would be generated.

Model saving and loading

At last, saving and loading models are also really easy. The loading function also automatically detects if the model is trained on multiple GPUs. An example to save and load the model we just trained:

We have also provided a list of pretrained model, you can find all available ones under the list. For example, to load an MPNN+CNN model pretrained on BindingDB Kd dataset:

That’s it! You can now train a state-of-the-art deep learning model for the drug-target interaction prediction task 👏!

DeepPurpose also supports many more functionalities, for example, this demo shows how to use the Ax platform to do the latest hyperparameter tuning methods such as Bayesian Optimization on DeepPurpose.

Spread the word

This post was originally published by Kexin Huang at Towards Data Science

Related posts