Assignment 6 : Use of docking methods for drug design

Part B: Combined molecular Graph Neural Networks and structural docking to selects potent PD-1/PD-L1 small molecule inhibitors

Introduction

The Programmable Cell Death Protein 1/Programmable Death-Ligand 1 (PD-1/PD-L1) interaction is an immune checkpoint utilized by cancer cells to enhance immune suppression. There exists a huge need to develop small molecule drugs that are fast acting, cheap, and readily bioavailable compared to antibodies. Unfortunately, synthesizing and validating large libraries of small-molecule to inhibit PD-1/PD-L1 interaction in a blind manner is a both time-consuming and expensive. Here we use a machine learning method call EGNN which is based on graph neural networks (GNNs) and docking scores. We call them local and global features respectively. GNNs are models which use message passing between nodes and edges of graphs. These graphs can be anything. In our case nodes are atoms and edges are bonds of molecules. GNNs retain a state with information of its neighbours with an arbitrary depth. We call it as the sub graph radius. On the other hand, docking scores can be used as global features which captures interactions of the molecules with the target. Therefore, EGNN model take both local and global features to select potent small molecule inhibitors.

Ref: Graph Neural Networks:A Review of Methods and Applications

Objective

This assignment will teach you how to use machine learning and molecular modeling, specifically Graph Neural Networks in combination with molecular docking, to select potent small molecule PD-1/PD-L1 inhibitors

Procedure

Step 01.a: Downloading required files and scripts

All the files and scripts required for the assignment can be downloaded from here. Make a sub directory in your working directory for this assignment. Make folders fullmodel/, input/, test/, train/ in this sub directory and copy all the scripts into the subdirectory. Now you should have four folders and downloaded scripts in your working directory. Now place the data.txt and test.txt files in the input/ directory. The data.txt file has SMILES, 96 CANDOCK docking scores against PD-L1 homodimer and their active or inactive status. The test.txt file also have SMILES and docking scores. Here, information in the data.txt file will be used to train the machine learning model and the trained machine learning model (EGNN) will be used to predict the activity of a molecular library in test.txt.

Step 01.b: Creating a Python environment and install required packages

Run commands below to create the python environment and install packages

module load anaconda
rcac-conda-env create -n egnn -j
source activate egnn
conda install pytorch -c pytorch
conda install -c rdkit rdkit
conda install -c anaconda scikit-learn

Step 02: Preprocessing data

This step will preprocess the given data. It will open both data.txt and test.txt files and save the data separately as needed to run the training step. Here we can change the sub-graph radius with in the preprocess_data.sh script. It has been set to 2 by default. Following command can be used to do this.

bash preprocess_data.sh

If running the script is successful, you will see some prepared files in the train/ folder.

Step 03: Run training

Open and see the parameters in the run_training.sh script. You should keep the same sub-graph radius which you used for pre-processing data. Keep the other parameters as defaults. You can use following command to start training the model.

bash run_training.sh

After training is finished you will see a model file and a text file in the fullmodel/ directory. The text file will have the epoch, time, training loss, validation loss, area under the curve (AUC), precision, recall and F1 score.

Step 04: Train full with bootstrapping and take predictions for the test set

Use the following command to start training with bootstrapping. This will train the model hundred times with hundred different random seeds and record all the data in the bootstrapping_results directory. You should keep same parameters which you used in step 03.

sbatch train_full.sh

After training with bootstrapping is done, create a subdirectory called bootstrapping_results/ in the working directory. Then use the following command to take predictions. Again keep same parameters.

bash run_test.sh

A csv file with all the predictions for the test set will be generated in the working directory. The take_counts.py script can be used to calculate the number of times a molecule to be predicted as active over number of bootstrapped models. Then this number can be combined with main results using the script combining_smiles_bootstrapping_average_countsover0.5.py. Sripts can be used as follows.

Questions

  1. Change the dim (molecular vector dimension) parameter in the run_training.sh script and run it for 6 different numbers. (eg: You can try dim = 6, 7, 8, 9, 10, 15). Then plot average F1 score against epoch for these dim parameter values in the same plot.
  2. Generate similar plots for radius parameter (radius = 2 and 3 ) and hidden_layer parameter. Determine most suitable parameters to be used in the training of EGNN for this data based on the graphs which you have plotted.
  3. Calculate and report the average training F1 score, average validation F1 score, average training AUC and average validation AUC for the EGNN model over five folds.
  4. Give structures of the top five predictions you got as final results with their respective average softmax scores with the standard deviation.
  5. what is the effect of bootstrapping in EGNN?