Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DL Edition] T038: Protein Ligand Interaction Prediction #290

Merged
merged 14 commits into from
Apr 11, 2023

Conversation

Old-Shatterhand
Copy link
Collaborator

@Old-Shatterhand Old-Shatterhand commented Dec 8, 2022

Description

Proof of concept for GNN-based protein ligand interaction prediction in talktorial T038

Todos

  • Find an appropriate dataset
  • Fix bug in code to be able to train on batches larger than one sample

Questions

None

Status

Initial draft for further discussion

@review-notebook-app
Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

@Old-Shatterhand Old-Shatterhand changed the title T038 first draft with a proof of concept [DL Edition] T038 first draft with a proof of concept Dec 8, 2022
@Old-Shatterhand Old-Shatterhand changed the title [DL Edition] T038 first draft with a proof of concept [DL Edition] T038: Protein Ligand Interaction Prediction Dec 8, 2022
@Old-Shatterhand Old-Shatterhand added the new talktorial New talktorial label Dec 8, 2022
@Old-Shatterhand
Copy link
Collaborator Author

Old-Shatterhand commented Dec 9, 2022

Pull request update

After some minor updates on the nodebook, the following work is left.

TODOS:

  • Find an appropriate dataset for training
  • Discussion: Put plots and explanations of training based on selected dataset
  • Questions: Come up with questions about the content

@Old-Shatterhand
Copy link
Collaborator Author

Old-Shatterhand commented Dec 19, 2022

Talktorial review

Pullrequest of talktorial about GNN-based protein-ligand interaction prediction.

Details

  • Talktorial ID: T038
  • Title: Protein-Ligand Interaction Prediction
  • Original authors: Roman Joeres
  • Reviewer(s): Andrea Volkamer, tbd
  • Date of review: DD-MM-YYYY (tbd)

Content

  • One line summary: Introduction of GNN-based protein-ligand interaction prediction
  • Potential labels or categories (e.g. machine learning, small molecules, online APIs): Kinases, Machine Learning, Graph Neural Networks
  • Time it took to execute (approx.): 1 hour
  • I have used the talktorial template and followed the content and formatting suggestions there
  • Packages must be open-sourced and should be installable from conda-forge. If you are adding new packages to the TeachOpenCADD environment, please check if already installed packages can perform the same functionality and if not leave a sentence explaining why the new addition is needed. If the new package is not on conda-forge, please list them and their intended usage here.
    • biotite, pypdb, chembl-webresource-client, rdkit: Already in TeachOpenCADD
    • torch 1.10.1, torch-geometric 2.2.0, torch-cluster 1.6.0, torch-scatter 2.0.9, torch-sparse 0.6.13, torch-spline-conv 1.2.1, cpuonly 2.0: From "dubious" sources (either own conda-channel (pyorch) or installed as pip wheel. All of them only for cpu
  • Data must be publicly available, preferably accessible via a webserver or downloadable via a URL. Please list the data resources that you use and how to access them:

Content style

  • Talktorial includes cross-references to other talktorials if applicable
  • The table of contents reflects the talktorial story-line; order of #, ##, ### headers is correct
  • URLs are linked with meaningful words, instead of pasting the URL directly or linking words like here.
  • I have spell-checked the notebook
  • Images have enough resolution to be rendered with quality, without being too heavy.
  • All figures have a description
  • Markdown cell content is still in-line with code cell output (whenever results are discussed)
  • I have checked that cell outputs are not incredibly long (this applies also to DataFrames)
  • Formatting looks correctly on the Sphinx render (bold, italics, figure placing)

Code style

  • Variable and function names follow snake case rules (e.g. a_variable_name vs aVariableName)
  • Spacing follows PEP8 (run Black on the code cells if needed)
  • Code line are under 99 characters each (run black-nb -l 99)
  • Comments are useful and well placed
  • There are no unpythonic idioms like for i in range(len(list)) (see slides)
  • All 3rd party dependencies are listed at the top of the notebook
  • I have marked all code cell with output referenced in markdown cells with the label # NBVAL_CHECK_OUTPUT
  • I have identified potential candidates for a code refactor / useful functions
  • All import ... lines are at the top (practice part) cell, ordered by standard library / 3rd party packages / our own (teachopencadd.*)
  • I have used absolute paths instead of relative paths
    HERE = Path(_dh[-1])
    DATA = HERE / "data"

Website

We present our talktorials on our TeachOpenCADD website (https://projects.volkamerlab.org/teachopencadd/), so we have to check as well if the Jupyter notebook renders nicely there.

  • If this PR adds a new talktorial, please follow these steps:
    • Add your talktorial to the complete list of talktorials here (at the end).
    • Add your talktorial to one or multiple of the collections here. Or propose a new collection section in your PR.
    • Add your talktorial's nblink file by running python generate_nblinks.py from within the directory teachopencadd/docs/talktorials.
    • Please complile the website following the instructions here.
  • Check the rendering of the talktorial of this PR.
  • Is your talktorial listed in the talktorial list?
  • Is your talktorial listed in the talktorial collections?
    • Add a picture for your talktorial in the collection view by following these instructions.

@gerritgr
Copy link
Collaborator

  • I will (and elsewhere)-> We will? (I think in other notebooks, it is third person, have not checked though).
  • "field of protein ligand interaction prediction" -> "protein-ligand"
  • Maybe explain the terms protein and ligand very shortly in the intro.
  • Titles are lowercase (Sentence Case) in the other notebooks.
  • "..., I'll link to this otherwise, I'll explain new things below." -> "... , I will link to this. Otherwise, I will explain new things below."
  • "one wants to" -> wants
  • There are some other typos, but Grammarly or so can catch these I guess.
  • simple Feed-forward Neural Network (FNN): do you mean MLP? GNNs are also technically feed-forward networks.
  • State in the beginning of the workflow that this is a binary classification task.
  • " from the PDB entry with ID 4O75." link to a talkturial explaining PDB earlier.
  • explain C_alpha
  • In the "Technical background", say that we use the same GNN architecture for both ligands and proteins.
  • I think the BCE explanation should be clarified. What are the negative and positive samples here (binding and non-binding?)?
  • suppress DtypeWarning and add a comment explaining what kiba_preprocessing is doing
  • "Storing and representing data in PLI-prediction is a bit different from other neural networks. ": You mean the input data (i.e., the graphs), right? Maybe clarify this. Also, even though it is obvious, PLI was not introduced as an abbreviation.
  • Maybe merge together with the "Data Points" subsection.
  • I have not checked the code yet.

@Old-Shatterhand
Copy link
Collaborator Author

I implemented Gerrits comments and uploaded a new notebook.

@@ -0,0 +1,995 @@
{
Copy link
Collaborator

@mbackenkoehler mbackenkoehler Mar 10, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Describe interaction a bit more. Is it only about binding? If so maybe mention classical approaches to the same problem.
  • For example (missing comma)
  • You could motivate the model with virtual screening of compound libraries, for example
  • TBC in the end.

Reply via ReviewNB

@@ -0,0 +1,995 @@
{
Copy link
Collaborator

@mbackenkoehler mbackenkoehler Mar 10, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Maybe change workflow to 'model' something similar; The workflow would include data prep, training and so on, I presume.
  • Simplify the second sentence.
  • "We will only use the information if an interaction exists or not," -> This sounds a bit strange. Just say right away, you are transforming the task to a classification
  • introduction of FNN acronym is missing
  • link to rcsb.org for 4O75

Reply via ReviewNB

@@ -0,0 +1,995 @@
{
Copy link
Collaborator

@mbackenkoehler mbackenkoehler Mar 10, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • remove 'in protein ligand interaction prediction' in second sentence
  • typo in C_{alpha} (last sentence above fig.)
  • fig 2 caption: 'protein structures as graphs' is maybe better; typo: representations
  • missing figure 3?

Reply via ReviewNB

@@ -0,0 +1,995 @@
{
Copy link
Collaborator

@mbackenkoehler mbackenkoehler Mar 10, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • second sentence has 'use' twice
  • third sentence: missing comma?

Reply via ReviewNB

@@ -0,0 +1,995 @@
{
Copy link
Collaborator

@mbackenkoehler mbackenkoehler Mar 10, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • clarify if epsilon is a (hyper-)parameter
  • not sure if onehot or one hot
  • last par: final element to finalize our -> final element to our GNN
  • full stop after pooling function.
  • "For simplicity reasons" sounds wrong. Maybe just "For simplicity, " or "For the sake of simplicity"

Reply via ReviewNB

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding second point: I guess, one-hot is the solution, Grammarly would suggest. I changed the text occurrences to "one-hot".

@@ -0,0 +1,995 @@
{
Copy link
Collaborator

@mbackenkoehler mbackenkoehler Mar 10, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Line #25.            with open(os.path.join(self.folder_name, "tables", "ligands.tsv"), "r") as data:

use path library for consistency


Reply via ReviewNB

@@ -0,0 +1,995 @@
{
Copy link
Collaborator

@mbackenkoehler mbackenkoehler Mar 10, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Line #32.                (filename[:-4], pdb_to_graph(os.path.join(os.path.join(self.folder_name, "proteins", filename)))) for filename in os.listdir(os.path.join(self.folder_name, "proteins"))

pathlib (see above)


Reply via ReviewNB

@@ -0,0 +1,995 @@
{
Copy link
Collaborator

@mbackenkoehler mbackenkoehler Mar 10, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Line #62.                # then split the data and store them for later reuse without running the preprocessing pipeline
  • consider a torch data splitter?


Reply via ReviewNB

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be too complex here, I'd say. Here, you easily see what's happening and how and why.

@@ -0,0 +1,995 @@
{
Copy link
Collaborator

@mbackenkoehler mbackenkoehler Mar 10, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Line #1.    class Encoder(torch.nn.Module):

Maybe use encoding in fig 1 for consistency


Reply via ReviewNB

@@ -0,0 +1,995 @@
{
Copy link
Collaborator

@mbackenkoehler mbackenkoehler Mar 10, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

describe what the training does. I know, it's obvious, but maybe it helps. Make the reader recall the BCE loss and mention that Adam is a standard choice etc...

explain why there is a single epoch, maybe encourage the reader to try higher values.


Reply via ReviewNB

@gerritgr gerritgr merged commit d5a4eaf into DL_edition Apr 11, 2023
@mbackenkoehler mbackenkoehler deleted the rj-038-dti branch January 29, 2024 10:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
new talktorial New talktorial
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants