This repository contains the code used to get information required for analysis of Reactome failed queries.
For extraction of MeSH terms, an UMLS license/account is required. If you do not have account, register at https://utslogin.nlm.nih.gov/cas/login and set the credentials in the configuration yaml file.
- Python - Reactome_PMID_Metadata_Extraction , generates
reactome_pmid_metadata.tsv
, which contains metadata of PMIDs present in Reactome. - Python - Reactome_Failed_Query_Analysis , generates
failed_query_analysis_output.tsv
, which contains details regarding the failed query terms. - R - Reactome_Analysis , performs the analysis using above generated files, in case the above files are not available, they will be downloaded.
MTI WebAPI is used to get MeSH terms using their batch processing. Their code is in Java hence pyjnius is used to run the JAR files. The files are present in /lib.
These JAR files can be found in ziy/skr-webapi repository.
Following files are generated by the python notebooks, if the user only wants to perform Analysis using R code then they will be automatically downloaded from the links:
File | Generated by | Source |
---|---|---|
reactome_pmid_metadata.tsv | Reactome_PMID_Metadata_Extraction.ipynb | Link |
failed_query_analysis_output.tsv | Reactome_Failed_Query_Analysis.ipynb | Link |
-
Make a copy of
parameters_sample.yml
named parameters.yml and set the configurations in it. Following are mandatory parameters to change in the YML file:-
MTI Credentials, register at https://utslogin.nlm.nih.gov/cas/login
mti: email_id : "example@example.com" username : "username" password : "password"
-
INDRA Database REST URL
indra_db_rest_url : "SET_INDRA_DB_URL"
-
Reactome Parameters
reactome_organism: "Homo sapiens"
-
User Query
query: "MATN2"
Please Note : If you want to skip Metadata file creation and only run the Analysis then skip step 3 and 4 and continue from step 5, the required files will be downloaded accordingly.
-
-
Execute Reactome_PMID_Metadata_Extraction.ipynb, this will generatereactome_pmid_metadata.tsv
file, which is used in step 5, -
Execute Reactome_Failed_Query_Analysis.ipynb, this will generatefailed_query_analysis_output.tsv
file, which is required in step 5
Do NOT perform Step 5 with partially generated output files from step 3 and 4. If you have partial file then delete those as the Rmd code with download missing files which are pre processed, if required.
Please note: This step will require complete TSV files generated by Step 3 and 4, if these files are not present in your directory or you have skipped step 3,4 then they will be downloaded.
In RStudio Console enter following
rmarkdown::render('Reactome_Analysis.Rmd', output_file = 'analysis_output.nb.html')
OR
Open Reactome_Analysis.Rmd in RStudio and run all the chunks to generate the analysis using Ctrl + Alt + R
or follow the image below.
Output Files:
- indra_output.html
Contains Statements from INDRA containing interactions for the query term - analysis_output.nb.html
Contains the analysis performed using Rmd file.
This file will not be generated if you use 'Run All' approach in previous step. To get the HTML output follow the image below
- Installation, (required when run without Docker)
pip install --no-cache-dir -r ./dependencies/requirements.txt R -e 'source("./dependencies/installPackages.R")'
- Make a copy of
parameters_sample.yml
named parameters.yml and set the configurations in it. Following are mandatory parameters to change in the YML file:-
MTI Credentials, register at https://utslogin.nlm.nih.gov/cas/login
mti: email_id : "example@example.com" username : "username" password : "password"
-
INDRA Database REST URL
indra_db_rest_url : "SET_INDRA_DB_URL"
-
Reactome Parameters
reactome_organism: "Homo sapiens"
-
User Query
query: "MATN2"
-
- Execute the Python Notebooks and R file
bash startup.sh path/to/parameters.yml
Output Files:
- indra_output.html
Contains Statements from INDRA containing interactions for the query term - analysis_output.nb.html
Contains the analysis performed using Rmd file.
Hot to run locally using Docker Image pritishaw/reactome-failed-query-analysis
- Pull Docker Image
docker run --name reactome-failed-query-analysis pritishaw/reactome-failed-query-analysis:latest
- Start Notebooks
docker pull pritishaw/reactome-failed-query-analysis:latest
- Follow sequence of execution as mentioned above
How to run locally using jupyter/repo2docker (Docker)
- Installation
pip install jupyter-repo2docker
- Build and Start Notebooks
jupyter-repo2docker https://github.com/cannin/enhance_nlp_interaction_network_gsoc2020
Note: Docker needs to be running in local machine - An URL with token will be printed in terminal, you can access Jupyter Notebooks and RStudio using that link as follows:
Jupyter Notebooks : Open the link directly, all Notebooks will be visible at/notebooks
RStudio : Go to/rstudio
to open RStudio - Follow sequence of execution as mentioned above
Sample file can be found here parameters_sample.yml
. Following configurations can be made using the file. For testing the Python notebooks, you can use the template parameters_test.yml
, it has configuration for processing a small subset of the query terms.
# PYTHON NOTEBOOK PARAMETERS ----
# Register at https://utslogin.nlm.nih.gov/cas/login for MTI credentials
mti:
email_id : "example@example.com"
username : "username"
password : "password"
pmid_threshold : 20
indra_db_rest_url : "SET_INDRA_DB_URL"
reactome_failed_terms_link : "https://gist.githubusercontent.com/PritiShaw/03ce10747835390ec8a755fed9ea813d/raw/cc72cb5479f09b574e03ed22c8d4e3147e09aa0c/Reactome.csv"
failed_query_threshold : null # null Indicates all terms will be processed
failed_query_hits_threshold : 10
reactome_pmid_url : "https://reactome.org/download/current/ReactionPMIDS.txt"
failed_query_output_file_path : "failed_query_analysis_output.tsv"
pmid_chunk_limit : 0
pmid_metadata_output_path : "reactome_pmid_metadata.tsv"
# R NOTEBOOK (Rmd) PARAMETERS ----
# Notebook
max_dt_table_display : 100
# Python environment
python_virtualenv : "/srv/venv"
# General
min_failed_search_hits : 10
# Rank Terms
top_n_reactome_journals : 10
min_indra_query_term_count : 0
min_indra_statement_count : 0
min_pmc_citation_count : 0
min_oc_citation_count : 0
# Reactome Parameters
reactome_organism: "Homo sapiens"
# User Query
query: "MATN2"
# Output
all_mesh_by_top_level_pathways_file : "all_mesh_by_top_level_pathways_full.txt"
top_level_pathways_file : "top_level_pathways.txt"
indra_stmt_html_file : "indra_output.html"
indra_stmt_json_file : "indra_output.json"
Papermill is used to parameterize the Python notebooks , to use this, follow the steps below:
-
Install from requirements.txt
pip install --no-cache-dir -r ./dependencies/requirements.txt
-
Setup Config YAML file
Create a copy of parameters_sample.yml and make the changes. -
To Run the Notebooks
papermill Reactome_Failed_Query_Analysis.ipynb failed_query_analysis.ipynb --log-output -k python3 -f PATH/TO/CONFIG/FILE.yml
papermill Reactome_PMID_Metadata_Extraction.ipynb pmid_metadata.ipynb --log-output -k python3 -f PATH/TO/CONFIG/FILE.yml