Skip to content

almussel/pubMunch-docker

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

pubMunch-docker

This respository contains a docker container for the Literature Searching pipeline for BRCA Exchange. The pipeline does the following:

  • Download data files

  • Get a list of PMIDs from PubMed for papers that mention "BRCA" in the title or abstract

  • Run pubMunch on the PMIDs that have not previously been processed:

    • The crawler downloads HTML and PDF files for each PMID via publication APIs
    • The converter scrapes the HTML and PDFS files, creating raw text
    • The mutation finder uses regular expressions to find gene names (BRCA1 and BRCA2 in this case) and variant descriptions.
  • Get a current list of BRCA variants from BRCA Exchange via its GA4GH instance

  • Match variants found by pubMunch to the variants in BRCA Exchange

  • Outputs a JSON file with the list of papers containing each variant, with their PMID, title, abstract, and other information.

To run the pipeline, docker must be installed on your sysetem. You can build the container by running

make

The docker container can be invoked with the below command, where <username> and <password> are Synapse credentials that have write access to the BRCA Exchange Literature Searching folder

docker run quay.io/almussel/pubmunch-docker -u <username> -p <password>

In order to get full functionality of the pipeline, it should be run on a server with institutional credentials to allow access to publications. A proxy can also be passed to docker run with the option

--env http_proxy="http://user:password@host:port"

The container can also be run with Synapse credentials so the results are uploaded to the BRCA Exchange Literature Searching folder, as follows:

docker run quay.io/almussel/pubmunch-docker -u <username> -p <password>

Without these credentials, the output of the pipeline will be stored locally.

To run the pipeline on a small sample of 20 pmids, use the -t option.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published