diff --git a/.editorconfig b/.editorconfig new file mode 100644 index 00000000..779d3d44 --- /dev/null +++ b/.editorconfig @@ -0,0 +1,24 @@ +root = true + +[*] +charset = utf-8 +end_of_line = lf +insert_final_newline = true +trim_trailing_whitespace = false +indent_size = 4 +indent_style = space + +[*.{md,yml,yaml,html,css,scss,js}] +indent_size = 2 + +# These files are edited and tested upstream in nf-core/modules +[/modules/nf-core/**] +charset = unset +end_of_line = unset +insert_final_newline = unset +trim_trailing_whitespace = unset +indent_style = unset +indent_size = unset + +[/assets/email*] +indent_size = unset diff --git a/.gitattributes b/.gitattributes index 7fe55006..050bb120 100644 --- a/.gitattributes +++ b/.gitattributes @@ -1 +1,3 @@ *.config linguist-language=nextflow +modules/nf-core/** linguist-generated +subworkflows/nf-core/** linguist-generated diff --git a/.github/.dockstore.yml b/.github/.dockstore.yml index 030138a0..191fabd2 100644 --- a/.github/.dockstore.yml +++ b/.github/.dockstore.yml @@ -3,3 +3,4 @@ version: 1.2 workflows: - subclass: nfl primaryDescriptorPath: /nextflow.config + publish: True diff --git a/.github/CONTRIBUTING.md b/.github/CONTRIBUTING.md index 570aa1de..b2084868 100644 --- a/.github/CONTRIBUTING.md +++ b/.github/CONTRIBUTING.md @@ -15,11 +15,10 @@ Contributions to the code are even more welcome ;) If you'd like to write some code for nf-core/pgdb, the standard workflow is as follows: -1. Check that there isn't already an issue about your idea in the [nf-core/pgdb issues](https://github.com/nf-core/pgdb/issues) to avoid duplicating work - * If there isn't one already, please create one so that others know you're working on this +1. Check that there isn't already an issue about your idea in the [nf-core/pgdb issues](https://github.com/nf-core/pgdb/issues) to avoid duplicating work. If there isn't one already, please create one so that others know you're working on this 2. [Fork](https://help.github.com/en/github/getting-started-with-github/fork-a-repo) the [nf-core/pgdb repository](https://github.com/nf-core/pgdb) to your GitHub account 3. Make the necessary changes / additions within your forked repository following [Pipeline conventions](#pipeline-contribution-conventions) -4. Use `nf-core schema build .` and add any new parameters to the pipeline JSON schema (requires [nf-core tools](https://github.com/nf-core/tools) >= 1.10). +4. Use `nf-core schema build` and add any new parameters to the pipeline JSON schema (requires [nf-core tools](https://github.com/nf-core/tools) >= 1.10). 5. Submit a Pull Request against the `dev` branch and wait for the code to be reviewed and merged If you're not used to this workflow with git, you can start with some [docs from GitHub](https://help.github.com/en/github/collaborating-with-issues-and-pull-requests) or even their [excellent `git` resources](https://try.github.io/). @@ -49,9 +48,9 @@ These tests are run both with the latest available version of `Nextflow` and als :warning: Only in the unlikely and regretful event of a release happening with a bug. -* On your own fork, make a new branch `patch` based on `upstream/master`. -* Fix the bug, and bump version (X.Y.Z+1). -* A PR should be made on `master` from patch to directly this particular bug. +- On your own fork, make a new branch `patch` based on `upstream/master`. +- Fix the bug, and bump version (X.Y.Z+1). +- A PR should be made on `master` from patch to directly this particular bug. ## Getting help @@ -68,26 +67,23 @@ If you wish to contribute a new step, please use the following coding standards: 1. Define the corresponding input channel into your new process from the expected previous process channel 2. Write the process block (see below). 3. Define the output channel if needed (see below). -4. Add any new flags/options to `nextflow.config` with a default (see below). -5. Add any new flags/options to `nextflow_schema.json` with help text (with `nf-core schema build .`) -6. Add any new flags/options to the help message (for integer/text parameters, print to help the corresponding `nextflow.config` parameter). -7. Add sanity checks for all relevant parameters. -8. Add any new software to the `scrape_software_versions.py` script in `bin/` and the version command to the `scrape_software_versions` process in `main.nf`. -9. Do local tests that the new code works properly and as expected. -10. Add a new test command in `.github/workflow/ci.yaml`. -11. If applicable add a [MultiQC](https://https://multiqc.info/) module. -12. Update MultiQC config `assets/multiqc_config.yaml` so relevant suffixes, name clean up, General Statistics Table column order, and module figures are in the right order. -13. Optional: Add any descriptions of MultiQC report sections and output files to `docs/output.md`. +4. Add any new parameters to `nextflow.config` with a default (see below). +5. Add any new parameters to `nextflow_schema.json` with help text (via the `nf-core schema build` tool). +6. Add sanity checks and validation for all relevant parameters. +7. Perform local tests to validate that the new code works as expected. +8. If applicable, add a new test command in `.github/workflow/ci.yml`. +9. Update MultiQC config `assets/multiqc_config.yml` so relevant suffixes, file name clean up and module plots are in the appropriate order. If applicable, add a [MultiQC](https://https://multiqc.info/) module. +10. Add a description of the output files and if relevant any appropriate images from the MultiQC report to `docs/output.md`. ### Default values Parameters should be initialised / defined with default values in `nextflow.config` under the `params` scope. -Once there, use `nf-core schema build .` to add to `nextflow_schema.json`. +Once there, use `nf-core schema build` to add to `nextflow_schema.json`. ### Default processes resource requirements -Sensible defaults for process resource requirements (CPUs / memory / time) for a process should be defined in `conf/base.config`. These should generally be specified generic with `withLabel:` selectors so they can be shared across multiple processes/steps of the pipeline. A nf-core standard set of labels that should be followed where possible can be seen in the [nf-core pipeline template](https://github.com/nf-core/tools/blob/master/nf_core/pipeline-template/%7B%7Bcookiecutter.name_noslash%7D%7D/conf/base.config), which has the default process as a single core-process, and then different levels of multi-core configurations for increasingly large memory requirements defined with standardised labels. +Sensible defaults for process resource requirements (CPUs / memory / time) for a process should be defined in `conf/base.config`. These should generally be specified generic with `withLabel:` selectors so they can be shared across multiple processes/steps of the pipeline. A nf-core standard set of labels that should be followed where possible can be seen in the [nf-core pipeline template](https://github.com/nf-core/tools/blob/master/nf_core/pipeline-template/conf/base.config), which has the default process as a single core-process, and then different levels of multi-core configurations for increasingly large memory requirements defined with standardised labels. The process resources can be passed on to the tool dynamically within the process with the `${task.cpu}` and `${task.memory}` variables in the `script:` block. @@ -95,34 +91,13 @@ The process resources can be passed on to the tool dynamically within the proces Please use the following naming schemes, to make it easy to understand what is going where. -* initial process channel: `ch_output_from_` -* intermediate and terminal channels: `ch__for_` +- initial process channel: `ch_output_from_` +- intermediate and terminal channels: `ch__for_` ### Nextflow version bumping If you are using a new feature from core Nextflow, you may bump the minimum required version of nextflow in the pipeline with: `nf-core bump-version --nextflow . [min-nf-version]` -### Software version reporting - -If you add a new tool to the pipeline, please ensure you add the information of the tool to the `get_software_version` process. - -Add to the script block of the process, something like the following: - -```bash - --version &> v_.txt 2>&1 || true -``` - -or - -```bash - --help | head -n 1 &> v_.txt 2>&1 || true -``` - -You then need to edit the script `bin/scrape_software_versions.py` to: - -1. Add a Python regex for your tool's `--version` output (as in stored in the `v_.txt` file), to ensure the version is reported as a `v` and the version number e.g. `v2.1.1` -2. Add a HTML entry to the `OrderedDict` for formatting in MultiQC. - ### Images and figures For overview images and other documents we follow the nf-core [style guidelines and examples](https://nf-co.re/developers/design_guidelines). diff --git a/.github/ISSUE_TEMPLATE/bug_report.md b/.github/ISSUE_TEMPLATE/bug_report.md deleted file mode 100644 index 550bc402..00000000 --- a/.github/ISSUE_TEMPLATE/bug_report.md +++ /dev/null @@ -1,64 +0,0 @@ ---- -name: Bug report -about: Report something that is broken or incorrect -labels: bug ---- - - - -## Check Documentation - -I have checked the following places for your error: - -- [ ] [nf-core website: troubleshooting](https://nf-co.re/usage/troubleshooting) -- [ ] [nf-core/pgdb pipeline documentation](https://nf-co.re/nf-core/pgdb/usage) - -## Description of the bug - - - -## Steps to reproduce - -Steps to reproduce the behaviour: - -1. Command line: -2. See error: - -## Expected behaviour - - - -## Log files - -Have you provided the following extra information/files: - -- [ ] The command used to run the pipeline -- [ ] The `.nextflow.log` file - -## System - -- Hardware: -- Executor: -- OS: -- Version - -## Nextflow Installation - -- Version: - -## Container engine - -- Engine: -- version: -- Image tag: - -## Additional context - - diff --git a/.github/ISSUE_TEMPLATE/bug_report.yml b/.github/ISSUE_TEMPLATE/bug_report.yml new file mode 100644 index 00000000..7da9375b --- /dev/null +++ b/.github/ISSUE_TEMPLATE/bug_report.yml @@ -0,0 +1,50 @@ +name: Bug report +description: Report something that is broken or incorrect +labels: bug +body: + - type: markdown + attributes: + value: | + Before you post this issue, please check the documentation: + + - [nf-core website: troubleshooting](https://nf-co.re/usage/troubleshooting) + - [nf-core/pgdb pipeline documentation](https://nf-co.re/pgdb/usage) + + - type: textarea + id: description + attributes: + label: Description of the bug + description: A clear and concise description of what the bug is. + validations: + required: true + + - type: textarea + id: command_used + attributes: + label: Command used and terminal output + description: Steps to reproduce the behaviour. Please paste the command you used to launch the pipeline and the output from your terminal. + render: console + placeholder: | + $ nextflow run ... + + Some output where something broke + + - type: textarea + id: files + attributes: + label: Relevant files + description: | + Please drag and drop the relevant files here. Create a `.zip` archive if the extension is not allowed. + Your verbose log file `.nextflow.log` is often useful _(this is a hidden file in the directory where you launched the pipeline)_ as well as custom Nextflow configuration files. + + - type: textarea + id: system + attributes: + label: System information + description: | + * Nextflow version _(eg. 21.10.3)_ + * Hardware _(eg. HPC, Desktop, Cloud)_ + * Executor _(eg. slurm, local, awsbatch)_ + * Container engine: _(e.g. Docker, Singularity, Conda, Podman, Shifter or Charliecloud)_ + * OS _(eg. CentOS Linux, macOS, Linux Mint)_ + * Version of nf-core/pgdb _(eg. 1.1, 1.5, 1.8.2)_ diff --git a/.github/ISSUE_TEMPLATE/config.yml b/.github/ISSUE_TEMPLATE/config.yml index 092879e9..4d05380f 100644 --- a/.github/ISSUE_TEMPLATE/config.yml +++ b/.github/ISSUE_TEMPLATE/config.yml @@ -1,4 +1,3 @@ -blank_issues_enabled: false contact_links: - name: Join nf-core url: https://nf-co.re/join diff --git a/.github/ISSUE_TEMPLATE/feature_request.md b/.github/ISSUE_TEMPLATE/feature_request.md deleted file mode 100644 index 3340f8f1..00000000 --- a/.github/ISSUE_TEMPLATE/feature_request.md +++ /dev/null @@ -1,32 +0,0 @@ ---- -name: Feature request -about: Suggest an idea for the nf-core website -labels: enhancement ---- - - - -## Is your feature request related to a problem? Please describe - - - - - -## Describe the solution you'd like - - - -## Describe alternatives you've considered - - - -## Additional context - - diff --git a/.github/ISSUE_TEMPLATE/feature_request.yml b/.github/ISSUE_TEMPLATE/feature_request.yml new file mode 100644 index 00000000..c9c43676 --- /dev/null +++ b/.github/ISSUE_TEMPLATE/feature_request.yml @@ -0,0 +1,11 @@ +name: Feature request +description: Suggest an idea for the nf-core/pgdb pipeline +labels: enhancement +body: + - type: textarea + id: description + attributes: + label: Description of feature + description: Please describe your suggestion for a new feature. It might help to describe a problem or use case, plus any alternatives that you have considered. + validations: + required: true diff --git a/.github/PULL_REQUEST_TEMPLATE.md b/.github/PULL_REQUEST_TEMPLATE.md index 6f6c41ca..649c8bae 100644 --- a/.github/PULL_REQUEST_TEMPLATE.md +++ b/.github/PULL_REQUEST_TEMPLATE.md @@ -15,11 +15,10 @@ Learn more about contributing: [CONTRIBUTING.md](https://github.com/nf-core/pgdb - [ ] This comment contains a description of changes (with reason). - [ ] If you've fixed a bug or added code that should be tested, add tests! - - [ ] If you've added a new tool - add to the software_versions process and a regex to `scrape_software_versions.py` - - [ ] If you've added a new tool - have you followed the pipeline conventions in the [contribution docs](https://github.com/nf-core/pgdb/tree/master/.github/CONTRIBUTING.md) - - [ ] If necessary, also make a PR on the nf-core/pgdb _branch_ on the [nf-core/test-datasets](https://github.com/nf-core/test-datasets) repository. -- [ ] Make sure your code lints (`nf-core lint .`). -- [ ] Ensure the test suite passes (`nextflow run . -profile test,docker`). + - [ ] If you've added a new tool - have you followed the pipeline conventions in the [contribution docs](https://github.com/nf-core/pgdb/tree/master/.github/CONTRIBUTING.md) + - [ ] If necessary, also make a PR on the nf-core/pgdb _branch_ on the [nf-core/test-datasets](https://github.com/nf-core/test-datasets) repository. +- [ ] Make sure your code lints (`nf-core lint`). +- [ ] Ensure the test suite passes (`nextflow run . -profile test,docker --outdir `). - [ ] Usage Documentation in `docs/usage.md` is updated. - [ ] Output Documentation in `docs/output.md` is updated. - [ ] `CHANGELOG.md` is updated. diff --git a/.github/markdownlint.yml b/.github/markdownlint.yml deleted file mode 100644 index 8d7eb53b..00000000 --- a/.github/markdownlint.yml +++ /dev/null @@ -1,12 +0,0 @@ -# Markdownlint configuration file -default: true -line-length: false -no-duplicate-header: - siblings_only: true -no-inline-html: - allowed_elements: - - img - - p - - kbd - - details - - summary diff --git a/.github/workflows/awsfulltest.yml b/.github/workflows/awsfulltest.yml index 7a9e38f3..c429e8ad 100644 --- a/.github/workflows/awsfulltest.yml +++ b/.github/workflows/awsfulltest.yml @@ -1,43 +1,27 @@ name: nf-core AWS full size tests # This workflow is triggered on published releases. -# It can be additionally triggered manually with GitHub actions workflow dispatch. +# It can be additionally triggered manually with GitHub actions workflow dispatch button. # It runs the -profile 'test_full' on AWS batch on: - workflow_run: - workflows: ["nf-core Docker push (release)"] - types: [completed] + release: + types: [published] workflow_dispatch: - jobs: - run-awstest: + run-tower: name: Run AWS full tests if: github.repository == 'nf-core/pgdb' runs-on: ubuntu-latest steps: - - name: Setup Miniconda - uses: conda-incubator/setup-miniconda@v2 + - name: Launch workflow via tower + uses: nf-core/tower-action@v3 with: - auto-update-conda: true - python-version: 3.7 - - name: Install awscli - run: conda install -c conda-forge awscli - - name: Start AWS batch job - # TODO nf-core: You can customise AWS full pipeline tests as required - # Add full size test data (but still relatively small datasets for few samples) - # on the `test_full.config` test runs with only one set of parameters - # Then specify `-profile test_full` instead of `-profile test` on the AWS batch command - env: - AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }} - AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }} - TOWER_ACCESS_TOKEN: ${{ secrets.AWS_TOWER_TOKEN }} - AWS_JOB_DEFINITION: ${{ secrets.AWS_JOB_DEFINITION }} - AWS_JOB_QUEUE: ${{ secrets.AWS_JOB_QUEUE }} - AWS_S3_BUCKET: ${{ secrets.AWS_S3_BUCKET }} - run: | - aws batch submit-job \ - --region eu-west-1 \ - --job-name nf-core-pgdb \ - --job-queue $AWS_JOB_QUEUE \ - --job-definition $AWS_JOB_DEFINITION \ - --container-overrides '{"command": ["nf-core/pgdb", "-r '"${GITHUB_SHA}"' -profile test --outdir s3://'"${AWS_S3_BUCKET}"'/pgdb/results-'"${GITHUB_SHA}"' -w s3://'"${AWS_S3_BUCKET}"'/pgdb/work-'"${GITHUB_SHA}"' -with-tower"], "environment": [{"name": "TOWER_ACCESS_TOKEN", "value": "'"$TOWER_ACCESS_TOKEN"'"}]}' + workspace_id: ${{ secrets.TOWER_WORKSPACE_ID }} + access_token: ${{ secrets.TOWER_ACCESS_TOKEN }} + compute_env: ${{ secrets.TOWER_COMPUTE_ENV }} + workdir: s3://${{ secrets.AWS_S3_BUCKET }}/work/pgdb/work-${{ github.sha }} + parameters: | + { + "outdir": "s3://${{ secrets.AWS_S3_BUCKET }}/pgdb/results-${{ github.sha }}" + } + profiles: test_full,aws_tower diff --git a/.github/workflows/awstest.yml b/.github/workflows/awstest.yml index 7ec9f445..833f821e 100644 --- a/.github/workflows/awstest.yml +++ b/.github/workflows/awstest.yml @@ -1,39 +1,25 @@ name: nf-core AWS test -# This workflow is triggered on push to the master branch. -# It can be additionally triggered manually with GitHub actions workflow dispatch. -# It runs the -profile 'test' on AWS batch. +# This workflow can be triggered manually with the GitHub actions workflow dispatch button. +# It runs the -profile 'test' on AWS batch on: workflow_dispatch: - jobs: - run-awstest: + run-tower: name: Run AWS tests if: github.repository == 'nf-core/pgdb' runs-on: ubuntu-latest steps: - - name: Setup Miniconda - uses: conda-incubator/setup-miniconda@v2 + # Launch workflow using Tower CLI tool action + - name: Launch workflow via tower + uses: nf-core/tower-action@v3 with: - auto-update-conda: true - python-version: 3.7 - - name: Install awscli - run: conda install -c conda-forge awscli - - name: Start AWS batch job - # TODO nf-core: You can customise CI pipeline run tests as required - # For example: adding multiple test runs with different parameters - # Remember that you can parallelise this by using strategy.matrix - env: - AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }} - AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }} - TOWER_ACCESS_TOKEN: ${{ secrets.AWS_TOWER_TOKEN }} - AWS_JOB_DEFINITION: ${{ secrets.AWS_JOB_DEFINITION }} - AWS_JOB_QUEUE: ${{ secrets.AWS_JOB_QUEUE }} - AWS_S3_BUCKET: ${{ secrets.AWS_S3_BUCKET }} - run: | - aws batch submit-job \ - --region eu-west-1 \ - --job-name nf-core-pgdb \ - --job-queue $AWS_JOB_QUEUE \ - --job-definition $AWS_JOB_DEFINITION \ - --container-overrides '{"command": ["nf-core/pgdb", "-r '"${GITHUB_SHA}"' -profile test --outdir s3://'"${AWS_S3_BUCKET}"'/pgdb/results-'"${GITHUB_SHA}"' -w s3://'"${AWS_S3_BUCKET}"'/pgdb/work-'"${GITHUB_SHA}"' -with-tower"], "environment": [{"name": "TOWER_ACCESS_TOKEN", "value": "'"$TOWER_ACCESS_TOKEN"'"}]}' + workspace_id: ${{ secrets.TOWER_WORKSPACE_ID }} + access_token: ${{ secrets.TOWER_ACCESS_TOKEN }} + compute_env: ${{ secrets.TOWER_COMPUTE_ENV }} + workdir: s3://${{ secrets.AWS_S3_BUCKET }}/work/pgdb/work-${{ github.sha }} + parameters: | + { + "outdir": "s3://${{ secrets.AWS_S3_BUCKET }}/pgdb/results-test-${{ github.sha }}" + } + profiles: test,aws_tower diff --git a/.github/workflows/branch.yml b/.github/workflows/branch.yml index e10c3dfe..d86970b0 100644 --- a/.github/workflows/branch.yml +++ b/.github/workflows/branch.yml @@ -13,8 +13,7 @@ jobs: - name: Check PRs if: github.repository == 'nf-core/pgdb' run: | - { [[ ${{github.event.pull_request.head.repo.full_name}} == nf-core/pgdb ]] && [[ $GITHUB_HEAD_REF = "dev" ]]; } || [[ $GITHUB_HEAD_REF == "patch" ]] - + { [[ ${{github.event.pull_request.head.repo.full_name }} == nf-core/pgdb ]] && [[ $GITHUB_HEAD_REF = "dev" ]]; } || [[ $GITHUB_HEAD_REF == "patch" ]] # If the above check failed, post a comment on the PR explaining the failure # NOTE - this doesn't currently work if the PR is coming from a fork, due to limitations in GitHub actions secrets @@ -23,15 +22,23 @@ jobs: uses: mshick/add-pr-comment@v1 with: message: | + ## This PR is against the `master` branch :x: + + * Do not close this PR + * Click _Edit_ and change the `base` to `dev` + * This CI test will remain failed until you push a new commit + + --- + Hi @${{ github.event.pull_request.user.login }}, - It looks like this pull-request is has been made against the ${{github.event.pull_request.head.repo.full_name}} `master` branch. + It looks like this pull-request is has been made against the [${{github.event.pull_request.head.repo.full_name }}](https://github.com/${{github.event.pull_request.head.repo.full_name }}) `master` branch. The `master` branch on nf-core repositories should always contain code from the latest release. - Because of this, PRs to `master` are only allowed if they come from the ${{github.event.pull_request.head.repo.full_name}} `dev` branch. + Because of this, PRs to `master` are only allowed if they come from the [${{github.event.pull_request.head.repo.full_name }}](https://github.com/${{github.event.pull_request.head.repo.full_name }}) `dev` branch. You do not need to close this PR, you can change the target branch to `dev` by clicking the _"Edit"_ button at the top of this page. + Note that even after this, the test will continue to show as failing until you push a new commit. Thanks again for your contribution! repo-token: ${{ secrets.GITHUB_TOKEN }} allow-repeats: false - diff --git a/.github/workflows/ci.yml b/.github/workflows/ci.yml index c857ba1b..530d34c0 100644 --- a/.github/workflows/ci.yml +++ b/.github/workflows/ci.yml @@ -8,53 +8,51 @@ on: release: types: [published] +env: + NXF_ANSI_LOG: false + CAPSULE_LOG: none + jobs: test: - name: Run workflow tests + name: Run pipeline with test data # Only run on push if this is the nf-core dev branch (merged PRs) - if: ${{ github.event_name != 'push' || (github.event_name == 'push' && github.repository == 'nf-core/pgdb') }} + if: "${{ github.event_name != 'push' || (github.event_name == 'push' && github.repository == 'nf-core/pgdb') }}" runs-on: ubuntu-latest - env: - NXF_VER: ${{ matrix.nxf_ver }} - NXF_ANSI_LOG: false - COSMIC_USERNAME: ${{ secrets.COSMIC_USERNAME }} - COSMIC_PASSWORD: ${{ secrets.COSMIC_PASSWORD }} - strategy: matrix: - # Nextflow versions: check pipeline minimum and current latest - nxf_ver: ['20.04.0', ''] + # Nextflow versions + include: + # Test pipeline minimum Nextflow version + - NXF_VER: "21.10.3" + NXF_EDGE: "" + # Test latest edge release of Nextflow + - NXF_VER: "" + NXF_EDGE: "1" steps: - name: Check out pipeline code uses: actions/checkout@v2 - - name: Check if Dockerfile or Conda environment changed - uses: technote-space/get-diff-action@v4 - with: - FILES: | - Dockerfile - environment.yml - - name: Build new docker image - if: env.MATCHED_FILES - run: docker build --no-cache . -t nfcore/pgdb:dev - - name: Pull docker image - if: ${{ !env.MATCHED_FILES }} - run: | - docker pull nfcore/pgdb:dev - docker tag nfcore/pgdb:dev nfcore/pgdb:dev + - name: Install Nextflow env: - CAPSULE_LOG: none + NXF_VER: ${{ matrix.NXF_VER }} + # Uncomment only if the edge release is more recent than the latest stable release + # See https://github.com/nextflow-io/nextflow/issues/2467 + # NXF_EDGE: ${{ matrix.NXF_EDGE }} run: | wget -qO- get.nextflow.io | bash sudo mv nextflow /usr/local/bin/ + - name: Run pipeline with test data - # TODO nf-core: You can customise CI pipeline run tests as required # For example: adding multiple test runs with different parameters # Remember that you can parallelise this by using strategy.matrix - run: nextflow run ${GITHUB_WORKSPACE} -profile test,docker - - name: Run pipeline with cosmic test - # TODO nf-core: You can customise CI pipeline run tests as required - # For example: adding multiple test runs with different parameters - # Remember that you can parallelise this by using strategy.matrix - run: nextflow run ${GITHUB_WORKSPACE} -profile test_full,docker --cosmic_user_name ${COSMIC_USERNAME} --cosmic_password ${COSMIC_PASSWORD} + run: | + nextflow run ${GITHUB_WORKSPACE} -profile test,docker --outdir ./results + - name: Run pipeline with COSMIC Cell Lines and cBioPortal data + env: + COSMIC_USERNAME: ${{ secrets.COSMIC_USERNAME }} + COSMIC_PASSWORD: ${{ secrets.COSMIC_PASSWORD }} + # For example: adding multiple test runs with different parameters + # Remember that you can parallelise this by using strategy.matrix + run: | + nextflow run ${GITHUB_WORKSPACE} -profile test_cosmic_cbio,docker --outdir ./results --cosmic_user_name $COSMIC_USERNAME --cosmic_password $COSMIC_PASSWORD diff --git a/.github/workflows/fix-linting.yml b/.github/workflows/fix-linting.yml new file mode 100644 index 00000000..2007a371 --- /dev/null +++ b/.github/workflows/fix-linting.yml @@ -0,0 +1,55 @@ +name: Fix linting from a comment +on: + issue_comment: + types: [created] + +jobs: + deploy: + # Only run if comment is on a PR with the main repo, and if it contains the magic keywords + if: > + contains(github.event.comment.html_url, '/pull/') && + contains(github.event.comment.body, '@nf-core-bot fix linting') && + github.repository == 'nf-core/pgdb' + runs-on: ubuntu-latest + steps: + # Use the @nf-core-bot token to check out so we can push later + - uses: actions/checkout@v3 + with: + token: ${{ secrets.nf_core_bot_auth_token }} + + # Action runs on the issue comment, so we don't get the PR by default + # Use the gh cli to check out the PR + - name: Checkout Pull Request + run: gh pr checkout ${{ github.event.issue.number }} + env: + GITHUB_TOKEN: ${{ secrets.nf_core_bot_auth_token }} + + - uses: actions/setup-node@v2 + + - name: Install Prettier + run: npm install -g prettier @prettier/plugin-php + + # Check that we actually need to fix something + - name: Run 'prettier --check' + id: prettier_status + run: | + if prettier --check ${GITHUB_WORKSPACE}; then + echo "::set-output name=result::pass" + else + echo "::set-output name=result::fail" + fi + + - name: Run 'prettier --write' + if: steps.prettier_status.outputs.result == 'fail' + run: prettier --write ${GITHUB_WORKSPACE} + + - name: Commit & push changes + if: steps.prettier_status.outputs.result == 'fail' + run: | + git config user.email "core@nf-co.re" + git config user.name "nf-core-bot" + git config push.default upstream + git add . + git status + git commit -m "[automated] Fix linting with Prettier" + git push diff --git a/.github/workflows/linting.yml b/.github/workflows/linting.yml index bef81e61..77358dee 100644 --- a/.github/workflows/linting.yml +++ b/.github/workflows/linting.yml @@ -1,6 +1,7 @@ name: nf-core linting # This workflow is triggered on pushes and PRs to the repository. -# It runs the `nf-core lint` and markdown lint tests to ensure that the code meets the nf-core guidelines +# It runs the `nf-core lint` and markdown lint tests to ensure +# that the code meets the nf-core guidelines. on: push: pull_request: @@ -8,32 +9,35 @@ on: types: [published] jobs: - Markdown: + EditorConfig: runs-on: ubuntu-latest steps: - uses: actions/checkout@v2 - - uses: actions/setup-node@v1 - with: - node-version: '10' - - name: Install markdownlint - run: npm install -g markdownlint-cli - - name: Run Markdownlint - run: markdownlint ${GITHUB_WORKSPACE} -c ${GITHUB_WORKSPACE}/.github/markdownlint.yml - YAML: + + - uses: actions/setup-node@v2 + + - name: Install editorconfig-checker + run: npm install -g editorconfig-checker + + - name: Run ECLint check + run: editorconfig-checker -exclude README.md $(find .* -type f | grep -v '.git\|.py\|.md\|json\|yml\|yaml\|html\|css\|work\|.nextflow\|build\|nf_core.egg-info\|log.txt\|Makefile') + + Prettier: runs-on: ubuntu-latest steps: - - uses: actions/checkout@v1 - - uses: actions/setup-node@v1 - with: - node-version: '10' - - name: Install yaml-lint - run: npm install -g yaml-lint - - name: Run yaml-lint - run: yamllint $(find ${GITHUB_WORKSPACE} -type f -name "*.yml") + - uses: actions/checkout@v2 + + - uses: actions/setup-node@v2 + + - name: Install Prettier + run: npm install -g prettier + + - name: Run Prettier --check + run: prettier --check ${GITHUB_WORKSPACE} + nf-core: runs-on: ubuntu-latest steps: - - name: Check out pipeline code uses: actions/checkout@v2 @@ -44,10 +48,10 @@ jobs: wget -qO- get.nextflow.io | bash sudo mv nextflow /usr/local/bin/ - - uses: actions/setup-python@v1 + - uses: actions/setup-python@v3 with: - python-version: '3.6' - architecture: 'x64' + python-version: "3.6" + architecture: "x64" - name: Install dependencies run: | @@ -59,7 +63,7 @@ jobs: GITHUB_COMMENTS_URL: ${{ github.event.pull_request.comments_url }} GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }} GITHUB_PR_COMMIT: ${{ github.event.pull_request.head.sha }} - run: nf-core -l lint_log.txt lint ${GITHUB_WORKSPACE} --markdown lint_results.md + run: nf-core -l lint_log.txt lint --dir ${GITHUB_WORKSPACE} --markdown lint_results.md - name: Save PR number if: ${{ always() }} @@ -69,9 +73,8 @@ jobs: if: ${{ always() }} uses: actions/upload-artifact@v2 with: - name: linting-log-file + name: linting-logs path: | lint_log.txt lint_results.md PR_number.txt - diff --git a/.github/workflows/linting_comment.yml b/.github/workflows/linting_comment.yml index 90f03c6f..04758f61 100644 --- a/.github/workflows/linting_comment.yml +++ b/.github/workflows/linting_comment.yml @@ -1,4 +1,3 @@ - name: nf-core linting comment # This workflow is triggered after the linting action is complete # It posts an automated comment to the PR, even if the PR is coming from a fork @@ -15,6 +14,7 @@ jobs: uses: dawidd6/action-download-artifact@v2 with: workflow: linting.yml + workflow_conclusion: completed - name: Get PR number id: pr_number @@ -26,4 +26,3 @@ jobs: GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }} number: ${{ steps.pr_number.outputs.pr_number }} path: linting-logs/lint_results.md - diff --git a/.github/workflows/push_dockerhub_dev.yml b/.github/workflows/push_dockerhub_dev.yml deleted file mode 100644 index d0a78c08..00000000 --- a/.github/workflows/push_dockerhub_dev.yml +++ /dev/null @@ -1,28 +0,0 @@ -name: nf-core Docker push (dev) -# This builds the docker image and pushes it to DockerHub -# Runs on nf-core repo releases and push event to 'dev' branch (PR merges) -on: - push: - branches: - - dev - -jobs: - push_dockerhub: - name: Push new Docker image to Docker Hub (dev) - runs-on: ubuntu-latest - # Only run for the nf-core repo, for releases and merged PRs - if: ${{ github.repository == 'nf-core/pgdb' }} - env: - DOCKERHUB_USERNAME: ${{ secrets.DOCKERHUB_USERNAME }} - DOCKERHUB_PASS: ${{ secrets.DOCKERHUB_PASS }} - steps: - - name: Check out pipeline code - uses: actions/checkout@v2 - - - name: Build new docker image - run: docker build --no-cache . -t nfcore/pgdb:dev - - - name: Push Docker image to DockerHub (dev) - run: | - echo "$DOCKERHUB_PASS" | docker login -u "$DOCKERHUB_USERNAME" --password-stdin - docker push nfcore/pgdb:dev diff --git a/.github/workflows/push_dockerhub_release.yml b/.github/workflows/push_dockerhub_release.yml deleted file mode 100644 index b326404f..00000000 --- a/.github/workflows/push_dockerhub_release.yml +++ /dev/null @@ -1,29 +0,0 @@ -name: nf-core Docker push (release) -# This builds the docker image and pushes it to DockerHub -# Runs on nf-core repo releases and push event to 'dev' branch (PR merges) -on: - release: - types: [published] - -jobs: - push_dockerhub: - name: Push new Docker image to Docker Hub (release) - runs-on: ubuntu-latest - # Only run for the nf-core repo, for releases and merged PRs - if: ${{ github.repository == 'nf-core/pgdb' }} - env: - DOCKERHUB_USERNAME: ${{ secrets.DOCKERHUB_USERNAME }} - DOCKERHUB_PASS: ${{ secrets.DOCKERHUB_PASS }} - steps: - - name: Check out pipeline code - uses: actions/checkout@v2 - - - name: Build new docker image - run: docker build --no-cache . -t nfcore/pgdb:latest - - - name: Push Docker image to DockerHub (release) - run: | - echo "$DOCKERHUB_PASS" | docker login -u "$DOCKERHUB_USERNAME" --password-stdin - docker push nfcore/pgdb:latest - docker tag nfcore/pgdb:latest nfcore/pgdb:${{ github.event.release.tag_name }} - docker push nfcore/pgdb:${{ github.event.release.tag_name }} diff --git a/.gitignore b/.gitignore index 9fbe9b86..5124c9ac 100644 --- a/.gitignore +++ b/.gitignore @@ -1,11 +1,8 @@ .nextflow* work/ data/ -results* +results/ .DS_Store -tests/ testing/ testing* *.pyc -.idea/ -/fasta_database.fa diff --git a/.gitpod.yml b/.gitpod.yml new file mode 100644 index 00000000..85d95ecc --- /dev/null +++ b/.gitpod.yml @@ -0,0 +1,14 @@ +image: nfcore/gitpod:latest + +vscode: + extensions: # based on nf-core.nf-core-extensionpack + - codezombiech.gitignore # Language support for .gitignore files + # - cssho.vscode-svgviewer # SVG viewer + - esbenp.prettier-vscode # Markdown/CommonMark linting and style checking for Visual Studio Code + - eamodio.gitlens # Quickly glimpse into whom, why, and when a line or code block was changed + - EditorConfig.EditorConfig # override user/workspace settings with settings found in .editorconfig files + - Gruntfuggly.todo-tree # Display TODO and FIXME in a tree view in the activity bar + - mechatroner.rainbow-csv # Highlight columns in csv files in different colors + # - nextflow.nextflow # Nextflow syntax highlighting + - oderwat.indent-rainbow # Highlight indentation level + - streetsidesoftware.code-spell-checker # Spelling checker for source code diff --git a/.idea/.gitignore b/.idea/.gitignore new file mode 100644 index 00000000..73f69e09 --- /dev/null +++ b/.idea/.gitignore @@ -0,0 +1,8 @@ +# Default ignored files +/shelf/ +/workspace.xml +# Datasource local storage ignored files +/dataSources/ +/dataSources.local.xml +# Editor-based HTTP Client requests +/httpRequests/ diff --git a/.nf-core.yml b/.nf-core.yml new file mode 100644 index 00000000..778ae193 --- /dev/null +++ b/.nf-core.yml @@ -0,0 +1,4 @@ +repository_type: pipeline +lint: + files_exist: + - conf/igenomes.config diff --git a/.prettierignore b/.prettierignore new file mode 100644 index 00000000..d0e7ae58 --- /dev/null +++ b/.prettierignore @@ -0,0 +1,9 @@ +email_template.html +.nextflow* +work/ +data/ +results/ +.DS_Store +testing/ +testing* +*.pyc diff --git a/.prettierrc.yml b/.prettierrc.yml new file mode 100644 index 00000000..c81f9a76 --- /dev/null +++ b/.prettierrc.yml @@ -0,0 +1 @@ +printWidth: 120 diff --git a/CHANGELOG.md b/CHANGELOG.md index 1eb7aefc..78474f96 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -3,14 +3,19 @@ The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/) and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html). -## v1.0dev - [date] +## 1.0.0 Initial release of nf-core/pgdb, created with the [nf-core](https://nf-co.re/) template. ### `Added` -### `Fixed` +The initial version of the pipeline features the following steps: -### `Dependencies` +- _(optional)_ ENSEMBL Reference proteomes included in final proteome +- Convert a Variant genome database like COSMIC or CBioPortal to proteomes +- Convert provided VCF to proteome database +- _(optional)_ Generate the decoy database and attach it to the final proteome -### `Deprecated` +### `Known issues` + +If you experience nextflow running forever after a failed step, try setting `errorStrategy = terminate`. See the corresponding [nextflow issue](https://github.com/nextflow-io/nextflow/issues/1457). diff --git a/CITATIONS.md b/CITATIONS.md new file mode 100644 index 00000000..12311663 --- /dev/null +++ b/CITATIONS.md @@ -0,0 +1,52 @@ +# nf-core/pgdb: Citations + +## [pgdb](https://pubmed.ncbi.nlm.nih.gov/34904638/) + +> Husen M Umer, Enrique Audain, Yafeng Zhu, Julianus Pfeuffer, Timo Sachsenberg, Janne Lehtiö, Rui M Branca, Yasset Perez-Riverol, Generation of ENSEMBL-based proteogenomics databases boosts the identification of non-canonical peptides, Bioinformatics, Volume 38, Issue 5, 1 March 2022, Pages 1470–1472, https://doi.org/10.1093/bioinformatics/btab838 + +## [nf-core](https://pubmed.ncbi.nlm.nih.gov/32055031/) + +> Ewels PA, Peltzer A, Fillinger S, Patel H, Alneberg J, Wilm A, Garcia MU, Di Tommaso P, Nahnsen S. The nf-core framework for community-curated bioinformatics pipelines. Nat Biotechnol. 2020 Mar;38(3):276-278. doi: 10.1038/s41587-020-0439-x. PubMed PMID: 32055031. + +## [Nextflow](https://pubmed.ncbi.nlm.nih.gov/28398311/) + +> Di Tommaso P, Chatzou M, Floden EW, Barja PP, Palumbo E, Notredame C. Nextflow enables reproducible computational workflows. Nat Biotechnol. 2017 Apr 11;35(4):316-319. doi: 10.1038/nbt.3820. PubMed PMID: 28398311. + +## Pipeline tools + +- [pypgatk](https://pubmed.ncbi.nlm.nih.gov/34904638/) + + > Husen M Umer, Enrique Audain, Yafeng Zhu, Julianus Pfeuffer, Timo Sachsenberg, Janne Lehtiö, Rui M Branca, Yasset Perez-Riverol, Generation of ENSEMBL-based proteogenomics databases boosts the identification of non-canonical peptides, Bioinformatics, Volume 38, Issue 5, 1 March 2022, Pages 1470–1472, https://doi.org/10.1093/bioinformatics/btab838 + +## Data sources + +- [ENSEMBL](https://pubmed.ncbi.nlm.nih.gov/31691826/) + + > Yates, A. D., Achuthan, P., Akanni, W., Allen, J., Allen, J., Alvarez-Jarreta, J., ... & Flicek, P. (2020). Ensembl 2020. Nucleic acids research, 48(D1), D682-D688. + +- [COSMIC](https://pubmed.ncbi.nlm.nih.gov/15188009/) + + > Bamford, S., Dawson, E., Forbes, S., Clements, J., Pettett, R., Dogan, A., ... & Wooster, R. (2004). The COSMIC (Catalogue of Somatic Mutations in Cancer) database and website. British journal of cancer, 91(2), 355-358. + +- [cBioPortal](https://pubmed.ncbi.nlm.nih.gov/23550210/) + + > Gao, J., Aksoy, B. A., Dogrusoz, U., Dresdner, G., Gross, B., Sumer, S. O., ... & Schultz, N. (2013). Integrative analysis of complex cancer genomics and clinical profiles using the cBioPortal. Science signaling, 6(269), pl1-pl1. + +## Software packaging/containerisation tools + +- [Anaconda](https://anaconda.com) + + > Anaconda Software Distribution. Computer software. Vers. 2-2.4.0. Anaconda, Nov. 2016. Web. + +- [Bioconda](https://pubmed.ncbi.nlm.nih.gov/29967506/) + + > Grüning B, Dale R, Sjödin A, Chapman BA, Rowe J, Tomkins-Tinch CH, Valieris R, Köster J; Bioconda Team. Bioconda: sustainable and comprehensive software distribution for the life sciences. Nat Methods. 2018 Jul;15(7):475-476. doi: 10.1038/s41592-018-0046-7. PubMed PMID: 29967506. + +- [BioContainers](https://pubmed.ncbi.nlm.nih.gov/28379341/) + + > da Veiga Leprevost F, Grüning B, Aflitos SA, Röst HL, Uszkoreit J, Barsnes H, Vaudel M, Moreno P, Gatto L, Weber J, Bai M, Jimenez RC, Sachsenberg T, Pfeuffer J, Alvarez RV, Griss J, Nesvizhskii AI, Perez-Riverol Y. BioContainers: an open-source and community-driven framework for software standardization. Bioinformatics. 2017 Aug 15;33(16):2580-2582. doi: 10.1093/bioinformatics/btx192. PubMed PMID: 28379341; PubMed Central PMCID: PMC5870671. + +- [Docker](https://dl.acm.org/doi/10.5555/2600239.2600241) + +- [Singularity](https://pubmed.ncbi.nlm.nih.gov/28494014/) + > Kurtzer GM, Sochat V, Bauer MW. Singularity: Scientific containers for mobility of compute. PLoS One. 2017 May 11;12(5):e0177459. doi: 10.1371/journal.pone.0177459. eCollection 2017. PubMed PMID: 28494014; PubMed Central PMCID: PMC5426675. diff --git a/CODE_OF_CONDUCT.md b/CODE_OF_CONDUCT.md index 405fb1bf..f4fd052f 100644 --- a/CODE_OF_CONDUCT.md +++ b/CODE_OF_CONDUCT.md @@ -1,46 +1,111 @@ -# Contributor Covenant Code of Conduct +# Code of Conduct at nf-core (v1.0) ## Our Pledge -In the interest of fostering an open and welcoming environment, we as contributors and maintainers pledge to making participation in our project and our community a harassment-free experience for everyone, regardless of age, body size, disability, ethnicity, gender identity and expression, level of experience, nationality, personal appearance, race, religion, or sexual identity and orientation. +In the interest of fostering an open, collaborative, and welcoming environment, we as contributors and maintainers of nf-core, pledge to making participation in our projects and community a harassment-free experience for everyone, regardless of: -## Our Standards +- Age +- Body size +- Familial status +- Gender identity and expression +- Geographical location +- Level of experience +- Nationality and national origins +- Native language +- Physical and neurological ability +- Race or ethnicity +- Religion +- Sexual identity and orientation +- Socioeconomic status -Examples of behavior that contributes to creating a positive environment include: +Please note that the list above is alphabetised and is therefore not ranked in any order of preference or importance. -* Using welcoming and inclusive language -* Being respectful of differing viewpoints and experiences -* Gracefully accepting constructive criticism -* Focusing on what is best for the community -* Showing empathy towards other community members +## Preamble -Examples of unacceptable behavior by participants include: +> Note: This Code of Conduct (CoC) has been drafted by the nf-core Safety Officer and been edited after input from members of the nf-core team and others. "We", in this document, refers to the Safety Officer and members of the nf-core core team, both of whom are deemed to be members of the nf-core community and are therefore required to abide by this Code of Conduct. This document will amended periodically to keep it up-to-date, and in case of any dispute, the most current version will apply. -* The use of sexualized language or imagery and unwelcome sexual attention or advances -* Trolling, insulting/derogatory comments, and personal or political attacks -* Public or private harassment -* Publishing others' private information, such as a physical or electronic address, without explicit permission -* Other conduct which could reasonably be considered inappropriate in a professional setting +An up-to-date list of members of the nf-core core team can be found [here](https://nf-co.re/about). Our current safety officer is Renuka Kudva. + +nf-core is a young and growing community that welcomes contributions from anyone with a shared vision for [Open Science Policies](https://www.fosteropenscience.eu/taxonomy/term/8). Open science policies encompass inclusive behaviours and we strive to build and maintain a safe and inclusive environment for all individuals. + +We have therefore adopted this code of conduct (CoC), which we require all members of our community and attendees in nf-core events to adhere to in all our workspaces at all times. Workspaces include but are not limited to Slack, meetings on Zoom, Jitsi, YouTube live etc. + +Our CoC will be strictly enforced and the nf-core team reserve the right to exclude participants who do not comply with our guidelines from our workspaces and future nf-core activities. + +We ask all members of our community to help maintain a supportive and productive workspace and to avoid behaviours that can make individuals feel unsafe or unwelcome. Please help us maintain and uphold this CoC. + +Questions, concerns or ideas on what we can include? Contact safety [at] nf-co [dot] re ## Our Responsibilities -Project maintainers are responsible for clarifying the standards of acceptable behavior and are expected to take appropriate and fair corrective action in response to any instances of unacceptable behavior. +The safety officer is responsible for clarifying the standards of acceptable behavior and are expected to take appropriate and fair corrective action in response to any instances of unacceptable behaviour. + +The safety officer in consultation with the nf-core core team have the right and responsibility to remove, edit, or reject comments, commits, code, wiki edits, issues, and other contributions that are not aligned to this Code of Conduct, or to ban temporarily or permanently any contributor for other behaviors that they deem inappropriate, threatening, offensive, or harmful. + +Members of the core team or the safety officer who violate the CoC will be required to recuse themselves pending investigation. They will not have access to any reports of the violations and be subject to the same actions as others in violation of the CoC. + +## When are where does this Code of Conduct apply? + +Participation in the nf-core community is contingent on following these guidelines in all our workspaces and events. This includes but is not limited to the following listed alphabetically and therefore in no order of preference: + +- Communicating with an official project email address. +- Communicating with community members within the nf-core Slack channel. +- Participating in hackathons organised by nf-core (both online and in-person events). +- Participating in collaborative work on GitHub, Google Suite, community calls, mentorship meetings, email correspondence. +- Participating in workshops, training, and seminar series organised by nf-core (both online and in-person events). This applies to events hosted on web-based platforms such as Zoom, Jitsi, YouTube live etc. +- Representing nf-core on social media. This includes both official and personal accounts. + +## nf-core cares 😊 + +nf-core's CoC and expectations of respectful behaviours for all participants (including organisers and the nf-core team) include but are not limited to the following (listed in alphabetical order): + +- Ask for consent before sharing another community member’s personal information (including photographs) on social media. +- Be respectful of differing viewpoints and experiences. We are all here to learn from one another and a difference in opinion can present a good learning opportunity. +- Celebrate your accomplishments at events! (Get creative with your use of emojis 🎉 🥳 💯 🙌 !) +- Demonstrate empathy towards other community members. (We don’t all have the same amount of time to dedicate to nf-core. If tasks are pending, don’t hesitate to gently remind members of your team. If you are leading a task, ask for help if you feel overwhelmed.) +- Engage with and enquire after others. (This is especially important given the geographically remote nature of the nf-core community, so let’s do this the best we can) +- Focus on what is best for the team and the community. (When in doubt, ask) +- Graciously accept constructive criticism, yet be unafraid to question, deliberate, and learn. +- Introduce yourself to members of the community. (We’ve all been outsiders and we know that talking to strangers can be hard for some, but remember we’re interested in getting to know you and your visions for open science!) +- Show appreciation and **provide clear feedback**. (This is especially important because we don’t see each other in person and it can be harder to interpret subtleties. Also remember that not everyone understands a certain language to the same extent as you do, so **be clear in your communications to be kind.**) +- Take breaks when you feel like you need them. +- Using welcoming and inclusive language. (Participants are encouraged to display their chosen pronouns on Zoom or in communication on Slack.) + +## nf-core frowns on 😕 + +The following behaviours from any participants within the nf-core community (including the organisers) will be considered unacceptable under this code of conduct. Engaging or advocating for any of the following could result in expulsion from nf-core workspaces. + +- Deliberate intimidation, stalking or following and sustained disruption of communication among participants of the community. This includes hijacking shared screens through actions such as using the annotate tool in conferencing software such as Zoom. +- “Doxing” i.e. posting (or threatening to post) another person’s personal identifying information online. +- Spamming or trolling of individuals on social media. +- Use of sexual or discriminatory imagery, comments, or jokes and unwelcome sexual attention. +- Verbal and text comments that reinforce social structures of domination related to gender, gender identity and expression, sexual orientation, ability, physical appearance, body size, race, age, religion or work experience. + +### Online Trolling + +The majority of nf-core interactions and events are held online. Unfortunately, holding events online comes with the added issue of online trolling. This is unacceptable, reports of such behaviour will be taken very seriously, and perpetrators will be excluded from activities immediately. + +All community members are required to ask members of the group they are working within for explicit consent prior to taking screenshots of individuals during video calls. + +## Procedures for Reporting CoC violations -Project maintainers have the right and responsibility to remove, edit, or reject comments, commits, code, wiki edits, issues, and other contributions that are not aligned to this Code of Conduct, or to ban temporarily or permanently any contributor for other behaviors that they deem inappropriate, threatening, offensive, or harmful. +If someone makes you feel uncomfortable through their behaviours or actions, report it as soon as possible. -## Scope +You can reach out to members of the [nf-core core team](https://nf-co.re/about) and they will forward your concerns to the safety officer(s). -This Code of Conduct applies both within project spaces and in public spaces when an individual is representing the project or its community. Examples of representing a project or community include using an official project e-mail address, posting via an official social media account, or acting as an appointed representative at an online or offline event. Representation of a project may be further defined and clarified by project maintainers. +Issues directly concerning members of the core team will be dealt with by other members of the core team and the safety manager, and possible conflicts of interest will be taken into account. nf-core is also in discussions about having an ombudsperson, and details will be shared in due course. -## Enforcement +All reports will be handled with utmost discretion and confidentially. -Instances of abusive, harassing, or otherwise unacceptable behavior may be reported by contacting the project team on [Slack](https://nf-co.re/join/slack). The project team will review and investigate all complaints, and will respond in a way that it deems appropriate to the circumstances. The project team is obligated to maintain confidentiality with regard to the reporter of an incident. Further details of specific enforcement policies may be posted separately. +## Attribution and Acknowledgements -Project maintainers who do not follow or enforce the Code of Conduct in good faith may face temporary or permanent repercussions as determined by other members of the project's leadership. +- The [Contributor Covenant, version 1.4](http://contributor-covenant.org/version/1/4) +- The [OpenCon 2017 Code of Conduct](http://www.opencon2017.org/code_of_conduct) (CC BY 4.0 OpenCon organisers, SPARC and Right to Research Coalition) +- The [eLife innovation sprint 2020 Code of Conduct](https://sprint.elifesciences.org/code-of-conduct/) +- The [Mozilla Community Participation Guidelines v3.1](https://www.mozilla.org/en-US/about/governance/policies/participation/) (version 3.1, CC BY-SA 3.0 Mozilla) -## Attribution +## Changelog -This Code of Conduct is adapted from the [Contributor Covenant][homepage], version 1.4, available at [https://www.contributor-covenant.org/version/1/4/code-of-conduct/][version] +### v1.0 - March 12th, 2021 -[homepage]: https://contributor-covenant.org -[version]: https://www.contributor-covenant.org/version/1/4/code-of-conduct/ +- Complete rewrite from original [Contributor Covenant](http://contributor-covenant.org/) CoC. diff --git a/Dockerfile b/Dockerfile deleted file mode 100644 index 26138738..00000000 --- a/Dockerfile +++ /dev/null @@ -1,17 +0,0 @@ -FROM nfcore/base:1.12.1 -LABEL authors="Husen M. Umer & Yasset Perez-Riverol" \ - description="Docker image containing all software requirements for the nf-core/pgdb pipeline" - -# Install the conda environment -COPY environment.yml / -RUN conda env create --quiet -f /environment.yml && conda clean -a - -# Add conda installation dir to PATH (instead of doing 'conda activate') -ENV PATH /opt/conda/envs/nf-core-pgdb-1.0dev/bin:$PATH - -# Dump the details of the installed packages to a file for posterity -RUN conda env export --name nf-core-pgdb-1.0dev > nf-core-pgdb-1.0dev.yml - -# Instruct R processes to use these empty files instead of clashing with a local version -RUN touch .Rprofile -RUN touch .Renviron diff --git a/README.md b/README.md index 2331e8bf..1d3f0f7b 100644 --- a/README.md +++ b/README.md @@ -1,43 +1,58 @@ -# ![nf-core/pgdb](docs/images/nf-core-pgdb_logo.png) +# ![nf-core/pgdb](docs/images/nf-core-pgdb_logo_light.png#gh-light-mode-only) ![nf-core/pgdb](docs/images/nf-core-pgdb_logo_dark.png#gh-dark-mode-only) -The ProteoGenomics database generation workflow (**pgdb**) use the [pypgatk](https://github.com/bigbio/py-pgatk) and [nextflow](https://www.nextflow.io/) to create different protein databases for ProteoGenomics data analysis. +[![GitHub Actions CI Status](https://github.com/nf-core/pgdb/workflows/nf-core%20CI/badge.svg)](https://github.com/nf-core/pgdb/actions?query=workflow%3A%22nf-core+CI%22) +[![GitHub Actions Linting Status](https://github.com/nf-core/pgdb/workflows/nf-core%20linting/badge.svg)](https://github.com/nf-core/pgdb/actions?query=workflow%3A%22nf-core+linting%22) +[![AWS CI](https://img.shields.io/badge/CI%20tests-full%20size-FF9900?logo=Amazon%20AWS)](https://nf-co.re/pgdb/results) +[![Cite with Zenodo](https://zenodo.org/badge/DOI/10.5281/zenodo.4722662.svg)](https://doi.org/10.5281/zenodo.4722662) -[![GitHub Actions CI Status](https://github.com/nf-core/pgdb/workflows/nf-core%20CI/badge.svg)](https://github.com/bigbio/pgdb/actions) -[![GitHub Actions Linting Status](https://github.com/nf-core/pgdb/workflows/nf-core%20linting/badge.svg)](https://github.com/bigbio/pgdb/actions) -[![Nextflow](https://img.shields.io/badge/nextflow-%E2%89%A520.04.0-brightgreen.svg)](https://www.nextflow.io/) +[![Nextflow](https://img.shields.io/badge/nextflow%20DSL2-%E2%89%A521.10.3-23aa62.svg)](https://www.nextflow.io/) +[![run with conda](http://img.shields.io/badge/run%20with-conda-3EB049?logo=anaconda)](https://docs.conda.io/en/latest/) +[![run with docker](https://img.shields.io/badge/run%20with-docker-0db7ed?logo=docker)](https://www.docker.com/) +[![run with singularity](https://img.shields.io/badge/run%20with-singularity-1d355c.svg)](https://sylabs.io/docs/) +[![Launch on Nextflow Tower](https://img.shields.io/badge/Launch%20%F0%9F%9A%80-Nextflow%20Tower-%234256e7)](https://tower.nf/launch?pipeline=https://github.com/nf-core/pgdb) -[![install with bioconda](https://img.shields.io/badge/install%20with-bioconda-brightgreen.svg)](https://bioconda.github.io/) -[![Docker](https://img.shields.io/docker/automated/nfcore/pgdb.svg)](https://hub.docker.com/r/nfcore/pgdb) [![Get help on Slack](http://img.shields.io/badge/slack-nf--core%20%23pgdb-4A154B?logo=slack)](https://nfcore.slack.com/channels/pgdb) +[![Follow on Twitter](http://img.shields.io/badge/twitter-%40nf__core-1DA1F2?logo=twitter)](https://twitter.com/nf_core) +[![Watch on YouTube](http://img.shields.io/badge/youtube-nf--core-FF0000?logo=youtube)](https://www.youtube.com/c/nf-core) ## Introduction - -**nf-core/pgdb** is a bioinformatics best-practise analysis pipeline for +**nf-core/pgdb** is a bioinformatics pipeline to generate proteogenomics databases. pgdb allows users to create proteogenomics databases using EMSEMBL as the reference proteome database. Three different major databases can be attached to the final proteogenomics database: + +- The reference proteome (ENSEMBL Reference proteome) +- Non canonical proteins: pseudo-genes, sORFs, lncRNA. +- Variants: COSMIC, cBioPortal, GENOMAD variants + +The pipeline allows to estimate decoy proteins with different methods and attach them to the final proteogenomics database. The pipeline is built using [Nextflow](https://www.nextflow.io), a workflow tool to run tasks across multiple compute infrastructures in a very portable manner. It comes with docker containers making installation trivial and results highly reproducible. ## Quick Start -1. Install [`nextflow`](https://nf-co.re/usage/installation) +1. Install [`Nextflow`](https://www.nextflow.io/docs/latest/getstarted.html#installation) (`>=21.10.3`) -2. Install any of [`Docker`](https://docs.docker.com/engine/installation/), [`Singularity`](https://www.sylabs.io/guides/3.0/user-guide/) or [`Podman`](https://podman.io/) for full pipeline reproducibility _(please only use [`Conda`](https://conda.io/miniconda.html) as a last resort; see [docs](https://nf-co.re/usage/configuration#basic-configuration-profiles))_ +2. Install any of [`Docker`](https://docs.docker.com/engine/installation/), [`Singularity`](https://www.sylabs.io/guides/3.0/user-guide/) (you can follow [this tutorial](https://singularity-tutorial.github.io/01-installation/)), [`Podman`](https://podman.io/), [`Shifter`](https://nersc.gitlab.io/development/shifter/how-to-use/) or [`Charliecloud`](https://hpc.github.io/charliecloud/) for full pipeline reproducibility _(you can use [`Conda`](https://conda.io/miniconda.html) both to install Nextflow itself and also to manage software within pipelines. Please only use it within pipelines as a last resort; see [docs](https://nf-co.re/usage/configuration#basic-configuration-profiles))_. 3. Download the pipeline and test it on a minimal dataset with a single command: - ```bash - nextflow run nf-core/pgdb -profile test, - ``` + ```console + nextflow run nf-core/pgdb -profile test,YOURPROFILE --outdir + ``` + + Note that some form of configuration will be needed so that Nextflow knows how to fetch the required software. This is usually done in the form of a config profile (`YOURPROFILE` in the example command above). You can chain multiple config profiles in a comma-separated string. - > Please check [nf-core/configs](https://github.com/nf-core/configs#documentation) to see if a custom config file to run nf-core pipelines already exists for your Institute. If so, you can simply use `-profile ` in your command. This will enable either `docker` or `singularity` and set the appropriate execution settings for your local compute environment. + > - The pipeline comes with config profiles called `docker`, `singularity`, `podman`, `shifter`, `charliecloud` and `conda` which instruct the pipeline to use the named tool for software management. For example, `-profile test,docker`. + > - Please check [nf-core/configs](https://github.com/nf-core/configs#documentation) to see if a custom config file to run nf-core pipelines already exists for your Institute. If so, you can simply use `-profile ` in your command. This will enable either `docker` or `singularity` and set the appropriate execution settings for your local compute environment. + > - If you are using `singularity`, please use the [`nf-core download`](https://nf-co.re/tools/#downloading-pipelines-for-offline-use) command to download images first, before running the pipeline. Setting the [`NXF_SINGULARITY_CACHEDIR` or `singularity.cacheDir`](https://www.nextflow.io/docs/latest/singularity.html?#singularity-docker-hub) Nextflow options enables you to store and re-use the images from a central location for future pipeline runs. + > - If you are using `conda`, it is highly recommended to use the [`NXF_CONDA_CACHEDIR` or `conda.cacheDir`](https://www.nextflow.io/docs/latest/conda.html) settings to store the environments in a central location for future pipeline runs. 4. Start running your own analysis! - + ```bash + nextflow run nf-core/pgdb -profile --ncrna true --pseudogenes true --altorfs true + ``` - ```bash - nextflow run nf-core/pgdb -profile --ensembl_name homo_sapines --ensembl false - ``` + > This will create a proteogenomics database with the ENSEMBL reference proteome and non canonical proteins like pseudo genes, non coding rnas or alternative open reading frames. See [usage docs](https://nf-co.re/pgdb/usage) for all of the available options when running the pipeline. @@ -47,25 +62,18 @@ By default, the pipeline currently performs the following: ![ProteoGenomics Database](/docs/images/pgdb-databases.png) -* Download protein databases from ENSEMBL -* Translate from Genomics Variant databases into ProteoGenomics Databases (`COSMIC`, `GNOMAD`) -* Add to a Reference proteomics database, non-coding RNAs + pseudogenes. -* Compute Decoy for a proteogenomics databases +- Download protein databases from ENSEMBL +- Translate from Genomics Variant databases into ProteoGenomics Databases (`COSMIC`, `GNOMAD`) +- Add to a Reference proteomics database, non-coding RNAs + pseudogenes. +- Compute Decoy for a proteogenomics databases ## Documentation The nf-core/pgdb pipeline comes with documentation about the pipeline: [usage](https://nf-co.re/pgdb/usage) and [output](https://nf-co.re/pgdb/output). - - ## Credits -nf-core/pgdb was originally written by Husen M. Umer & Yasset Perez-Riverol. - -We thank the following people for their extensive assistance in the development -of this pipeline: - - +nf-core/pgdb was originally written by Husen M. Umer (EMBL-EBI) & Yasset Perez-Riverol (Karolinska Institute) ## Contributions and Support @@ -75,8 +83,15 @@ For further information or help, don't hesitate to get in touch on the [Slack `# ## Citations - - +The pgdb pipeline should be cited using the following citation: + +> Umer HM, Audain E, Zhu Y, Pfeuffer J, Sachsenberg T, Lehtiö J, Branca R, Perez-Riverol Y. Generation of ENSEMBL-based proteogenomics databases boosts the identification of non-canonical peptides. +> +> _Bioinformatics_. 2021 Dec 14;38(5):1470–2. doi: [10.1093/bioinformatics/btab838](https://dx.doi.org/10.1093/bioinformatics/btab838). Epub ahead of print. PMID: 34904638; PMCID: PMC8825679. + +additionally you can cite the pipeline directly with the following doi: 10.5281/zenodo.4722662 + +An extensive list of references for the tools used by the pipeline can be found in the [`CITATIONS.md`](CITATIONS.md) file. You can cite the `nf-core` publication as follows: @@ -85,8 +100,3 @@ You can cite the `nf-core` publication as follows: > Philip Ewels, Alexander Peltzer, Sven Fillinger, Harshil Patel, Johannes Alneberg, Andreas Wilm, Maxime Ulysse Garcia, Paolo Di Tommaso & Sven Nahnsen. > > _Nat Biotechnol._ 2020 Feb 13. doi: [10.1038/s41587-020-0439-x](https://dx.doi.org/10.1038/s41587-020-0439-x). -> ReadCube: [Full Access Link](https://rdcu.be/b1GjZ) - -In addition, references of tools and data used in this pipeline are as follows: - - diff --git a/assets/email_template.html b/assets/email_template.html index 25eac41b..1c6536c1 100644 --- a/assets/email_template.html +++ b/assets/email_template.html @@ -1,11 +1,10 @@ - - + nf-core/pgdb Pipeline Report diff --git a/assets/multiqc_config.yaml b/assets/multiqc_config.yaml deleted file mode 100644 index 2d261aca..00000000 --- a/assets/multiqc_config.yaml +++ /dev/null @@ -1,11 +0,0 @@ -report_comment: > - This report has been generated by the nf-core/pgdb - analysis pipeline. For information about how to interpret these results, please see the - documentation. -report_section_order: - software_versions: - order: -1000 - nf-core-pgdb-summary: - order: -1001 - -export_plots: true diff --git a/assets/multiqc_config.yml b/assets/multiqc_config.yml new file mode 100644 index 00000000..5bb746d9 --- /dev/null +++ b/assets/multiqc_config.yml @@ -0,0 +1,11 @@ +report_comment: > + This report has been generated by the nf-core/pgdb + analysis pipeline. For information about how to interpret these results, please see the + documentation. +report_section_order: + software_versions: + order: -1000 + "nf-core-pgdb-summary": + order: -1001 + +export_plots: true diff --git a/assets/nf-core-pgdb_logo.png b/assets/nf-core-pgdb_logo.png deleted file mode 100644 index d5895cc6..00000000 Binary files a/assets/nf-core-pgdb_logo.png and /dev/null differ diff --git a/assets/nf-core-pgdb_logo_light.png b/assets/nf-core-pgdb_logo_light.png new file mode 100644 index 00000000..2e87a3e7 Binary files /dev/null and b/assets/nf-core-pgdb_logo_light.png differ diff --git a/assets/samplesheet.csv b/assets/samplesheet.csv new file mode 100644 index 00000000..5f653ab7 --- /dev/null +++ b/assets/samplesheet.csv @@ -0,0 +1,3 @@ +sample,fastq_1,fastq_2 +SAMPLE_PAIRED_END,/path/to/fastq/files/AEG588A1_S1_L002_R1_001.fastq.gz,/path/to/fastq/files/AEG588A1_S1_L002_R2_001.fastq.gz +SAMPLE_SINGLE_END,/path/to/fastq/files/AEG588A4_S4_L003_R1_001.fastq.gz, diff --git a/assets/schema_input.json b/assets/schema_input.json new file mode 100644 index 00000000..ce2b8f57 --- /dev/null +++ b/assets/schema_input.json @@ -0,0 +1,36 @@ +{ + "$schema": "http://json-schema.org/draft-07/schema", + "$id": "https://raw.githubusercontent.com/nf-core/pgdb/master/assets/schema_input.json", + "title": "nf-core/pgdb pipeline - params.input schema", + "description": "Schema for the file provided with params.input", + "type": "array", + "items": { + "type": "object", + "properties": { + "sample": { + "type": "string", + "pattern": "^\\S+$", + "errorMessage": "Sample name must be provided and cannot contain spaces" + }, + "fastq_1": { + "type": "string", + "pattern": "^\\S+\\.f(ast)?q\\.gz$", + "errorMessage": "FastQ file for reads 1 must be provided, cannot contain spaces and must have extension '.fq.gz' or '.fastq.gz'" + }, + "fastq_2": { + "errorMessage": "FastQ file for reads 2 cannot contain spaces and must have extension '.fq.gz' or '.fastq.gz'", + "anyOf": [ + { + "type": "string", + "pattern": "^\\S+\\.f(ast)?q\\.gz$" + }, + { + "type": "string", + "maxLength": 0 + } + ] + } + }, + "required": ["sample", "fastq_1"] + } +} diff --git a/assets/sendmail_template.txt b/assets/sendmail_template.txt index 69d74ec0..a477b901 100644 --- a/assets/sendmail_template.txt +++ b/assets/sendmail_template.txt @@ -12,18 +12,18 @@ $email_html Content-Type: image/png;name="nf-core-pgdb_logo.png" Content-Transfer-Encoding: base64 Content-ID: -Content-Disposition: inline; filename="nf-core-pgdb_logo.png" +Content-Disposition: inline; filename="nf-core-pgdb_logo_light.png" -<% out << new File("$projectDir/assets/nf-core-pgdb_logo.png"). - bytes. - encodeBase64(). - toString(). - tokenize( '\n' )*. - toList()*. - collate( 76 )*. - collect { it.join() }. - flatten(). - join( '\n' ) %> +<% out << new File("$projectDir/assets/nf-core-pgdb_logo_light.png"). + bytes. + encodeBase64(). + toString(). + tokenize( '\n' )*. + toList()*. + collate( 76 )*. + collect { it.join() }. + flatten(). + join( '\n' ) %> <% if (mqcFile){ @@ -37,15 +37,15 @@ Content-ID: Content-Disposition: attachment; filename=\"${mqcFileObj.getName()}\" ${mqcFileObj. - bytes. - encodeBase64(). - toString(). - tokenize( '\n' )*. - toList()*. - collate( 76 )*. - collect { it.join() }. - flatten(). - join( '\n' )} + bytes. + encodeBase64(). + toString(). + tokenize( '\n' )*. + toList()*. + collate( 76 )*. + collect { it.join() }. + flatten(). + join( '\n' )} """ }} %> diff --git a/bin/check_samplesheet.py b/bin/check_samplesheet.py new file mode 100755 index 00000000..3652c63c --- /dev/null +++ b/bin/check_samplesheet.py @@ -0,0 +1,260 @@ +#!/usr/bin/env python + + +"""Provide a command line tool to validate and transform tabular samplesheets.""" + + +import argparse +import csv +import logging +import sys +from collections import Counter +from pathlib import Path + + +logger = logging.getLogger() + + +class RowChecker: + """ + Define a service that can validate and transform each given row. + + Attributes: + modified (list): A list of dicts, where each dict corresponds to a previously + validated and transformed row. The order of rows is maintained. + + """ + + VALID_FORMATS = ( + ".fq.gz", + ".fastq.gz", + ) + + def __init__( + self, + sample_col="sample", + first_col="fastq_1", + second_col="fastq_2", + single_col="single_end", + **kwargs, + ): + """ + Initialize the row checker with the expected column names. + + Args: + sample_col (str): The name of the column that contains the sample name + (default "sample"). + first_col (str): The name of the column that contains the first (or only) + FASTQ file path (default "fastq_1"). + second_col (str): The name of the column that contains the second (if any) + FASTQ file path (default "fastq_2"). + single_col (str): The name of the new column that will be inserted and + records whether the sample contains single- or paired-end sequencing + reads (default "single_end"). + + """ + super().__init__(**kwargs) + self._sample_col = sample_col + self._first_col = first_col + self._second_col = second_col + self._single_col = single_col + self._seen = set() + self.modified = [] + + def validate_and_transform(self, row): + """ + Perform all validations on the given row and insert the read pairing status. + + Args: + row (dict): A mapping from column headers (keys) to elements of that row + (values). + + """ + self._validate_sample(row) + self._validate_first(row) + self._validate_second(row) + self._validate_pair(row) + self._seen.add((row[self._sample_col], row[self._first_col])) + self.modified.append(row) + + def _validate_sample(self, row): + """Assert that the sample name exists and convert spaces to underscores.""" + assert len(row[self._sample_col]) > 0, "Sample input is required." + # Sanitize samples slightly. + row[self._sample_col] = row[self._sample_col].replace(" ", "_") + + def _validate_first(self, row): + """Assert that the first FASTQ entry is non-empty and has the right format.""" + assert len(row[self._first_col]) > 0, "At least the first FASTQ file is required." + self._validate_fastq_format(row[self._first_col]) + + def _validate_second(self, row): + """Assert that the second FASTQ entry has the right format if it exists.""" + if len(row[self._second_col]) > 0: + self._validate_fastq_format(row[self._second_col]) + + def _validate_pair(self, row): + """Assert that read pairs have the same file extension. Report pair status.""" + if row[self._first_col] and row[self._second_col]: + row[self._single_col] = False + assert ( + Path(row[self._first_col]).suffixes[-2:] == Path(row[self._second_col]).suffixes[-2:] + ), "FASTQ pairs must have the same file extensions." + else: + row[self._single_col] = True + + def _validate_fastq_format(self, filename): + """Assert that a given filename has one of the expected FASTQ extensions.""" + assert any(filename.endswith(extension) for extension in self.VALID_FORMATS), ( + f"The FASTQ file has an unrecognized extension: {filename}\n" + f"It should be one of: {', '.join(self.VALID_FORMATS)}" + ) + + def validate_unique_samples(self): + """ + Assert that the combination of sample name and FASTQ filename is unique. + + In addition to the validation, also rename the sample if more than one sample, + FASTQ file combination exists. + + """ + assert len(self._seen) == len(self.modified), "The pair of sample name and FASTQ must be unique." + if len({pair[0] for pair in self._seen}) < len(self._seen): + counts = Counter(pair[0] for pair in self._seen) + seen = Counter() + for row in self.modified: + sample = row[self._sample_col] + seen[sample] += 1 + if counts[sample] > 1: + row[self._sample_col] = f"{sample}_T{seen[sample]}" + + +def read_head(handle, num_lines=10): + """Read the specified number of lines from the current position in the file.""" + lines = [] + for idx, line in enumerate(handle): + if idx == num_lines: + break + lines.append(line) + return "".join(lines) + + +def sniff_format(handle): + """ + Detect the tabular format. + + Args: + handle (text file): A handle to a `text file`_ object. The read position is + expected to be at the beginning (index 0). + + Returns: + csv.Dialect: The detected tabular format. + + .. _text file: + https://docs.python.org/3/glossary.html#term-text-file + + """ + peek = read_head(handle) + handle.seek(0) + sniffer = csv.Sniffer() + if not sniffer.has_header(peek): + logger.critical(f"The given sample sheet does not appear to contain a header.") + sys.exit(1) + dialect = sniffer.sniff(peek) + return dialect + + +def check_samplesheet(file_in, file_out): + """ + Check that the tabular samplesheet has the structure expected by nf-core pipelines. + + Validate the general shape of the table, expected columns, and each row. Also add + an additional column which records whether one or two FASTQ reads were found. + + Args: + file_in (pathlib.Path): The given tabular samplesheet. The format can be either + CSV, TSV, or any other format automatically recognized by ``csv.Sniffer``. + file_out (pathlib.Path): Where the validated and transformed samplesheet should + be created; always in CSV format. + + Example: + This function checks that the samplesheet follows the following structure, + see also the `viral recon samplesheet`_:: + + sample,fastq_1,fastq_2 + SAMPLE_PE,SAMPLE_PE_RUN1_1.fastq.gz,SAMPLE_PE_RUN1_2.fastq.gz + SAMPLE_PE,SAMPLE_PE_RUN2_1.fastq.gz,SAMPLE_PE_RUN2_2.fastq.gz + SAMPLE_SE,SAMPLE_SE_RUN1_1.fastq.gz, + + .. _viral recon samplesheet: + https://raw.githubusercontent.com/nf-core/test-datasets/viralrecon/samplesheet/samplesheet_test_illumina_amplicon.csv + + """ + required_columns = {"sample", "fastq_1", "fastq_2"} + # See https://docs.python.org/3.9/library/csv.html#id3 to read up on `newline=""`. + with file_in.open(newline="") as in_handle: + reader = csv.DictReader(in_handle, dialect=sniff_format(in_handle)) + # Validate the existence of the expected header columns. + if not required_columns.issubset(reader.fieldnames): + logger.critical(f"The sample sheet **must** contain the column headers: {', '.join(required_columns)}.") + sys.exit(1) + # Validate each row. + checker = RowChecker() + for i, row in enumerate(reader): + try: + checker.validate_and_transform(row) + except AssertionError as error: + logger.critical(f"{str(error)} On line {i + 2}.") + sys.exit(1) + checker.validate_unique_samples() + header = list(reader.fieldnames) + header.insert(1, "single_end") + # See https://docs.python.org/3.9/library/csv.html#id3 to read up on `newline=""`. + with file_out.open(mode="w", newline="") as out_handle: + writer = csv.DictWriter(out_handle, header, delimiter=",") + writer.writeheader() + for row in checker.modified: + writer.writerow(row) + + +def parse_args(argv=None): + """Define and immediately parse command line arguments.""" + parser = argparse.ArgumentParser( + description="Validate and transform a tabular samplesheet.", + epilog="Example: python check_samplesheet.py samplesheet.csv samplesheet.valid.csv", + ) + parser.add_argument( + "file_in", + metavar="FILE_IN", + type=Path, + help="Tabular input samplesheet in CSV or TSV format.", + ) + parser.add_argument( + "file_out", + metavar="FILE_OUT", + type=Path, + help="Transformed output samplesheet in CSV format.", + ) + parser.add_argument( + "-l", + "--log-level", + help="The desired log level (default WARNING).", + choices=("CRITICAL", "ERROR", "WARNING", "INFO", "DEBUG"), + default="WARNING", + ) + return parser.parse_args(argv) + + +def main(argv=None): + """Coordinate argument parsing and program execution.""" + args = parse_args(argv) + logging.basicConfig(level=args.log_level, format="[%(levelname)s] %(message)s") + if not args.file_in.is_file(): + logger.error(f"The given input file {args.file_in} was not found!") + sys.exit(2) + args.file_out.parent.mkdir(parents=True, exist_ok=True) + check_samplesheet(args.file_in, args.file_out) + + +if __name__ == "__main__": + sys.exit(main()) diff --git a/bin/markdown_to_html.py b/bin/markdown_to_html.py deleted file mode 100755 index a26d1ff5..00000000 --- a/bin/markdown_to_html.py +++ /dev/null @@ -1,91 +0,0 @@ -#!/usr/bin/env python -from __future__ import print_function -import argparse -import markdown -import os -import sys -import io - - -def convert_markdown(in_fn): - input_md = io.open(in_fn, mode="r", encoding="utf-8").read() - html = markdown.markdown( - "[TOC]\n" + input_md, - extensions=["pymdownx.extra", "pymdownx.b64", "pymdownx.highlight", "pymdownx.emoji", "pymdownx.tilde", "toc"], - extension_configs={ - "pymdownx.b64": {"base_path": os.path.dirname(in_fn)}, - "pymdownx.highlight": {"noclasses": True}, - "toc": {"title": "Table of Contents"}, - }, - ) - return html - - -def wrap_html(contents): - header = """ - - - - - -
- """ - footer = """ -
- - - """ - return header + contents + footer - - -def parse_args(args=None): - parser = argparse.ArgumentParser() - parser.add_argument("mdfile", type=argparse.FileType("r"), nargs="?", help="File to convert. Defaults to stdin.") - parser.add_argument( - "-o", "--out", type=argparse.FileType("w"), default=sys.stdout, help="Output file name. Defaults to stdout." - ) - return parser.parse_args(args) - - -def main(args=None): - args = parse_args(args) - converted_md = convert_markdown(args.mdfile.name) - html = wrap_html(converted_md) - args.out.write(html) - - -if __name__ == "__main__": - sys.exit(main()) diff --git a/bin/scrape_software_versions.py b/bin/scrape_software_versions.py deleted file mode 100755 index 0bde138a..00000000 --- a/bin/scrape_software_versions.py +++ /dev/null @@ -1,54 +0,0 @@ -#!/usr/bin/env python -from __future__ import print_function -from collections import OrderedDict -import re - -# TODO nf-core: Add additional regexes for new tools in process get_software_versions -regexes = { - "nf-core/pgdb": ["v_pipeline.txt", r"(\S+)"], - "Nextflow": ["v_nextflow.txt", r"(\S+)"], - "FastQC": ["v_fastqc.txt", r"FastQC v(\S+)"], - "MultiQC": ["v_multiqc.txt", r"multiqc, version (\S+)"], -} -results = OrderedDict() -results["nf-core/pgdb"] = 'N/A' -results["Nextflow"] = 'N/A' -results["FastQC"] = 'N/A' -results["MultiQC"] = 'N/A' - -# Search each file using its regex -for k, v in regexes.items(): - try: - with open(v[0]) as x: - versions = x.read() - match = re.search(v[1], versions) - if match: - results[k] = "v{}".format(match.group(1)) - except IOError: - results[k] = False - -# Remove software set to false in results -for k in list(results): - if not results[k]: - del results[k] - -# Dump to YAML -print( - """ -id: 'software_versions' -section_name: 'nf-core/pgdb Software Versions' -section_href: 'https://github.com/nf-core/pgdb' -plot_type: 'html' -description: 'are collected at run time from the software output.' -data: | -
-""" -) -for k, v in results.items(): - print("
{}
{}
".format(k, v)) -print("
") - -# Write out regexes as csv file: -with open("software_versions.csv", "w") as f: - for k, v in results.items(): - f.write("{}\t{}\n".format(k, v)) diff --git a/conf/assemblies_conf.json b/conf/assemblies_conf.json index 09c801fc..d18bc8fb 100644 --- a/conf/assemblies_conf.json +++ b/conf/assemblies_conf.json @@ -49111,4 +49111,4 @@ "base_count": 20864403, "assembly_ucsc": null } -] \ No newline at end of file +] diff --git a/conf/base.config b/conf/base.config index d6265dd9..3c859501 100644 --- a/conf/base.config +++ b/conf/base.config @@ -1,51 +1,55 @@ /* - * ------------------------------------------------- - * nf-core/pgdb Nextflow base config file - * ------------------------------------------------- - * A 'blank slate' config file, appropriate for general - * use on most high performance compute environments. - * Assumes that all software is installed and available - * on the PATH. Runs in `local` mode - all jobs will be - * run on the logged in environment. - */ +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + nf-core/pgdb Nextflow base config file +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + A 'blank slate' config file, appropriate for general use on most high performance + compute environments. Assumes that all software is installed and available on + the PATH. Runs in `local` mode - all jobs will be run on the logged in environment. +---------------------------------------------------------------------------------------- +*/ process { - // TODO nf-core: Check the defaults for all processes - cpus = { check_max( 1 * task.attempt, 'cpus' ) } - memory = { check_max( 7.GB * task.attempt, 'memory' ) } - time = { check_max( 4.h * task.attempt, 'time' ) } - - errorStrategy = { task.exitStatus in [143,137,104,134,139] ? 'retry' : 'finish' } - maxRetries = 1 - maxErrors = '-1' - - // Process-specific resource requirements - // NOTE - Only one of the labels below are used in the fastqc process in the main script. - // If possible, it would be nice to keep the same label naming convention when - // adding in your processes. - // TODO nf-core: Customise requirements for specific processes. - // See https://www.nextflow.io/docs/latest/config.html#config-process-selectors - withLabel:process_low { - cpus = { check_max( 2 * task.attempt, 'cpus' ) } + cpus = { check_max( 2 * task.attempt, 'cpus' ) } memory = { check_max( 6.GB * task.attempt, 'memory' ) } - time = { check_max( 6.h * task.attempt, 'time' ) } - } - withLabel:process_medium { - cpus = { check_max( 6 * task.attempt, 'cpus' ) } - memory = { check_max( 12.GB * task.attempt, 'memory' ) } - time = { check_max( 8.h * task.attempt, 'time' ) } - } - withLabel:process_high { - cpus = { check_max( 12 * task.attempt, 'cpus' ) } - memory = { check_max( 32.GB * task.attempt, 'memory' ) } - time = { check_max( 10.h * task.attempt, 'time' ) } - } - withLabel:process_long { - time = { check_max( 20.h * task.attempt, 'time' ) } - } - withName:get_software_versions { - cache = false - } + time = { check_max( 4.h * task.attempt, 'time' ) } + + errorStrategy = { task.exitStatus in [143,137,104,134,139, 130] ? 'retry' : 'finish' } + maxRetries = 1 + maxErrors = '-1' + // Process-specific resource requirements + // NOTE - Please try and re-use the labels below as much as possible. + // These labels are used and recognised by default in DSL2 files hosted on nf-core/modules. + // If possible, it would be nice to keep the same label naming convention when + // adding in your local modules too. + // See https://www.nextflow.io/docs/latest/config.html#config-process-selectors + withLabel:process_low { + cpus = { check_max( 2 * task.attempt, 'cpus' ) } + memory = { check_max( 12.GB * task.attempt, 'memory' ) } + time = { check_max( 4.h * task.attempt, 'time' ) } + } + withLabel:process_medium { + cpus = { check_max( 6 * task.attempt, 'cpus' ) } + memory = { check_max( 36.GB * task.attempt, 'memory' ) } + time = { check_max( 8.h * task.attempt, 'time' ) } + } + withLabel:process_high { + cpus = { check_max( 12 * task.attempt, 'cpus' ) } + memory = { check_max( 72.GB * task.attempt, 'memory' ) } + time = { check_max( 16.h * task.attempt, 'time' ) } + } + withLabel:process_long { + time = { check_max( 20.h * task.attempt, 'time' ) } + } + withLabel:process_high_memory { + memory = { check_max( 200.GB * task.attempt, 'memory' ) } + } + withLabel:error_ignore { + errorStrategy = 'ignore' + } + withLabel:error_retry { + errorStrategy = 'retry' + maxRetries = 2 + } } diff --git a/conf/cbioportal_config.yaml b/conf/cbioportal_config.yaml index 7adba59a..1cf4a4c4 100644 --- a/conf/cbioportal_config.yaml +++ b/conf/cbioportal_config.yaml @@ -4,7 +4,7 @@ cbioportal_data_downloader: cbioportal_api: base_url: https://www.cbioportal.org/webservice.do cancer_studies: cmd=getCancerStudies - cbioportal_download_url: http://download.cbioportal.org + cbioportal_download_url: https://cbioportal-datahub.s3.amazonaws.com logger: formatters: DEBUG: "%(asctime)s [%(levelname)7s][%(name)48s][%(module)32s, %(lineno)4s] %(message)s" @@ -12,10 +12,8 @@ cbioportal_data_downloader: loglevel: DEBUG multithreading: True proteindb: - filter_info: - filter_column: 'Tumor_Sample_Barcode' - accepted_values: 'all' - split_by_filter_column: False - clinical_sample_file: '' - - \ No newline at end of file + filter_info: + filter_column: "CANCER_TYPE" + accepted_values: "all" + split_by_filter_column: False + clinical_sample_file: "" diff --git a/conf/cosmic_config.yaml b/conf/cosmic_config.yaml index ea17c211..5441959f 100644 --- a/conf/cosmic_config.yaml +++ b/conf/cosmic_config.yaml @@ -8,17 +8,16 @@ cosmic_data: mutations_cellline_url: cosmic/file_download/GRCh38/cell_lines/v92 mutations_cellline_file: CosmicCLP_MutantExport.tsv.gz all_celllines_genes_file: All_CellLines_Genes.fasta.gz - cosmic_user: '' - cosmic_password: '' + cosmic_user: "" + cosmic_password: "" logger: formatters: DEBUG: "%(asctime)s [%(levelname)7s][%(name)48s][%(module)32s, %(lineno)4s] %(message)s" INFO: "%(asctime)s [%(levelname)7s][%(name)48s] %(message)s" loglevel: DEBUG proteindb: - filter_info: - filter_column: 'Primary site' - accepted_values: 'all' - split_by_filter_column: False - clinical_sample_file: '' - \ No newline at end of file + filter_info: + filter_column: "Histology subtype 1" + accepted_values: "all" + split_by_filter_column: False + clinical_sample_file: "" diff --git a/conf/ensembl_config.yaml b/conf/ensembl_config.yaml index fab2bdd7..ce5c03f2 100644 --- a/conf/ensembl_config.yaml +++ b/conf/ensembl_config.yaml @@ -1,27 +1,27 @@ ensembl_translation: translation_table: 1 - proteindb_output_file: 'peptide-database.fa' + proteindb_output_file: "peptide-database.fa" ensembl_translation: mito_translation_table: 2 var_prefix: "var" report_ref_seq: False - annotation_field_name: 'CSQ' - af_field: '' + annotation_field_name: "CSQ" + af_field: "" af_threshold: 0.01 transcript_index: 3 consequence_index: 1 - exclude_biotypes: '' - exclude_consequences: 'downstream_gene_variant, upstream_gene_variant, intergenic_variant, intron_variant, synonymous_variant, regulatory_region_variant' + exclude_biotypes: "" + exclude_consequences: "downstream_gene_variant, upstream_gene_variant, intergenic_variant, intron_variant, synonymous_variant, regulatory_region_variant" skip_including_all_cds: False - include_biotypes: 'protein_coding,polymorphic_pseudogene,non_stop_decay,nonsense_mediated_decay,IG_C_gene,IG_D_gene,IG_J_gene,IG_V_gene,TR_C_gene,TR_D_gene,TR_J_gene,TR_V_gene,TEC' - include_consequences: 'all' - biotype_str: 'transcript_biotype' + include_biotypes: "protein_coding,polymorphic_pseudogene,non_stop_decay,nonsense_mediated_decay,IG_C_gene,IG_D_gene,IG_J_gene,IG_V_gene,TR_C_gene,TR_D_gene,TR_J_gene,TR_V_gene,TEC" + include_consequences: "all" + biotype_str: "transcript_biotype" num_orfs: 3 num_orfs_complement: 0 expression_str: "" expression_thresh: 5.0 ignore_filters: False - accepted_filters: '' + accepted_filters: "" logger: formatters: DEBUG: "%(asctime)s [%(levelname)7s][%(name)48s][%(module)32s, %(lineno)4s] %(message)s" diff --git a/conf/ensembl_downloader_config.yaml b/conf/ensembl_downloader_config.yaml index c820a03a..9ee7caf7 100644 --- a/conf/ensembl_downloader_config.yaml +++ b/conf/ensembl_downloader_config.yaml @@ -6,23 +6,24 @@ ensembl_data_downloader: skip_cds: False skip_cdna: False skip_ncrna: False + skip_dna: False skip_vcf: False ensembl_ftp: base_url: ftp://ftp.ensembl.org/pub - rewrite_local_path_ensembl_repo: 'False' + rewrite_local_path_ensembl_repo: "False" ensembl_file_names: protein_sequence_file: file_type: pep file_suffixes: - - all - - abinitio + - all + - abinitio file_extension: fa gtf_file: file_suffixes: - - '' - - abinitio. - - chr. - - chr_patch_hapl_scaff. + - "" + - abinitio. + - chr. + - chr_patch_hapl_scaff. file_extension: gtf ensembl_api: server: http://rest.ensembl.org @@ -31,4 +32,4 @@ ensembl_data_downloader: formatters: DEBUG: "%(asctime)s [%(levelname)7s][%(name)48s][%(module)32s, %(lineno)4s] %(message)s" INFO: "%(asctime)s [%(levelname)7s][%(name)48s] %(message)s" - loglevel: DEBUG \ No newline at end of file + loglevel: DEBUG diff --git a/conf/igenomes.config b/conf/igenomes.config deleted file mode 100644 index 31b7ee61..00000000 --- a/conf/igenomes.config +++ /dev/null @@ -1,421 +0,0 @@ -/* - * ------------------------------------------------- - * Nextflow config file for iGenomes paths - * ------------------------------------------------- - * Defines reference genomes, using iGenome paths - * Can be used by any config that customises the base - * path using $params.igenomes_base / --igenomes_base - */ - -params { - // illumina iGenomes reference file paths - genomes { - 'GRCh37' { - fasta = "${params.igenomes_base}/Homo_sapiens/Ensembl/GRCh37/Sequence/WholeGenomeFasta/genome.fa" - bwa = "${params.igenomes_base}/Homo_sapiens/Ensembl/GRCh37/Sequence/BWAIndex/genome.fa" - bowtie2 = "${params.igenomes_base}/Homo_sapiens/Ensembl/GRCh37/Sequence/Bowtie2Index/" - star = "${params.igenomes_base}/Homo_sapiens/Ensembl/GRCh37/Sequence/STARIndex/" - bismark = "${params.igenomes_base}/Homo_sapiens/Ensembl/GRCh37/Sequence/BismarkIndex/" - gtf = "${params.igenomes_base}/Homo_sapiens/Ensembl/GRCh37/Annotation/Genes/genes.gtf" - bed12 = "${params.igenomes_base}/Homo_sapiens/Ensembl/GRCh37/Annotation/Genes/genes.bed" - readme = "${params.igenomes_base}/Homo_sapiens/Ensembl/GRCh37/Annotation/README.txt" - mito_name = "MT" - macs_gsize = "2.7e9" - blacklist = "${projectDir}/assets/blacklists/GRCh37-blacklist.bed" - } - 'GRCh38' { - fasta = "${params.igenomes_base}/Homo_sapiens/NCBI/GRCh38/Sequence/WholeGenomeFasta/genome.fa" - bwa = "${params.igenomes_base}/Homo_sapiens/NCBI/GRCh38/Sequence/BWAIndex/genome.fa" - bowtie2 = "${params.igenomes_base}/Homo_sapiens/NCBI/GRCh38/Sequence/Bowtie2Index/" - star = "${params.igenomes_base}/Homo_sapiens/NCBI/GRCh38/Sequence/STARIndex/" - bismark = "${params.igenomes_base}/Homo_sapiens/NCBI/GRCh38/Sequence/BismarkIndex/" - gtf = "${params.igenomes_base}/Homo_sapiens/NCBI/GRCh38/Annotation/Genes/genes.gtf" - bed12 = "${params.igenomes_base}/Homo_sapiens/NCBI/GRCh38/Annotation/Genes/genes.bed" - mito_name = "chrM" - macs_gsize = "2.7e9" - blacklist = "${projectDir}/assets/blacklists/hg38-blacklist.bed" - } - 'GRCm38' { - fasta = "${params.igenomes_base}/Mus_musculus/Ensembl/GRCm38/Sequence/WholeGenomeFasta/genome.fa" - bwa = "${params.igenomes_base}/Mus_musculus/Ensembl/GRCm38/Sequence/BWAIndex/genome.fa" - bowtie2 = "${params.igenomes_base}/Mus_musculus/Ensembl/GRCm38/Sequence/Bowtie2Index/" - star = "${params.igenomes_base}/Mus_musculus/Ensembl/GRCm38/Sequence/STARIndex/" - bismark = "${params.igenomes_base}/Mus_musculus/Ensembl/GRCm38/Sequence/BismarkIndex/" - gtf = "${params.igenomes_base}/Mus_musculus/Ensembl/GRCm38/Annotation/Genes/genes.gtf" - bed12 = "${params.igenomes_base}/Mus_musculus/Ensembl/GRCm38/Annotation/Genes/genes.bed" - readme = "${params.igenomes_base}/Mus_musculus/Ensembl/GRCm38/Annotation/README.txt" - mito_name = "MT" - macs_gsize = "1.87e9" - blacklist = "${projectDir}/assets/blacklists/GRCm38-blacklist.bed" - } - 'TAIR10' { - fasta = "${params.igenomes_base}/Arabidopsis_thaliana/Ensembl/TAIR10/Sequence/WholeGenomeFasta/genome.fa" - bwa = "${params.igenomes_base}/Arabidopsis_thaliana/Ensembl/TAIR10/Sequence/BWAIndex/genome.fa" - bowtie2 = "${params.igenomes_base}/Arabidopsis_thaliana/Ensembl/TAIR10/Sequence/Bowtie2Index/" - star = "${params.igenomes_base}/Arabidopsis_thaliana/Ensembl/TAIR10/Sequence/STARIndex/" - bismark = "${params.igenomes_base}/Arabidopsis_thaliana/Ensembl/TAIR10/Sequence/BismarkIndex/" - gtf = "${params.igenomes_base}/Arabidopsis_thaliana/Ensembl/TAIR10/Annotation/Genes/genes.gtf" - bed12 = "${params.igenomes_base}/Arabidopsis_thaliana/Ensembl/TAIR10/Annotation/Genes/genes.bed" - readme = "${params.igenomes_base}/Arabidopsis_thaliana/Ensembl/TAIR10/Annotation/README.txt" - mito_name = "Mt" - } - 'EB2' { - fasta = "${params.igenomes_base}/Bacillus_subtilis_168/Ensembl/EB2/Sequence/WholeGenomeFasta/genome.fa" - bwa = "${params.igenomes_base}/Bacillus_subtilis_168/Ensembl/EB2/Sequence/BWAIndex/genome.fa" - bowtie2 = "${params.igenomes_base}/Bacillus_subtilis_168/Ensembl/EB2/Sequence/Bowtie2Index/" - star = "${params.igenomes_base}/Bacillus_subtilis_168/Ensembl/EB2/Sequence/STARIndex/" - bismark = "${params.igenomes_base}/Bacillus_subtilis_168/Ensembl/EB2/Sequence/BismarkIndex/" - gtf = "${params.igenomes_base}/Bacillus_subtilis_168/Ensembl/EB2/Annotation/Genes/genes.gtf" - bed12 = "${params.igenomes_base}/Bacillus_subtilis_168/Ensembl/EB2/Annotation/Genes/genes.bed" - readme = "${params.igenomes_base}/Bacillus_subtilis_168/Ensembl/EB2/Annotation/README.txt" - } - 'UMD3.1' { - fasta = "${params.igenomes_base}/Bos_taurus/Ensembl/UMD3.1/Sequence/WholeGenomeFasta/genome.fa" - bwa = "${params.igenomes_base}/Bos_taurus/Ensembl/UMD3.1/Sequence/BWAIndex/genome.fa" - bowtie2 = "${params.igenomes_base}/Bos_taurus/Ensembl/UMD3.1/Sequence/Bowtie2Index/" - star = "${params.igenomes_base}/Bos_taurus/Ensembl/UMD3.1/Sequence/STARIndex/" - bismark = "${params.igenomes_base}/Bos_taurus/Ensembl/UMD3.1/Sequence/BismarkIndex/" - gtf = "${params.igenomes_base}/Bos_taurus/Ensembl/UMD3.1/Annotation/Genes/genes.gtf" - bed12 = "${params.igenomes_base}/Bos_taurus/Ensembl/UMD3.1/Annotation/Genes/genes.bed" - readme = "${params.igenomes_base}/Bos_taurus/Ensembl/UMD3.1/Annotation/README.txt" - mito_name = "MT" - } - 'WBcel235' { - fasta = "${params.igenomes_base}/Caenorhabditis_elegans/Ensembl/WBcel235/Sequence/WholeGenomeFasta/genome.fa" - bwa = "${params.igenomes_base}/Caenorhabditis_elegans/Ensembl/WBcel235/Sequence/BWAIndex/genome.fa" - bowtie2 = "${params.igenomes_base}/Caenorhabditis_elegans/Ensembl/WBcel235/Sequence/Bowtie2Index/" - star = "${params.igenomes_base}/Caenorhabditis_elegans/Ensembl/WBcel235/Sequence/STARIndex/" - bismark = "${params.igenomes_base}/Caenorhabditis_elegans/Ensembl/WBcel235/Sequence/BismarkIndex/" - gtf = "${params.igenomes_base}/Caenorhabditis_elegans/Ensembl/WBcel235/Annotation/Genes/genes.gtf" - bed12 = "${params.igenomes_base}/Caenorhabditis_elegans/Ensembl/WBcel235/Annotation/Genes/genes.bed" - mito_name = "MtDNA" - macs_gsize = "9e7" - } - 'CanFam3.1' { - fasta = "${params.igenomes_base}/Canis_familiaris/Ensembl/CanFam3.1/Sequence/WholeGenomeFasta/genome.fa" - bwa = "${params.igenomes_base}/Canis_familiaris/Ensembl/CanFam3.1/Sequence/BWAIndex/genome.fa" - bowtie2 = "${params.igenomes_base}/Canis_familiaris/Ensembl/CanFam3.1/Sequence/Bowtie2Index/" - star = "${params.igenomes_base}/Canis_familiaris/Ensembl/CanFam3.1/Sequence/STARIndex/" - bismark = "${params.igenomes_base}/Canis_familiaris/Ensembl/CanFam3.1/Sequence/BismarkIndex/" - gtf = "${params.igenomes_base}/Canis_familiaris/Ensembl/CanFam3.1/Annotation/Genes/genes.gtf" - bed12 = "${params.igenomes_base}/Canis_familiaris/Ensembl/CanFam3.1/Annotation/Genes/genes.bed" - readme = "${params.igenomes_base}/Canis_familiaris/Ensembl/CanFam3.1/Annotation/README.txt" - mito_name = "MT" - } - 'GRCz10' { - fasta = "${params.igenomes_base}/Danio_rerio/Ensembl/GRCz10/Sequence/WholeGenomeFasta/genome.fa" - bwa = "${params.igenomes_base}/Danio_rerio/Ensembl/GRCz10/Sequence/BWAIndex/genome.fa" - bowtie2 = "${params.igenomes_base}/Danio_rerio/Ensembl/GRCz10/Sequence/Bowtie2Index/" - star = "${params.igenomes_base}/Danio_rerio/Ensembl/GRCz10/Sequence/STARIndex/" - bismark = "${params.igenomes_base}/Danio_rerio/Ensembl/GRCz10/Sequence/BismarkIndex/" - gtf = "${params.igenomes_base}/Danio_rerio/Ensembl/GRCz10/Annotation/Genes/genes.gtf" - bed12 = "${params.igenomes_base}/Danio_rerio/Ensembl/GRCz10/Annotation/Genes/genes.bed" - mito_name = "MT" - } - 'BDGP6' { - fasta = "${params.igenomes_base}/Drosophila_melanogaster/Ensembl/BDGP6/Sequence/WholeGenomeFasta/genome.fa" - bwa = "${params.igenomes_base}/Drosophila_melanogaster/Ensembl/BDGP6/Sequence/BWAIndex/genome.fa" - bowtie2 = "${params.igenomes_base}/Drosophila_melanogaster/Ensembl/BDGP6/Sequence/Bowtie2Index/" - star = "${params.igenomes_base}/Drosophila_melanogaster/Ensembl/BDGP6/Sequence/STARIndex/" - bismark = "${params.igenomes_base}/Drosophila_melanogaster/Ensembl/BDGP6/Sequence/BismarkIndex/" - gtf = "${params.igenomes_base}/Drosophila_melanogaster/Ensembl/BDGP6/Annotation/Genes/genes.gtf" - bed12 = "${params.igenomes_base}/Drosophila_melanogaster/Ensembl/BDGP6/Annotation/Genes/genes.bed" - mito_name = "M" - macs_gsize = "1.2e8" - } - 'EquCab2' { - fasta = "${params.igenomes_base}/Equus_caballus/Ensembl/EquCab2/Sequence/WholeGenomeFasta/genome.fa" - bwa = "${params.igenomes_base}/Equus_caballus/Ensembl/EquCab2/Sequence/BWAIndex/genome.fa" - bowtie2 = "${params.igenomes_base}/Equus_caballus/Ensembl/EquCab2/Sequence/Bowtie2Index/" - star = "${params.igenomes_base}/Equus_caballus/Ensembl/EquCab2/Sequence/STARIndex/" - bismark = "${params.igenomes_base}/Equus_caballus/Ensembl/EquCab2/Sequence/BismarkIndex/" - gtf = "${params.igenomes_base}/Equus_caballus/Ensembl/EquCab2/Annotation/Genes/genes.gtf" - bed12 = "${params.igenomes_base}/Equus_caballus/Ensembl/EquCab2/Annotation/Genes/genes.bed" - readme = "${params.igenomes_base}/Equus_caballus/Ensembl/EquCab2/Annotation/README.txt" - mito_name = "MT" - } - 'EB1' { - fasta = "${params.igenomes_base}/Escherichia_coli_K_12_DH10B/Ensembl/EB1/Sequence/WholeGenomeFasta/genome.fa" - bwa = "${params.igenomes_base}/Escherichia_coli_K_12_DH10B/Ensembl/EB1/Sequence/BWAIndex/genome.fa" - bowtie2 = "${params.igenomes_base}/Escherichia_coli_K_12_DH10B/Ensembl/EB1/Sequence/Bowtie2Index/" - star = "${params.igenomes_base}/Escherichia_coli_K_12_DH10B/Ensembl/EB1/Sequence/STARIndex/" - bismark = "${params.igenomes_base}/Escherichia_coli_K_12_DH10B/Ensembl/EB1/Sequence/BismarkIndex/" - gtf = "${params.igenomes_base}/Escherichia_coli_K_12_DH10B/Ensembl/EB1/Annotation/Genes/genes.gtf" - bed12 = "${params.igenomes_base}/Escherichia_coli_K_12_DH10B/Ensembl/EB1/Annotation/Genes/genes.bed" - readme = "${params.igenomes_base}/Escherichia_coli_K_12_DH10B/Ensembl/EB1/Annotation/README.txt" - } - 'Galgal4' { - fasta = "${params.igenomes_base}/Gallus_gallus/Ensembl/Galgal4/Sequence/WholeGenomeFasta/genome.fa" - bwa = "${params.igenomes_base}/Gallus_gallus/Ensembl/Galgal4/Sequence/BWAIndex/genome.fa" - bowtie2 = "${params.igenomes_base}/Gallus_gallus/Ensembl/Galgal4/Sequence/Bowtie2Index/" - star = "${params.igenomes_base}/Gallus_gallus/Ensembl/Galgal4/Sequence/STARIndex/" - bismark = "${params.igenomes_base}/Gallus_gallus/Ensembl/Galgal4/Sequence/BismarkIndex/" - gtf = "${params.igenomes_base}/Gallus_gallus/Ensembl/Galgal4/Annotation/Genes/genes.gtf" - bed12 = "${params.igenomes_base}/Gallus_gallus/Ensembl/Galgal4/Annotation/Genes/genes.bed" - mito_name = "MT" - } - 'Gm01' { - fasta = "${params.igenomes_base}/Glycine_max/Ensembl/Gm01/Sequence/WholeGenomeFasta/genome.fa" - bwa = "${params.igenomes_base}/Glycine_max/Ensembl/Gm01/Sequence/BWAIndex/genome.fa" - bowtie2 = "${params.igenomes_base}/Glycine_max/Ensembl/Gm01/Sequence/Bowtie2Index/" - star = "${params.igenomes_base}/Glycine_max/Ensembl/Gm01/Sequence/STARIndex/" - bismark = "${params.igenomes_base}/Glycine_max/Ensembl/Gm01/Sequence/BismarkIndex/" - gtf = "${params.igenomes_base}/Glycine_max/Ensembl/Gm01/Annotation/Genes/genes.gtf" - bed12 = "${params.igenomes_base}/Glycine_max/Ensembl/Gm01/Annotation/Genes/genes.bed" - readme = "${params.igenomes_base}/Glycine_max/Ensembl/Gm01/Annotation/README.txt" - } - 'Mmul_1' { - fasta = "${params.igenomes_base}/Macaca_mulatta/Ensembl/Mmul_1/Sequence/WholeGenomeFasta/genome.fa" - bwa = "${params.igenomes_base}/Macaca_mulatta/Ensembl/Mmul_1/Sequence/BWAIndex/genome.fa" - bowtie2 = "${params.igenomes_base}/Macaca_mulatta/Ensembl/Mmul_1/Sequence/Bowtie2Index/" - star = "${params.igenomes_base}/Macaca_mulatta/Ensembl/Mmul_1/Sequence/STARIndex/" - bismark = "${params.igenomes_base}/Macaca_mulatta/Ensembl/Mmul_1/Sequence/BismarkIndex/" - gtf = "${params.igenomes_base}/Macaca_mulatta/Ensembl/Mmul_1/Annotation/Genes/genes.gtf" - bed12 = "${params.igenomes_base}/Macaca_mulatta/Ensembl/Mmul_1/Annotation/Genes/genes.bed" - readme = "${params.igenomes_base}/Macaca_mulatta/Ensembl/Mmul_1/Annotation/README.txt" - mito_name = "MT" - } - 'IRGSP-1.0' { - fasta = "${params.igenomes_base}/Oryza_sativa_japonica/Ensembl/IRGSP-1.0/Sequence/WholeGenomeFasta/genome.fa" - bwa = "${params.igenomes_base}/Oryza_sativa_japonica/Ensembl/IRGSP-1.0/Sequence/BWAIndex/genome.fa" - bowtie2 = "${params.igenomes_base}/Oryza_sativa_japonica/Ensembl/IRGSP-1.0/Sequence/Bowtie2Index/" - star = "${params.igenomes_base}/Oryza_sativa_japonica/Ensembl/IRGSP-1.0/Sequence/STARIndex/" - bismark = "${params.igenomes_base}/Oryza_sativa_japonica/Ensembl/IRGSP-1.0/Sequence/BismarkIndex/" - gtf = "${params.igenomes_base}/Oryza_sativa_japonica/Ensembl/IRGSP-1.0/Annotation/Genes/genes.gtf" - bed12 = "${params.igenomes_base}/Oryza_sativa_japonica/Ensembl/IRGSP-1.0/Annotation/Genes/genes.bed" - mito_name = "Mt" - } - 'CHIMP2.1.4' { - fasta = "${params.igenomes_base}/Pan_troglodytes/Ensembl/CHIMP2.1.4/Sequence/WholeGenomeFasta/genome.fa" - bwa = "${params.igenomes_base}/Pan_troglodytes/Ensembl/CHIMP2.1.4/Sequence/BWAIndex/genome.fa" - bowtie2 = "${params.igenomes_base}/Pan_troglodytes/Ensembl/CHIMP2.1.4/Sequence/Bowtie2Index/" - star = "${params.igenomes_base}/Pan_troglodytes/Ensembl/CHIMP2.1.4/Sequence/STARIndex/" - bismark = "${params.igenomes_base}/Pan_troglodytes/Ensembl/CHIMP2.1.4/Sequence/BismarkIndex/" - gtf = "${params.igenomes_base}/Pan_troglodytes/Ensembl/CHIMP2.1.4/Annotation/Genes/genes.gtf" - bed12 = "${params.igenomes_base}/Pan_troglodytes/Ensembl/CHIMP2.1.4/Annotation/Genes/genes.bed" - readme = "${params.igenomes_base}/Pan_troglodytes/Ensembl/CHIMP2.1.4/Annotation/README.txt" - mito_name = "MT" - } - 'Rnor_6.0' { - fasta = "${params.igenomes_base}/Rattus_norvegicus/Ensembl/Rnor_6.0/Sequence/WholeGenomeFasta/genome.fa" - bwa = "${params.igenomes_base}/Rattus_norvegicus/Ensembl/Rnor_6.0/Sequence/BWAIndex/genome.fa" - bowtie2 = "${params.igenomes_base}/Rattus_norvegicus/Ensembl/Rnor_6.0/Sequence/Bowtie2Index/" - star = "${params.igenomes_base}/Rattus_norvegicus/Ensembl/Rnor_6.0/Sequence/STARIndex/" - bismark = "${params.igenomes_base}/Rattus_norvegicus/Ensembl/Rnor_6.0/Sequence/BismarkIndex/" - gtf = "${params.igenomes_base}/Rattus_norvegicus/Ensembl/Rnor_6.0/Annotation/Genes/genes.gtf" - bed12 = "${params.igenomes_base}/Rattus_norvegicus/Ensembl/Rnor_6.0/Annotation/Genes/genes.bed" - mito_name = "MT" - } - 'R64-1-1' { - fasta = "${params.igenomes_base}/Saccharomyces_cerevisiae/Ensembl/R64-1-1/Sequence/WholeGenomeFasta/genome.fa" - bwa = "${params.igenomes_base}/Saccharomyces_cerevisiae/Ensembl/R64-1-1/Sequence/BWAIndex/genome.fa" - bowtie2 = "${params.igenomes_base}/Saccharomyces_cerevisiae/Ensembl/R64-1-1/Sequence/Bowtie2Index/" - star = "${params.igenomes_base}/Saccharomyces_cerevisiae/Ensembl/R64-1-1/Sequence/STARIndex/" - bismark = "${params.igenomes_base}/Saccharomyces_cerevisiae/Ensembl/R64-1-1/Sequence/BismarkIndex/" - gtf = "${params.igenomes_base}/Saccharomyces_cerevisiae/Ensembl/R64-1-1/Annotation/Genes/genes.gtf" - bed12 = "${params.igenomes_base}/Saccharomyces_cerevisiae/Ensembl/R64-1-1/Annotation/Genes/genes.bed" - mito_name = "MT" - macs_gsize = "1.2e7" - } - 'EF2' { - fasta = "${params.igenomes_base}/Schizosaccharomyces_pombe/Ensembl/EF2/Sequence/WholeGenomeFasta/genome.fa" - bwa = "${params.igenomes_base}/Schizosaccharomyces_pombe/Ensembl/EF2/Sequence/BWAIndex/genome.fa" - bowtie2 = "${params.igenomes_base}/Schizosaccharomyces_pombe/Ensembl/EF2/Sequence/Bowtie2Index/" - star = "${params.igenomes_base}/Schizosaccharomyces_pombe/Ensembl/EF2/Sequence/STARIndex/" - bismark = "${params.igenomes_base}/Schizosaccharomyces_pombe/Ensembl/EF2/Sequence/BismarkIndex/" - gtf = "${params.igenomes_base}/Schizosaccharomyces_pombe/Ensembl/EF2/Annotation/Genes/genes.gtf" - bed12 = "${params.igenomes_base}/Schizosaccharomyces_pombe/Ensembl/EF2/Annotation/Genes/genes.bed" - readme = "${params.igenomes_base}/Schizosaccharomyces_pombe/Ensembl/EF2/Annotation/README.txt" - mito_name = "MT" - macs_gsize = "1.21e7" - } - 'Sbi1' { - fasta = "${params.igenomes_base}/Sorghum_bicolor/Ensembl/Sbi1/Sequence/WholeGenomeFasta/genome.fa" - bwa = "${params.igenomes_base}/Sorghum_bicolor/Ensembl/Sbi1/Sequence/BWAIndex/genome.fa" - bowtie2 = "${params.igenomes_base}/Sorghum_bicolor/Ensembl/Sbi1/Sequence/Bowtie2Index/" - star = "${params.igenomes_base}/Sorghum_bicolor/Ensembl/Sbi1/Sequence/STARIndex/" - bismark = "${params.igenomes_base}/Sorghum_bicolor/Ensembl/Sbi1/Sequence/BismarkIndex/" - gtf = "${params.igenomes_base}/Sorghum_bicolor/Ensembl/Sbi1/Annotation/Genes/genes.gtf" - bed12 = "${params.igenomes_base}/Sorghum_bicolor/Ensembl/Sbi1/Annotation/Genes/genes.bed" - readme = "${params.igenomes_base}/Sorghum_bicolor/Ensembl/Sbi1/Annotation/README.txt" - } - 'Sscrofa10.2' { - fasta = "${params.igenomes_base}/Sus_scrofa/Ensembl/Sscrofa10.2/Sequence/WholeGenomeFasta/genome.fa" - bwa = "${params.igenomes_base}/Sus_scrofa/Ensembl/Sscrofa10.2/Sequence/BWAIndex/genome.fa" - bowtie2 = "${params.igenomes_base}/Sus_scrofa/Ensembl/Sscrofa10.2/Sequence/Bowtie2Index/" - star = "${params.igenomes_base}/Sus_scrofa/Ensembl/Sscrofa10.2/Sequence/STARIndex/" - bismark = "${params.igenomes_base}/Sus_scrofa/Ensembl/Sscrofa10.2/Sequence/BismarkIndex/" - gtf = "${params.igenomes_base}/Sus_scrofa/Ensembl/Sscrofa10.2/Annotation/Genes/genes.gtf" - bed12 = "${params.igenomes_base}/Sus_scrofa/Ensembl/Sscrofa10.2/Annotation/Genes/genes.bed" - readme = "${params.igenomes_base}/Sus_scrofa/Ensembl/Sscrofa10.2/Annotation/README.txt" - mito_name = "MT" - } - 'AGPv3' { - fasta = "${params.igenomes_base}/Zea_mays/Ensembl/AGPv3/Sequence/WholeGenomeFasta/genome.fa" - bwa = "${params.igenomes_base}/Zea_mays/Ensembl/AGPv3/Sequence/BWAIndex/genome.fa" - bowtie2 = "${params.igenomes_base}/Zea_mays/Ensembl/AGPv3/Sequence/Bowtie2Index/" - star = "${params.igenomes_base}/Zea_mays/Ensembl/AGPv3/Sequence/STARIndex/" - bismark = "${params.igenomes_base}/Zea_mays/Ensembl/AGPv3/Sequence/BismarkIndex/" - gtf = "${params.igenomes_base}/Zea_mays/Ensembl/AGPv3/Annotation/Genes/genes.gtf" - bed12 = "${params.igenomes_base}/Zea_mays/Ensembl/AGPv3/Annotation/Genes/genes.bed" - mito_name = "Mt" - } - 'hg38' { - fasta = "${params.igenomes_base}/Homo_sapiens/UCSC/hg38/Sequence/WholeGenomeFasta/genome.fa" - bwa = "${params.igenomes_base}/Homo_sapiens/UCSC/hg38/Sequence/BWAIndex/genome.fa" - bowtie2 = "${params.igenomes_base}/Homo_sapiens/UCSC/hg38/Sequence/Bowtie2Index/" - star = "${params.igenomes_base}/Homo_sapiens/UCSC/hg38/Sequence/STARIndex/" - bismark = "${params.igenomes_base}/Homo_sapiens/UCSC/hg38/Sequence/BismarkIndex/" - gtf = "${params.igenomes_base}/Homo_sapiens/UCSC/hg38/Annotation/Genes/genes.gtf" - bed12 = "${params.igenomes_base}/Homo_sapiens/UCSC/hg38/Annotation/Genes/genes.bed" - mito_name = "chrM" - macs_gsize = "2.7e9" - blacklist = "${projectDir}/assets/blacklists/hg38-blacklist.bed" - } - 'hg19' { - fasta = "${params.igenomes_base}/Homo_sapiens/UCSC/hg19/Sequence/WholeGenomeFasta/genome.fa" - bwa = "${params.igenomes_base}/Homo_sapiens/UCSC/hg19/Sequence/BWAIndex/genome.fa" - bowtie2 = "${params.igenomes_base}/Homo_sapiens/UCSC/hg19/Sequence/Bowtie2Index/" - star = "${params.igenomes_base}/Homo_sapiens/UCSC/hg19/Sequence/STARIndex/" - bismark = "${params.igenomes_base}/Homo_sapiens/UCSC/hg19/Sequence/BismarkIndex/" - gtf = "${params.igenomes_base}/Homo_sapiens/UCSC/hg19/Annotation/Genes/genes.gtf" - bed12 = "${params.igenomes_base}/Homo_sapiens/UCSC/hg19/Annotation/Genes/genes.bed" - readme = "${params.igenomes_base}/Homo_sapiens/UCSC/hg19/Annotation/README.txt" - mito_name = "chrM" - macs_gsize = "2.7e9" - blacklist = "${projectDir}/assets/blacklists/hg19-blacklist.bed" - } - 'mm10' { - fasta = "${params.igenomes_base}/Mus_musculus/UCSC/mm10/Sequence/WholeGenomeFasta/genome.fa" - bwa = "${params.igenomes_base}/Mus_musculus/UCSC/mm10/Sequence/BWAIndex/genome.fa" - bowtie2 = "${params.igenomes_base}/Mus_musculus/UCSC/mm10/Sequence/Bowtie2Index/" - star = "${params.igenomes_base}/Mus_musculus/UCSC/mm10/Sequence/STARIndex/" - bismark = "${params.igenomes_base}/Mus_musculus/UCSC/mm10/Sequence/BismarkIndex/" - gtf = "${params.igenomes_base}/Mus_musculus/UCSC/mm10/Annotation/Genes/genes.gtf" - bed12 = "${params.igenomes_base}/Mus_musculus/UCSC/mm10/Annotation/Genes/genes.bed" - readme = "${params.igenomes_base}/Mus_musculus/UCSC/mm10/Annotation/README.txt" - mito_name = "chrM" - macs_gsize = "1.87e9" - blacklist = "${projectDir}/assets/blacklists/mm10-blacklist.bed" - } - 'bosTau8' { - fasta = "${params.igenomes_base}/Bos_taurus/UCSC/bosTau8/Sequence/WholeGenomeFasta/genome.fa" - bwa = "${params.igenomes_base}/Bos_taurus/UCSC/bosTau8/Sequence/BWAIndex/genome.fa" - bowtie2 = "${params.igenomes_base}/Bos_taurus/UCSC/bosTau8/Sequence/Bowtie2Index/" - star = "${params.igenomes_base}/Bos_taurus/UCSC/bosTau8/Sequence/STARIndex/" - bismark = "${params.igenomes_base}/Bos_taurus/UCSC/bosTau8/Sequence/BismarkIndex/" - gtf = "${params.igenomes_base}/Bos_taurus/UCSC/bosTau8/Annotation/Genes/genes.gtf" - bed12 = "${params.igenomes_base}/Bos_taurus/UCSC/bosTau8/Annotation/Genes/genes.bed" - mito_name = "chrM" - } - 'ce10' { - fasta = "${params.igenomes_base}/Caenorhabditis_elegans/UCSC/ce10/Sequence/WholeGenomeFasta/genome.fa" - bwa = "${params.igenomes_base}/Caenorhabditis_elegans/UCSC/ce10/Sequence/BWAIndex/genome.fa" - bowtie2 = "${params.igenomes_base}/Caenorhabditis_elegans/UCSC/ce10/Sequence/Bowtie2Index/" - star = "${params.igenomes_base}/Caenorhabditis_elegans/UCSC/ce10/Sequence/STARIndex/" - bismark = "${params.igenomes_base}/Caenorhabditis_elegans/UCSC/ce10/Sequence/BismarkIndex/" - gtf = "${params.igenomes_base}/Caenorhabditis_elegans/UCSC/ce10/Annotation/Genes/genes.gtf" - bed12 = "${params.igenomes_base}/Caenorhabditis_elegans/UCSC/ce10/Annotation/Genes/genes.bed" - readme = "${params.igenomes_base}/Caenorhabditis_elegans/UCSC/ce10/Annotation/README.txt" - mito_name = "chrM" - macs_gsize = "9e7" - } - 'canFam3' { - fasta = "${params.igenomes_base}/Canis_familiaris/UCSC/canFam3/Sequence/WholeGenomeFasta/genome.fa" - bwa = "${params.igenomes_base}/Canis_familiaris/UCSC/canFam3/Sequence/BWAIndex/genome.fa" - bowtie2 = "${params.igenomes_base}/Canis_familiaris/UCSC/canFam3/Sequence/Bowtie2Index/" - star = "${params.igenomes_base}/Canis_familiaris/UCSC/canFam3/Sequence/STARIndex/" - bismark = "${params.igenomes_base}/Canis_familiaris/UCSC/canFam3/Sequence/BismarkIndex/" - gtf = "${params.igenomes_base}/Canis_familiaris/UCSC/canFam3/Annotation/Genes/genes.gtf" - bed12 = "${params.igenomes_base}/Canis_familiaris/UCSC/canFam3/Annotation/Genes/genes.bed" - readme = "${params.igenomes_base}/Canis_familiaris/UCSC/canFam3/Annotation/README.txt" - mito_name = "chrM" - } - 'danRer10' { - fasta = "${params.igenomes_base}/Danio_rerio/UCSC/danRer10/Sequence/WholeGenomeFasta/genome.fa" - bwa = "${params.igenomes_base}/Danio_rerio/UCSC/danRer10/Sequence/BWAIndex/genome.fa" - bowtie2 = "${params.igenomes_base}/Danio_rerio/UCSC/danRer10/Sequence/Bowtie2Index/" - star = "${params.igenomes_base}/Danio_rerio/UCSC/danRer10/Sequence/STARIndex/" - bismark = "${params.igenomes_base}/Danio_rerio/UCSC/danRer10/Sequence/BismarkIndex/" - gtf = "${params.igenomes_base}/Danio_rerio/UCSC/danRer10/Annotation/Genes/genes.gtf" - bed12 = "${params.igenomes_base}/Danio_rerio/UCSC/danRer10/Annotation/Genes/genes.bed" - mito_name = "chrM" - macs_gsize = "1.37e9" - } - 'dm6' { - fasta = "${params.igenomes_base}/Drosophila_melanogaster/UCSC/dm6/Sequence/WholeGenomeFasta/genome.fa" - bwa = "${params.igenomes_base}/Drosophila_melanogaster/UCSC/dm6/Sequence/BWAIndex/genome.fa" - bowtie2 = "${params.igenomes_base}/Drosophila_melanogaster/UCSC/dm6/Sequence/Bowtie2Index/" - star = "${params.igenomes_base}/Drosophila_melanogaster/UCSC/dm6/Sequence/STARIndex/" - bismark = "${params.igenomes_base}/Drosophila_melanogaster/UCSC/dm6/Sequence/BismarkIndex/" - gtf = "${params.igenomes_base}/Drosophila_melanogaster/UCSC/dm6/Annotation/Genes/genes.gtf" - bed12 = "${params.igenomes_base}/Drosophila_melanogaster/UCSC/dm6/Annotation/Genes/genes.bed" - mito_name = "chrM" - macs_gsize = "1.2e8" - } - 'equCab2' { - fasta = "${params.igenomes_base}/Equus_caballus/UCSC/equCab2/Sequence/WholeGenomeFasta/genome.fa" - bwa = "${params.igenomes_base}/Equus_caballus/UCSC/equCab2/Sequence/BWAIndex/genome.fa" - bowtie2 = "${params.igenomes_base}/Equus_caballus/UCSC/equCab2/Sequence/Bowtie2Index/" - star = "${params.igenomes_base}/Equus_caballus/UCSC/equCab2/Sequence/STARIndex/" - bismark = "${params.igenomes_base}/Equus_caballus/UCSC/equCab2/Sequence/BismarkIndex/" - gtf = "${params.igenomes_base}/Equus_caballus/UCSC/equCab2/Annotation/Genes/genes.gtf" - bed12 = "${params.igenomes_base}/Equus_caballus/UCSC/equCab2/Annotation/Genes/genes.bed" - readme = "${params.igenomes_base}/Equus_caballus/UCSC/equCab2/Annotation/README.txt" - mito_name = "chrM" - } - 'galGal4' { - fasta = "${params.igenomes_base}/Gallus_gallus/UCSC/galGal4/Sequence/WholeGenomeFasta/genome.fa" - bwa = "${params.igenomes_base}/Gallus_gallus/UCSC/galGal4/Sequence/BWAIndex/genome.fa" - bowtie2 = "${params.igenomes_base}/Gallus_gallus/UCSC/galGal4/Sequence/Bowtie2Index/" - star = "${params.igenomes_base}/Gallus_gallus/UCSC/galGal4/Sequence/STARIndex/" - bismark = "${params.igenomes_base}/Gallus_gallus/UCSC/galGal4/Sequence/BismarkIndex/" - gtf = "${params.igenomes_base}/Gallus_gallus/UCSC/galGal4/Annotation/Genes/genes.gtf" - bed12 = "${params.igenomes_base}/Gallus_gallus/UCSC/galGal4/Annotation/Genes/genes.bed" - readme = "${params.igenomes_base}/Gallus_gallus/UCSC/galGal4/Annotation/README.txt" - mito_name = "chrM" - } - 'panTro4' { - fasta = "${params.igenomes_base}/Pan_troglodytes/UCSC/panTro4/Sequence/WholeGenomeFasta/genome.fa" - bwa = "${params.igenomes_base}/Pan_troglodytes/UCSC/panTro4/Sequence/BWAIndex/genome.fa" - bowtie2 = "${params.igenomes_base}/Pan_troglodytes/UCSC/panTro4/Sequence/Bowtie2Index/" - star = "${params.igenomes_base}/Pan_troglodytes/UCSC/panTro4/Sequence/STARIndex/" - bismark = "${params.igenomes_base}/Pan_troglodytes/UCSC/panTro4/Sequence/BismarkIndex/" - gtf = "${params.igenomes_base}/Pan_troglodytes/UCSC/panTro4/Annotation/Genes/genes.gtf" - bed12 = "${params.igenomes_base}/Pan_troglodytes/UCSC/panTro4/Annotation/Genes/genes.bed" - readme = "${params.igenomes_base}/Pan_troglodytes/UCSC/panTro4/Annotation/README.txt" - mito_name = "chrM" - } - 'rn6' { - fasta = "${params.igenomes_base}/Rattus_norvegicus/UCSC/rn6/Sequence/WholeGenomeFasta/genome.fa" - bwa = "${params.igenomes_base}/Rattus_norvegicus/UCSC/rn6/Sequence/BWAIndex/genome.fa" - bowtie2 = "${params.igenomes_base}/Rattus_norvegicus/UCSC/rn6/Sequence/Bowtie2Index/" - star = "${params.igenomes_base}/Rattus_norvegicus/UCSC/rn6/Sequence/STARIndex/" - bismark = "${params.igenomes_base}/Rattus_norvegicus/UCSC/rn6/Sequence/BismarkIndex/" - gtf = "${params.igenomes_base}/Rattus_norvegicus/UCSC/rn6/Annotation/Genes/genes.gtf" - bed12 = "${params.igenomes_base}/Rattus_norvegicus/UCSC/rn6/Annotation/Genes/genes.bed" - mito_name = "chrM" - } - 'sacCer3' { - fasta = "${params.igenomes_base}/Saccharomyces_cerevisiae/UCSC/sacCer3/Sequence/WholeGenomeFasta/genome.fa" - bwa = "${params.igenomes_base}/Saccharomyces_cerevisiae/UCSC/sacCer3/Sequence/BWAIndex/genome.fa" - bowtie2 = "${params.igenomes_base}/Saccharomyces_cerevisiae/UCSC/sacCer3/Sequence/Bowtie2Index/" - star = "${params.igenomes_base}/Saccharomyces_cerevisiae/UCSC/sacCer3/Sequence/STARIndex/" - bismark = "${params.igenomes_base}/Saccharomyces_cerevisiae/UCSC/sacCer3/Sequence/BismarkIndex/" - readme = "${params.igenomes_base}/Saccharomyces_cerevisiae/UCSC/sacCer3/Annotation/README.txt" - mito_name = "chrM" - macs_gsize = "1.2e7" - } - 'susScr3' { - fasta = "${params.igenomes_base}/Sus_scrofa/UCSC/susScr3/Sequence/WholeGenomeFasta/genome.fa" - bwa = "${params.igenomes_base}/Sus_scrofa/UCSC/susScr3/Sequence/BWAIndex/genome.fa" - bowtie2 = "${params.igenomes_base}/Sus_scrofa/UCSC/susScr3/Sequence/Bowtie2Index/" - star = "${params.igenomes_base}/Sus_scrofa/UCSC/susScr3/Sequence/STARIndex/" - bismark = "${params.igenomes_base}/Sus_scrofa/UCSC/susScr3/Sequence/BismarkIndex/" - gtf = "${params.igenomes_base}/Sus_scrofa/UCSC/susScr3/Annotation/Genes/genes.gtf" - bed12 = "${params.igenomes_base}/Sus_scrofa/UCSC/susScr3/Annotation/Genes/genes.bed" - readme = "${params.igenomes_base}/Sus_scrofa/UCSC/susScr3/Annotation/README.txt" - mito_name = "chrM" - } - } -} diff --git a/conf/modules.config b/conf/modules.config new file mode 100644 index 00000000..21d42161 --- /dev/null +++ b/conf/modules.config @@ -0,0 +1,22 @@ +/* +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + Config file for defining DSL2 per module options and publishing paths +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + Available keys to override module options: + ext.args = Additional arguments appended to command in module. + ext.args2 = Second set of arguments appended to command in module (multi-tool modules). + ext.args3 = Third set of arguments appended to command in module (multi-tool modules). + ext.prefix = File name prefix for output files. +---------------------------------------------------------------------------------------- +*/ + +process { + + publishDir = [ + path: { "${params.outdir}/${task.process.tokenize(':')[-1].tokenize('_')[0].toLowerCase()}" }, + mode: params.publish_dir_mode, + saveAs: { filename -> filename.equals('versions.yml') ? null : filename } + ] + + +} diff --git a/conf/protein_decoy.yaml b/conf/protein_decoy.yaml index 1eee726f..c024d1da 100644 --- a/conf/protein_decoy.yaml +++ b/conf/protein_decoy.yaml @@ -2,9 +2,14 @@ proteindb_decoy: output: protein-decoy.fa cleavage_sites: KR cleavage_position: c - anti_cleavage_sites: '' + max_missed_cleavages: 2 + anti_cleavage_sites: "" min_peptide_length: 5 max_iterations: 100 + keep_target_hits: true + enzyme: trypsin + method: decoypyrat + max_peptide_length: 100 do_not_shuffle: False do_not_switch: False decoy_prefix: DECOY_ diff --git a/conf/test.config b/conf/test.config index c1a7c822..e89e0feb 100644 --- a/conf/test.config +++ b/conf/test.config @@ -1,25 +1,30 @@ /* - * ------------------------------------------------- - * Nextflow config file for running tests - * ------------------------------------------------- - * Defines bundled input files and everything required - * to run a fast and simple test. Use as follows: - * nextflow run nf-core/pgdb -profile test, - */ +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + Nextflow config file for running minimal tests +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + Defines input files and everything required to run a fast and simple pipeline test. + + Use as follows: + nextflow run nf-core/pgdb -profile test, --outdir + +---------------------------------------------------------------------------------------- +*/ params { - config_profile_name = 'Test profile' - config_profile_description = 'Minimal test dataset to check pipeline function' - // Limit resources so that this can run on GitHub Actions - max_cpus = 2 - max_memory = 6.GB - max_time = 48.h + config_profile_name = 'Test profile' + config_profile_description = 'Minimal test dataset to check pipeline function' + + // Limit resources so that this can run on GitHub Actions + max_cpus = 2 + max_memory = 6.GB + max_time = 48.h - single_end = false - ensembl_name = 'homo_sapiens' - ensembl = false - gnomad = false - cosmic = false - cosmic_celllines = false - cbioportal = false + ensembl_name = 'meleagris_gallopavo' + ensembl = false + gnomad = false + cosmic = false + cosmic_celllines = false + cbioportal = false + decoy = true + clean_database = true } diff --git a/conf/test_cbioportal.config b/conf/test_cbioportal.config new file mode 100644 index 00000000..e4151d8f --- /dev/null +++ b/conf/test_cbioportal.config @@ -0,0 +1,31 @@ +/* +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + Nextflow config file for running full-size tests +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + Defines input files and everything required to run a full size pipeline test. + + Use as follows: + nextflow run nf-core/pgdb -profile test_cbioportal, --outdir + +---------------------------------------------------------------------------------------- +*/ + +params { + config_profile_name = 'Full test profile' + config_profile_description = 'Full test cBioPortal generation' + + // Limit resources so that this can run on GitHub Actions + max_cpus = 2 + max_memory = 6.GB + max_time = 48.h + + // Input data for full size test + ensembl_name = 'homo_sapiens' + ensembl = false + add_reference = false + gnomad = false + cosmic = false + cosmic_celllines = false + cbioportal = true + decoy = true +} diff --git a/conf/test_cosmic_cbio.config b/conf/test_cosmic_cbio.config new file mode 100644 index 00000000..d170bfba --- /dev/null +++ b/conf/test_cosmic_cbio.config @@ -0,0 +1,31 @@ +/* +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + Nextflow config file for running full-size tests +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + Defines input files and everything required to run a full size pipeline test. + + Use as follows: + nextflow run nf-core/pgdb -profile test_cosmic, --outdir + +---------------------------------------------------------------------------------------- +*/ + +params { + config_profile_name = 'Full test profile' + config_profile_description = 'Full test COSMIC Cell lines and cBioPortal generation' + + // Limit resources so that this can run on GitHub Actions + max_cpus = 2 + max_memory = 6.GB + max_time = 48.h + + // Input data for full size test + ensembl_name = 'homo_sapiens' + ensembl = false + add_reference = false + gnomad = false + cosmic = false + cosmic_celllines = true + cbioportal = true + decoy = true +} diff --git a/conf/test_full.config b/conf/test_full.config index 4fe60805..59016146 100644 --- a/conf/test_full.config +++ b/conf/test_full.config @@ -1,28 +1,30 @@ /* - * ------------------------------------------------- - * Nextflow config file for running full-size tests - * ------------------------------------------------- - * Defines bundled input files and everything required - * to run a full size pipeline test. Use as follows: - * nextflow run nf-core/pgdb -profile test_full, - */ +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + Nextflow config file for running full-size tests +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + Defines input files and everything required to run a full size pipeline test. + + Use as follows: + nextflow run nf-core/pgdb -profile test_full, --outdir + +---------------------------------------------------------------------------------------- +*/ params { - config_profile_name = 'Full test profile' - config_profile_description = 'Full test COSMIC generation' + config_profile_name = 'Full test profile' + config_profile_description = 'Full test COSMIC generation' - // Input data for full size test - // TODO nf-core: Specify the paths to your full test data ( on nf-core/test-datasets or directly in repositories, e.g. SRA) - // TODO nf-core: Give any required params for the test so that command line flags are not needed - single_end = false - max_cpus = 2 - max_memory = 6.GB - max_time = 48.h + // Limit resources so that this can run on GitHub Actions + //max_cpus = 2 + //max_memory = 6.GB + //max_time = 48.h - ensembl_name = 'homo_sapiens' - ensembl = false - gnomad = false - cosmic = true - cosmic_celllines = false - cbioportal = false + // Input data for full size test + ensembl_name = 'homo_sapiens' + ensembl = false + gnomad = false + cosmic = true + cosmic_celllines = false + cbioportal = false + decoy = true } diff --git a/docs/README.md b/docs/README.md index 4ed33963..c50da6d2 100644 --- a/docs/README.md +++ b/docs/README.md @@ -2,9 +2,9 @@ The nf-core/pgdb documentation is split into the following pages: -* [Usage](usage.md) - * An overview of how the pipeline works, how to run it and a description of all of the different command-line flags. -* [Output](output.md) - * An overview of the different results produced by the pipeline and how to interpret them. +- [Usage](usage.md) + - An overview of how the pipeline works, how to run it and a description of all of the different command-line flags. +- [Output](output.md) + - An overview of the different results produced by the pipeline and how to interpret them. You can find a lot more documentation about installing, configuring and running nf-core pipelines on the website: [https://nf-co.re](https://nf-co.re) diff --git a/docs/images/mqc_fastqc_adapter.png b/docs/images/mqc_fastqc_adapter.png new file mode 100755 index 00000000..361d0e47 Binary files /dev/null and b/docs/images/mqc_fastqc_adapter.png differ diff --git a/docs/images/mqc_fastqc_counts.png b/docs/images/mqc_fastqc_counts.png new file mode 100755 index 00000000..cb39ebb8 Binary files /dev/null and b/docs/images/mqc_fastqc_counts.png differ diff --git a/docs/images/mqc_fastqc_quality.png b/docs/images/mqc_fastqc_quality.png new file mode 100755 index 00000000..a4b89bf5 Binary files /dev/null and b/docs/images/mqc_fastqc_quality.png differ diff --git a/docs/images/nf-core-pgdb_logo.png b/docs/images/nf-core-pgdb_logo.png deleted file mode 100644 index 08232c49..00000000 Binary files a/docs/images/nf-core-pgdb_logo.png and /dev/null differ diff --git a/docs/images/nf-core-pgdb_logo_dark.png b/docs/images/nf-core-pgdb_logo_dark.png new file mode 100644 index 00000000..b94eef17 Binary files /dev/null and b/docs/images/nf-core-pgdb_logo_dark.png differ diff --git a/docs/images/nf-core-pgdb_logo_light.png b/docs/images/nf-core-pgdb_logo_light.png new file mode 100644 index 00000000..52bca4ce Binary files /dev/null and b/docs/images/nf-core-pgdb_logo_light.png differ diff --git a/docs/output.md b/docs/output.md index aa44a34f..a154bdc4 100644 --- a/docs/output.md +++ b/docs/output.md @@ -1,63 +1,53 @@ # nf-core/pgdb: Output -## :warning: Please read this documentation on the nf-core website: [https://nf-co.re/pgdb/output](https://nf-co.re/pgdb/output) - -> _Documentation of pipeline parameters is generated automatically from the pipeline schema and can no longer be found in markdown files._ - ## Introduction -This document describes the output produced by the pipeline. Most of the plots are taken from the MultiQC report, which summarises results at the end of the pipeline. +This document describes the output produced by the pipeline. The main output of the pgdb pipeline is the protein sequence database. Protein databases are use for peptide and protein [identification algorithms](https://pubmed.ncbi.nlm.nih.gov/27975215/). In most of the proteomics experiments, researchers tried to quantified the peptides and proteins using canonical protein databases such as ENSEMBL or UNIPROT. -The directories listed below will be created in the results directory after the pipeline has finished. All paths are relative to the top-level results directory. +[Proteogenomics](https://www.nature.com/articles/nmeth.3144) is the field of research at the interface of proteomics and genomics. In this approach, "customized" protein sequence databases generated using genomic and transcriptomic information are used to help identify "novel" peptides (not present in reference/canonical protein sequence databases) from mass spectrometry–based proteomic data; in turn, the proteomic data can be used to provide protein-level evidence of gene expression and to help refine gene models. In recent years, owing to the emergence of new sequencing technologies such as RNA-seq and dramatic improvements in the depth and throughput of mass spectrometry–based proteomics, the pace of proteogenomic research has greatly accelerated. - +pgdb allows researchers to create custom proteogenomic databses using different sources such as COSMIC, cBioPortal, ENSEMBL variants of gNOMAD. ## Pipeline overview -The pipeline is built using [Nextflow](https://www.nextflow.io/) -and processes data using the following steps: +The pipeline is built using [Nextflow](https://www.nextflow.io/) and aim to create a final protein database by adding different protein sequences depending of the options provided by the user/researcher. The pipeline will handle the download from the different sources and perform the operations in the data like translation from genome and transcript sequences into protein sequences, etc. -* [FastQC](#fastqc) - Read quality control -* [MultiQC](#multiqc) - Aggregate report describing results from the whole pipeline -* [Pipeline information](#pipeline-information) - Report metrics generated during the workflow execution +The main source of canonical protein sequence in pgdb is ENSEMBL. The user can then attach to the proteogenomic database protein vairants from COSMIC or cBioPortal. In addition, the pseudogenes, lncRNAs and other novel translation events can be configure to get novel protein sequences. The main sources of sequences are: -## FastQC +- [Ensembl](#ensembl) - Download the Ensembl databases proteins and add canonical proteins. +- [Ensemblnoncanonical](#ensemblnoncanonical) - ENSEMBL non canonical proteins +- [Variants](#variants) - Add the COSMIC and cPortal variant databases. +- [Decoy](#decoys) - Add decoy proteins to the final database. +- [Output](#output) - Output results including clean databases and decoy generation -[FastQC](http://www.bioinformatics.babraham.ac.uk/projects/fastqc/) gives general quality metrics about your sequenced reads. It provides information about the quality score distribution across your reads, per base sequence content (%A/T/G/C), adapter contamination and overrepresented sequences. +## Pipeline modes -For further reading and documentation see the [FastQC help pages](http://www.bioinformatics.babraham.ac.uk/projects/fastqc/Help/). +### Ensembl -**Output files:** +The pipeline will download the the ENSEMBL protein reference proteome, this will be added to the final protein database. The protein database is downloaded from [ENSEMBL FTP](http://www.ensembl.org/info/data/ftp/index.html). -* `fastqc/` - * `*_fastqc.html`: FastQC report containing quality metrics for your untrimmed raw fastq files. -* `fastqc/zips/` - * `*_fastqc.zip`: Zip archive containing the FastQC report, tab-delimited data file and plot images. +### Ensembl non canonical -> **NB:** The FastQC plots displayed in the MultiQC report shows _untrimmed_ reads. They may contain adapter sequence and potentially regions with low quality. +The Ensembl non canonical includes the pseudogenes, lncRNAs, etc. The accessions of each type of kind of novel protein is predefined by the [pypgatk tool](https://github.com/bigbio/py-pgatk). -## MultiQC +- `ncRNA_ENST00000456688` - non coding RNA transcript. +- `altorf_ENST00000310473` - alternative open reading frame +- `pseudo_ENST00000436135` - pseudo gene translation -[MultiQC](http://multiqc.info) is a visualization tool that generates a single HTML report summarizing all samples in your project. Most of the pipeline QC results are visualised in the report and further statistics are available in the report data directory. +### Variants -The pipeline has special steps which also allow the software versions to be reported in the MultiQC output for future traceability. +The COSMIC or cBioPortal variants are downloaded automatically from these resources. The accessions of those proteins are: -For more information about how to use MultiQC reports, see [https://multiqc.info](https://multiqc.info). +- `COSMIC:ANXA3_ENST00000503570:p.A67T:Substitution-Missense` - Accession of the protein includes the position of the amino acid variant. -**Output files:** +### Decoy -* `multiqc/` - * `multiqc_report.html`: a standalone HTML file that can be viewed in your web browser. - * `multiqc_data/`: directory containing parsed statistics from the different tools used in the pipeline. - * `multiqc_plots/`: directory containing static images from the report in various formats. +Decoy can be added to the final database. Decoys accessions are prefix with `DECOY_` by default, but they can be configured by the users. -## Pipeline information +## Output files -[Nextflow](https://www.nextflow.io/docs/latest/tracing.html) provides excellent functionality for generating various reports relevant to the running and execution of the pipeline. This will allow you to troubleshoot errors with the running of the pipeline, and also provide you with other information such as launch commands, run times and resource usage. +The nf-core/pgdb pipeline produces one single output file: `/final_proteinDB.fa` _(or whatever `params.final_database_protein` is set to)_. -**Output files:** +This FASTA database includes all of the protein sequences including the reference proteomes, variants, pseudo-genes, etc. -* `pipeline_info/` - * Reports generated by Nextflow: `execution_report.html`, `execution_timeline.html`, `execution_trace.txt` and `pipeline_dag.dot`/`pipeline_dag.svg`. - * Reports generated by the pipeline: `pipeline_report.html`, `pipeline_report.txt` and `software_versions.csv`. - * Documentation for interpretation of results in HTML format: `results_description.html`. +A directory called `pipeline_info` is also created with logs and reports from the pipeline execution. diff --git a/docs/usage.md b/docs/usage.md index a1b0e42d..db9c473e 100644 --- a/docs/usage.md +++ b/docs/usage.md @@ -6,16 +6,24 @@ ## Introduction - +General usage: -## Running the pipeline +```bash +nextflow run nf-core/pgdb -profile --ensembl_name homo_sapiens --decoy +``` + +This command will download the ENSEMBL human proteome and attach the decoy database to it. -The typical command for running the pipeline is as follows: +## Adding non canonical proteins + +The main purpose of the pgdb pipeline is to add non-canonical proteins to the database including variants, ncRNAs, altORFs: ```bash -nextflow run nf-core/pgdb --input '*_R{1,2}.fastq.gz' -profile docker +nextflow run nf-core/pgdb --ensembl_name homo_sapiens --altorfs --decoy -profile docker ``` +Please see the [https://nf-co.re/pgdb/parameters](parameter documentation) to see which options are available. + This will launch the pipeline with the `docker` configuration profile. See below for more information about profiles. Note that the pipeline will create the following files in your working directory: @@ -27,6 +35,10 @@ results # Finished results (configurable, see below) # Other nextflow hidden files, eg. history of pipeline runs and old logs. ``` +## Pipeline full documentation and examples + +The full documentation of the pipeline can be found [here](https://pgatk.readthedocs.io/) including examples to generate databases from COSMIC or cBioportal. + ### Updating the pipeline When you run the above command, Nextflow automatically pulls the pipeline code from GitHub and stores it as a cached version. When running the pipeline after this, it will always use the cached version if available - even if the pipeline has been updated since. To make sure that you're running the latest version of the pipeline, make sure that you regularly update the cached version of the pipeline: @@ -51,7 +63,7 @@ This version number will be logged in reports when you run the pipeline, so that Use this parameter to choose a configuration profile. Profiles can give configuration presets for different compute environments. -Several generic profiles are bundled with the pipeline which instruct the pipeline to use software packaged using different methods (Docker, Singularity, Podman, Conda) - see below. +Several generic profiles are bundled with the pipeline which instruct the pipeline to use software packaged using different methods (Docker, Singularity, Podman, Shifter, Charliecloud, Conda) - see below. > We highly recommend the use of Docker or Singularity containers for full pipeline reproducibility, however when this is not possible, Conda is also supported. @@ -62,22 +74,28 @@ They are loaded in sequence, so later profiles can overwrite earlier profiles. If `-profile` is not specified, the pipeline will run locally and expect all software to be installed and available on the `PATH`. This is _not_ recommended. -* `docker` - * A generic configuration profile to be used with [Docker](https://docker.com/) - * Pulls software from Docker Hub: [`nfcore/pgdb`](https://hub.docker.com/r/nfcore/pgdb/) -* `singularity` - * A generic configuration profile to be used with [Singularity](https://sylabs.io/docs/) - * Pulls software from Docker Hub: [`nfcore/pgdb`](https://hub.docker.com/r/nfcore/pgdb/) -* `podman` - * A generic configuration profile to be used with [Podman](https://podman.io/) - * Pulls software from Docker Hub: [`nfcore/pgdb`](https://hub.docker.com/r/nfcore/pgdb/) -* `conda` - * Please only use Conda as a last resort i.e. when it's not possible to run the pipeline with Docker, Singularity or Podman. - * A generic configuration profile to be used with [Conda](https://conda.io/docs/) - * Pulls most software from [Bioconda](https://bioconda.github.io/) -* `test` - * A profile with a complete configuration for automated testing - * Includes links to test data so needs no other parameters +- `docker` + - A generic configuration profile to be used with [Docker](https://docker.com/) + - Pulls software from Docker Hub: [`nfcore/pgdb`](https://hub.docker.com/r/nfcore/pgdb/) +- `singularity` + - A generic configuration profile to be used with [Singularity](https://sylabs.io/docs/) + - Pulls software from Docker Hub: [`nfcore/pgdb`](https://hub.docker.com/r/nfcore/pgdb/) +- `podman` + - A generic configuration profile to be used with [Podman](https://podman.io/) + - Pulls software from Docker Hub: [`nfcore/pgdb`](https://hub.docker.com/r/nfcore/pgdb/) +- `shifter` + - A generic configuration profile to be used with [Shifter](https://nersc.gitlab.io/development/shifter/how-to-use/) + - Pulls software from Docker Hub: [`nfcore/pgdb`](https://hub.docker.com/r/nfcore/pgdb/) +- `charliecloud` + - A generic configuration profile to be used with [Charliecloud](https://hpc.github.io/charliecloud/) + - Pulls software from Docker Hub: [`nfcore/pgdb`](https://hub.docker.com/r/nfcore/pgdb/) +- `conda` + - Please only use Conda as a last resort i.e. when it's not possible to run the pipeline with Docker, Singularity, Podman, Shifter or Charliecloud. + - A generic configuration profile to be used with [Conda](https://conda.io/docs/) + - Pulls most software from [Bioconda](https://bioconda.github.io/) +- `test` + - A profile with a complete configuration for automated testing + - Includes links to test data so needs no other parameters ### `-resume` @@ -95,21 +113,39 @@ Each step in the pipeline has a default set of requirements for number of CPUs, Whilst these default requirements will hopefully work for most people with most data, you may find that you want to customise the compute resources that the pipeline requests. You can do this by creating a custom config file. For example, to give the workflow process `star` 32GB of memory, you could use the following config: -```nextflow -process { - withName: star { - memory = 32.GB - } -} -``` + ```nextflow + process { + withName: PANGOLIN { + container = 'https://depot.galaxyproject.org/singularity/pangolin:3.0.5--pyhdfd78af_0' + } + } + ``` + +To find the exact name of a process you wish to modify the compute resources, check the live-status of a nextflow run displayed on your terminal or check the nextflow error for a line like so: `Error executing process > 'bwa'`. In this case the name to specify in the custom config file is `bwa`. See the main [Nextflow documentation](https://www.nextflow.io/docs/latest/config.html) for more information. -If you are likely to be running `nf-core` pipelines regularly it may be a good idea to request that your custom config file is uploaded to the `nf-core/configs` git repository. Before you do this please can you test that the config file works with your pipeline of choice using the `-c` parameter (see definition above). You can then create a pull request to the `nf-core/configs` repository with the addition of your config file, associated documentation file (see examples in [`nf-core/configs/docs`](https://github.com/nf-core/configs/tree/master/docs)), and amending [`nfcore_custom.config`](https://github.com/nf-core/configs/blob/master/nfcore_custom.config) to include your custom profile. +- For Conda: + + ```nextflow + process { + withName: PANGOLIN { + conda = 'bioconda::pangolin=3.0.5' + } + } + ``` + +> **NB:** If you wish to periodically update individual tool-specific results (e.g. Pangolin) generated by the pipeline then you must ensure to keep the `work/` directory otherwise the `-resume` ability of the pipeline will be compromised and it will restart from scratch. + +### nf-core/configs + +In most cases, you will only need to create a custom config as a one-off but if you and others within your organisation are likely to be running nf-core pipelines regularly and need to use the same settings regularly it may be a good idea to request that your custom config file is uploaded to the `nf-core/configs` git repository. Before you do this please can you test that the config file works with your pipeline of choice using the `-c` parameter. You can then create a pull request to the `nf-core/configs` repository with the addition of your config file, associated documentation file (see examples in [`nf-core/configs/docs`](https://github.com/nf-core/configs/tree/master/docs)), and amending [`nfcore_custom.config`](https://github.com/nf-core/configs/blob/master/nfcore_custom.config) to include your custom profile. + +See the main [Nextflow documentation](https://www.nextflow.io/docs/latest/config.html) for more information about creating your own configuration files. If you have any questions or issues please send us a message on [Slack](https://nf-co.re/join/slack) on the [`#configs` channel](https://nfcore.slack.com/channels/configs). -### Running in the background +## Running in the background Nextflow handles job submissions and supervises the running jobs. The Nextflow process must run until the pipeline is finished. @@ -118,11 +154,11 @@ The Nextflow `-bg` flag launches Nextflow in the background, detached from your Alternatively, you can use `screen` / `tmux` or similar tool to create a detached session which you can log back into at a later time. Some HPC setups also allow you to run nextflow within a cluster job submitted your job scheduler (from where it submits more jobs). -#### Nextflow memory requirements +## Nextflow memory requirements In some cases, the Nextflow Java virtual machines can start to request a large amount of memory. We recommend adding the following line to your environment to limit this (typically in `~/.bashrc` or `~./bash_profile`): -```bash +```console NXF_OPTS='-Xms1g -Xmx4g' ``` diff --git a/environment.yml b/environment.yml deleted file mode 100644 index faefad15..00000000 --- a/environment.yml +++ /dev/null @@ -1,17 +0,0 @@ -# You can use this file to create a conda environment for this pipeline: -# conda env create -f environment.yml -name: nf-core-pgdb-1.0dev -channels: - - conda-forge - - bioconda - - defaults -dependencies: - - conda-forge::python=3.9.1 - - conda-forge::markdown=3.3.3 - - conda-forge::pymdown-extensions=8.1 - - conda-forge::pygments=2.7.4 - # All the dependencies of the pipeline. - - bioconda::pypgatk=0.0.10 - - conda-forge::coreutils=8.31 - - conda-forge::git-lfs=2.13.2 - - conda-forge::git=2.30.0 diff --git a/lib/Headers.groovy b/lib/Headers.groovy new file mode 100644 index 00000000..15d1d388 --- /dev/null +++ b/lib/Headers.groovy @@ -0,0 +1,43 @@ +/* + * This file holds several functions used to render the nf-core ANSI header. + */ + +class Headers { + + private static Map log_colours(Boolean monochrome_logs) { + Map colorcodes = [:] + colorcodes['reset'] = monochrome_logs ? '' : "\033[0m" + colorcodes['dim'] = monochrome_logs ? '' : "\033[2m" + colorcodes['black'] = monochrome_logs ? '' : "\033[0;30m" + colorcodes['green'] = monochrome_logs ? '' : "\033[0;32m" + colorcodes['yellow'] = monochrome_logs ? '' : "\033[0;33m" + colorcodes['yellow_bold'] = monochrome_logs ? '' : "\033[1;93m" + colorcodes['blue'] = monochrome_logs ? '' : "\033[0;34m" + colorcodes['purple'] = monochrome_logs ? '' : "\033[0;35m" + colorcodes['cyan'] = monochrome_logs ? '' : "\033[0;36m" + colorcodes['white'] = monochrome_logs ? '' : "\033[0;37m" + colorcodes['red'] = monochrome_logs ? '' : "\033[1;91m" + return colorcodes + } + + static String dashed_line(monochrome_logs) { + Map colors = log_colours(monochrome_logs) + return "-${colors.dim}----------------------------------------------------${colors.reset}-" + } + + static String nf_core(workflow, monochrome_logs) { + Map colors = log_colours(monochrome_logs) + String.format( + """\n + ${dashed_line(monochrome_logs)} + ${colors.green},--.${colors.black}/${colors.green},-.${colors.reset} + ${colors.blue} ___ __ __ __ ___ ${colors.green}/,-._.--~\'${colors.reset} + ${colors.blue} |\\ | |__ __ / ` / \\ |__) |__ ${colors.yellow}} {${colors.reset} + ${colors.blue} | \\| | \\__, \\__/ | \\ |___ ${colors.green}\\`-._,-`-,${colors.reset} + ${colors.green}`._,._,\'${colors.reset} + ${colors.purple} ${workflow.manifest.name} v${workflow.manifest.version}${colors.reset} + ${dashed_line(monochrome_logs)} + """.stripIndent() + ) + } +} diff --git a/lib/NfcoreSchema.groovy b/lib/NfcoreSchema.groovy new file mode 100644 index 00000000..b3d092f8 --- /dev/null +++ b/lib/NfcoreSchema.groovy @@ -0,0 +1,529 @@ +// +// This file holds several functions used to perform JSON parameter validation, help and summary rendering for the nf-core pipeline template. +// + +import org.everit.json.schema.Schema +import org.everit.json.schema.loader.SchemaLoader +import org.everit.json.schema.ValidationException +import org.json.JSONObject +import org.json.JSONTokener +import org.json.JSONArray +import groovy.json.JsonSlurper +import groovy.json.JsonBuilder + +class NfcoreSchema { + + // + // Resolve Schema path relative to main workflow directory + // + public static String getSchemaPath(workflow, schema_filename='nextflow_schema.json') { + return "${workflow.projectDir}/${schema_filename}" + } + + // + // Function to loop over all parameters defined in schema and check + // whether the given parameters adhere to the specifications + // + /* groovylint-disable-next-line UnusedPrivateMethodParameter */ + public static void validateParameters(workflow, params, log, schema_filename='nextflow_schema.json') { + def has_error = false + //~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~// + // Check for nextflow core params and unexpected params + def json = new File(getSchemaPath(workflow, schema_filename=schema_filename)).text + def Map schemaParams = (Map) new JsonSlurper().parseText(json).get('definitions') + def nf_params = [ + // Options for base `nextflow` command + 'bg', + 'c', + 'C', + 'config', + 'd', + 'D', + 'dockerize', + 'h', + 'log', + 'q', + 'quiet', + 'syslog', + 'v', + 'version', + + // Options for `nextflow run` command + 'ansi', + 'ansi-log', + 'bg', + 'bucket-dir', + 'c', + 'cache', + 'config', + 'dsl2', + 'dump-channels', + 'dump-hashes', + 'E', + 'entry', + 'latest', + 'lib', + 'main-script', + 'N', + 'name', + 'offline', + 'params-file', + 'pi', + 'plugins', + 'poll-interval', + 'pool-size', + 'profile', + 'ps', + 'qs', + 'queue-size', + 'r', + 'resume', + 'revision', + 'stdin', + 'stub', + 'stub-run', + 'test', + 'w', + 'with-charliecloud', + 'with-conda', + 'with-dag', + 'with-docker', + 'with-mpi', + 'with-notification', + 'with-podman', + 'with-report', + 'with-singularity', + 'with-timeline', + 'with-tower', + 'with-trace', + 'with-weblog', + 'without-docker', + 'without-podman', + 'work-dir' + ] + def unexpectedParams = [] + + // Collect expected parameters from the schema + def expectedParams = [] + def enums = [:] + for (group in schemaParams) { + for (p in group.value['properties']) { + expectedParams.push(p.key) + if (group.value['properties'][p.key].containsKey('enum')) { + enums[p.key] = group.value['properties'][p.key]['enum'] + } + } + } + + for (specifiedParam in params.keySet()) { + // nextflow params + if (nf_params.contains(specifiedParam)) { + log.error "ERROR: You used a core Nextflow option with two hyphens: '--${specifiedParam}'. Please resubmit with '-${specifiedParam}'" + has_error = true + } + // unexpected params + def params_ignore = params.schema_ignore_params.split(',') + 'schema_ignore_params' + def expectedParamsLowerCase = expectedParams.collect{ it.replace("-", "").toLowerCase() } + def specifiedParamLowerCase = specifiedParam.replace("-", "").toLowerCase() + def isCamelCaseBug = (specifiedParam.contains("-") && !expectedParams.contains(specifiedParam) && expectedParamsLowerCase.contains(specifiedParamLowerCase)) + if (!expectedParams.contains(specifiedParam) && !params_ignore.contains(specifiedParam) && !isCamelCaseBug) { + // Temporarily remove camelCase/camel-case params #1035 + def unexpectedParamsLowerCase = unexpectedParams.collect{ it.replace("-", "").toLowerCase()} + if (!unexpectedParamsLowerCase.contains(specifiedParamLowerCase)){ + unexpectedParams.push(specifiedParam) + } + } + } + + //~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~// + // Validate parameters against the schema + InputStream input_stream = new File(getSchemaPath(workflow, schema_filename=schema_filename)).newInputStream() + JSONObject raw_schema = new JSONObject(new JSONTokener(input_stream)) + + // Remove anything that's in params.schema_ignore_params + raw_schema = removeIgnoredParams(raw_schema, params) + + Schema schema = SchemaLoader.load(raw_schema) + + // Clean the parameters + def cleanedParams = cleanParameters(params) + + // Convert to JSONObject + def jsonParams = new JsonBuilder(cleanedParams) + JSONObject params_json = new JSONObject(jsonParams.toString()) + + // Validate + try { + schema.validate(params_json) + } catch (ValidationException e) { + println '' + log.error 'ERROR: Validation of pipeline parameters failed!' + JSONObject exceptionJSON = e.toJSON() + printExceptions(exceptionJSON, params_json, log, enums) + println '' + has_error = true + } + + // Check for unexpected parameters + if (unexpectedParams.size() > 0) { + Map colors = NfcoreTemplate.logColours(params.monochrome_logs) + println '' + def warn_msg = 'Found unexpected parameters:' + for (unexpectedParam in unexpectedParams) { + warn_msg = warn_msg + "\n* --${unexpectedParam}: ${params[unexpectedParam].toString()}" + } + log.warn warn_msg + log.info "- ${colors.dim}Ignore this warning: params.schema_ignore_params = \"${unexpectedParams.join(',')}\" ${colors.reset}" + println '' + } + + if (has_error) { + System.exit(1) + } + } + + // + // Beautify parameters for --help + // + public static String paramsHelp(workflow, params, command, schema_filename='nextflow_schema.json') { + Map colors = NfcoreTemplate.logColours(params.monochrome_logs) + Integer num_hidden = 0 + String output = '' + output += 'Typical pipeline command:\n\n' + output += " ${colors.cyan}${command}${colors.reset}\n\n" + Map params_map = paramsLoad(getSchemaPath(workflow, schema_filename=schema_filename)) + Integer max_chars = paramsMaxChars(params_map) + 1 + Integer desc_indent = max_chars + 14 + Integer dec_linewidth = 160 - desc_indent + for (group in params_map.keySet()) { + Integer num_params = 0 + String group_output = colors.underlined + colors.bold + group + colors.reset + '\n' + def group_params = params_map.get(group) // This gets the parameters of that particular group + for (param in group_params.keySet()) { + if (group_params.get(param).hidden && !params.show_hidden_params) { + num_hidden += 1 + continue; + } + def type = '[' + group_params.get(param).type + ']' + def description = group_params.get(param).description + def defaultValue = group_params.get(param).default != null ? " [default: " + group_params.get(param).default.toString() + "]" : '' + def description_default = description + colors.dim + defaultValue + colors.reset + // Wrap long description texts + // Loosely based on https://dzone.com/articles/groovy-plain-text-word-wrap + if (description_default.length() > dec_linewidth){ + List olines = [] + String oline = "" // " " * indent + description_default.split(" ").each() { wrd -> + if ((oline.size() + wrd.size()) <= dec_linewidth) { + oline += wrd + " " + } else { + olines += oline + oline = wrd + " " + } + } + olines += oline + description_default = olines.join("\n" + " " * desc_indent) + } + group_output += " --" + param.padRight(max_chars) + colors.dim + type.padRight(10) + colors.reset + description_default + '\n' + num_params += 1 + } + group_output += '\n' + if (num_params > 0){ + output += group_output + } + } + if (num_hidden > 0){ + output += colors.dim + "!! Hiding $num_hidden params, use --show_hidden_params to show them !!\n" + colors.reset + } + output += NfcoreTemplate.dashedLine(params.monochrome_logs) + return output + } + + // + // Groovy Map summarising parameters/workflow options used by the pipeline + // + public static LinkedHashMap paramsSummaryMap(workflow, params, schema_filename='nextflow_schema.json') { + // Get a selection of core Nextflow workflow options + def Map workflow_summary = [:] + if (workflow.revision) { + workflow_summary['revision'] = workflow.revision + } + workflow_summary['runName'] = workflow.runName + if (workflow.containerEngine) { + workflow_summary['containerEngine'] = workflow.containerEngine + } + if (workflow.container) { + workflow_summary['container'] = workflow.container + } + workflow_summary['launchDir'] = workflow.launchDir + workflow_summary['workDir'] = workflow.workDir + workflow_summary['projectDir'] = workflow.projectDir + workflow_summary['userName'] = workflow.userName + workflow_summary['profile'] = workflow.profile + workflow_summary['configFiles'] = workflow.configFiles.join(', ') + + // Get pipeline parameters defined in JSON Schema + def Map params_summary = [:] + def params_map = paramsLoad(getSchemaPath(workflow, schema_filename=schema_filename)) + for (group in params_map.keySet()) { + def sub_params = new LinkedHashMap() + def group_params = params_map.get(group) // This gets the parameters of that particular group + for (param in group_params.keySet()) { + if (params.containsKey(param)) { + def params_value = params.get(param) + def schema_value = group_params.get(param).default + def param_type = group_params.get(param).type + if (schema_value != null) { + if (param_type == 'string') { + if (schema_value.contains('$projectDir') || schema_value.contains('${projectDir}')) { + def sub_string = schema_value.replace('\$projectDir', '') + sub_string = sub_string.replace('\${projectDir}', '') + if (params_value.contains(sub_string)) { + schema_value = params_value + } + } + if (schema_value.contains('$params.outdir') || schema_value.contains('${params.outdir}')) { + def sub_string = schema_value.replace('\$params.outdir', '') + sub_string = sub_string.replace('\${params.outdir}', '') + if ("${params.outdir}${sub_string}" == params_value) { + schema_value = params_value + } + } + } + } + + // We have a default in the schema, and this isn't it + if (schema_value != null && params_value != schema_value) { + sub_params.put(param, params_value) + } + // No default in the schema, and this isn't empty + else if (schema_value == null && params_value != "" && params_value != null && params_value != false) { + sub_params.put(param, params_value) + } + } + } + params_summary.put(group, sub_params) + } + return [ 'Core Nextflow options' : workflow_summary ] << params_summary + } + + // + // Beautify parameters for summary and return as string + // + public static String paramsSummaryLog(workflow, params) { + Map colors = NfcoreTemplate.logColours(params.monochrome_logs) + String output = '' + def params_map = paramsSummaryMap(workflow, params) + def max_chars = paramsMaxChars(params_map) + for (group in params_map.keySet()) { + def group_params = params_map.get(group) // This gets the parameters of that particular group + if (group_params) { + output += colors.bold + group + colors.reset + '\n' + for (param in group_params.keySet()) { + output += " " + colors.blue + param.padRight(max_chars) + ": " + colors.green + group_params.get(param) + colors.reset + '\n' + } + output += '\n' + } + } + output += "!! Only displaying parameters that differ from the pipeline defaults !!\n" + output += NfcoreTemplate.dashedLine(params.monochrome_logs) + return output + } + + // + // Loop over nested exceptions and print the causingException + // + private static void printExceptions(ex_json, params_json, log, enums, limit=5) { + def causingExceptions = ex_json['causingExceptions'] + if (causingExceptions.length() == 0) { + def m = ex_json['message'] =~ /required key \[([^\]]+)\] not found/ + // Missing required param + if (m.matches()) { + log.error "* Missing required parameter: --${m[0][1]}" + } + // Other base-level error + else if (ex_json['pointerToViolation'] == '#') { + log.error "* ${ex_json['message']}" + } + // Error with specific param + else { + def param = ex_json['pointerToViolation'] - ~/^#\// + def param_val = params_json[param].toString() + if (enums.containsKey(param)) { + def error_msg = "* --${param}: '${param_val}' is not a valid choice (Available choices" + if (enums[param].size() > limit) { + log.error "${error_msg} (${limit} of ${enums[param].size()}): ${enums[param][0..limit-1].join(', ')}, ... )" + } else { + log.error "${error_msg}: ${enums[param].join(', ')})" + } + } else { + log.error "* --${param}: ${ex_json['message']} (${param_val})" + } + } + } + for (ex in causingExceptions) { + printExceptions(ex, params_json, log, enums) + } + } + + // + // Remove an element from a JSONArray + // + private static JSONArray removeElement(json_array, element) { + def list = [] + int len = json_array.length() + for (int i=0;i + if(raw_schema.keySet().contains('definitions')){ + raw_schema.definitions.each { definition -> + for (key in definition.keySet()){ + if (definition[key].get("properties").keySet().contains(ignore_param)){ + // Remove the param to ignore + definition[key].get("properties").remove(ignore_param) + // If the param was required, change this + if (definition[key].has("required")) { + def cleaned_required = removeElement(definition[key].required, ignore_param) + definition[key].put("required", cleaned_required) + } + } + } + } + } + if(raw_schema.keySet().contains('properties') && raw_schema.get('properties').keySet().contains(ignore_param)) { + raw_schema.get("properties").remove(ignore_param) + } + if(raw_schema.keySet().contains('required') && raw_schema.required.contains(ignore_param)) { + def cleaned_required = removeElement(raw_schema.required, ignore_param) + raw_schema.put("required", cleaned_required) + } + } + return raw_schema + } + + // + // Clean and check parameters relative to Nextflow native classes + // + private static Map cleanParameters(params) { + def new_params = params.getClass().newInstance(params) + for (p in params) { + // remove anything evaluating to false + if (!p['value']) { + new_params.remove(p.key) + } + // Cast MemoryUnit to String + if (p['value'].getClass() == nextflow.util.MemoryUnit) { + new_params.replace(p.key, p['value'].toString()) + } + // Cast Duration to String + if (p['value'].getClass() == nextflow.util.Duration) { + new_params.replace(p.key, p['value'].toString().replaceFirst(/d(?!\S)/, "day")) + } + // Cast LinkedHashMap to String + if (p['value'].getClass() == LinkedHashMap) { + new_params.replace(p.key, p['value'].toString()) + } + } + return new_params + } + + // + // This function tries to read a JSON params file + // + private static LinkedHashMap paramsLoad(String json_schema) { + def params_map = new LinkedHashMap() + try { + params_map = paramsRead(json_schema) + } catch (Exception e) { + println "Could not read parameters settings from JSON. $e" + params_map = new LinkedHashMap() + } + return params_map + } + + // + // Method to actually read in JSON file using Groovy. + // Group (as Key), values are all parameters + // - Parameter1 as Key, Description as Value + // - Parameter2 as Key, Description as Value + // .... + // Group + // - + private static LinkedHashMap paramsRead(String json_schema) throws Exception { + def json = new File(json_schema).text + def Map schema_definitions = (Map) new JsonSlurper().parseText(json).get('definitions') + def Map schema_properties = (Map) new JsonSlurper().parseText(json).get('properties') + /* Tree looks like this in nf-core schema + * definitions <- this is what the first get('definitions') gets us + group 1 + title + description + properties + parameter 1 + type + description + parameter 2 + type + description + group 2 + title + description + properties + parameter 1 + type + description + * properties <- parameters can also be ungrouped, outside of definitions + parameter 1 + type + description + */ + + // Grouped params + def params_map = new LinkedHashMap() + schema_definitions.each { key, val -> + def Map group = schema_definitions."$key".properties // Gets the property object of the group + def title = schema_definitions."$key".title + def sub_params = new LinkedHashMap() + group.each { innerkey, value -> + sub_params.put(innerkey, value) + } + params_map.put(title, sub_params) + } + + // Ungrouped params + def ungrouped_params = new LinkedHashMap() + schema_properties.each { innerkey, value -> + ungrouped_params.put(innerkey, value) + } + params_map.put("Other parameters", ungrouped_params) + + return params_map + } + + // + // Get maximum number of characters across all parameter names + // + private static Integer paramsMaxChars(params_map) { + Integer max_chars = 0 + for (group in params_map.keySet()) { + def group_params = params_map.get(group) // This gets the parameters of that particular group + for (param in group_params.keySet()) { + if (param.size() > max_chars) { + max_chars = param.size() + } + } + } + return max_chars + } +} diff --git a/lib/NfcoreTemplate.groovy b/lib/NfcoreTemplate.groovy new file mode 100644 index 00000000..2fc0a9b9 --- /dev/null +++ b/lib/NfcoreTemplate.groovy @@ -0,0 +1,258 @@ +// +// This file holds several functions used within the nf-core pipeline template. +// + +import org.yaml.snakeyaml.Yaml + +class NfcoreTemplate { + + // + // Check AWS Batch related parameters have been specified correctly + // + public static void awsBatch(workflow, params) { + if (workflow.profile.contains('awsbatch')) { + // Check params.awsqueue and params.awsregion have been set if running on AWSBatch + assert (params.awsqueue && params.awsregion) : "Specify correct --awsqueue and --awsregion parameters on AWSBatch!" + // Check outdir paths to be S3 buckets if running on AWSBatch + assert params.outdir.startsWith('s3:') : "Outdir not on S3 - specify S3 Bucket to run on AWSBatch!" + } + } + + // + // Warn if a -profile or Nextflow config has not been provided to run the pipeline + // + public static void checkConfigProvided(workflow, log) { + if (workflow.profile == 'standard' && workflow.configFiles.size() <= 1) { + log.warn "[$workflow.manifest.name] You are attempting to run the pipeline without any custom configuration!\n\n" + + "This will be dependent on your local compute environment but can be achieved via one or more of the following:\n" + + " (1) Using an existing pipeline profile e.g. `-profile docker` or `-profile singularity`\n" + + " (2) Using an existing nf-core/configs for your Institution e.g. `-profile crick` or `-profile uppmax`\n" + + " (3) Using your own local custom config e.g. `-c /path/to/your/custom.config`\n\n" + + "Please refer to the quick start section and usage docs for the pipeline.\n " + } + } + + // + // Construct and send completion email + // + public static void email(workflow, params, summary_params, projectDir, log, multiqc_report=[]) { + + // Set up the e-mail variables + def subject = "[$workflow.manifest.name] Successful: $workflow.runName" + if (!workflow.success) { + subject = "[$workflow.manifest.name] FAILED: $workflow.runName" + } + + def summary = [:] + for (group in summary_params.keySet()) { + summary << summary_params[group] + } + + def misc_fields = [:] + misc_fields['Date Started'] = workflow.start + misc_fields['Date Completed'] = workflow.complete + misc_fields['Pipeline script file path'] = workflow.scriptFile + misc_fields['Pipeline script hash ID'] = workflow.scriptId + if (workflow.repository) misc_fields['Pipeline repository Git URL'] = workflow.repository + if (workflow.commitId) misc_fields['Pipeline repository Git Commit'] = workflow.commitId + if (workflow.revision) misc_fields['Pipeline Git branch/tag'] = workflow.revision + misc_fields['Nextflow Version'] = workflow.nextflow.version + misc_fields['Nextflow Build'] = workflow.nextflow.build + misc_fields['Nextflow Compile Timestamp'] = workflow.nextflow.timestamp + + def email_fields = [:] + email_fields['version'] = workflow.manifest.version + email_fields['runName'] = workflow.runName + email_fields['success'] = workflow.success + email_fields['dateComplete'] = workflow.complete + email_fields['duration'] = workflow.duration + email_fields['exitStatus'] = workflow.exitStatus + email_fields['errorMessage'] = (workflow.errorMessage ?: 'None') + email_fields['errorReport'] = (workflow.errorReport ?: 'None') + email_fields['commandLine'] = workflow.commandLine + email_fields['projectDir'] = workflow.projectDir + email_fields['summary'] = summary << misc_fields + + // On success try attach the multiqc report + def mqc_report = null + try { + if (workflow.success) { + mqc_report = multiqc_report.getVal() + if (mqc_report.getClass() == ArrayList && mqc_report.size() >= 1) { + if (mqc_report.size() > 1) { + log.warn "[$workflow.manifest.name] Found multiple reports from process 'MULTIQC', will use only one" + } + mqc_report = mqc_report[0] + } + } + } catch (all) { + if (multiqc_report) { + log.warn "[$workflow.manifest.name] Could not attach MultiQC report to summary email" + } + } + + // Check if we are only sending emails on failure + def email_address = params.email + if (!params.email && params.email_on_fail && !workflow.success) { + email_address = params.email_on_fail + } + + // Render the TXT template + def engine = new groovy.text.GStringTemplateEngine() + def tf = new File("$projectDir/assets/email_template.txt") + def txt_template = engine.createTemplate(tf).make(email_fields) + def email_txt = txt_template.toString() + + // Render the HTML template + def hf = new File("$projectDir/assets/email_template.html") + def html_template = engine.createTemplate(hf).make(email_fields) + def email_html = html_template.toString() + + // Render the sendmail template + def max_multiqc_email_size = params.max_multiqc_email_size as nextflow.util.MemoryUnit + def smail_fields = [ email: email_address, subject: subject, email_txt: email_txt, email_html: email_html, projectDir: "$projectDir", mqcFile: mqc_report, mqcMaxSize: max_multiqc_email_size.toBytes() ] + def sf = new File("$projectDir/assets/sendmail_template.txt") + def sendmail_template = engine.createTemplate(sf).make(smail_fields) + def sendmail_html = sendmail_template.toString() + + // Send the HTML e-mail + Map colors = logColours(params.monochrome_logs) + if (email_address) { + try { + if (params.plaintext_email) { throw GroovyException('Send plaintext e-mail, not HTML') } + // Try to send HTML e-mail using sendmail + [ 'sendmail', '-t' ].execute() << sendmail_html + log.info "-${colors.purple}[$workflow.manifest.name]${colors.green} Sent summary e-mail to $email_address (sendmail)-" + } catch (all) { + // Catch failures and try with plaintext + def mail_cmd = [ 'mail', '-s', subject, '--content-type=text/html', email_address ] + if ( mqc_report.size() <= max_multiqc_email_size.toBytes() ) { + mail_cmd += [ '-A', mqc_report ] + } + mail_cmd.execute() << email_html + log.info "-${colors.purple}[$workflow.manifest.name]${colors.green} Sent summary e-mail to $email_address (mail)-" + } + } + + // Write summary e-mail HTML to a file + def output_d = new File("${params.outdir}/pipeline_info/") + if (!output_d.exists()) { + output_d.mkdirs() + } + def output_hf = new File(output_d, "pipeline_report.html") + output_hf.withWriter { w -> w << email_html } + def output_tf = new File(output_d, "pipeline_report.txt") + output_tf.withWriter { w -> w << email_txt } + } + + // + // Print pipeline summary on completion + // + public static void summary(workflow, params, log) { + Map colors = logColours(params.monochrome_logs) + if (workflow.success) { + if (workflow.stats.ignoredCount == 0) { + log.info "-${colors.purple}[$workflow.manifest.name]${colors.green} Pipeline completed successfully${colors.reset}-" + } else { + log.info "-${colors.purple}[$workflow.manifest.name]${colors.red} Pipeline completed successfully, but with errored process(es) ${colors.reset}-" + } + } else { + log.info "-${colors.purple}[$workflow.manifest.name]${colors.red} Pipeline completed with errors${colors.reset}-" + } + } + + // + // ANSII Colours used for terminal logging + // + public static Map logColours(Boolean monochrome_logs) { + Map colorcodes = [:] + + // Reset / Meta + colorcodes['reset'] = monochrome_logs ? '' : "\033[0m" + colorcodes['bold'] = monochrome_logs ? '' : "\033[1m" + colorcodes['dim'] = monochrome_logs ? '' : "\033[2m" + colorcodes['underlined'] = monochrome_logs ? '' : "\033[4m" + colorcodes['blink'] = monochrome_logs ? '' : "\033[5m" + colorcodes['reverse'] = monochrome_logs ? '' : "\033[7m" + colorcodes['hidden'] = monochrome_logs ? '' : "\033[8m" + + // Regular Colors + colorcodes['black'] = monochrome_logs ? '' : "\033[0;30m" + colorcodes['red'] = monochrome_logs ? '' : "\033[0;31m" + colorcodes['green'] = monochrome_logs ? '' : "\033[0;32m" + colorcodes['yellow'] = monochrome_logs ? '' : "\033[0;33m" + colorcodes['blue'] = monochrome_logs ? '' : "\033[0;34m" + colorcodes['purple'] = monochrome_logs ? '' : "\033[0;35m" + colorcodes['cyan'] = monochrome_logs ? '' : "\033[0;36m" + colorcodes['white'] = monochrome_logs ? '' : "\033[0;37m" + + // Bold + colorcodes['bblack'] = monochrome_logs ? '' : "\033[1;30m" + colorcodes['bred'] = monochrome_logs ? '' : "\033[1;31m" + colorcodes['bgreen'] = monochrome_logs ? '' : "\033[1;32m" + colorcodes['byellow'] = monochrome_logs ? '' : "\033[1;33m" + colorcodes['bblue'] = monochrome_logs ? '' : "\033[1;34m" + colorcodes['bpurple'] = monochrome_logs ? '' : "\033[1;35m" + colorcodes['bcyan'] = monochrome_logs ? '' : "\033[1;36m" + colorcodes['bwhite'] = monochrome_logs ? '' : "\033[1;37m" + + // Underline + colorcodes['ublack'] = monochrome_logs ? '' : "\033[4;30m" + colorcodes['ured'] = monochrome_logs ? '' : "\033[4;31m" + colorcodes['ugreen'] = monochrome_logs ? '' : "\033[4;32m" + colorcodes['uyellow'] = monochrome_logs ? '' : "\033[4;33m" + colorcodes['ublue'] = monochrome_logs ? '' : "\033[4;34m" + colorcodes['upurple'] = monochrome_logs ? '' : "\033[4;35m" + colorcodes['ucyan'] = monochrome_logs ? '' : "\033[4;36m" + colorcodes['uwhite'] = monochrome_logs ? '' : "\033[4;37m" + + // High Intensity + colorcodes['iblack'] = monochrome_logs ? '' : "\033[0;90m" + colorcodes['ired'] = monochrome_logs ? '' : "\033[0;91m" + colorcodes['igreen'] = monochrome_logs ? '' : "\033[0;92m" + colorcodes['iyellow'] = monochrome_logs ? '' : "\033[0;93m" + colorcodes['iblue'] = monochrome_logs ? '' : "\033[0;94m" + colorcodes['ipurple'] = monochrome_logs ? '' : "\033[0;95m" + colorcodes['icyan'] = monochrome_logs ? '' : "\033[0;96m" + colorcodes['iwhite'] = monochrome_logs ? '' : "\033[0;97m" + + // Bold High Intensity + colorcodes['biblack'] = monochrome_logs ? '' : "\033[1;90m" + colorcodes['bired'] = monochrome_logs ? '' : "\033[1;91m" + colorcodes['bigreen'] = monochrome_logs ? '' : "\033[1;92m" + colorcodes['biyellow'] = monochrome_logs ? '' : "\033[1;93m" + colorcodes['biblue'] = monochrome_logs ? '' : "\033[1;94m" + colorcodes['bipurple'] = monochrome_logs ? '' : "\033[1;95m" + colorcodes['bicyan'] = monochrome_logs ? '' : "\033[1;96m" + colorcodes['biwhite'] = monochrome_logs ? '' : "\033[1;97m" + + return colorcodes + } + + // + // Does what is says on the tin + // + public static String dashedLine(monochrome_logs) { + Map colors = logColours(monochrome_logs) + return "-${colors.dim}----------------------------------------------------${colors.reset}-" + } + + // + // nf-core logo + // + public static String logo(workflow, monochrome_logs) { + Map colors = logColours(monochrome_logs) + String.format( + """\n + ${dashedLine(monochrome_logs)} + ${colors.green},--.${colors.black}/${colors.green},-.${colors.reset} + ${colors.blue} ___ __ __ __ ___ ${colors.green}/,-._.--~\'${colors.reset} + ${colors.blue} |\\ | |__ __ / ` / \\ |__) |__ ${colors.yellow}} {${colors.reset} + ${colors.blue} | \\| | \\__, \\__/ | \\ |___ ${colors.green}\\`-._,-`-,${colors.reset} + ${colors.green}`._,._,\'${colors.reset} + ${colors.purple} ${workflow.manifest.name} v${workflow.manifest.version}${colors.reset} + ${dashedLine(monochrome_logs)} + """.stripIndent() + ) + } +} diff --git a/lib/Utils.groovy b/lib/Utils.groovy new file mode 100644 index 00000000..28567bd7 --- /dev/null +++ b/lib/Utils.groovy @@ -0,0 +1,40 @@ +// +// This file holds several Groovy functions that could be useful for any Nextflow pipeline +// + +import org.yaml.snakeyaml.Yaml + +class Utils { + + // + // When running with -profile conda, warn if channels have not been set-up appropriately + // + public static void checkCondaChannels(log) { + Yaml parser = new Yaml() + def channels = [] + try { + def config = parser.load("conda config --show channels".execute().text) + channels = config.channels + } catch(NullPointerException | IOException e) { + log.warn "Could not verify conda channel configuration." + return + } + + // Check that all channels are present + def required_channels = ['conda-forge', 'bioconda', 'defaults'] + def conda_check_failed = !required_channels.every { ch -> ch in channels } + + // Check that they are in the right order + conda_check_failed |= !(channels.indexOf('conda-forge') < channels.indexOf('bioconda')) + conda_check_failed |= !(channels.indexOf('bioconda') < channels.indexOf('defaults')) + + if (conda_check_failed) { + log.warn "~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\n" + + " There is a problem with your Conda configuration!\n\n" + + " You will need to set-up the conda-forge and bioconda channels correctly.\n" + + " Please refer to https://bioconda.github.io/user/install.html#set-up-channels\n" + + " NB: The order of the channels matters!\n" + + "~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~" + } + } +} diff --git a/lib/WorkflowMain.groovy b/lib/WorkflowMain.groovy new file mode 100644 index 00000000..63ca42d3 --- /dev/null +++ b/lib/WorkflowMain.groovy @@ -0,0 +1,93 @@ +// +// This file holds several functions specific to the main.nf workflow in the nf-core/pgdb pipeline +// + +class WorkflowMain { + + // + // Citation string for pipeline + // + public static String citation(workflow) { + return "If you use ${workflow.manifest.name} for your analysis please cite:\n\n" + + //"* The pipeline\n" + + //" https://doi.org/10.5281/zenodo.XXXXXXX\n\n" + + "* The nf-core framework\n" + + " https://doi.org/10.1038/s41587-020-0439-x\n\n" + + "* Software dependencies\n" + + " https://github.com/${workflow.manifest.name}/blob/master/CITATIONS.md" + } + + // + // Print help to screen if required + // + public static String help(workflow, params, log) { + def command = "nextflow run ${workflow.manifest.name} --input samplesheet.csv --genome GRCh37 -profile docker" + def help_string = '' + help_string += NfcoreTemplate.logo(workflow, params.monochrome_logs) + help_string += NfcoreSchema.paramsHelp(workflow, params, command) + help_string += '\n' + citation(workflow) + '\n' + help_string += NfcoreTemplate.dashedLine(params.monochrome_logs) + return help_string + } + + // + // Print parameter summary log to screen + // + public static String paramsSummaryLog(workflow, params, log) { + def summary_log = '' + summary_log += NfcoreTemplate.logo(workflow, params.monochrome_logs) + summary_log += NfcoreSchema.paramsSummaryLog(workflow, params) + summary_log += '\n' + citation(workflow) + '\n' + summary_log += NfcoreTemplate.dashedLine(params.monochrome_logs) + return summary_log + } + + // + // Validate parameters and print summary to screen + // + public static void initialise(workflow, params, log) { + // Print help to screen if required + if (params.help) { + log.info help(workflow, params, log) + System.exit(0) + } + + // Validate workflow parameters via the JSON schema + if (params.validate_params) { + NfcoreSchema.validateParameters(workflow, params, log) + } + + // Print parameter summary log to screen + log.info paramsSummaryLog(workflow, params, log) + + // Check that a -profile or Nextflow config has been provided to run the pipeline + NfcoreTemplate.checkConfigProvided(workflow, log) + + // Check that conda channels are set-up correctly + if (params.enable_conda) { + Utils.checkCondaChannels(log) + } + + // Check AWS batch settings + NfcoreTemplate.awsBatch(workflow, params) + + // Check input has been provided + if (!params.input) { + log.error "Please provide an input samplesheet to the pipeline e.g. '--input samplesheet.csv'" + System.exit(1) + } + } + + // + // Get attribute from genome config file e.g. fasta + // + public static String getGenomeAttribute(params, attribute) { + def val = '' + if (params.genomes && params.genome && params.genomes.containsKey(params.genome)) { + if (params.genomes[ params.genome ].containsKey(attribute)) { + val = params.genomes[ params.genome ][ attribute ] + } + } + return val + } +} diff --git a/lib/WorkflowPgdb.groovy b/lib/WorkflowPgdb.groovy new file mode 100644 index 00000000..84fb1db9 --- /dev/null +++ b/lib/WorkflowPgdb.groovy @@ -0,0 +1,59 @@ +// +// This file holds several functions specific to the workflow/pgdb.nf in the nf-core/pgdb pipeline +// + +class WorkflowPgdb { + + // + // Check and validate parameters + // + public static void initialise(params, log) { +// genomeExistsError(params, log) +// +// if (!params.fasta) { +// log.error "Genome fasta file not specified with e.g. '--fasta genome.fa' or via a detectable config file." +// System.exit(1) +// } + } + + // + // Get workflow summary for MultiQC + // + public static String paramsSummaryMultiqc(workflow, summary) { + String summary_section = '' + for (group in summary.keySet()) { + def group_params = summary.get(group) // This gets the parameters of that particular group + if (group_params) { + summary_section += "

$group

\n" + summary_section += "
\n" + for (param in group_params.keySet()) { + summary_section += "
$param
${group_params.get(param) ?: 'N/A'}
\n" + } + summary_section += "
\n" + } + } + + String yaml_file_text = "id: '${workflow.manifest.name.replace('/','-')}-summary'\n" + yaml_file_text += "description: ' - this information is collected when the pipeline is started.'\n" + yaml_file_text += "section_name: '${workflow.manifest.name} Workflow Summary'\n" + yaml_file_text += "section_href: 'https://github.com/${workflow.manifest.name}'\n" + yaml_file_text += "plot_type: 'html'\n" + yaml_file_text += "data: |\n" + yaml_file_text += "${summary_section}" + return yaml_file_text + } + + // + // Exit pipeline if incorrect --genome key provided + // +// private static void genomeExistsError(params, log) { +// if (params.genomes && params.genome && !params.genomes.containsKey(params.genome)) { +// log.error "~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\n" + +// " Genome '${params.genome}' not found in any config files provided to the pipeline.\n" + +// " Currently, the available genome keys are:\n" + +// " ${params.genomes.keySet().join(", ")}\n" + +// "~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~" +// System.exit(1) +// } +// } +} diff --git a/lib/nfcore_external_java_deps.jar b/lib/nfcore_external_java_deps.jar new file mode 100644 index 00000000..805c8bb5 Binary files /dev/null and b/lib/nfcore_external_java_deps.jar differ diff --git a/main.nf b/main.nf index 36ec2660..d675f843 100644 --- a/main.nf +++ b/main.nf @@ -1,976 +1,55 @@ #!/usr/bin/env nextflow - /* -======================================================================================== - nf-core/pgdb -======================================================================================== - nf-core/pgdb Analysensembl_downloader_configis Pipeline. - #### Homepage / Documentation - https://github.com/nf-core/pgdb +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + nf-core/pgdb +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + Github : https://github.com/nf-core/pgdb + Website: https://nf-co.re/pgdb + Slack : https://nfcore.slack.com/channels/pgdb ---------------------------------------------------------------------------------------- */ -def helpMessage() { - log.info nfcoreHeader() - log.info """ - - Usage: - - The typical command for running the pipeline is as follows: - - nextflow run nf-core/pgdb --ensembl_name homo_sapines --ensembl false --gnomad false --cosmic false --cosmic_celllines false --cbioportal false - - Main arguments: - --final_database_protein Output file name for the final database protein fasta file under the outdir/ directory. - --help Print this help document - - Process flags: - --ncrna Generate protein database from non-coding RNAs [true | false] (default: false) - --pseudogenes Generate protein database from pseudogenes [true | false] (default: false) - --altorfs Generate alternative ORFs from canonical proteins [true | false] (default: false) - --cbioportal Download cBioPortal studies and genrate protein database [true | false] (default: false) - --cosmic Download COSMIC mutation files and generate protein database [true | false] (default: false) - --cosmic_celllines Download COSMIC cell line files and generate protein database [true | false] (default: false) - --ensembl Download ENSEMBL variants and generate protein database [true | false] (default: false) - --gnomad Download gnomAD files and generate protein database [true | false] (default: false) - --decoy Append the decoy proteins to the database [true | false] (default: false) - --add_reference Add the reference proteome to the file [true | false ] (default: false) - - Configuration files: By default all config files are located in the configs directory. - --ensembl_downloader_config Path to configuration file for ENSEMBL download parameters - --ensembl_config Path to configuration file for parameters in generating - protein databases from ENSMEBL sequences - --cosmic_config Path to configuration file for parameters in generating - protein databases from COSMIC mutations - --cbioportal_config Path to configuration file for parameters in generating - protein databases from cBioPortal mutations - --protein_decoy_config Path to configuration file for parameters used in generating - decoy databases - - Database parameters: - --taxonomy Taxonomy (Taxon ID) for the species to download ENSEMBL data, - default is 9606 for humans. For the list of supported taxonomies see: - https://www.ensembl.org/info/about/species.html - - --ensembl_name Ensembl Name is used to find the specific name in ENSEMBL for the taxonomy for download - The list can be found here: configs/ensembl_species.txt - - --cosmic_tissue_type Specify a tissue type to limit the COSMIC mutations to a particular caner type - (by default all tumor types are used) - - --cosmic_cellline_name Specify a sample name to limit the COSMIC cell line mutations to - a particular cell line (by default all cell lines are used) - - --cbioportal_tissue_type Specify a tissue type to limit the cBioPortal mutations to - a particular caner type (by default all tumor types are used) - --af_field Allele frequency identifier string in VCF Info column, if no AF info is given set it to empty. - For human VCF files from ENSEMBL the default is set to MAF - - Data download parameters: - --cosmic_user_name User name (or email) for COSMIC account - --cosmic_password Password for COSMIC account - In order to be able to download COSMIC data, the user should - provide a user and password. Please first register in COSMIC - database (https://cancer.sanger.ac.uk/cosmic/register). - - --gencode_url URL for downloading GENCODE datafiles: gencode.v19.pc_transcripts.fa.gz and - gencode.v19.annotation.gtf.gz - --gnomad_file_url URL for downloading gnomAD VCF file(s) - - - Output parameters: - --decoy_prefix String to be used as prefix for the generated decoy sequences - --publish_dir_mode [str] Mode for publishing results in the output directory. Available: - symlink, rellink, link, copy, copyNoFollow, move (Default: copy) - --outdir Output folder for the results by default is $baseDir/result - --email [email] Set this parameter to your e-mail address to get a summary e-mail with - details of the run sent to you when the workflow exits - --email_on_fail [email] Same as --email, except only send mail if the workflow is not successful - -name [str] Name for the pipeline run. If not specified, Nextflow will automatically generate a random - - AWSBatch options: - --awsqueue [str] The AWSBatch JobQueue that needs to be set when running on AWSBatch - --awsregion [str] The AWS Region for your AWS Batch job to run on - --awscli [str] Path to the AWS CLI tool - """.stripIndent() -} - -// Show help message -if (params.help){ - helpMessage() - exit 0 -} +nextflow.enable.dsl = 2 /* - * SET UP CONFIGURATION VARIABLES - */ - -// Has the run name been specified by the user? -// this has the bonus effect of catching both -name and --name -custom_runName = params.name -if (!(workflow.runName ==~ /[a-z]+_[a-z]+/)) { - custom_runName = workflow.runName -} - -// Check AWS batch settings -if (workflow.profile.contains('awsbatch')) { - // AWSBatch sanity checking - if (!params.awsqueue || !params.awsregion) exit 1, "Specify correct --awsqueue and --awsregion parameters on AWSBatch!" - // Check outdir paths to be S3 buckets if running on AWSBatch - // related: https://github.com/nextflow-io/nextflow/issues/813 - if (!params.outdir.startsWith('s3:')) exit 1, "Outdir not on S3 - specify S3 Bucket to run on AWSBatch!" - // Prevent trace files to be stored on S3 since S3 does not support rolling files. - if (params.tracedir.startsWith('s3:')) exit 1, "Specify a local tracedir or run without trace! S3 cannot be used for tracefiles." -} - -// Stage config files -ch_output_docs = file("$baseDir/docs/output.md", checkIfExists: true) -ch_output_docs_images = file("$baseDir/docs/images/", checkIfExists: true) -ensembl_downloader_config = file(params.ensembl_downloader_config, checkIfExists: true) -ensembl_config = file(params.ensembl_config) -cosmic_config = file(params.cosmic_config) -cbioportal_config = file(params.cbioportal_config) -protein_decoy_config = file(params.protein_decoy_config) - -if (params.split_by_filter_column){ - split_by_filter_column = "--split_by_filter_column" -} - -af_field = params.af_field -if (params.ensembl_name == "homo_sapiens"){ - af_field = "MAF" -} - -// Pipeline checks -if ((params.cosmic || params.cosmic_celllines) && (params.cosmic_user_name=="" || params.cosmic_password=="")){ - exit 1, "User name and password has to be provided. In order to be able to download COSMIC data, the user should provide a user and password. Please first register in COSMIC database (https://cancer.sanger.ac.uk/cosmic/register)." -} - -// Pipeline OS-specific commands -ZCAT = (System.properties['os.name'] == 'Mac OS X' ? 'gzcat' : 'zcat') - - -/** - * Download data from ensembl for the particular species. - */ -process ensembl_fasta_download{ - - when: - params.add_reference || params.ensembl || params.altorfs || params.ncrna || params.pseudogenes - - input: - file ensembl_downloader_config - - output: - file "database_ensembl/*.gz" into ensembl_fasta_gz_databases - - script: - """ - pypgatk_cli.py ensembl-downloader --config_file ${ensembl_downloader_config} --ensembl_name ${params.ensembl_name} -sv -sc - """ -} - -/** - * Decompress all the data downloaded from ENSEMBL - */ -process gunzip_ensembl_files{ - - publishDir "${params.outdir}", mode: 'copy', overwrite: true - - input: - file(fasta_file) from ensembl_fasta_gz_databases - - output: - file '*.pep.all.fa' into ensembl_protein_database_sub - file '*cdna.all.fa' into ensembl_cdna_database, ensembl_cdna_database_sub - file '*ncrna.fa' into ensembl_ncrna_database, ensembl_ncrna_database_sub - file '*.gtf' into gtf - - script: - """ - gunzip -d -f ${fasta_file} - """ -} - -process add_reference_proteome{ - - when: - params.add_reference - - input: - file reference_proteome from ensembl_protein_database_sub - - output: - file 'reference_proteome.fa' into ensembl_protein_database - - script: - """ - cat ${reference_proteome} >> reference_proteome.fa - """ - -} - -/** - * Concatenate cDNA and ncRNA databases - **/ -process merge_cdnas{ - - input: - file a from ensembl_cdna_database_sub.collect() - file b from ensembl_ncrna_database_sub.collect() - - output: - file 'total_cdnas.fa' into total_cdnas - - script: - """ - cat ${a} >> total_cdnas.fa - cat ${b} >> total_cdnas.fa - """ -} - -/** - * Creates the ncRNA protein database - */ -process add_ncrna{ - - publishDir "${params.outdir}", mode: 'copy', overwrite: true - - when: - params.ncrna - - input: - file x from total_cdnas - file ensembl_config - - output: - file 'ncRNAs_proteinDB.fa' into optional_ncrna - - script: - """ - pypgatk_cli.py dnaseq-to-proteindb --config_file "${ensembl_config}" --input_fasta ${x} --output_proteindb ncRNAs_proteinDB.fa --include_biotypes "${params.biotypes['ncRNA']}" --skip_including_all_cds --var_prefix ncRNA_ - """ -} - -merged_databases = ensembl_protein_database.mix(optional_ncrna) - -/** - * Creates the pseudogenes protein database - */ -process add_pseudogenes { - - publishDir "${params.outdir}", mode: 'copy', overwrite: true - - when: - params.pseudogenes - - input: - file x from total_cdnas - file ensembl_config - - output: - file 'pseudogenes_proteinDB.fa' into optional_pseudogenes - - script: - """ - pypgatk_cli.py dnaseq-to-proteindb --config_file "${ensembl_config}" --input_fasta "${x}" --output_proteindb pseudogenes_proteinDB.fa --include_biotypes "${params.biotypes['pseudogene']}" --skip_including_all_cds --var_prefix pseudo_ - """ -} - -merged_databases = merged_databases.mix(optional_pseudogenes) - -/** - * Creates the altORFs protein database - */ -process add_altorfs { - - publishDir "${params.outdir}", mode: 'copy', overwrite: true - - when: - params.altorfs - - input: - file x from ensembl_cdna_database - file ensembl_config - - output: - file('altorfs_proteinDB.fa') into optional_altorfs - - script: - """ - pypgatk_cli.py dnaseq-to-proteindb --config_file "${ensembl_config}" --input_fasta "${x}" --output_proteindb altorfs_proteinDB.fa --include_biotypes "${params.biotypes['protein_coding']}'" --skip_including_all_cds --var_prefix altorf_ - """ -} - -merged_databases = merged_databases.mix(optional_altorfs) - -/* Mutations to proteinDB */ - -/** - * Download COSMIC Mutations - */ -process cosmic_download { - - when: - params.cosmic - - input: - file cosmic_config - - output: - file "database_cosmic/*.gz" into cosmic_files - - script: - """ - pypgatk_cli.py cosmic-downloader --config_file "${cosmic_config}" --username ${params.cosmic_user_name} --password ${params.cosmic_password} - """ -} - -/** - * Decompress the data downloaded from COSMIC - */ -process gunzip_cosmic_files{ - - when: - params.cosmic - - input: - file(data_file) from cosmic_files - - output: - file "All_COSMIC_Genes.fasta" into cosmic_genes - file "CosmicMutantExport.tsv" into cosmic_mutations - file "All_CellLines_Genes.fasta" into cosmic_celllines_genes - file "CosmicCLP_MutantExport.tsv" into cosmic_celllines_mutations - - script: - """ - gunzip -d -f ${data_file} - """ -} - -/** - * Generate proteindb from cosmic mutations +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + VALIDATE & PRINT PARAMETER SUMMARY +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ */ -process cosmic_proteindb{ - - publishDir "${params.outdir}", mode: 'copy', overwrite: true - - when: - params.cosmic - - input: - file g from cosmic_genes - file m from cosmic_mutations - file cosmic_config - - output: - file 'cosmic_proteinDB*.fa' into cosmic_proteindbs - - script: - """ - pypgatk_cli.py cosmic-to-proteindb --config_file "${cosmic_config}" --input_mutation ${m} --input_genes ${g} --filter_column 'Primary site' --accepted_values ${params.cosmic_tissue_type} ${split_by_filter_column} --output_db cosmic_proteinDB.fa - """ -} - -merged_databases = merged_databases.mix(cosmic_proteindbs) - -/** - * Generate proteindb from cosmic cell lines mutations -*/ -process cosmic_celllines_proteindb{ - - publishDir "${params.outdir}", mode: 'copy', overwrite: true - - when: - params.cosmic_celllines - - input: - file g from cosmic_celllines_genes - file m from cosmic_celllines_mutations - file cosmic_config - - output: - file 'cosmic_celllines_proteinDB*.fa' into cosmic_celllines_proteindbs - - script: - """ - pypgatk_cli.py cosmic-to-proteindb --config_file "${cosmic_config}" --input_mutation ${m} --input_genes ${g} --filter_column 'Sample name' --accepted_values ${params.cosmic_cellline_name} ${split_by_filter_column} --output_db cosmic_celllines_proteinDB.fa - """ -} - -merged_databases = merged_databases.mix(cosmic_celllines_proteindbs) - -/** - * Download VCF files from ensembl for the particular species. - */ -process ensembl_vcf_download{ - - when: - params.ensembl - - input: - file ensembl_downloader_config - - output: - file "database_ensembl/*.vcf.gz" into ensembl_vcf_gz_files - - script: - """ - pypgatk_cli.py ensembl-downloader --config_file ${ensembl_downloader_config} --ensembl_name ${params.ensembl_name} -sg -sp -sc -sd -sn - """ -} -/** - * Decompress vcf files downloaded from ENSEMBL - */ -process gunzip_vcf_ensembl_files{ - - label 'process_medium' - label 'process_single_thread' - - when: - params.ensembl - - input: - file vcf_file from ensembl_vcf_gz_files.flatten().map{ file(it) } - - output: - file "*.vcf" into ensembl_vcf_files - - script: - """ - gunzip -d -f $vcf_file - """ -} - -process check_ensembl_vcf{ - - label 'process_medium' - label 'process_single_thread' - - when: - params.ensembl - - input: - file vcf_file from ensembl_vcf_files - - output: - file "checked_*.vcf" into ensembl_vcf_files_checked - - script: - """ - awk 'BEGIN{FS=OFS="\t"}{if(\$1~"#" || (\$5!="" && \$4!="")) print}' $vcf_file > checked_$vcf_file - """ -} - -/** - * Generate protein database(s) from ENSEMBL vcf file(s) - */ -process ensembl_vcf_proteinDB { - - label 'process_medium' - label 'process_single_thread' - - when: - params.ensembl - - input: - file v from ensembl_vcf_files_checked - file f from total_cdnas - file g from gtf - file e from ensembl_config - - output: - file "${v}_proteinDB.fa" into proteinDB_vcf - - script: - """ - pypgatk_cli.py vcf-to-proteindb --config_file ${e} --af_field "${af_field}" --include_biotypes "${params.biotypes['protein_coding']}" --input_fasta ${f} --gene_annotations_gtf ${g} --vep_annotated_vcf ${v} --output_proteindb "${v}_proteinDB.fa" --var_prefix ensvar - """ -} - -//concatenate all ensembl proteindbs into one -proteinDB_vcf - .collectFile(name: 'ensembl_proteindb.fa', newLine: false, storeDir: "${baseDir}/result") - .set {proteinDB_vcf_final} - -merged_databases = merged_databases.mix(proteinDB_vcf_final) - -/****** gnomAD variatns *****/ - -/** - * Download gencode files (fasta and gtf) - */ -process gencode_download{ - - when: - params.gnomad - - input: - val g from params.gencode_url - - output: - file("gencode.v19.pc_transcripts.fa") into gencode_fasta - file("gencode.v19.annotation.gtf") into gencode_gtf - - script: - """ - wget ${g}/gencode.v19.pc_transcripts.fa.gz - wget ${g}/gencode.v19.annotation.gtf.gz - gunzip *.gz - """ -} - -/** - * Download gnomAD variants (VCF) - requires gsutil - */ -process gnomad_download{ - - when: - params.gnomad - - input: - val g from params.gnomad_file_url - - output: - file "*.vcf.bgz" into gnomad_vcf_bgz - - script: - """ - gsutil cp ${g} . - """ -} - -/** - * Extract gnomAD VCF - */ -process extract_gnomad_vcf{ - - when: - params.gnomad - - input: - file g from gnomad_vcf_bgz.flatten().map{ file(it) } - - output: - file "*.vcf" into gnomad_vcf_files - - script: - """ - zcat ${g} > ${g}.vcf - """ -} - -/** - * Generate gmomAD proteinDB - */ -process gnomad_proteindb{ - - when: - params.gnomad - - input: - file v from gnomad_vcf_files - file f from gencode_fasta - file g from gencode_gtf - file e from ensembl_config - - output: - file "${v}_proteinDB.fa" into gnomad_vcf_proteindb - - script: - """ - pypgatk_cli.py vcf-to-proteindb --config_file ${e} --vep_annotated_vcf ${v} --input_fasta ${f} --gene_annotations_gtf ${g} --output_proteindb "${v}_proteinDB.fa" --af_field controls_AF --transcript_index 6 --biotype_str transcript_type --annotation_field_name vep --var_prefix gnomadvar - """ -} - -//concatenate all gnomad proteindbs into one -gnomad_vcf_proteindb - .collectFile(name: 'gnomad_proteindb.fa', newLine: false, storeDir: "${baseDir}/result") - .set {gnomad_vcf_proteindb_final} - -merged_databases = merged_databases.mix(gnomad_vcf_proteindb_final) - -/****** cBioPortal mutations *****/ -/** - * Download GRCh37 CDS file from ENSEMBL release 75 - */ -process cds_GRCh37_download{ - - when: - params.cbioportal - - output: - file("Homo_sapiens.GRCh37.75.cds.all.fa") into GRCh37_cds - - script: - """ - wget ftp://ftp.ensembl.org/pub/release-75/fasta/homo_sapiens/cds/Homo_sapiens.GRCh37.75.cds.all.fa.gz - gunzip *.gz - """ -} - -/** - * Download all cBioPortal studies using git-lfs -*/ - process download_all_cbioportal { - - when: - params.cbioportal - - output: - file('cbioportal_allstudies_data_mutations_mskcc.txt') into cbio_mutations - file('cbioportal_allstudies_data_clinical_sample.txt') into cbio_samples - - script: - """ - git clone https://github.com/cBioPortal/datahub.git - cd datahub - git lfs install --local --skip-smudge - git lfs pull -I public --include "data*clinical*sample.txt" - git lfs pull -I public --include "data_mutations_mskcc.txt" - cd .. - cat datahub/public/*/data_mutations_mskcc.txt > cbioportal_allstudies_data_mutations_mskcc.txt - cat datahub/public/*/*data*clinical*sample.txt | awk 'BEGIN{FS=OFS="\\t"}{if(\$1!~"#SAMPLE_ID"){gsub("#SAMPLE_ID", "\\nSAMPLE_ID");} print}' | awk 'BEGIN{FS=OFS="\\t"}{s=0; j=0; for(i=1;i<=NF;i++){if(\$i=="CANCER_TYPE_DETAILED") j=1; if(\$i=="CANCER_TYPE") s=1;} if(j==1 && s==0){gsub("CANCER_TYPE_DETAILED", "CANCER_TYPE");} print;}' > cbioportal_allstudies_data_clinical_sample.txt - """ - } - -/** - * Generate proteinDB from cBioPortal mutations - */ - process cbioportal_proteindb{ - - publishDir "${params.outdir}", mode: 'copy', overwrite: true - - when: - params.cbioportal - - input: - file g from GRCh37_cds - file m from cbio_mutations - file s from cbio_samples - file cbioportal_config - - output: - file 'cbioPortal_proteinDB*.fa' into cBioportal_proteindb - - script: - """ - pypgatk_cli.py cbioportal-to-proteindb --config_file "${cbioportal_config}" --input_mutation ${m} --input_cds ${g} --clinical_sample_file ${s} --filter_column 'Tumor_Sample_Barcode' --accepted_values ${params.cbioportal_tissue_type} ${split_by_filter_column} --output_db cbioPortal_proteinDB.fa - """ -} - -merged_databases = merged_databases.mix(cBioportal_proteindb) - -/** - * Concatenate all generated databases from merged_databases channel to the final_database_protein file - */ -process merge_proteindbs { - - publishDir "${params.outdir}", mode: 'copy', overwrite: true - - input: - file("proteindb*") from merged_databases.collect() - - output: - file "${params.final_database_protein}" into protiendbs - file "${params.final_database_protein}" into proteindb_result - - script: - """ - cat proteindb* > ${params.final_database_protein} - """ -} - -/** - * Create the decoy database using DecoyPYrat - * Decoy sequences will have "_DECOY" prefix tag to the protein accession. - */ -process decoy { - - publishDir "${params.outdir}", mode: 'copy', overwrite: true - - when: - params.decoy - - input: - file f from protiendbs - file protein_decoy_config - - output: - file "${params.decoy_prefix}${params.final_database_protein}" into fasta_decoy_db - - script: - """ - pypgatk_cli.py generate-decoy --config_file ${protein_decoy_config} --input $f --decoy_prefix "${params.decoy_prefix}" --output "${params.decoy_prefix}${params.final_database_protein}" - """ -} - -/** Write the final results to S3 bucket**/ - -if (params.decoy) { - fasta_decoy_db.subscribe { results -> results.copyTo("${params.result_file}")} -} else { - proteindb_result.subscribe { results -> results.copyTo("${params.result_file}") } -} - - -//--------------------------------------------------------------- // -//---------------------- Nextflow specifics --------------------- // -//--------------------------------------------------------------- // - - -// Header log info -log.info nfcoreHeader() -def summary = [:] -if (workflow.revision) summary['Pipeline Release'] = workflow.revision -summary['Run Name'] = custom_runName ?: workflow.runName -summary['Max Resources'] = "$params.max_memory memory, $params.max_cpus cpus, $params.max_time time per job" -if (workflow.containerEngine) summary['Container'] = "$workflow.containerEngine - $workflow.container" -summary['Output dir'] = params.outdir -summary['Launch dir'] = workflow.launchDir -summary['Working dir'] = workflow.workDir -summary['Script dir'] = workflow.projectDir -summary['User'] = workflow.userName -if (workflow.profile.contains('awsbatch')) { - summary['AWS Region'] = params.awsregion - summary['AWS Queue'] = params.awsqueue - summary['AWS CLI'] = params.awscli -} -summary['Config Profile'] = workflow.profile -if (params.config_profile_description) summary['Config Profile Description'] = params.config_profile_description -if (params.config_profile_contact) summary['Config Profile Contact'] = params.config_profile_contact -if (params.config_profile_url) summary['Config Profile URL'] = params.config_profile_url -summary['Config Files'] = workflow.configFiles.join(', ') -if (params.email || params.email_on_fail) { - summary['E-mail Address'] = params.email - summary['E-mail on failure'] = params.email_on_fail -} -log.info summary.collect { k,v -> "${k.padRight(18)}: $v" }.join("\n") -log.info "-\033[2m--------------------------------------------------\033[0m-" - -// Check the hostnames against configured profiles -checkHostname() - -Channel.from(summary.collect{ [it.key, it.value] }) - .map { k,v -> "
$k
${v ?: 'N/A'}
" } - .reduce { a, b -> return [a, b].join("\n ") } - .map { x -> """ - id: 'nf-core-proteomicslfq-summary' - description: " - this information is collected when the pipeline is started." - section_name: 'nf-core/proteomicslfq Workflow Summary' - section_href: 'https://github.com/nf-core/proteomicslfq' - plot_type: 'html' - data: | -
- $x -
- """.stripIndent() } - .set { ch_workflow_summary } +WorkflowMain.initialise(workflow, params, log) /* - * Parse software version numbers - */ -process get_software_versions { - publishDir "${params.outdir}/pipeline_info", mode: params.publish_dir_mode, - saveAs: { filename -> - if (filename.indexOf(".csv") > 0) filename - else null - } +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + NAMED WORKFLOW FOR PIPELINE +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +*/ - output: - file 'software_versions_mqc.yaml' into ch_software_versions_yaml - file "software_versions.csv" +include { PGDB } from './workflows/pgdb' - script: - """ - echo $workflow.manifest.version > v_pipeline.txt - echo $workflow.nextflow.version > v_nextflow.txt - scrape_software_versions.py &> software_versions_mqc.yaml - """ +// +// WORKFLOW: Run main nf-core/pgdb analysis pipeline +// +workflow NFCORE_PGDB { + PGDB () } /* - * STEP 3 - Output Description HTML - */ -process output_documentation { - - publishDir "${params.outdir}/pipeline_info", mode: params.publish_dir_mode - - input: - file output_docs from ch_output_docs - file images from ch_output_docs_images - - output: - file "results_description.html" +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + RUN ALL WORKFLOWS +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +*/ - script: - """ - markdown_to_html.py $output_docs -o results_description.html - """ +// +// WORKFLOW: Execute a single named workflow for the pipeline +// See: https://github.com/nf-core/rnaseq/issues/619 +// +workflow { + NFCORE_PGDB () } /* - * Completion e-mail notification - */ -workflow.onComplete { - - // Set up the e-mail variables - def subject = "[nf-core/pgdb] Successful: $workflow.runName" - if (!workflow.success) { - subject = "[nf-core/pgdb] FAILED: $workflow.runName" - } - def email_fields = [:] - email_fields['version'] = workflow.manifest.version - email_fields['runName'] = custom_runName ?: workflow.runName - email_fields['success'] = workflow.success - email_fields['dateComplete'] = workflow.complete - email_fields['duration'] = workflow.duration - email_fields['exitStatus'] = workflow.exitStatus - email_fields['errorMessage'] = (workflow.errorMessage ?: 'None') - email_fields['errorReport'] = (workflow.errorReport ?: 'None') - email_fields['commandLine'] = workflow.commandLine - email_fields['projectDir'] = workflow.projectDir - email_fields['summary'] = summary - email_fields['summary']['Date Started'] = workflow.start - email_fields['summary']['Date Completed'] = workflow.complete - email_fields['summary']['Pipeline script file path'] = workflow.scriptFile - email_fields['summary']['Pipeline script hash ID'] = workflow.scriptId - if (workflow.repository) email_fields['summary']['Pipeline repository Git URL'] = workflow.repository - if (workflow.commitId) email_fields['summary']['Pipeline repository Git Commit'] = workflow.commitId - if (workflow.revision) email_fields['summary']['Pipeline Git branch/tag'] = workflow.revision - email_fields['summary']['Nextflow Version'] = workflow.nextflow.version - email_fields['summary']['Nextflow Build'] = workflow.nextflow.build - email_fields['summary']['Nextflow Compile Timestamp'] = workflow.nextflow.timestamp - - // TODO nf-core: If not using MultiQC, strip out this code (including params.max_multiqc_email_size) - // On success try attach the multiqc report - //def mqc_report = null - //try { - // if (workflow.success) { - // mqc_report = ch_multiqc_report.getVal() - // if (mqc_report.getClass() == ArrayList) { - // log.warn "[nf-core/pgdb] Found multiple reports from process 'multiqc', will use only one" - // mqc_report = mqc_report[0] - // } - // } - //} catch (all) { - // log.warn "[nf-core/pgdb] Could not attach MultiQC report to summary email" - //} - - // Check if we are only sending emails on failure - email_address = params.email - if (!params.email && params.email_on_fail && !workflow.success) { - email_address = params.email_on_fail - } - - // Render the TXT template - def engine = new groovy.text.GStringTemplateEngine() - def tf = new File("$projectDir/assets/email_template.txt") - def txt_template = engine.createTemplate(tf).make(email_fields) - def email_txt = txt_template.toString() - - // Render the HTML template - def hf = new File("$projectDir/assets/email_template.html") - def html_template = engine.createTemplate(hf).make(email_fields) - def email_html = html_template.toString() - - // Render the sendmail template - def smail_fields = [ email: email_address, subject: subject, email_txt: email_txt, email_html: email_html, projectDir: "$projectDir", mqcFile: mqc_report, mqcMaxSize: params.max_multiqc_email_size.toBytes() ] - def sf = new File("$projectDir/assets/sendmail_template.txt") - def sendmail_template = engine.createTemplate(sf).make(smail_fields) - def sendmail_html = sendmail_template.toString() - - // Send the HTML e-mail - if (email_address) { - try { - if (params.plaintext_email) { throw GroovyException('Send plaintext e-mail, not HTML') } - // Try to send HTML e-mail using sendmail - [ 'sendmail', '-t' ].execute() << sendmail_html - log.info "[nf-core/pgdb] Sent summary e-mail to $email_address (sendmail)" - } catch (all) { - // Catch failures and try with plaintext - def mail_cmd = [ 'mail', '-s', subject, '--content-type=text/html', email_address ] - if ( mqc_report.size() <= params.max_multiqc_email_size.toBytes() ) { - mail_cmd += [ '-A', mqc_report ] - } - mail_cmd.execute() << email_html - log.info "[nf-core/pgdb] Sent summary e-mail to $email_address (mail)" - } - } - - // Write summary e-mail HTML to a file - def output_d = new File("${params.outdir}/pipeline_info/") - if (!output_d.exists()) { - output_d.mkdirs() - } - def output_hf = new File(output_d, "pipeline_report.html") - output_hf.withWriter { w -> w << email_html } - def output_tf = new File(output_d, "pipeline_report.txt") - output_tf.withWriter { w -> w << email_txt } - - c_green = params.monochrome_logs ? '' : "\033[0;32m"; - c_purple = params.monochrome_logs ? '' : "\033[0;35m"; - c_red = params.monochrome_logs ? '' : "\033[0;31m"; - c_reset = params.monochrome_logs ? '' : "\033[0m"; - - if (workflow.stats.ignoredCount > 0 && workflow.success) { - log.info "-${c_purple}Warning, pipeline completed, but with errored process(es) ${c_reset}-" - log.info "-${c_red}Number of ignored errored process(es) : ${workflow.stats.ignoredCount} ${c_reset}-" - log.info "-${c_green}Number of successfully ran process(es) : ${workflow.stats.succeedCount} ${c_reset}-" - } - - if (workflow.success) { - log.info "-${c_purple}[nf-core/pgdb]${c_green} Pipeline completed successfully${c_reset}-" - } else { - checkHostname() - log.info "-${c_purple}[nf-core/pgdb]${c_red} Pipeline completed with errors${c_reset}-" - } - -} - - -def nfcoreHeader() { - // Log colors ANSI codes - c_black = params.monochrome_logs ? '' : "\033[0;30m"; - c_blue = params.monochrome_logs ? '' : "\033[0;34m"; - c_cyan = params.monochrome_logs ? '' : "\033[0;36m"; - c_dim = params.monochrome_logs ? '' : "\033[2m"; - c_green = params.monochrome_logs ? '' : "\033[0;32m"; - c_purple = params.monochrome_logs ? '' : "\033[0;35m"; - c_reset = params.monochrome_logs ? '' : "\033[0m"; - c_white = params.monochrome_logs ? '' : "\033[0;37m"; - c_yellow = params.monochrome_logs ? '' : "\033[0;33m"; - - return """ -${c_dim}--------------------------------------------------${c_reset}- - ${c_green},--.${c_black}/${c_green},-.${c_reset} - ${c_blue} ___ __ __ __ ___ ${c_green}/,-._.--~\'${c_reset} - ${c_blue} |\\ | |__ __ / ` / \\ |__) |__ ${c_yellow}} {${c_reset} - ${c_blue} | \\| | \\__, \\__/ | \\ |___ ${c_green}\\`-._,-`-,${c_reset} - ${c_green}`._,._,\'${c_reset} - ${c_purple} nf-core/pgdb v${workflow.manifest.version}${c_reset} - -${c_dim}--------------------------------------------------${c_reset}- - """.stripIndent() -} - -def checkHostname() { - def c_reset = params.monochrome_logs ? '' : "\033[0m" - def c_white = params.monochrome_logs ? '' : "\033[0;37m" - def c_red = params.monochrome_logs ? '' : "\033[1;91m" - def c_yellow_bold = params.monochrome_logs ? '' : "\033[1;93m" - if (params.hostnames) { - def hostname = "hostname".execute().text.trim() - params.hostnames.each { prof, hnames -> - hnames.each { hname -> - if (hostname.contains(hname) && !workflow.profile.contains(prof)) { - log.error "====================================================\n" + - " ${c_red}WARNING!${c_reset} You are running with `-profile $workflow.profile`\n" + - " but your machine hostname is ${c_white}'$hostname'${c_reset}\n" + - " ${c_yellow_bold}It's highly recommended that you use `-profile $prof${c_reset}`\n" + - "============================================================" - } - } - } - } -} +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + THE END +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +*/ diff --git a/modules.json b/modules.json new file mode 100644 index 00000000..a004fd67 --- /dev/null +++ b/modules.json @@ -0,0 +1,17 @@ +{ + "name": "nf-core/pgdb", + "homePage": "https://github.com/nf-core/pgdb", + "repos": { + "nf-core/modules": { + "custom/dumpsoftwareversions": { + "git_sha": "e745e167c1020928ef20ea1397b6b4d230681b4d" + }, + "fastqc": { + "git_sha": "e745e167c1020928ef20ea1397b6b4d230681b4d" + }, + "multiqc": { + "git_sha": "e745e167c1020928ef20ea1397b6b4d230681b4d" + } + } + } +} diff --git a/modules/local/cbioportalmutations/cbioportal_proteindb.nf b/modules/local/cbioportalmutations/cbioportal_proteindb.nf new file mode 100644 index 00000000..cff18b78 --- /dev/null +++ b/modules/local/cbioportalmutations/cbioportal_proteindb.nf @@ -0,0 +1,36 @@ +/** + * Generate proteinDB from cBioPortal mutations + */ +process CBIOPORTAL_PROTEINDB { + + conda (params.enable_conda ? "bioconda::pypgatk=0.0.19" : null) + container "${ workflow.containerEngine == 'singularity' && !task.ext.singularity_pull_docker_container ? + 'https://depot.galaxyproject.org/singularity/pypgatk_0.0.19--py_0' : + 'quay.io/biocontainers/pypgatk:0.0.19--py_0' }" + + when: + params.cbioportal + + input: + file g + file m + file s + file cbioportal_config + val cbioportal_filter_column + val cbioportal_accepted_values + + output: + path 'cbioPortal_proteinDB*.fa' ,emit: cBioportal_proteindb + + script: + """ + pypgatk_cli.py cbioportal-to-proteindb \\ + --config_file $cbioportal_config \\ + --input_mutation $m \\ + --input_cds $g \\ + --clinical_sample_file $s \\ + --filter_column $cbioportal_filter_column \\ + --accepted_values $cbioportal_accepted_values \\ + --output_db cbioPortal_proteinDB.fa + """ +} diff --git a/modules/local/cbioportalmutations/cds_GRCh37_download.nf b/modules/local/cbioportalmutations/cds_GRCh37_download.nf new file mode 100644 index 00000000..be3b7bc1 --- /dev/null +++ b/modules/local/cbioportalmutations/cds_GRCh37_download.nf @@ -0,0 +1,18 @@ +/** + * Download GRCh37 CDS file from ENSEMBL release 75 + */ +process CDS_GRCH37_DOWNLOAD { + + when: + params.cbioportal + + output: + path("Homo_sapiens.GRCh37.75.cds.all.fa") ,emit:ch_GRCh37_cds + + script: + """ + wget ftp://ftp.ensembl.org/pub/release-75/fasta/homo_sapiens/cds/Homo_sapiens.GRCh37.75.cds.all.fa.gz + gunzip *.gz + """ +} + diff --git a/modules/local/cbioportalmutations/download_all_cbioportal.nf b/modules/local/cbioportalmutations/download_all_cbioportal.nf new file mode 100644 index 00000000..7820f757 --- /dev/null +++ b/modules/local/cbioportalmutations/download_all_cbioportal.nf @@ -0,0 +1,66 @@ +/** + * Download all cBioPortal studies using git-lfs +*/ +process DOWNLOAD_ALL_CBIOPORTAL { + + // conda (params.enable_conda ? "bioconda::pypgatk=0.0.19 conda-forge::git-lfs=2.13.2 conda-forge::git=2.30.0" : null) + // container "${ workflow.containerEngine == 'singularity' && !task.ext.singularity_pull_docker_container ? + // 'https://depot.galaxyproject.org/singularity/pypgatk_0.0.19--py_0' ; + // 'https://depot.galaxyproject.org/singularity/git-lfs_1.5.2--0'; + // 'https://containers.biocontainers.pro/s3/singularity/git_1' : + // 'quay.io/biocontainers/pypgatk:0.0.19--py_0'; + // 'quay.io/biocontainers/git-lfs:1.5.2--0'; + // 'bitnami/git:2.30.2' }" + container "nfcore/pgdb:latest" + + when: + params.cbioportal + + input: + file cbioportal_config + val cbioportal_study_id + + output: + path('cbioportal_allstudies_data_mutations.txt') ,emit: cbio_mutations + path('cbioportal_allstudies_data_clinical_sample.txt') ,emit: cbio_samples + + script: + if (cbioportal_study_id == "all") + """ + git clone https://github.com/cBioPortal/datahub.git + git lfs install --local --skip-smudge + git lfs pull -I public --include "data*clinical*sample.txt" + git lfs pull -I public --include "data_mutations.txt" + cat datahub/public/*/data_mutations.txt > cbioportal_allstudies_data_mutations.txt + cat datahub/public/*/*data*clinical*sample.txt | \\ + awk 'BEGIN{FS=OFS="\\t"}{if(\$1!~"#SAMPLE_ID"){gsub("#SAMPLE_ID", "\\nSAMPLE_ID");} print}' | \\ + awk 'BEGIN{FS=OFS="\\t"}{s=0; j=0; \\ + for(i=1;i<=NF;i++){ \\ + if(\$i=="CANCER_TYPE_DETAILED") j=1; \\ + if(\$i=="CANCER_TYPE") s=1; \\ + } \\ + if(j==1 && s==0){ \\ + gsub("CANCER_TYPE_DETAILED", "CANCER_TYPE"); \\ + } \\ + print; \\ + }' \\ + > cbioportal_allstudies_data_clinical_sample.txt + """ + else + """ + pypgatk_cli.py cbioportal-downloader \\ + --config_file "$cbioportal_config" \\ + -d "$cbioportal_study_id" + + tar -xzvf database_cbioportal/${cbioportal_study_id}.tar.gz + cat ${cbioportal_study_id}/data_mutations.txt > cbioportal_allstudies_data_mutations.txt + cat ${cbioportal_study_id}/data_clinical_sample.txt | \\ + awk 'BEGIN{FS=OFS="\\t"}{if(\$1!~"#SAMPLE_ID"){gsub("#SAMPLE_ID", "\\nSAMPLE_ID");} print}' | \\ + awk 'BEGIN{FS=OFS="\\t"}{s=0; j=0; \\ + for(i=1;i<=NF;i++){ \\ + if(\$i=="CANCER_TYPE_DETAILED") j=1; if(\$i=="CANCER_TYPE") s=1; \\ + } \\ + if(j==1 && s==0){gsub("CANCER_TYPE_DETAILED", "CANCER_TYPE");} print;}' \\ + > cbioportal_allstudies_data_clinical_sample.txt + """ +} diff --git a/modules/local/clean_protein_database.nf b/modules/local/clean_protein_database.nf new file mode 100644 index 00000000..9462c81d --- /dev/null +++ b/modules/local/clean_protein_database.nf @@ -0,0 +1,42 @@ +/** + * clean the database for stop codons, and unwanted AA like: *, also remove proteins with less than 6 AA + */ +process CLEAN_PROTEIN_DATABASE { + + conda (params.enable_conda ? "bioconda::pypgatk=0.0.19" : null) + container "${ workflow.containerEngine == 'singularity' && !task.ext.singularity_pull_docker_container ? + 'https://depot.galaxyproject.org/singularity/pypgatk_0.0.19--py_0' : + 'quay.io/biocontainers/pypgatk:0.0.19--py_0' }" + + publishDir "${params.outdir}/", mode: params.publish_dir_mode, + // Final step if not creating a decoy database - save output to params.final_database_protein + saveAs: { filename -> + params.decoy ? null : params.final_database_protein + } + + when: + params.clean_database + + input: + file file + file ensembl_config + val minimum_aa + + output: + path 'database_clean.fa' ,emit: clean_database_sh + + script: + stop_codons = '' + if (params.add_stop_codons){ + stop_codons = "--add_stop_codons" + } + + """ + pypgatk_cli.py ensembl-check \\ + -in "$file" \\ + --config_file $ensembl_config \\ + -out database_clean.fa \\ + --num_aa "$minimum_aa" \\ + "$stop_codons" + """ +} diff --git a/modules/local/cosmicmutations/cosmic_celllines_download.nf b/modules/local/cosmicmutations/cosmic_celllines_download.nf new file mode 100644 index 00000000..45706542 --- /dev/null +++ b/modules/local/cosmicmutations/cosmic_celllines_download.nf @@ -0,0 +1,21 @@ +/** + * Download COSMIC Cell Lines Mutations + */ +process COSMIC_CELLLINES_DOWNLOAD { + + when: + params.cosmic_celllines + + output: + path "database_cosmic/All_CellLines_Genes.fasta" ,emit: cosmic_celllines_genes + path "database_cosmic/CosmicCLP_MutantExport.tsv" ,emit:cosmic_celllines_mutations + + script: + base64 = '' + """ + base64=`echo "$params.cosmic_user_name:$params.cosmic_password" | base64` + curl -o database_cosmic/All_CellLines_Genes.fasta.gz --create-dirs `curl -H "Authorization: \$base64" https://cancer.sanger.ac.uk/cosmic/file_download/GRCh38/cell_lines/$params.cosmic_dababase_version/All_CellLines_Genes.fasta.gz |grep -Po 'url[" :]+\\K[^"]+'` + curl -o database_cosmic/CosmicCLP_MutantExport.tsv.gz --create-dirs `curl -H "Authorization: \$base64" https://cancer.sanger.ac.uk/cosmic/file_download/GRCh38/cell_lines/$params.cosmic_dababase_version/CosmicCLP_MutantExport.tsv.gz |grep -Po 'url[" :]+\\K[^"]+'` + gunzip database_cosmic/*.gz + """ +} diff --git a/modules/local/cosmicmutations/cosmic_celllines_proteindb.nf b/modules/local/cosmicmutations/cosmic_celllines_proteindb.nf new file mode 100644 index 00000000..daa775b1 --- /dev/null +++ b/modules/local/cosmicmutations/cosmic_celllines_proteindb.nf @@ -0,0 +1,33 @@ +/** + * Generate proteindb from cosmic cell lines mutations +*/ +process COSMIC_CELLLINES_PROTEINDB { + + conda (params.enable_conda ? "bioconda::pypgatk=0.0.19" : null) + container "${ workflow.containerEngine == 'singularity' && !task.ext.singularity_pull_docker_container ? + 'https://depot.galaxyproject.org/singularity/pypgatk_0.0.19--py_0' : + 'quay.io/biocontainers/pypgatk:0.0.19--py_0' }" + + when: + params.cosmic_celllines + + input: + file g + file m + file cosmic_config + val cosmic_cellline_name + + output: + path 'cosmic_celllines_proteinDB*.fa' ,emit:cosmic_celllines_proteindbs + + script: + """ + pypgatk_cli.py cosmic-to-proteindb \\ + --config_file "$cosmic_config" \\ + --input_mutation $m \\ + --input_genes $g \\ + --filter_column 'Sample name' \\ + --accepted_values $cosmic_cellline_name \\ + --output_db cosmic_celllines_proteinDB.fa + """ +} diff --git a/modules/local/cosmicmutations/cosmic_celllines_proteindb_local.nf b/modules/local/cosmicmutations/cosmic_celllines_proteindb_local.nf new file mode 100644 index 00000000..03a7f4a9 --- /dev/null +++ b/modules/local/cosmicmutations/cosmic_celllines_proteindb_local.nf @@ -0,0 +1,33 @@ +/** + * Generate proteindb from local cosmic cell lines mutations +*/ +process COSMIC_CELLLINES_PROTEINDB_LOCAL { + + conda (params.enable_conda ? "bioconda::pypgatk=0.0.19" : null) + container "${ workflow.containerEngine == 'singularity' && !task.ext.singularity_pull_docker_container ? + 'https://depot.galaxyproject.org/singularity/pypgatk_0.0.19--py_0' : + 'quay.io/biocontainers/pypgatk:0.0.19--py_0' }" + + when: + params.cosmiccelllines_genes && params.cosmiccelllines_mutations + + input: + file cosmic_config + val cosmic_cellline_name + file cosmiccelllines_mutations + file cosmiccelllines_genes + + output: + file 'cosmic_celllines_proteinDB*.fa' into cosmic_celllines_proteindbs_uselocal + + script: + """ + pypgatk_cli.py cosmic-to-proteindb \\ + --config_file "$cosmic_config" \\ + --input_mutation $cosmiccelllines_mutations \\ + --input_genes $cosmiccelllines_genes \\ + --filter_column 'Sample name' \\ + --accepted_values $cosmic_cellline_name \\ + --output_db cosmic_celllines_proteinDB.fa + """ +} diff --git a/modules/local/cosmicmutations/cosmic_download.nf b/modules/local/cosmicmutations/cosmic_download.nf new file mode 100644 index 00000000..da5ded31 --- /dev/null +++ b/modules/local/cosmicmutations/cosmic_download.nf @@ -0,0 +1,21 @@ +/** + * Download COSMIC Mutations + */ +process COSMIC_DOWNLOAD { + + when: + params.cosmic + + output: + path "database_cosmic/All_COSMIC_Genes.fasta" ,emit: cosmic_genes + path "database_cosmic/CosmicMutantExport.tsv" ,emit:cosmic_mutations + + script: + base64 = '' + """ + base64=`echo "$params.cosmic_user_name:$params.cosmic_password" | base64` + curl -o database_cosmic/All_COSMIC_Genes.fasta.gz --create-dirs `curl -H "Authorization: \$base64" https://cancer.sanger.ac.uk/cosmic/file_download/GRCh38/cosmic/$params.cosmic_dababase_version/All_COSMIC_Genes.fasta.gz |grep -Po 'url[" :]+\\K[^"]+'` + curl -o database_cosmic/CosmicMutantExport.tsv.gz --create-dirs `curl -H "Authorization: \$base64" https://cancer.sanger.ac.uk/cosmic/file_download/GRCh38/cosmic/$params.cosmic_dababase_version/CosmicMutantExport.tsv.gz |grep -Po 'url[" :]+\\K[^"]+'` + gunzip database_cosmic/*.gz + """ +} diff --git a/modules/local/cosmicmutations/cosmic_proteindb.nf b/modules/local/cosmicmutations/cosmic_proteindb.nf new file mode 100644 index 00000000..21eb166b --- /dev/null +++ b/modules/local/cosmicmutations/cosmic_proteindb.nf @@ -0,0 +1,32 @@ +/** + * Generate proteindb from cosmic mutations +*/ +process COSMIC_PROTEINDB { + + conda (params.enable_conda ? "bioconda::pypgatk=0.0.19" : null) + container "${ workflow.containerEngine == 'singularity' && !task.ext.singularity_pull_docker_container ? + 'https://depot.galaxyproject.org/singularity/pypgatk_0.0.19--py_0' : + 'quay.io/biocontainers/pypgatk:0.0.19--py_0' }" + + when: + params.cosmic + + input: + file g + file m + file cosmic_config + val cosmic_cancer_type + + output: + path 'cosmic_proteinDB*.fa' ,emit: cosmic_proteindbs + + script: + """ + pypgatk_cli.py cosmic-to-proteindb \\ + --config_file "$cosmic_config" \\ + --input_mutation $m --input_genes $g \\ + --filter_column 'Histology subtype 1' \\ + --accepted_values $cosmic_cancer_type \\ + --output_db cosmic_proteinDB.fa + """ +} diff --git a/modules/local/cosmicmutations/cosmic_proteindb_local.nf b/modules/local/cosmicmutations/cosmic_proteindb_local.nf new file mode 100644 index 00000000..bf5ec108 --- /dev/null +++ b/modules/local/cosmicmutations/cosmic_proteindb_local.nf @@ -0,0 +1,32 @@ +/** + * Generate proteindb from local cosmic mutations +*/ +process COSMIC_PROTEINDB_LOCAL { + + conda (params.enable_conda ? "bioconda::pypgatk=0.0.19" : null) + container "${ workflow.containerEngine == 'singularity' && !task.ext.singularity_pull_docker_container ? + 'https://depot.galaxyproject.org/singularity/pypgatk_0.0.19--py_0' : + 'quay.io/biocontainers/pypgatk:0.0.19--py_0' }" + + when: + params.cosmicgenes && params.cosmicmutations + + input: + file cosmic_config + val cosmic_cancer_type + file cosmicmutations + file cosmicgenes + + output: + file 'cosmic_proteinDB*.fa' into cosmic_proteindbs_uselocal + + script: + """ + pypgatk_cli.py cosmic-to-proteindb \\ + --config_file "$cosmic_config" \\ + --input_mutation $cosmicmutations --input_genes $cosmicgenes \\ + --filter_column 'Histology subtype 1' \\ + --accepted_values $cosmic_cancer_type \\ + --output_db cosmic_proteinDB.fa + """ +} diff --git a/modules/local/customvcf/gtf_to_fasta.nf b/modules/local/customvcf/gtf_to_fasta.nf new file mode 100644 index 00000000..8d9464d2 --- /dev/null +++ b/modules/local/customvcf/gtf_to_fasta.nf @@ -0,0 +1,25 @@ +/** + * Generate protein databse for a given VCF + */ +process GTF_TO_FASTA { + + conda (params.enable_conda ? "bioconda::gffread=0.12.1" : null) + container "${ workflow.containerEngine == 'singularity' && !task.ext.singularity_pull_docker_container ? + 'https://depot.galaxyproject.org/singularity/gffread_0.12.1--h8b12597_0' : + 'quay.io/biocontainers/gffread:0.12.1--h8b12597_0' }" + + when: + params.vcf + + input: + file g + file f + + output: + path "transcripts.fa" ,emit:gtf_transcripts_fasta + + script: + """ + gffread -w transcripts.fa -g $f $g + """ +} diff --git a/modules/local/customvcf/vcf_proteinDB.nf b/modules/local/customvcf/vcf_proteinDB.nf new file mode 100644 index 00000000..a176dae7 --- /dev/null +++ b/modules/local/customvcf/vcf_proteinDB.nf @@ -0,0 +1,35 @@ +process VCF_PROTEINDB { + + conda (params.enable_conda ? "bioconda::pypgatk=0.0.19" : null) + container "${ workflow.containerEngine == 'singularity' && !task.ext.singularity_pull_docker_container ? + 'https://depot.galaxyproject.org/singularity/pypgatk_0.0.19--py_0' : + 'quay.io/biocontainers/pypgatk:0.0.19--py_0' }" + + when: + params.vcf + + input: + file v + file f + file g + file e + val af_field + + output: + path "*_proteinDB.fa" ,emit: proteinDB_custom_vcf + + script: + """ + awk 'BEGIN{FS=OFS="\t"}{if(\$1=="chrM") \$1="MT"; gsub("chr","",\$1); print}' \\ + $v > ${v.baseName}_changedChrNames.vcf + + pypgatk_cli.py vcf-to-proteindb \\ + --config_file $e \\ + --af_field "$af_field" \\ + --input_fasta $f \\ + --gene_annotations_gtf $g \\ + --vcf ${v.baseName}_changedChrNames.vcf \\ + --output_proteindb ${v.baseName}_proteinDB.fa \\ + --annotation_field_name '' + """ +} diff --git a/modules/local/decoy.nf b/modules/local/decoy.nf new file mode 100644 index 00000000..89b0f118 --- /dev/null +++ b/modules/local/decoy.nf @@ -0,0 +1,39 @@ +/** + * Create the decoy database using DecoyPYrat + * Decoy sequences will have "DECOY_" prefix tag to the protein accession. + */ +process DECOY { + + conda (params.enable_conda ? "bioconda::pypgatk=0.0.19" : null) + container "${ workflow.containerEngine == 'singularity' && !task.ext.singularity_pull_docker_container ? + 'https://depot.galaxyproject.org/singularity/pypgatk_0.0.19--py_0' : + 'quay.io/biocontainers/pypgatk:0.0.19--py_0' }" + + publishDir "${params.outdir}/", mode: params.publish_dir_mode, + saveAs: { filename -> params.final_database_protein } + + when: + params.decoy + + input: + file f + file protein_decoy_config + val decoy_method + val decoy_enzyme + val decoy_prefix + + output: + path 'decoy_database.fa' ,emit: fasta_decoy_db_ch + + script: + """ + pypgatk_cli.py generate-decoy \\ + --method "$decoy_method" \\ + --enzyme "$decoy_enzyme" \\ + --config_file $protein_decoy_config \\ + --input_database $f \\ + --decoy_prefix "$decoy_prefix" \\ + --output_database decoy_database.fa + """ +} + diff --git a/modules/local/get_software_versions.nf b/modules/local/get_software_versions.nf new file mode 100644 index 00000000..c48fbd3e --- /dev/null +++ b/modules/local/get_software_versions.nf @@ -0,0 +1,17 @@ +/* + * Parse software version numbers + */ +process GET_SOFTWARE_VERSIONS { + + output: + path "v_pipeline.txt" ,emit: v_pipeline + path "v_nextflow.txt" ,emit: v_nextflow + path "pypgatk_version.txt" ,emit: pypgatk_version + + script: + """ + echo $workflow.manifest.version > v_pipeline.txt + echo $workflow.nextflow.version > v_nextflow.txt + pip list |grep pypgatk > pypgatk_version.txt + """ +} diff --git a/modules/local/gnomadvariatns/extract_gnomad_vcf.nf b/modules/local/gnomadvariatns/extract_gnomad_vcf.nf new file mode 100644 index 00000000..03613267 --- /dev/null +++ b/modules/local/gnomadvariatns/extract_gnomad_vcf.nf @@ -0,0 +1,19 @@ +/** + * Extract gnomAD VCF + */ +process EXTRACT_GNOMAD_VCF { + + when: + params.gnomad + + input: + file g + + output: + path "*.vcf" ,emit: gnomad_vcf_files + + script: + """ + zcat $g > ${g}.vcf + """ +} diff --git a/modules/local/gnomadvariatns/gencode_download.nf b/modules/local/gnomadvariatns/gencode_download.nf new file mode 100644 index 00000000..176d362a --- /dev/null +++ b/modules/local/gnomadvariatns/gencode_download.nf @@ -0,0 +1,22 @@ +/** + * Download gencode files (fasta and gtf) + */ +process GENCODE_DOWNLOAD { + + when: + params.gnomad + + input: + val g + + output: + path("gencode.v19.pc_transcripts.fa") ,emit:gencode_fasta + path("gencode.v19.annotation.gtf") ,emit:gencode_gtf + + script: + """ + wget ${g}/gencode.v19.pc_transcripts.fa.gz + wget ${g}/gencode.v19.annotation.gtf.gz + gunzip *.gz + """ +} diff --git a/modules/local/gnomadvariatns/gnomad_download.nf b/modules/local/gnomadvariatns/gnomad_download.nf new file mode 100644 index 00000000..e6f12f7a --- /dev/null +++ b/modules/local/gnomadvariatns/gnomad_download.nf @@ -0,0 +1,21 @@ +/** + * Download gnomAD variants (VCF) - requires gsutil + */ +process GNOMAD_DOWNLOAD { + + conda (params.enable_conda ? "conda-forge::gsutil=5.10" : null) + + when: + params.gnomad + + input: + val g + + output: + path "*.vcf.bgz" ,emit:gnomad_vcf_bgz + + script: + """ + gsutil cp $g . + """ +} diff --git a/modules/local/gnomadvariatns/gnomad_proteindb.nf b/modules/local/gnomadvariatns/gnomad_proteindb.nf new file mode 100644 index 00000000..eaa18c89 --- /dev/null +++ b/modules/local/gnomadvariatns/gnomad_proteindb.nf @@ -0,0 +1,36 @@ +/** + * Generate gmomAD proteinDB + */ +process GNOMAD_PROTEINDB { + + conda (params.enable_conda ? "bioconda::pypgatk=0.0.19" : null) + container "${ workflow.containerEngine == 'singularity' && !task.ext.singularity_pull_docker_container ? + 'https://depot.galaxyproject.org/singularity/pypgatk_0.0.19--py_0' : + 'quay.io/biocontainers/pypgatk:0.0.19--py_0' }" + + when: + params.gnomad + + input: + file v + file f + file g + file e + + output: + path "${v}_proteinDB.fa" ,emit:gnomad_vcf_proteindb + + script: + """ + pypgatk_cli.py vcf-to-proteindb \\ + --config_file $e \\ + --vcf $v \\ + --input_fasta $f \\ + --gene_annotations_gtf $g \\ + --output_proteindb "${v}_proteinDB.fa" \\ + --af_field controls_AF \\ + --transcript_index 6 \\ + --annotation_field_name vep \\ + --var_prefix gnomadvar + """ +} diff --git a/modules/local/merge_proteindbs.nf b/modules/local/merge_proteindbs.nf new file mode 100644 index 00000000..5f9f3664 --- /dev/null +++ b/modules/local/merge_proteindbs.nf @@ -0,0 +1,22 @@ +/** + * Concatenate all generated databases from merged_databases channel to the final_database_protein file + */ +process MERGE_PROTEINDBS { + + publishDir "${params.outdir}/", mode: params.publish_dir_mode, + // Final step if not cleaning or creating a decoy database - save output to params.final_database_protein + saveAs: { filename -> + params.clean_database || params.decoy ? null : params.final_database_protein + } + + input: + file("proteindb*") + + output: + path 'merged_databases.fa' ,emit: to_clean_ch + + script: + """ + cat proteindb* > merged_databases.fa + """ +} diff --git a/modules/local/output_documentation.nf b/modules/local/output_documentation.nf new file mode 100644 index 00000000..afcc4cc1 --- /dev/null +++ b/modules/local/output_documentation.nf @@ -0,0 +1,19 @@ +/* + * Output Description HTML + */ +process OUTPUT_DOCUMENTATION { + + publishDir "${params.outdir}/pipeline_info", mode: params.publish_dir_mode + + input: + file output_docs + file images + + output: + path 'results_description.html' + + script: + """ + markdown_to_html.py $output_docs -o results_description.html + """ +} diff --git a/modules/local/proteomes/add_altorfs.nf b/modules/local/proteomes/add_altorfs.nf new file mode 100644 index 00000000..679014a4 --- /dev/null +++ b/modules/local/proteomes/add_altorfs.nf @@ -0,0 +1,31 @@ +/** + * Creates the altORFs protein database + */ +process ADD_ALTORFS { + + conda (params.enable_conda ? "bioconda::pypgatk=0.0.19" : null) + container "${ workflow.containerEngine == 'singularity' && !task.ext.singularity_pull_docker_container ? + 'https://depot.galaxyproject.org/singularity/pypgatk_0.0.19--py_0' : + 'quay.io/biocontainers/pypgatk:0.0.19--py_0' }" + + when: + params.altorfs + + input: + file x + file ensembl_config + + output: + path 'altorfs_proteinDB.fa' ,emit:optional_altorfs + + script: + """ + pypgatk_cli.py dnaseq-to-proteindb \\ + --config_file "$ensembl_config" \\ + --input_fasta "$x" \\ + --output_proteindb altorfs_proteinDB.fa \\ + --include_biotypes "${params.biotypes['protein_coding']}'" \\ + --skip_including_all_cds \\ + --var_prefix altorf_ + """ +} diff --git a/modules/local/proteomes/add_ncrna.nf b/modules/local/proteomes/add_ncrna.nf new file mode 100644 index 00000000..3b9684d3 --- /dev/null +++ b/modules/local/proteomes/add_ncrna.nf @@ -0,0 +1,31 @@ +/** + * Creates the ncRNA protein database + */ +process ADD_NCRNA { + + conda (params.enable_conda ? "bioconda::pypgatk=0.0.19" : null) + container "${ workflow.containerEngine == 'singularity' && !task.ext.singularity_pull_docker_container ? + 'https://depot.galaxyproject.org/singularity/pypgatk_0.0.19--py_0' : + 'quay.io/biocontainers/pypgatk:0.0.19--py_0' }" + + when: + params.ncrna + + input: + file x + file ensembl_config + + + output: + path 'ncRNAs_proteinDB.fa',emit: optional_ncrna + + script: + """ + pypgatk_cli.py dnaseq-to-proteindb \\ + --config_file "$ensembl_config" \\ + --input_fasta $x \\ + --output_proteindb ncRNAs_proteinDB.fa \\ + --include_biotypes "${params.biotypes['ncRNA']}" \\ + --skip_including_all_cds --var_prefix ncRNA_ + """ +} diff --git a/modules/local/proteomes/add_pseudogenes.nf b/modules/local/proteomes/add_pseudogenes.nf new file mode 100644 index 00000000..762d16b4 --- /dev/null +++ b/modules/local/proteomes/add_pseudogenes.nf @@ -0,0 +1,32 @@ +/** + * Creates the pseudogenes protein database + */ +process ADD_PSEUDOGENES { + + conda (params.enable_conda ? "bioconda::pypgatk=0.0.19" : null) + container "${ workflow.containerEngine == 'singularity' && !task.ext.singularity_pull_docker_container ? + 'https://depot.galaxyproject.org/singularity/pypgatk_0.0.19--py_0' : + 'quay.io/biocontainers/pypgatk:0.0.19--py_0' }" + + when: + params.pseudogenes + + input: + file x + file ensembl_config + + output: + path 'pseudogenes_proteinDB.fa' ,emit: optional_pseudogenes + + script: + """ + pypgatk_cli.py dnaseq-to-proteindb \\ + --config_file "$ensembl_config" \\ + --input_fasta "$x" \\ + --output_proteindb pseudogenes_proteinDB.fa \\ + --include_biotypes "${params.biotypes['pseudogene']}" \\ + --skip_including_all_cds \\ + --var_prefix pseudo_ + """ +} + diff --git a/modules/local/proteomes/add_reference_proteome.nf b/modules/local/proteomes/add_reference_proteome.nf new file mode 100644 index 00000000..c6875568 --- /dev/null +++ b/modules/local/proteomes/add_reference_proteome.nf @@ -0,0 +1,19 @@ +/** + * Add reference proteome + */ +process ADD_REFERENCE_PROTEOME { + + when: + params.add_reference + + input: + file reference_proteome + + output: + path 'reference_proteome.fa', emit: ensembl_protein_database + + script: + """ + cat $reference_proteome >> reference_proteome.fa + """ +} diff --git a/modules/local/proteomes/ensembl_fasta_download.nf b/modules/local/proteomes/ensembl_fasta_download.nf new file mode 100644 index 00000000..2c16228b --- /dev/null +++ b/modules/local/proteomes/ensembl_fasta_download.nf @@ -0,0 +1,32 @@ +/** + * Download data from ensembl for the particular species. + */ +process ENSEMBL_FASTA_DOWNLOAD { + + conda (params.enable_conda ? "bioconda::pypgatk=0.0.19" : null) + container "${ workflow.containerEngine == 'singularity' && !task.ext.singularity_pull_docker_container ? + 'https://depot.galaxyproject.org/singularity/pypgatk_0.0.19--py_0' : + 'quay.io/biocontainers/pypgatk:0.0.19--py_0' }" + + when: + params.add_reference || params.ensembl || params.altorfs || params.ncrna || params.pseudogenes || params.vcf + + input: + file ensembl_downloader_config + val ensembl_name + + output: + path 'database_ensembl/*.pep.all.fa', emit: ensembl_protein_database_sub + path 'database_ensembl/*cdna.all.fa', emit: ensembl_cdna_database_sub + path 'database_ensembl/*ncrna.fa', emit: ensembl_ncrna_database_sub + path 'database_ensembl/*.dna*.fa' ,emit: genome_fasta + path 'database_ensembl/*.gtf' ,emit: gtf + + script: + """ + pypgatk_cli.py ensembl-downloader \\ + --config_file $ensembl_downloader_config \\ + --ensembl_name $ensembl_name \\ + -sv -sc + """ +} diff --git a/modules/local/proteomes/merge_cdnas.nf b/modules/local/proteomes/merge_cdnas.nf new file mode 100644 index 00000000..2cf26123 --- /dev/null +++ b/modules/local/proteomes/merge_cdnas.nf @@ -0,0 +1,18 @@ +/** + * Concatenate cDNA and ncRNA databases + **/ +process MERGE_CDNAS { + + input: + file a + file b + + output: + path 'total_cdnas.fa',emit: total_cdnas + + script: + """ + cat $a >> total_cdnas.fa + cat $b >> total_cdnas.fa + """ +} diff --git a/modules/local/samplesheet_check.nf b/modules/local/samplesheet_check.nf new file mode 100644 index 00000000..fcacf567 --- /dev/null +++ b/modules/local/samplesheet_check.nf @@ -0,0 +1,28 @@ +process SAMPLESHEET_CHECK { + tag "$samplesheet" + + conda (params.enable_conda ? "conda-forge::python=3.8.3" : null) + container "${ workflow.containerEngine == 'singularity' && !task.ext.singularity_pull_docker_container ? + 'https://depot.galaxyproject.org/singularity/python:3.8.3' : + 'quay.io/biocontainers/python:3.8.3' }" + + input: + path samplesheet + + output: + path '*.csv' , emit: csv + path "versions.yml", emit: versions + + script: // This script is bundled with the pipeline, in nf-core/pgdb/bin/ + """ + check_samplesheet.py \\ + $samplesheet \\ + samplesheet.valid.csv + + cat <<-END_VERSIONS > versions.yml + "${task.process}": + python: \$(python --version | sed 's/Python //g') + END_VERSIONS + """ +} + diff --git a/modules/local/vcf/check_ensembl_vcf.nf b/modules/local/vcf/check_ensembl_vcf.nf new file mode 100644 index 00000000..fe91c3f9 --- /dev/null +++ b/modules/local/vcf/check_ensembl_vcf.nf @@ -0,0 +1,22 @@ +/** + * Check VCF files from ensembl for the particular species + */ +process CHECK_ENSEMBL_VCF { + + label 'process_medium' + label 'process_single_thread' + + when: + params.ensembl + + input: + file vcf_file + + output: + path "checked_*.vcf" ,emit:ensembl_vcf_files_checked + + script: + """ + awk 'BEGIN{FS=OFS="\t"}{if(\$1~"#" || (\$5!="" && \$4!="")) print}' $vcf_file > checked_$vcf_file + """ +} diff --git a/modules/local/vcf/ensembl_vcf_download.nf b/modules/local/vcf/ensembl_vcf_download.nf new file mode 100644 index 00000000..0b8fc1f3 --- /dev/null +++ b/modules/local/vcf/ensembl_vcf_download.nf @@ -0,0 +1,29 @@ +/** + * Download VCF files from ensembl for the particular species. + */ +process ENSEMBL_VCF_DOWNLOAD { + + conda (params.enable_conda ? "bioconda::pypgatk=0.0.19" : null) + container "${ workflow.containerEngine == 'singularity' && !task.ext.singularity_pull_docker_container ? + 'https://depot.galaxyproject.org/singularity/pypgatk_0.0.19--py_0' : + 'quay.io/biocontainers/pypgatk:0.0.19--py_0' }" + + when: + params.ensembl + + input: + file ensembl_downloader_config + val ensembl_name + + output: + path "database_ensembl/*.vcf" ,emit: ensembl_vcf_files + + script: + """ + pypgatk_cli.py ensembl-downloader \\ + --config_file $ensembl_downloader_config \\ + --ensembl_name $ensembl_name \\ + -sg -sp -sc -sd -sn + """ +} + diff --git a/modules/local/vcf/ensembl_vcf_proteindb.nf b/modules/local/vcf/ensembl_vcf_proteindb.nf new file mode 100644 index 00000000..043b5e20 --- /dev/null +++ b/modules/local/vcf/ensembl_vcf_proteindb.nf @@ -0,0 +1,39 @@ +/** + * Generate protein database(s) from ENSEMBL vcf file(s) + */ +process ENSEMBL_VCF_PROTEINDB { + + label 'process_medium' + label 'process_single_thread' + + conda (params.enable_conda ? "bioconda::pypgatk=0.0.19" : null) + container "${ workflow.containerEngine == 'singularity' && !task.ext.singularity_pull_docker_container ? + 'https://depot.galaxyproject.org/singularity/pypgatk_0.0.19--py_0' : + 'quay.io/biocontainers/pypgatk:0.0.19--py_0' }" + + when: + params.ensembl + + input: + file v + file f + file g + file e + val af_field + + output: + path "${v}_proteinDB.fa" ,emit: proteinDB_vcf + + script: + """ + pypgatk_cli.py vcf-to-proteindb \\ + --config_file $e \\ + --af_field $af_field \\ + --input_fasta $f \\ + --gene_annotations_gtf $g \\ + --vcf $v \\ + --output_proteindb "${v}_proteinDB.fa" \\ + --var_prefix ensvar \\ + --annotation_field_name 'CSQ' + """ +} diff --git a/modules/nf-core/modules/custom/dumpsoftwareversions/main.nf b/modules/nf-core/modules/custom/dumpsoftwareversions/main.nf new file mode 100644 index 00000000..327d5100 --- /dev/null +++ b/modules/nf-core/modules/custom/dumpsoftwareversions/main.nf @@ -0,0 +1,24 @@ +process CUSTOM_DUMPSOFTWAREVERSIONS { + label 'process_low' + + // Requires `pyyaml` which does not have a dedicated container but is in the MultiQC container + conda (params.enable_conda ? "bioconda::multiqc=1.11" : null) + container "${ workflow.containerEngine == 'singularity' && !task.ext.singularity_pull_docker_container ? + 'https://depot.galaxyproject.org/singularity/multiqc:1.11--pyhdfd78af_0' : + 'quay.io/biocontainers/multiqc:1.11--pyhdfd78af_0' }" + + input: + path versions + + output: + path "software_versions.yml" , emit: yml + path "software_versions_mqc.yml", emit: mqc_yml + path "versions.yml" , emit: versions + + when: + task.ext.when == null || task.ext.when + + script: + def args = task.ext.args ?: '' + template 'dumpsoftwareversions.py' +} diff --git a/modules/nf-core/modules/custom/dumpsoftwareversions/meta.yml b/modules/nf-core/modules/custom/dumpsoftwareversions/meta.yml new file mode 100644 index 00000000..60b546a0 --- /dev/null +++ b/modules/nf-core/modules/custom/dumpsoftwareversions/meta.yml @@ -0,0 +1,34 @@ +name: custom_dumpsoftwareversions +description: Custom module used to dump software versions within the nf-core pipeline template +keywords: + - custom + - version +tools: + - custom: + description: Custom module used to dump software versions within the nf-core pipeline template + homepage: https://github.com/nf-core/tools + documentation: https://github.com/nf-core/tools + licence: ["MIT"] +input: + - versions: + type: file + description: YML file containing software versions + pattern: "*.yml" + +output: + - yml: + type: file + description: Standard YML file containing software versions + pattern: "software_versions.yml" + - mqc_yml: + type: file + description: MultiQC custom content YML file containing software versions + pattern: "software_versions_mqc.yml" + - versions: + type: file + description: File containing software versions + pattern: "versions.yml" + +authors: + - "@drpatelh" + - "@grst" diff --git a/modules/nf-core/modules/custom/dumpsoftwareversions/templates/dumpsoftwareversions.py b/modules/nf-core/modules/custom/dumpsoftwareversions/templates/dumpsoftwareversions.py new file mode 100644 index 00000000..d1390392 --- /dev/null +++ b/modules/nf-core/modules/custom/dumpsoftwareversions/templates/dumpsoftwareversions.py @@ -0,0 +1,89 @@ +#!/usr/bin/env python + +import yaml +import platform +from textwrap import dedent + + +def _make_versions_html(versions): + html = [ + dedent( + """\\ + + + + + + + + + + """ + ) + ] + for process, tmp_versions in sorted(versions.items()): + html.append("") + for i, (tool, version) in enumerate(sorted(tmp_versions.items())): + html.append( + dedent( + f"""\\ + + + + + + """ + ) + ) + html.append("") + html.append("
Process Name Software Version
{process if (i == 0) else ''}{tool}{version}
") + return "\\n".join(html) + + +versions_this_module = {} +versions_this_module["${task.process}"] = { + "python": platform.python_version(), + "yaml": yaml.__version__, +} + +with open("$versions") as f: + versions_by_process = yaml.load(f, Loader=yaml.BaseLoader) | versions_this_module + +# aggregate versions by the module name (derived from fully-qualified process name) +versions_by_module = {} +for process, process_versions in versions_by_process.items(): + module = process.split(":")[-1] + try: + assert versions_by_module[module] == process_versions, ( + "We assume that software versions are the same between all modules. " + "If you see this error-message it means you discovered an edge-case " + "and should open an issue in nf-core/tools. " + ) + except KeyError: + versions_by_module[module] = process_versions + +versions_by_module["Workflow"] = { + "Nextflow": "$workflow.nextflow.version", + "$workflow.manifest.name": "$workflow.manifest.version", +} + +versions_mqc = { + "id": "software_versions", + "section_name": "${workflow.manifest.name} Software Versions", + "section_href": "https://github.com/${workflow.manifest.name}", + "plot_type": "html", + "description": "are collected at run time from the software output.", + "data": _make_versions_html(versions_by_module), +} + +with open("software_versions.yml", "w") as f: + yaml.dump(versions_by_module, f, default_flow_style=False) +with open("software_versions_mqc.yml", "w") as f: + yaml.dump(versions_mqc, f, default_flow_style=False) + +with open("versions.yml", "w") as f: + yaml.dump(versions_this_module, f, default_flow_style=False) diff --git a/modules/nf-core/modules/fastqc/main.nf b/modules/nf-core/modules/fastqc/main.nf new file mode 100644 index 00000000..ed6b8c50 --- /dev/null +++ b/modules/nf-core/modules/fastqc/main.nf @@ -0,0 +1,47 @@ +process FASTQC { + tag "$meta.id" + label 'process_medium' + + conda (params.enable_conda ? "bioconda::fastqc=0.11.9" : null) + container "${ workflow.containerEngine == 'singularity' && !task.ext.singularity_pull_docker_container ? + 'https://depot.galaxyproject.org/singularity/fastqc:0.11.9--0' : + 'quay.io/biocontainers/fastqc:0.11.9--0' }" + + input: + tuple val(meta), path(reads) + + output: + tuple val(meta), path("*.html"), emit: html + tuple val(meta), path("*.zip") , emit: zip + path "versions.yml" , emit: versions + + when: + task.ext.when == null || task.ext.when + + script: + def args = task.ext.args ?: '' + // Add soft-links to original FastQs for consistent naming in pipeline + def prefix = task.ext.prefix ?: "${meta.id}" + if (meta.single_end) { + """ + [ ! -f ${prefix}.fastq.gz ] && ln -s $reads ${prefix}.fastq.gz + fastqc $args --threads $task.cpus ${prefix}.fastq.gz + + cat <<-END_VERSIONS > versions.yml + "${task.process}": + fastqc: \$( fastqc --version | sed -e "s/FastQC v//g" ) + END_VERSIONS + """ + } else { + """ + [ ! -f ${prefix}_1.fastq.gz ] && ln -s ${reads[0]} ${prefix}_1.fastq.gz + [ ! -f ${prefix}_2.fastq.gz ] && ln -s ${reads[1]} ${prefix}_2.fastq.gz + fastqc $args --threads $task.cpus ${prefix}_1.fastq.gz ${prefix}_2.fastq.gz + + cat <<-END_VERSIONS > versions.yml + "${task.process}": + fastqc: \$( fastqc --version | sed -e "s/FastQC v//g" ) + END_VERSIONS + """ + } +} diff --git a/modules/nf-core/modules/fastqc/meta.yml b/modules/nf-core/modules/fastqc/meta.yml new file mode 100644 index 00000000..4da5bb5a --- /dev/null +++ b/modules/nf-core/modules/fastqc/meta.yml @@ -0,0 +1,52 @@ +name: fastqc +description: Run FastQC on sequenced reads +keywords: + - quality control + - qc + - adapters + - fastq +tools: + - fastqc: + description: | + FastQC gives general quality metrics about your reads. + It provides information about the quality score distribution + across your reads, the per base sequence content (%A/C/G/T). + You get information about adapter contamination and other + overrepresented sequences. + homepage: https://www.bioinformatics.babraham.ac.uk/projects/fastqc/ + documentation: https://www.bioinformatics.babraham.ac.uk/projects/fastqc/Help/ + licence: ["GPL-2.0-only"] +input: + - meta: + type: map + description: | + Groovy Map containing sample information + e.g. [ id:'test', single_end:false ] + - reads: + type: file + description: | + List of input FastQ files of size 1 and 2 for single-end and paired-end data, + respectively. +output: + - meta: + type: map + description: | + Groovy Map containing sample information + e.g. [ id:'test', single_end:false ] + - html: + type: file + description: FastQC report + pattern: "*_{fastqc.html}" + - zip: + type: file + description: FastQC report archive + pattern: "*_{fastqc.zip}" + - versions: + type: file + description: File containing software versions + pattern: "versions.yml" +authors: + - "@drpatelh" + - "@grst" + - "@ewels" + - "@FelixKrueger" diff --git a/modules/nf-core/modules/multiqc/main.nf b/modules/nf-core/modules/multiqc/main.nf new file mode 100644 index 00000000..1264aac1 --- /dev/null +++ b/modules/nf-core/modules/multiqc/main.nf @@ -0,0 +1,31 @@ +process MULTIQC { + label 'process_medium' + + conda (params.enable_conda ? 'bioconda::multiqc=1.12' : null) + container "${ workflow.containerEngine == 'singularity' && !task.ext.singularity_pull_docker_container ? + 'https://depot.galaxyproject.org/singularity/multiqc:1.12--pyhdfd78af_0' : + 'quay.io/biocontainers/multiqc:1.12--pyhdfd78af_0' }" + + input: + path multiqc_files + + output: + path "*multiqc_report.html", emit: report + path "*_data" , emit: data + path "*_plots" , optional:true, emit: plots + path "versions.yml" , emit: versions + + when: + task.ext.when == null || task.ext.when + + script: + def args = task.ext.args ?: '' + """ + multiqc -f $args . + + cat <<-END_VERSIONS > versions.yml + "${task.process}": + multiqc: \$( multiqc --version | sed -e "s/multiqc, version //g" ) + END_VERSIONS + """ +} diff --git a/modules/nf-core/modules/multiqc/meta.yml b/modules/nf-core/modules/multiqc/meta.yml new file mode 100644 index 00000000..6fa891ef --- /dev/null +++ b/modules/nf-core/modules/multiqc/meta.yml @@ -0,0 +1,40 @@ +name: MultiQC +description: Aggregate results from bioinformatics analyses across many samples into a single report +keywords: + - QC + - bioinformatics tools + - Beautiful stand-alone HTML report +tools: + - multiqc: + description: | + MultiQC searches a given directory for analysis logs and compiles a HTML report. + It's a general use tool, perfect for summarising the output from numerous bioinformatics tools. + homepage: https://multiqc.info/ + documentation: https://multiqc.info/docs/ + licence: ["GPL-3.0-or-later"] +input: + - multiqc_files: + type: file + description: | + List of reports / files recognised by MultiQC, for example the html and zip output of FastQC +output: + - report: + type: file + description: MultiQC report file + pattern: "multiqc_report.html" + - data: + type: dir + description: MultiQC data dir + pattern: "multiqc_data" + - plots: + type: file + description: Plots created by MultiQC + pattern: "*_data" + - versions: + type: file + description: File containing software versions + pattern: "versions.yml" +authors: + - "@abhi18av" + - "@bunop" + - "@drpatelh" diff --git a/nextflow.config b/nextflow.config index 8b52f9a9..39644dfb 100644 --- a/nextflow.config +++ b/nextflow.config @@ -1,85 +1,98 @@ /* - * ------------------------------------------------- - * nf-core/pgdb Nextflow config file - * ------------------------------------------------- - * Default config options for all environments. +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + nf-core/pgdb Nextflow config file +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + Default config options for all compute environments +---------------------------------------------------------------------------------------- */ // Global default params, used in configs params { - // Nf-core lint required params that are unused - input = 'foobarbaz' - - // Workflow flags - root_folder = '' - local_input_type = '' - // process flag variables - ncrna = true - pseudogenes = true - altorfs = true - cbioportal = true - cosmic = true - cosmic_celllines = true - ensembl = true - gnomad = true - decoy = false - push_s3 = false + input = 'dump_input.csv' + enable_conda = false + ncrna = false + pseudogenes = false + altorfs = false + vcf = false + cbioportal = false + cosmic = false + cosmic_celllines = null + cosmic_dababase_version = 'latest' + ensembl = false + gnomad = false add_reference = true + //local COSMIC files + cosmicgenes = null + cosmicmutations = null + cosmiccelllines_genes = null + cosmiccelllines_mutations = null + // Output results - result_file = 'fasta_database.fa' - outdir = "$baseDir/result" + outdir = './results' - //data download variables - cosmic_user_name = "" - cosmic_password = "" + // Clean database options + clean_database = false + minimum_aa = 6 + add_stop_codons = true - //config files - ensembl_downloader_config = "$baseDir/conf/ensembl_downloader_config.yaml" - ensembl_config = "$baseDir/conf/ensembl_config.yaml" - cosmic_config = "$baseDir/conf/cosmic_config.yaml" - cbioportal_config = "$baseDir/conf/cbioportal_config.yaml" - protein_decoy_config = "$baseDir/conf/protein_decoy.yaml" + // data download variables + cosmic_user_name = null + cosmic_password = null - //ENSEMBL parameters - taxonomy = "9606" + // config files + ensembl_downloader_config = "$projectDir/conf/ensembl_downloader_config.yaml" + ensembl_config = "$projectDir/conf/ensembl_config.yaml" + cosmic_config = "$projectDir/conf/cosmic_config.yaml" + cbioportal_config = "$projectDir/conf/cbioportal_config.yaml" + protein_decoy_config = "$projectDir/conf/protein_decoy.yaml" + + // ENSEMBL parameters + ensembl_name = 'homo_sapiens' /* Biotype groups according to: * https://www.ensembl.org/Help/Faq?id=468 and * http://vega.archive.ensembl.org/info/about/gene_and_transcript_types.html */ - biotypes = [ 'protein_coding': "protein_coding,polymorphic_pseudogene,non_stop_decay,nonsense_mediated_decay,IG_C_gene,IG_D_gene,IG_J_gene,IG_V_gene,TR_C_gene,TR_D_gene,TR_J_gene,TR_V_gene,TEC", - 'pseudogene': "pseudogene,IG_C_pseudogene,IG_J_pseudogene,IG_V_pseudogene,IG_pseudogene,TR_V_pseudogene,TR_J_pseudogene,processed_pseudogene,rRNA_pseudogene,transcribed_processed_pseudogene,transcribed_unitary_pseudogene,transcribed_unprocessed_pseudogene,translated_unprocessed_pseudogene,unitary_pseudogene,unprocessed_pseudogene,translated_processed_pseudogene", - 'ncRNA': "lncRNA,Mt_rRNA,Mt_tRNA,miRNA,misc_RNA,rRNA,retained_intron,ribozyme,sRNA,scRNA,scaRNA,snRNA,snoRNA,vaultRNA", + biotypes = [ + 'protein_coding': "protein_coding,polymorphic_pseudogene,non_stop_decay,nonsense_mediated_decay,IG_C_gene,IG_D_gene,IG_J_gene,IG_V_gene,TR_C_gene,TR_D_gene,TR_J_gene,TR_V_gene,TEC", + 'pseudogene': "pseudogene,IG_C_pseudogene,IG_J_pseudogene,IG_V_pseudogene,IG_pseudogene,TR_V_pseudogene,TR_J_pseudogene,processed_pseudogene,rRNA_pseudogene,transcribed_processed_pseudogene,transcribed_unitary_pseudogene,transcribed_unprocessed_pseudogene,translated_unprocessed_pseudogene,unitary_pseudogene,unprocessed_pseudogene,translated_processed_pseudogene", + 'ncRNA': "lncRNA,Mt_rRNA,Mt_tRNA,miRNA,misc_RNA,rRNA,retained_intron,ribozyme,sRNA,scRNA,scaRNA,snRNA,snoRNA,vaultRNA", ] - //vcf-to-proteindb parameters - cosmic_tissue_type = 'all' + // vcf-to-proteindb parameters + vcf = false + vcf_file = null + cosmic_cancer_type = 'all' cosmic_cellline_name = 'all' - cbioportal_tissue_type = 'all' - split_by_filter_column = false - split_by_filter_column = "" - af_field = "" //set to empty when AF_field does not exist in the INFO filed or filtering on AF is not desired + cbioportal_study_id = 'all' + cbioportal_accepted_values = 'all' + cbioportal_filter_column = 'CANCER_TYPE' + af_field = null // set to empty when AF_field does not exist in the INFO filed or filtering on AF is not desired - //Output parameters + // Output parameters final_database_protein = "final_proteinDB.fa" - decoy_prefix = "decoy_" - //gencode download parameters + // Decoy options + decoy_prefix = "Decoy_" + decoy = false + decoy_method = "decoypyrat" + decoy_enzyme = "Trypsin" + + // gencode download parameters gencode_url = "ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_19" gnomad_file_url = "gs://gnomad-public/release/2.1.1/vcf/genomes/gnomad.genomes.r2.1.1.sites.*.vcf.bgz" - params.gnomad_file_url = "gs://gnomad-public/release/2.1.1/vcf/exomes/gnomad.exomes.r2.1.1.sites.vcf.bgz" //use for testing the pipeline, smaller file - only exomes + gnomad_file_url = "gs://gnomad-public/release/2.1.1/vcf/exomes/gnomad.exomes.r2.1.1.sites.vcf.bgz" // use for testing the pipeline, smaller file - only exomes publish_dir_mode = 'copy' // Boilerplate options - name = false - email = false - email_on_fail = false + email = null + email_on_fail = null max_multiqc_email_size = 25.MB plaintext_email = false monochrome_logs = false @@ -87,10 +100,14 @@ params { tracedir = "${params.outdir}/pipeline_info" custom_config_version = 'master' custom_config_base = "https://raw.githubusercontent.com/nf-core/configs/${params.custom_config_version}" - hostnames = false - config_profile_description = false - config_profile_contact = false - config_profile_url = false + hostnames = null + config_profile_name = null + config_profile_description = null + config_profile_contact = null + config_profile_url = null + validate_params = true + show_hidden_params = false + schema_ignore_params = 'biotypes' // Defaults only, expecting to be overwritten max_memory = 128.GB @@ -99,117 +116,151 @@ params { } -// Container slug. Stable releases should specify release tag! -// Developmental code should specify :dev -process.container = 'nfcore/pgdb:dev' - // Load base.config by default for all pipelines includeConfig 'conf/base.config' // Load nf-core custom profiles from different Institutions try { - includeConfig "${params.custom_config_base}/nfcore_custom.config" + includeConfig "${params.custom_config_base}/nfcore_custom.config" } catch (Exception e) { - System.err.println("WARNING: Could not load nf-core/config profiles: ${params.custom_config_base}/nfcore_custom.config") + System.err.println("WARNING: Could not load nf-core/config profiles: ${params.custom_config_base}/nfcore_custom.config") } +// Load nf-core/pgdb custom profiles from different institutions. +// Warning: Uncomment only if a pipeline-specific instititutional config already exists on nf-core/configs! +// try { +// includeConfig "${params.custom_config_base}/pipeline/pgdb.config" +// } catch (Exception e) { +// System.err.println("WARNING: Could not load nf-core/config/pgdb profiles: ${params.custom_config_base}/pipeline/pgdb.config") +// } + + profiles { - conda { - process.conda = "$baseDir/environment.yml" - conda.createTimeout = '1 h' - } - debug { process.beforeScript = 'echo $HOSTNAME' } - docker { - docker.enabled = true - // Avoid this error: - // WARNING: Your kernel does not support swap limit capabilities or the cgroup is not mounted. Memory limited without swap. - // Testing this in nf-core after discussion here https://github.com/nf-core/tools/pull/351 - // once this is established and works well, nextflow might implement this behavior as new default. - docker.runOptions = '-u \$(id -u):\$(id -g)' - } - podman { - podman.enabled = true - } - singularity { - singularity.enabled = true - singularity.autoMounts = true - } - lsf { - process.executor = 'lsf' - } - test { includeConfig 'conf/test.config' } - test_localize { includeConfig 'conf/test_localize.config' } - test_full { includeConfig 'conf/test_full.config' } - test_speccount { includeConfig 'conf/test_speccount.config' } - dev { includeConfig 'conf/dev.config' } + debug { process.beforeScript = 'echo $HOSTNAME' } + conda { + conda.enabled = true + docker.enabled = false + singularity.enabled = false + podman.enabled = false + shifter.enabled = false + charliecloud.enabled = false + } + docker { + docker.enabled = true + docker.userEmulation = true + singularity.enabled = false + podman.enabled = false + shifter.enabled = false + charliecloud.enabled = false + } + singularity { + singularity.enabled = true + singularity.autoMounts = true + docker.enabled = false + podman.enabled = false + shifter.enabled = false + charliecloud.enabled = false + } + podman { + podman.enabled = true + docker.enabled = false + singularity.enabled = false + shifter.enabled = false + charliecloud.enabled = false + } + shifter { + shifter.enabled = true + docker.enabled = false + singularity.enabled = false + podman.enabled = false + charliecloud.enabled = false + } + charliecloud { + charliecloud.enabled = true + docker.enabled = false + singularity.enabled = false + podman.enabled = false + shifter.enabled = false + } + test { includeConfig 'conf/test.config' } + test_full { includeConfig 'conf/test_full.config' } + test_cosmic_cbio { includeConfig 'conf/test_cosmic_cbio.config' } } // Export these variables to prevent local Python/R libraries from conflicting with those in the container +// The JULIA depot path has been adjusted to a fixed path `/usr/local/share/julia` that needs to be used for packages in the container. +// See https://apeltzer.github.io/post/03-julia-lang-nextflow/ for details on that. Once we have a common agreement on where to keep Julia packages, this is adjustable. + env { - PYTHONNOUSERSITE = 1 - R_PROFILE_USER = "/.Rprofile" - R_ENVIRON_USER = "/.Renviron" - } + PYTHONNOUSERSITE = 1 + R_PROFILE_USER = "/.Rprofile" + R_ENVIRON_USER = "/.Renviron" + JULIA_DEPOT_PATH = "/usr/local/share/julia" +} // Capture exit codes from upstream processes when piping process.shell = ['/bin/bash', '-euo', 'pipefail'] +def trace_timestamp = new java.util.Date().format( 'yyyy-MM-dd_HH-mm-ss') timeline { - enabled = true - file = "${params.tracedir}/execution_timeline.html" - } + enabled = true + file = "${params.tracedir}/execution_timeline_${trace_timestamp}.html" +} report { - enabled = true - file = "${params.tracedir}/execution_report.html" - } + enabled = true + file = "${params.tracedir}/execution_report_${trace_timestamp}.html" +} trace { - enabled = true - file = "${params.tracedir}/execution_trace.txt" - } + enabled = true + file = "${params.tracedir}/execution_trace_${trace_timestamp}.txt" +} dag { - enabled = true - file = "${params.tracedir}/pipeline_dag.svg" - } + enabled = true + file = "${params.tracedir}/pipeline_dag_${trace_timestamp}.html" +} manifest { - name = 'nf-core/pgdb' - author = 'Husen M. Umer & Yasset Perez-Riverol' - homePage = 'https://github.com/nf-core/pgdb' - description = 'Proteogenomics database creation workflow using pypgatk framework. ' - mainScript = 'main.nf' - nextflowVersion = '>=20.04.0' - version = '1.0dev' + name = 'nf-core/pgdb' + author = 'Husen M. Umer & Yasset Perez-Riverol' + homePage = 'https://github.com/nf-core/pgdb' + description = 'Proteogenomics database creation workflow using pypgatk framework. ' + mainScript = 'main.nf' + nextflowVersion = '!>=21.10.3' + version = '1.1dev' } +// Load modules.config for DSL2 module specific options +includeConfig 'conf/modules.config' + // Function to ensure that resource requirements don't go beyond // a maximum limit def check_max(obj, type) { - if (type == 'memory') { - try { - if (obj.compareTo(params.max_memory as nextflow.util.MemoryUnit) == 1) - return params.max_memory as nextflow.util.MemoryUnit - else - return obj - } catch (all) { - println " ### ERROR ### Max memory '${params.max_memory}' is not valid! Using default value: $obj" - return obj - } - } else if (type == 'time') { - try { - if (obj.compareTo(params.max_time as nextflow.util.Duration) == 1) - return params.max_time as nextflow.util.Duration - else - return obj - } catch (all) { - println " ### ERROR ### Max time '${params.max_time}' is not valid! Using default value: $obj" - return obj + if (type == 'memory') { + try { + if (obj.compareTo(params.max_memory as nextflow.util.MemoryUnit) == 1) + return params.max_memory as nextflow.util.MemoryUnit + else + return obj + } catch (all) { + println " ### ERROR ### Max memory '${params.max_memory}' is not valid! Using default value: $obj" + return obj + } + } else if (type == 'time') { + try { + if (obj.compareTo(params.max_time as nextflow.util.Duration) == 1) + return params.max_time as nextflow.util.Duration + else + return obj + } catch (all) { + println " ### ERROR ### Max time '${params.max_time}' is not valid! Using default value: $obj" + return obj + } + } else if (type == 'cpus') { + try { + return Math.min( obj, params.max_cpus as int ) + } catch (all) { + println " ### ERROR ### Max cpus '${params.max_cpus}' is not valid! Using default value: $obj" + return obj + } } - } else if (type == 'cpus') { - try { - return Math.min( obj, params.max_cpus as int ) - } catch (all) { - println " ### ERROR ### Max cpus '${params.max_cpus}' is not valid! Using default value: $obj" - return obj - } -} } diff --git a/nextflow_schema.json b/nextflow_schema.json index b5e204e5..b022b57b 100644 --- a/nextflow_schema.json +++ b/nextflow_schema.json @@ -5,19 +5,261 @@ "description": "Proteogenomics database creation workflow using pypgatk framework. ", "type": "object", "definitions": { + "ensembl_canonical_proteomes": { + "title": "ENSEMBL canonical proteomes", + "type": "object", + "description": "Add ENSEMBL canonical proteomes", + "default": "", + "properties": { + "add_reference": { + "type": "boolean", + "default": true, + "description": "Add the reference proteome to the file" + }, + "ensembl_downloader_config": { + "type": "string", + "description": "Path to configuration file for ENSEMBL download parameters" + }, + "ensembl_config": { + "type": "string", + "description": "Path to configuration file for parameters in generating protein databases from ENSMEBL sequences" + }, + "gencode_url": { + "type": "string", + "default": "ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_19", + "description": "URL for downloading GENCODE datafiles" + }, + "ensembl_name": { + "type": "string", + "default": "homo_sapiens", + "description": "Taxonomic term for the species to download from ENSEMBL" + } + } + }, + "non_canonical_proteome_parameters": { + "title": "Non canonical proteome parameters", + "type": "object", + "description": "Non canonical proteins generation parameters", + "default": "", + "properties": { + "ncrna": { + "type": "boolean", + "description": "Generate protein database from non-coding RNA" + }, + "pseudogenes": { + "type": "boolean", + "description": "Generate protein database from pseudogenes" + }, + "altorfs": { + "type": "boolean", + "description": "Generate alternative ORFs from canonical proteins" + }, + "ensembl": { + "type": "boolean", + "description": "Download ENSEMBL variants and generate protein database" + } + } + }, + "custom_vcf_based_variant_proteomes": { + "title": "Custom VCF-based variant proteomes", + "type": "object", + "description": "Proteins generated using an input VCF", + "default": "", + "properties": { + "vcf": { + "type": "boolean", + "description": "Enable translation of a given VCF file" + }, + "vcf_file": { + "type": "string", + "description": "VCF file path to be translated. Generate variants proteins by modifying sequences of affected transcripts." + }, + "af_field": { + "type": "string", + "description": "Allele frequency identifier string in VCF Info column, if no AF info is given set it to empty." + } + } + }, + "cbioportal_variant_proteomes": { + "title": "cBioportal variant proteomes", + "type": "object", + "description": "cBioportal variant parameters", + "default": "", + "properties": { + "cbioportal": { + "type": "boolean", + "description": "Download cBioPortal studies and generate protein database" + }, + "cbioportal_accepted_values": { + "type": "string", + "default": "all", + "description": "Specify a tissue type to limit the cBioPortal mutations to a particular caner type" + }, + "cbioportal_filter_column": { + "type": "string", + "default": "CANCER_TYPE", + "description": "Specify a column from the clinical sample file to be used for filtering records" + }, + "cbioportal_study_id": { + "type": "string", + "description": "Download mutations from a specific study in cbiportal default is all which downloads mutations from all studies" + }, + "cbioportal_config": { + "type": "string", + "description": "cBioPortal configuration file" + } + } + }, + "cosmic_variant_proteomes": { + "title": "COSMIC variant proteomes", + "type": "object", + "description": "COSMIC variant proteins parameters", + "default": "", + "properties": { + "cosmic_dababase_version": { + "type": "string", + "default": "latest", + "description": "Cosmic database version required" + }, + "cosmic": { + "type": "boolean", + "description": "Download COSMIC mutation files and generate protein database" + }, + "cosmic_celllines": { + "type": "boolean", + "description": "Download COSMIC cell line files and generate protein database" + }, + "cosmic_user_name": { + "type": "string", + "description": "Password for COSMIC account" + }, + "cosmic_password": { + "type": "string", + "description": "Download mutations from a specific study in cbiportal default is all which downloads mutations from all studies" + }, + "cosmic_config": { + "type": "string", + "description": "Path to configuration file for parameters in generating" + }, + "cosmic_cancer_type": { + "type": "string", + "default": "all", + "description": "Specify a tissue type to limit the COSMIC mutations to a particular caner type" + }, + "cosmic_cellline_name": { + "type": "string", + "default": "all", + "description": "Specify a sample name to limit the COSMIC cell line mutations to a particular cell line" + }, + "cosmicgenes": { + "type": "string", + "description": "Loading local cosmic protein database" + }, + "cosmicmutations": { + "type": "string", + "description": "Loading local COSMIC mutation files" + }, + "cosmiccelllines_genes": { + "type": "string", + "description": "Loading local cosmic_celllines protein database" + }, + "cosmiccelllines_mutations": { + "type": "string", + "description": "Loading local COSMIC cell line files" + } + } + }, + "gnomad_variant_proteomes": { + "title": "GNOMAD variant proteomes", + "type": "object", + "description": "", + "default": "", + "properties": { + "gnomad": { + "type": "boolean", + "description": "Add gNOMAD variants to the database" + }, + "gnomad_file_url": { + "type": "string", + "default": "gs://gnomad-public/release/2.1.1/vcf/exomes/gnomad.exomes.r2.1.1.sites.vcf.bgz", + "description": "gNOMAD url" + } + } + }, + "decoy_generation": { + "title": "Decoy generation", + "type": "object", + "description": "Generate decoy proteins and attach them to the final protein database", + "default": "", + "properties": { + "decoy": { + "type": "boolean", + "description": "Append the decoy proteins to the database" + }, + "decoy_prefix": { + "type": "string", + "default": "Decoy_", + "description": "String to be used as prefix for the generated decoy sequences" + }, + "decoy_method": { + "type": "string", + "default": "decoypyrat", + "description": "Method used to generate the decoy database", + "enum": ["protein-reverse", "protein-shuffle", "decoypyrat"] + }, + "decoy_enzyme": { + "type": "string", + "default": "Trypsin", + "description": "Enzyme used to generate the decoy" + }, + "protein_decoy_config": { + "type": "string", + "description": "Configuration file to perform the decoy generation" + } + } + }, + "clean_and_process_database": { + "title": "Clean and process database", + "type": "object", + "description": "Clean and process the resulted database", + "default": "", + "properties": { + "clean_database": { + "type": "boolean", + "description": "Clean the database for stop codons, short protein sequences" + }, + "minimum_aa": { + "type": "integer", + "default": 6, + "description": "Minimum number of AminoAcids for a protein to be included in the database" + }, + "add_stop_codons": { + "type": "boolean", + "description": "If an stop codons is found, create two proteins from it" + } + } + }, "input_output_options": { "title": "Input/output options", "type": "object", "fa_icon": "fas fa-terminal", "description": "Define where the pipeline should find input data and save output data.", - "required": [ - "input" - ], + "required": ["outdir"], "properties": { + "input": { + "type": "string", + "format": "file-path", + "mimetype": "text/csv", + "pattern": "^\\S+\\.csv$", + "schema": "assets/schema_input.json", + "description": "Path to comma-separated file containing information about the samples in the experiment.", + "help_text": "You will need to create a design file with information about the samples in your experiment before running the pipeline. Use this parameter to specify its location. It has to be a comma-separated file with 3 columns, and a header row. See [usage docs](https://nf-co.re/pgdb/usage#samplesheet-input).", + "fa_icon": "fas fa-file-csv" + }, "outdir": { "type": "string", - "description": "The output directory where the results will be saved.", - "default": "./results", + "format": "directory-path", + "description": "The output directory where the results will be saved. You have to use absolute paths to storage on Cloud infrastructure.", "fa_icon": "fas fa-folder-open" }, "email": { @@ -26,87 +268,65 @@ "fa_icon": "fas fa-envelope", "help_text": "Set this parameter to your e-mail address to get a summary e-mail with details of the run sent to you when the workflow exits. If set in your user config file (`~/.nextflow/config`) then you don't need to specify this on the command line for every run.", "pattern": "^([a-zA-Z0-9_\\-\\.]+)@([a-zA-Z0-9_\\-\\.]+)\\.([a-zA-Z]{2,5})$" + }, + "final_database_protein": { + "type": "string", + "default": "final_proteinDB.fa", + "description": "Filename for the final protein database" } } }, - "reference_genome_options": { - "title": "Reference genome options", - "type": "object", - "fa_icon": "fas fa-dna", - "description": "Options for the reference genome indices used to align reads." - }, - "generic_options": { - "title": "Generic options", + "institutional_config_options": { + "title": "Institutional config options", "type": "object", - "fa_icon": "fas fa-file-import", - "description": "Less common options for the pipeline, typically set in a config file.", - "help_text": "These options are common to all nf-core pipelines and allow you to customise some of the core preferences for how the pipeline runs.\n\nTypically these options would be set in a Nextflow config file loaded for all pipeline runs, such as `~/.nextflow/config`.", + "fa_icon": "fas fa-university", + "description": "Parameters used to describe centralised config profiles. These should not be edited.", + "help_text": "The centralised nf-core configuration profiles use a handful of pipeline parameters to describe themselves. This information is then printed to the Nextflow log when you run a pipeline. You should not need to change these values when you run a pipeline.", "properties": { - "help": { - "type": "boolean", - "description": "Display help text.", + "hostnames": { + "type": "string", + "description": "Institutional configs hostname.", "hidden": true, - "fa_icon": "fas fa-question-circle" + "fa_icon": "fas fa-users-cog" }, - "publish_dir_mode": { + "custom_config_version": { "type": "string", - "default": "copy", - "hidden": true, - "description": "Method used to save pipeline results to output directory.", - "help_text": "The Nextflow `publishDir` option specifies which intermediate files should be saved to the output directory. This option tells the pipeline what method should be used to move these files. See [Nextflow docs](https://www.nextflow.io/docs/latest/process.html#publishdir) for details.", - "fa_icon": "fas fa-copy", - "enum": [ - "symlink", - "rellink", - "link", - "copy", - "copyNoFollow", - "move" - ] - }, - "name": { - "type": "string", - "description": "Workflow name.", - "fa_icon": "fas fa-fingerprint", + "description": "Git commit id for Institutional configs.", + "default": "master", "hidden": true, - "help_text": "A custom name for the pipeline run. Unlike the core nextflow `-name` option with one hyphen this parameter can be reused multiple times, for example if using `-resume`. Passed through to steps such as MultiQC and used for things like report filenames and titles." + "fa_icon": "fas fa-users-cog" }, - "email_on_fail": { + "custom_config_base": { "type": "string", - "description": "Email address for completion summary, only when pipeline fails.", - "fa_icon": "fas fa-exclamation-triangle", - "pattern": "^([a-zA-Z0-9_\\-\\.]+)@([a-zA-Z0-9_\\-\\.]+)\\.([a-zA-Z]{2,5})$", + "description": "Base directory for Institutional configs.", + "default": "https://raw.githubusercontent.com/nf-core/configs/master", "hidden": true, - "help_text": "This works exactly as with `--email`, except emails are only sent if the workflow is not successful." + "help_text": "If you're running offline, Nextflow will not be able to fetch the institutional config files from the internet. If you don't need them, then this is not a problem. If you do need them, you should download the files from the repo and tell Nextflow where to find them with this parameter.", + "fa_icon": "fas fa-users-cog" }, - "plaintext_email": { - "type": "boolean", - "description": "Send plain-text email instead of HTML.", - "fa_icon": "fas fa-remove-format", + "config_profile_name": { + "type": "string", + "description": "Institutional config name.", "hidden": true, - "help_text": "Set to receive plain-text e-mails instead of HTML formatted." + "fa_icon": "fas fa-users-cog" }, - "max_multiqc_email_size": { + "config_profile_description": { "type": "string", - "description": "File size limit when attaching MultiQC reports to summary emails.", - "default": "25.MB", - "fa_icon": "fas fa-file-upload", + "description": "Institutional config description.", "hidden": true, - "help_text": "If file generated by pipeline exceeds the threshold, it will not be attached." + "fa_icon": "fas fa-users-cog" }, - "monochrome_logs": { - "type": "boolean", - "description": "Do not use coloured log outputs.", - "fa_icon": "fas fa-palette", + "config_profile_contact": { + "type": "string", + "description": "Institutional config contact information.", "hidden": true, - "help_text": "Set to disable colourful command line output and live life in monochrome." + "fa_icon": "fas fa-users-cog" }, - "tracedir": { + "config_profile_url": { "type": "string", - "description": "Directory to keep pipeline Nextflow logs and reports.", - "default": "${params.outdir}/pipeline_info", - "fa_icon": "fas fa-cogs", - "hidden": true + "description": "Institutional config URL link.", + "hidden": true, + "fa_icon": "fas fa-users-cog" } } }, @@ -119,7 +339,7 @@ "properties": { "max_cpus": { "type": "integer", - "description": "Maximum number of CPUs that can be requested for any single job.", + "description": "Maximum number of CPUs that can be requested for any single job.", "default": 16, "fa_icon": "fas fa-microchip", "hidden": true, @@ -130,6 +350,7 @@ "description": "Maximum amount of memory that can be requested for any single job.", "default": "128.GB", "fa_icon": "fas fa-memory", + "pattern": "^\\d+(\\.\\d+)?\\.?\\s*(K|M|G|T)?B$", "hidden": true, "help_text": "Use to set an upper-limit for the memory requirement for each process. Should be a string in the format integer-unit e.g. `--max_memory '8.GB'`" }, @@ -138,203 +359,128 @@ "description": "Maximum amount of time that can be requested for any single job.", "default": "240.h", "fa_icon": "far fa-clock", + "pattern": "^(\\d+\\.?\\s*(s|m|h|day)\\s*)+$", "hidden": true, "help_text": "Use to set an upper-limit for the time requirement for each process. Should be a string in the format integer-unit e.g. `--max_time '2.h'`" } } }, - "institutional_config_options": { - "title": "Institutional config options", + "generic_options": { + "title": "Generic options", "type": "object", - "fa_icon": "fas fa-university", - "description": "Parameters used to describe centralised config profiles. These should not be edited.", - "help_text": "The centralised nf-core configuration profiles use a handful of pipeline parameters to describe themselves. This information is then printed to the Nextflow log when you run a pipeline. You should not need to change these values when you run a pipeline.", + "fa_icon": "fas fa-file-import", + "description": "Less common options for the pipeline, typically set in a config file.", + "help_text": "These options are common to all nf-core pipelines and allow you to customise some of the core preferences for how the pipeline runs.\n\nTypically these options would be set in a Nextflow config file loaded for all pipeline runs, such as `~/.nextflow/config`.", "properties": { - "custom_config_version": { - "type": "string", - "description": "Git commit id for Institutional configs.", - "default": "master", + "enable_conda": { + "type": "boolean", + "description": "Run this workflow with Conda. You can also use '-profile conda' instead of providing this parameter.", "hidden": true, - "fa_icon": "fas fa-users-cog", - "help_text": "Provide git commit id for custom Institutional configs hosted at `nf-core/configs`. This was implemented for reproducibility purposes. Default: `master`.\n\n```bash\n## Download and use config file with following git commit id\n--custom_config_version d52db660777c4bf36546ddb188ec530c3ada1b96\n```" + "fa_icon": "fas fa-bacon" }, - "custom_config_base": { - "type": "string", - "description": "Base directory for Institutional configs.", - "default": "https://raw.githubusercontent.com/nf-core/configs/master", - "hidden": true, - "help_text": "If you're running offline, nextflow will not be able to fetch the institutional config files from the internet. If you don't need them, then this is not a problem. If you do need them, you should download the files from the repo and tell nextflow where to find them with the `custom_config_base` option. For example:\n\n```bash\n## Download and unzip the config files\ncd /path/to/my/configs\nwget https://github.com/nf-core/configs/archive/master.zip\nunzip master.zip\n\n## Run the pipeline\ncd /path/to/my/data\nnextflow run /path/to/pipeline/ --custom_config_base /path/to/my/configs/configs-master/\n```\n\n> Note that the nf-core/tools helper package has a `download` command to download all required pipeline files + singularity containers + institutional configs in one go for you, to make this process easier.", - "fa_icon": "fas fa-users-cog" + "help": { + "type": "boolean", + "description": "Display help text.", + "fa_icon": "fas fa-question-circle", + "hidden": true }, - "hostnames": { + "publish_dir_mode": { "type": "string", - "description": "Institutional configs hostname.", - "hidden": true, - "fa_icon": "fas fa-users-cog" + "default": "copy", + "description": "Method used to save pipeline results to output directory.", + "help_text": "The Nextflow `publishDir` option specifies which intermediate files should be saved to the output directory. This option tells the pipeline what method should be used to move these files. See [Nextflow docs](https://www.nextflow.io/docs/latest/process.html#publishdir) for details.", + "fa_icon": "fas fa-copy", + "enum": ["symlink", "rellink", "link", "copy", "copyNoFollow", "move"], + "hidden": true }, - "config_profile_description": { + "email_on_fail": { "type": "string", - "description": "Institutional config description.", - "hidden": true, - "fa_icon": "fas fa-users-cog" + "description": "Email address for completion summary, only when pipeline fails.", + "fa_icon": "fas fa-exclamation-triangle", + "pattern": "^([a-zA-Z0-9_\\-\\.]+)@([a-zA-Z0-9_\\-\\.]+)\\.([a-zA-Z]{2,5})$", + "help_text": "An email address to send a summary email to when the pipeline is completed - ONLY sent if the pipeline does not exit successfully.", + "hidden": true }, - "config_profile_contact": { + "plaintext_email": { + "type": "boolean", + "description": "Send plain-text email instead of HTML.", + "fa_icon": "fas fa-remove-format", + "hidden": true + }, + "max_multiqc_email_size": { "type": "string", - "description": "Institutional config contact information.", - "hidden": true, - "fa_icon": "fas fa-users-cog" + "description": "File size limit when attaching MultiQC reports to summary emails.", + "pattern": "^\\d+(\\.\\d+)?\\.?\\s*(K|M|G|T)?B$", + "default": "25.MB", + "fa_icon": "fas fa-file-upload", + "hidden": true }, - "config_profile_url": { + "monochrome_logs": { + "type": "boolean", + "description": "Do not use coloured log outputs.", + "fa_icon": "fas fa-palette", + "hidden": true + }, + "tracedir": { "type": "string", - "description": "Institutional config URL link.", + "description": "Directory to keep pipeline Nextflow logs and reports.", + "default": "${params.outdir}/pipeline_info", + "fa_icon": "fas fa-cogs", + "hidden": true + }, + "validate_params": { + "type": "boolean", + "description": "Boolean whether to validate parameters against the schema at runtime", + "default": true, + "fa_icon": "fas fa-check-square", + "hidden": true + }, + "show_hidden_params": { + "type": "boolean", + "fa_icon": "far fa-eye-slash", + "description": "Show all params when using `--help`", "hidden": true, - "fa_icon": "fas fa-users-cog" + "help_text": "By default, parameters set as _hidden_ in the schema are not shown on the command line when a user runs with `--help`. Specifying this option will tell the pipeline to show all parameters." } } } }, "allOf": [ { - "$ref": "#/definitions/input_output_options" + "$ref": "#/definitions/ensembl_canonical_proteomes" }, { - "$ref": "#/definitions/reference_genome_options" + "$ref": "#/definitions/non_canonical_proteome_parameters" }, { - "$ref": "#/definitions/generic_options" + "$ref": "#/definitions/custom_vcf_based_variant_proteomes" }, { - "$ref": "#/definitions/max_job_request_options" + "$ref": "#/definitions/cbioportal_variant_proteomes" }, { - "$ref": "#/definitions/institutional_config_options" - } - ], - "properties": { - "root_folder": { - "type": "string" - }, - "local_input_type": { - "type": "string" - }, - "ncrna": { - "type": "string", - "default": "true" - }, - "pseudogenes": { - "type": "string", - "default": "true" - }, - "altorfs": { - "type": "string", - "default": "true" - }, - "cbioportal": { - "type": "string", - "default": "true" - }, - "cosmic": { - "type": "string", - "default": "true" - }, - "cosmic_celllines": { - "type": "string", - "default": "true" - }, - "ensembl": { - "type": "string", - "default": "true" - }, - "gnomad": { - "type": "string", - "default": "true" - }, - "decoy": { - "type": "string", - "default": "true" - }, - "push_s3": { - "type": "string" - }, - "add_reference": { - "type": "string", - "default": "true" - }, - "result_file": { - "type": "string", - "default": "fasta_database.fa" - }, - "cosmic_user_name": { - "type": "string" - }, - "cosmic_password": { - "type": "string" - }, - "ensembl_downloader_config": { - "type": "string", - "default": "/Users/yperez/IdeaProjects/github-repo/BDP/pgdb/configs/ensembl_downloader_config.yaml" - }, - "ensembl_config": { - "type": "string", - "default": "/Users/yperez/IdeaProjects/github-repo/BDP/pgdb/configs/ensembl_config.yaml" - }, - "cosmic_config": { - "type": "string", - "default": "/Users/yperez/IdeaProjects/github-repo/BDP/pgdb/configs/cosmic_config.yaml" + "$ref": "#/definitions/cosmic_variant_proteomes" }, - "cbioportal_config": { - "type": "string", - "default": "/Users/yperez/IdeaProjects/github-repo/BDP/pgdb/configs/cbioportal_config.yaml" - }, - "protein_decoy_config": { - "type": "string", - "default": "/Users/yperez/IdeaProjects/github-repo/BDP/pgdb/configs/protein_decoy.yaml" - }, - "taxonomy": { - "type": "integer", - "default": 9606 - }, - "biotypes": { - "type": "string", - "default": "[protein_coding:'protein_coding,polymorphic_pseudogene,non_stop_decay,nonsense_mediated_decay,IG_C_gene,IG_D_gene,IG_J_gene,IG_V_gene,TR_C_gene,TR_D_gene,TR_J_gene,TR_V_gene,TEC', pseudogene:'pseudogene,IG_C_pseudogene,IG_J_pseudogene,IG_V_pseudogene,IG_pseudogene,TR_V_pseudogene,TR_J_pseudogene,processed_pseudogene,rRNA_pseudogene,transcribed_processed_pseudogene,transcribed_unitary_pseudogene,transcribed_unprocessed_pseudogene,translated_unprocessed_pseudogene,unitary_pseudogene,unprocessed_pseudogene,translated_processed_pseudogene', ncRNA:'lncRNA,Mt_rRNA,Mt_tRNA,miRNA,misc_RNA,rRNA,retained_intron,ribozyme,sRNA,scRNA,scaRNA,snRNA,snoRNA,vaultRNA']" - }, - "cosmic_tissue_type": { - "type": "string", - "default": "all" - }, - "cosmic_cellline_name": { - "type": "string", - "default": "all" - }, - "cbioportal_tissue_type": { - "type": "string", - "default": "all" - }, - "split_by_filter_column": { - "type": "string" + { + "$ref": "#/definitions/gnomad_variant_proteomes" }, - "af_field": { - "type": "string" + { + "$ref": "#/definitions/decoy_generation" }, - "final_database_protein": { - "type": "string", - "default": "final_proteinDB.fa" + { + "$ref": "#/definitions/clean_and_process_database" }, - "decoy_prefix": { - "type": "string", - "default": "decoy_" + { + "$ref": "#/definitions/input_output_options" }, - "gencode_url": { - "type": "string", - "default": "ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_19" + { + "$ref": "#/definitions/institutional_config_options" }, - "gnomad_file_url": { - "type": "string", - "default": "gs://gnomad-public/release/2.1.1/vcf/exomes/gnomad.exomes.r2.1.1.sites.vcf.bgz" + { + "$ref": "#/definitions/max_job_request_options" }, - "input": { - "type": "string", - "default": "foobarbaz" + { + "$ref": "#/definitions/generic_options" } - } -} \ No newline at end of file + ] +} diff --git a/subworkflows/local/input_check.nf b/subworkflows/local/input_check.nf new file mode 100644 index 00000000..0aecf87f --- /dev/null +++ b/subworkflows/local/input_check.nf @@ -0,0 +1,44 @@ +// +// Check input samplesheet and get read channels +// + +include { SAMPLESHEET_CHECK } from '../../modules/local/samplesheet_check' + +workflow INPUT_CHECK { + take: + samplesheet // file: /path/to/samplesheet.csv + + main: + SAMPLESHEET_CHECK ( samplesheet ) + .csv + .splitCsv ( header:true, sep:',' ) + .map { create_fastq_channel(it) } + .set { reads } + + emit: + reads // channel: [ val(meta), [ reads ] ] + versions = SAMPLESHEET_CHECK.out.versions // channel: [ versions.yml ] +} + +// Function to get list of [ meta, [ fastq_1, fastq_2 ] ] +def create_fastq_channel(LinkedHashMap row) { + // create meta map + def meta = [:] + meta.id = row.sample + meta.single_end = row.single_end.toBoolean() + + // add path(s) of the fastq file(s) to the meta map + def fastq_meta = [] + if (!file(row.fastq_1).exists()) { + exit 1, "ERROR: Please check input samplesheet -> Read 1 FastQ file does not exist!\n${row.fastq_1}" + } + if (meta.single_end) { + fastq_meta = [ meta, [ file(row.fastq_1) ] ] + } else { + if (!file(row.fastq_2).exists()) { + exit 1, "ERROR: Please check input samplesheet -> Read 2 FastQ file does not exist!\n${row.fastq_2}" + } + fastq_meta = [ meta, [ file(row.fastq_1), file(row.fastq_2) ] ] + } + return fastq_meta +} diff --git a/workflows/pgdb.nf b/workflows/pgdb.nf new file mode 100644 index 00000000..3bfb28d3 --- /dev/null +++ b/workflows/pgdb.nf @@ -0,0 +1,251 @@ +#!/usr/bin/env nextflow + +/* +======================================================================================== + nf-core/pgdb +======================================================================================== + nf-core/pgdb Proteogenomics database generation + #### Homepage / Documentation + https://github.com/nf-core/pgdb +---------------------------------------------------------------------------------------- +*/ + +def summary_params = NfcoreSchema.paramsSummaryMap(workflow, params) + +// Validate input parameters +WorkflowPgdb.initialise(params, log) + +// Check input path parameters to see if they exist +def checkPathParamList = [ ] +for (param in checkPathParamList) { if (param) { file(param, checkIfExists: true) } } + +// if (params.validate_params) { +// NfcoreSchema.validateParameters(params, log) +// } + +/* +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + CONFIG FILES +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +*/ +ensembl_downloader_config = file(params.ensembl_downloader_config, checkIfExists: true) +ensembl_config = file(params.ensembl_config) +cosmic_config = file(params.cosmic_config) +if (params.cosmicgenes&¶ms.cosmicmutations) { + cosmicgenes = file(params.cosmicgenes) + cosmicmutations = file(params.cosmicmutations) +} +if (params.cosmiccelllines_genes&¶ms.cosmiccelllines_mutations) { + cosmiccelllines_genes = file(params.cosmiccelllines_genes) + cosmiccelllines_mutations = file(params.cosmiccelllines_mutations) +} +cbioportal_config = file(params.cbioportal_config) +protein_decoy_config = file(params.protein_decoy_config) + +af_field = params.af_field +if (af_field == null ) { + af_field = "" +} +if (params.ensembl_name == "homo_sapiens"){ + af_field = "MAF" +} +// ch_multiqc_config = file("$projectDir/assets/multiqc_config.yml", checkIfExists: true) +// ch_multiqc_custom_config = params.multiqc_config ? Channel.fromPath(params.multiqc_config) : Channel.empty() + +/* +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + IMPORT LOCAL MODULES/SUBWORKFLOWS +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +*/ + +// +// SUBWORKFLOW: Consisting of a mix of local and nf-core/modules +// +//include { INPUT_CHECK } from '../subworkflows/local/input_check' + +/* +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + IMPORT NF-CORE MODULES/SUBWORKFLOWS +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +*/ + +// Don't overwrite global params.modules, create a copy instead and use that within the main script. +//def modules = params.modules.clone() + +//include { GET_SOFTWARE_VERSIONS } from '../modules/local/get_software_versions' + +include { ENSEMBL_FASTA_DOWNLOAD } from '../modules/local/proteomes/ensembl_fasta_download' +include { ADD_REFERENCE_PROTEOME } from '../modules/local/proteomes/add_reference_proteome' +include { MERGE_CDNAS} from '../modules/local/proteomes/merge_cdnas' +include { ADD_NCRNA } from '../modules/local/proteomes/add_ncrna' +include { ADD_PSEUDOGENES } from '../modules/local/proteomes/add_pseudogenes' +include { ADD_ALTORFS } from '../modules/local/proteomes/add_altorfs' + +include { COSMIC_DOWNLOAD } from '../modules/local/cosmicmutations/cosmic_download' +include { COSMIC_CELLLINES_DOWNLOAD } from '../modules/local/cosmicmutations/cosmic_celllines_download' +include { COSMIC_PROTEINDB } from '../modules/local/cosmicmutations/cosmic_proteindb' +include { COSMIC_CELLLINES_PROTEINDB } from '../modules/local/cosmicmutations/cosmic_celllines_proteindb' +include { COSMIC_PROTEINDB_LOCAL } from '../modules/local/cosmicmutations/cosmic_proteindb_local' +include { COSMIC_CELLLINES_PROTEINDB_LOCAL } from '../modules/local/cosmicmutations/cosmic_celllines_proteindb_local' + +include { ENSEMBL_VCF_DOWNLOAD } from '../modules/local/vcf/ensembl_vcf_download' +include { CHECK_ENSEMBL_VCF } from '../modules/local/vcf/check_ensembl_vcf' +include { ENSEMBL_VCF_PROTEINDB } from '../modules/local/vcf/ensembl_vcf_proteindb' + +include { GTF_TO_FASTA } from '../modules/local/customvcf/gtf_to_fasta' +include { VCF_PROTEINDB } from '../modules/local/customvcf/vcf_proteinDB' + +include { GENCODE_DOWNLOAD } from '../modules/local/gnomadvariatns/gencode_download' +include { GNOMAD_DOWNLOAD } from '../modules/local/gnomadvariatns/gnomad_download' +include { EXTRACT_GNOMAD_VCF } from '../modules/local/gnomadvariatns/extract_gnomad_vcf' +include { GNOMAD_PROTEINDB } from '../modules/local/gnomadvariatns/gnomad_proteindb' + +include { CDS_GRCH37_DOWNLOAD } from '../modules/local/cbioportalmutations/cds_GRCh37_download' +include { DOWNLOAD_ALL_CBIOPORTAL } from '../modules/local/cbioportalmutations/download_all_cbioportal' +include { CBIOPORTAL_PROTEINDB } from '../modules/local/cbioportalmutations/cbioportal_proteindb' + +include { MERGE_PROTEINDBS } from '../modules/local/merge_proteindbs' +include { CLEAN_PROTEIN_DATABASE } from '../modules/local/clean_protein_database' +include { DECOY } from '../modules/local/decoy' +//include { OUTPUT_DOCUMENTATION } from '../modules/local/output_documentation' + +/* +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + RUN MAIN WORKFLOW +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +*/ + +workflow PGDB { + + // Parse software version numbers + //GET_SOFTWARE_VERSIONS() + + // Download data from ensembl for the particular species + ENSEMBL_FASTA_DOWNLOAD(ensembl_downloader_config,params.ensembl_name) + + ADD_REFERENCE_PROTEOME(ENSEMBL_FASTA_DOWNLOAD.out.ensembl_protein_database_sub) + + //Concatenate cDNA and ncRNA databases + MERGE_CDNAS(ENSEMBL_FASTA_DOWNLOAD.out.ensembl_cdna_database_sub.collect(),ENSEMBL_FASTA_DOWNLOAD.out.ensembl_ncrna_database_sub.collect()) + + //Creates the ncRNA protein database + ADD_NCRNA(MERGE_CDNAS.out.total_cdnas,ensembl_config) + merged_databases = ADD_REFERENCE_PROTEOME.out.ensembl_protein_database.mix(ADD_NCRNA.out.optional_ncrna) + + //Creates the pseudogenes protein database + ADD_PSEUDOGENES(MERGE_CDNAS.out.total_cdnas,ensembl_config) + merged_databases = merged_databases.mix(ADD_PSEUDOGENES.out.optional_pseudogenes) + + //Creates the altORFs protein database + ADD_ALTORFS(ENSEMBL_FASTA_DOWNLOAD.out.ensembl_cdna_database_sub,ensembl_config) + merged_databases = merged_databases.mix(ADD_ALTORFS.out.optional_altorfs) + + /* Mutations to proteinDB */ + + //Download COSMIC Mutations + COSMIC_DOWNLOAD() + + //Download COSMIC Cell Lines Mutations + COSMIC_CELLLINES_DOWNLOAD() + + //Generate proteindb from cosmic mutations + COSMIC_PROTEINDB(COSMIC_DOWNLOAD.out.cosmic_genes,COSMIC_DOWNLOAD.out.cosmic_mutations,cosmic_config,params.cosmic_cancer_type) + merged_databases = merged_databases.mix(COSMIC_PROTEINDB.out.cosmic_proteindbs) + + //Generate proteindb from local cosmic mutations + if (params.cosmicgenes&¶ms.cosmicmutations) { + COSMIC_PROTEINDB_LOCAL(cosmic_config,params.cosmic_cancer_type,cosmicmutations,cosmicgenes) + merged_databases = merged_databases.mix(COSMIC_PROTEINDB_LOCAL.out.cosmic_proteindbs_uselocal) + } + + //Generate proteindb from cosmic cell lines mutations + COSMIC_CELLLINES_PROTEINDB(COSMIC_CELLLINES_DOWNLOAD.out.cosmic_celllines_genes,COSMIC_CELLLINES_DOWNLOAD.out.cosmic_celllines_mutations,cosmic_config,params.cosmic_cellline_name) + merged_databases = merged_databases.mix(COSMIC_CELLLINES_PROTEINDB.out.cosmic_celllines_proteindbs) + + //Generate proteindb from local cosmic cell lines mutations + if (params.cosmiccelllines_genes&¶ms.cosmiccelllines_mutations) { + COSMIC_CELLLINES_PROTEINDB_LOCAL(cosmic_config,params.cosmic_cellline_name,cosmiccelllines_mutations,cosmiccelllines_genes) + merged_databases = merged_databases.mix(COSMIC_CELLLINES_PROTEINDB_LOCAL.out.cosmic_celllines_proteindbs_uselocal) + } + + //Download VCF files from ensembl for the particular species + ENSEMBL_VCF_DOWNLOAD(ensembl_downloader_config,params.ensembl_name) + CHECK_ENSEMBL_VCF(ENSEMBL_VCF_DOWNLOAD.out.ensembl_vcf_files) + + //Generate protein database(s) from ENSEMBL vcf file(s) + ENSEMBL_VCF_PROTEINDB(CHECK_ENSEMBL_VCF.out.ensembl_vcf_files_checked,MERGE_CDNAS.out.total_cdnas,ENSEMBL_FASTA_DOWNLOAD.out.gtf,ensembl_config,af_field) + + //concatenate all ensembl proteindbs into one + ENSEMBL_VCF_PROTEINDB.out.proteinDB_vcf.collectFile(name: 'ensembl_proteindb.fa', newLine: false, storeDir: "${projectDir}/result") + .set {proteinDB_vcf_final} + merged_databases = merged_databases.mix(proteinDB_vcf_final) + + /*Custom VCF */ + + //Generate protein databse for a given VCF + GTF_TO_FASTA(ENSEMBL_FASTA_DOWNLOAD.out.gtf,ENSEMBL_FASTA_DOWNLOAD.out.genome_fasta) + vcf_file = params.vcf_file ? Channel.fromPath(params.vcf_file, checkIfExists: true) : Channel.empty() + + VCF_PROTEINDB(vcf_file,GTF_TO_FASTA.out.gtf_transcripts_fasta,ENSEMBL_FASTA_DOWNLOAD.out.gtf,ensembl_config,af_field) + merged_databases = merged_databases.mix(VCF_PROTEINDB.out.proteinDB_custom_vcf) + + /*gnomAD variatns */ + + //Download gencode files (fasta and gtf) + GENCODE_DOWNLOAD(params.gencode_url) + + //Download gnomAD variants (VCF) - requires gsutil + GNOMAD_DOWNLOAD(params.gnomad_file_url) + + //Extract gnomAD VCF + EXTRACT_GNOMAD_VCF(GNOMAD_DOWNLOAD.out.gnomad_vcf_bgz.flatten().map{ file(it) }) + + //Generate gmomAD proteinDB + GNOMAD_PROTEINDB(EXTRACT_GNOMAD_VCF.out.gnomad_vcf_files,GENCODE_DOWNLOAD.out.gencode_fasta,GENCODE_DOWNLOAD.out.gencode_gtf,params.ensembl_config) + + //concatenate all gnomad proteindbs into one + GNOMAD_PROTEINDB.out.gnomad_vcf_proteindb.collectFile(name: 'gnomad_proteindb.fa', newLine: false, storeDir: "${projectDir}/result") + .set {gnomad_vcf_proteindb_final} + merged_databases = merged_databases.mix(gnomad_vcf_proteindb_final) + + /*cBioPortal mutations */ + + //Download GRCh37 CDS file from ENSEMBL release 75 + CDS_GRCH37_DOWNLOAD() + + //Download all cBioPortal studies using git-lfs + DOWNLOAD_ALL_CBIOPORTAL(cbioportal_config,params.cbioportal_study_id) + + //Generate proteinDB from cBioPortal mutations + CBIOPORTAL_PROTEINDB(CDS_GRCH37_DOWNLOAD.out.ch_GRCh37_cds,DOWNLOAD_ALL_CBIOPORTAL.out.cbio_mutations,DOWNLOAD_ALL_CBIOPORTAL.out.cbio_samples,cbioportal_config,params.cbioportal_filter_column,params.cbioportal_accepted_values) + merged_databases = merged_databases.mix(CBIOPORTAL_PROTEINDB.out.cBioportal_proteindb) + + //Concatenate all generated databases from merged_databases channel to the final_database_protein file + MERGE_PROTEINDBS(merged_databases.collect()) + + //clean the database for stop codons, and unwanted AA like: *, also remove proteins with less than 6 AA + CLEAN_PROTEIN_DATABASE(MERGE_PROTEINDBS.out.to_clean_ch,ensembl_config,params.minimum_aa) + + to_protein_decoy_ch = params.clean_database ? CLEAN_PROTEIN_DATABASE.out.clean_database_sh : MERGE_PROTEINDBS.out.to_clean_ch + + //Create the decoy database using DecoyPYrat + //Decoy sequences will have "DECOY_" prefix tag to the protein accession + DECOY(to_protein_decoy_ch,protein_decoy_config,params.decoy_method,params.decoy_enzyme,params.decoy_prefix) + + //Output Description HTML + //OUTPUT_DOCUMENTATION(ch_output_docs,ch_output_docs_images) + +} + +workflow.onComplete { + if (params.email || params.email_on_fail) { + NfcoreTemplate.email(workflow, params, summary_params, projectDir, log, multiqc_report) + } + NfcoreTemplate.summary(workflow, params, log) +} + +/* +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + THE END +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +*/