diff --git a/docs/output.md b/docs/output.md index ca828e9fc..1c0a0018f 100644 --- a/docs/output.md +++ b/docs/output.md @@ -18,6 +18,7 @@ The pipeline is built using [Nextflow](https://www.nextflow.io/) and processes d - [Introduction](#introduction) - [Pipeline overview](#pipeline-overview) - [Preprocessing](#preprocessing) + - [fq lint](#fq-lint) - [cat](#cat) - [FastQC](#fastqc) - [UMI-tools extract](#umi-tools-extract) @@ -61,6 +62,18 @@ The pipeline is built using [Nextflow](https://www.nextflow.io/) and processes d ## Preprocessing +### fq lint + +
+Output files + +- `fq_lint/` + - `*.fq_lint.txt`: If `--save_linting_log` is specified, fq lint logs for each sample will be placed in this directory. + +
+ +[fq lint](https://github.com/stjude-rust-labs/fq#lint) is a tool to validate both single-end and paired-end FastQ files. + ### cat
diff --git a/docs/usage.md b/docs/usage.md index 851e20c7d..1ca6d2e71 100644 --- a/docs/usage.md +++ b/docs/usage.md @@ -100,6 +100,10 @@ An [example samplesheet](../assets/samplesheet.csv) has been provided with the p > **NB:** The `group` and `replicate` columns were replaced with a single `sample` column as of v3.1 of the pipeline. The `sample` column is essentially a concatenation of the `group` and `replicate` columns, however it now also offers more flexibility in instances where replicate information is not required e.g. when sequencing clinical samples. If all values of `sample` have the same number of underscores, fields defined by these underscore-separated names may be used in the PCA plots produced by the pipeline, to regain the ability to represent different groupings. +## FASTQ validation + +[fq lint](https://github.com/stjude-rust-labs/fq#lint) is a tool to validate FastQ files, and can be used to check input FastQ files before continuing with the rest of the pipeline. fq lint can be run on either single-end or paired-end FastQ files. By default, FastQ files are validated using all validators provided by fq lint, and the validator will panic on the first error it encounters. If an error is encountered, the pipeline will exit. Linting can be skipped by setting the `--skip_linting` parameter to `false`. + ## FASTQ sampling If you would like to reduce the number of reads used in the analysis, for example to test pipeline operation with limited resource usage, you can make use of the FASTP option for trimming (see below). FASTP has an option to take the first `n` reads of input FASTQ file(s), so this can be used to reduce the reads passed to subsequent steps. For example, to pass only the first 10,000 reads for trimming you would set input paramters like: @@ -153,9 +157,9 @@ The `--umitools_grouping_method` parameter affects [how similar, but non-identic | UMI type | Source | Pipeline parameters | | ------------ | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | | In read name | [Illumina BCL convert >3.7.5](https://emea.support.illumina.com/content/dam/illumina-support/documents/documentation/software_documentation/bcl_convert/bcl-convert-v3-7-5-software-guide-1000000163594-00.pdf) | `--with_umi --skip_umi_extract --umitools_umi_separator ":"` | -| In sequence | [Lexogen QuantSeq® 3’ mRNA-Seq V2 FWD](https://www.lexogen.com/quantseq-3mrna-sequencing) + [UMI Second Strand Synthesis Module](https://faqs.lexogen.com/faq/how-can-i-add-umis-to-my-quantseq-libraries) | `--with_umi --umitools_extract_method "regex" --umitools_bc_pattern "^(?P.{6})(?P.{4}).*"` | -| In sequence | [Lexogen CORALL® Total RNA-Seq V1](https://www.lexogen.com/corall-total-rna-seq/)
> _mind [Appendix H](https://www.lexogen.com/wp-content/uploads/2020/04/095UG190V0130_CORALL-Total-RNA-Seq_2020-03-31.pdf) regarding optional trimming_ | `--with_umi --umitools_extract_method "regex" --umitools_bc_pattern "^(?P.{12}).*"`
Optional: `--clip_r2 9 --three_prime_clip_r2 12` | -| In sequence | [Takara Bio SMARTer® Stranded Total RNA-Seq Kit v3](https://www.takarabio.com/documents/User%20Manual/SMARTer%20Stranded%20Total%20RNA/SMARTer%20Stranded%20Total%20RNA-Seq%20Kit%20v3%20-%20Pico%20Input%20Mammalian%20User%20Manual-a_114949.pdf) | `--with_umi --umitools_extract_method "regex" --umitools_bc_pattern2 "^(?P.{8})(?P.{6}).*"` | +| In sequence | [Lexogen QuantSeq® 3’ mRNA-Seq V2 FWD](https://www.lexogen.com/quantseq-3mrna-sequencing) + [UMI Second Strand Synthesis Module](https://faqs.lexogen.com/faq/how-can-i-add-umis-to-my-quantseq-libraries) | `--with_umi --umitools_extract_method "regex" --umitools_bc_pattern "^(?P.{6})(?P.{4}).*"` | +| In sequence | [Lexogen CORALL® Total RNA-Seq V1](https://www.lexogen.com/corall-total-rna-seq/)
> _mind [Appendix H](https://www.lexogen.com/wp-content/uploads/2020/04/095UG190V0130_CORALL-Total-RNA-Seq_2020-03-31.pdf) regarding optional trimming_ | `--with_umi --umitools_extract_method "regex" --umitools_bc_pattern "^(?P.{12}).*"`
Optional: `--clip_r2 9 --three_prime_clip_r2 12` | +| In sequence | [Takara Bio SMARTer® Stranded Total RNA-Seq Kit v3](https://www.takarabio.com/documents/User%20Manual/SMARTer%20Stranded%20Total%20RNA/SMARTer%20Stranded%20Total%20RNA-Seq%20Kit%20v3%20-%20Pico%20Input%20Mammalian%20User%20Manual-a_114949.pdf) | `--with_umi --umitools_extract_method "regex" --umitools_bc_pattern2 "^(?P.{8})(?P.{6}).*"` | | In sequence | [Watchmaker mRNA Library Prep Kit](https://watchmakergenomics.com/wp-content/uploads/2023/11/M223_mRNA-Library-Prep-Kit-_UG_WMUG214_v1-1-0823.pdf) with [Twist UMI Adapter System](https://www.twistbioscience.com/sites/default/files/resources/2023-03/DOC-001337_TechNote-ProcessingSequencingDataUtilizingUMI-REV1-singles.pdf) | `--with_umi --umitools_extract_method "regex" --umitools_bc_pattern "^(?P.{5})(?P.{2}).*" --umitools_bc_pattern2 "^(?P.{5})(?P.{2}).*"` | > _No warranty for the accuracy or completeness of the parameters is implied_ diff --git a/modules.json b/modules.json index 9667d1b6b..e9ce3bd3d 100644 --- a/modules.json +++ b/modules.json @@ -57,6 +57,11 @@ "git_sha": "666652151335353eef2fcd58880bcef5bc2928e1", "installed_by": ["fastq_fastqc_umitools_fastp", "fastq_fastqc_umitools_trimgalore"] }, + "fq/lint": { + "branch": "master", + "git_sha": "a1abf90966a2a4016d3c3e41e228bfcbd4811ccc", + "installed_by": ["modules"] + }, "fq/subsample": { "branch": "master", "git_sha": "a1abf90966a2a4016d3c3e41e228bfcbd4811ccc", diff --git a/modules/nf-core/fq/lint/environment.yml b/modules/nf-core/fq/lint/environment.yml new file mode 100644 index 000000000..74b146083 --- /dev/null +++ b/modules/nf-core/fq/lint/environment.yml @@ -0,0 +1,5 @@ +channels: + - conda-forge + - bioconda +dependencies: + - bioconda::fq=0.12.0 diff --git a/modules/nf-core/fq/lint/main.nf b/modules/nf-core/fq/lint/main.nf new file mode 100644 index 000000000..6d301d4a0 --- /dev/null +++ b/modules/nf-core/fq/lint/main.nf @@ -0,0 +1,39 @@ +process FQ_LINT { + tag "$meta.id" + label 'process_low' + errorStrategy 'terminate' + + conda "${moduleDir}/environment.yml" + container "${ workflow.containerEngine == 'singularity' && !task.ext.singularity_pull_docker_container ? + 'https://depot.galaxyproject.org/singularity/fq:0.12.0--h9ee0642_0': + 'biocontainers/fq:0.12.0--h9ee0642_0' }" + + input: + tuple val(meta), path(fastq) + + output: + tuple val(meta), path("*.fq_lint.txt"), emit: lint + path "versions.yml" , emit: versions + + when: + task.ext.when == null || task.ext.when + + script: + def args = task.ext.args ?: '' + def prefix = task.ext.prefix ?: "${meta.id}" + """ + fq lint \\ + $args \\ + $fastq > ${prefix}.fq_lint.txt + + cat <<-END_VERSIONS > versions.yml + "${task.process}": + fq: \$(echo \$(fq lint --version | sed 's/fq-lint //g')) + END_VERSIONS + + if ! tail -n 1 ${prefix}.fq_lint.txt | grep -q 'fq-lint end'; then + echo "ERROR: Linting failure detected for ${meta.id}. See ${prefix}.fq_lint.txt for details." + exit 1 + fi + """ +} diff --git a/modules/nf-core/fq/lint/meta.yml b/modules/nf-core/fq/lint/meta.yml new file mode 100644 index 000000000..7240fb579 --- /dev/null +++ b/modules/nf-core/fq/lint/meta.yml @@ -0,0 +1,43 @@ +name: "fq_lint" +description: fq lint is a FASTQ file pair validator. +keywords: + - lint + - fastq + - validate +tools: + - "fq": + description: "fq is a library to generate and validate FASTQ file pairs." + homepage: "https://github.com/stjude-rust-labs/fq" + documentation: "https://github.com/stjude-rust-labs/fq" + tool_dev_url: "https://github.com/stjude-rust-labs/fq" + licence: ["MIT"] + identifier: "" +input: + - - meta: + type: map + description: | + Groovy Map containing sample information + e.g. [ id:'test', single_end:false ] + - fastq: + type: file + description: FASTQ file list + pattern: "*.fastq{,.gz}" +output: + - lint: + - meta: + type: file + description: Lint output + pattern: "*.fq_lint.txt" + - "*.fq_lint.txt": + type: file + description: Lint output + pattern: "*.fq_lint.txt" + - versions: + - versions.yml: + type: file + description: File containing software versions + pattern: "versions.yml" +authors: + - "@adamrtalbot" +maintainers: + - "@adamrtalbot" diff --git a/modules/nf-core/fq/lint/nextflow.config b/modules/nf-core/fq/lint/nextflow.config new file mode 100644 index 000000000..b7eeca5d2 --- /dev/null +++ b/modules/nf-core/fq/lint/nextflow.config @@ -0,0 +1,13 @@ +if (!params.skip_linting) { + process { + withName: 'FQ_LINT' { + ext.args = "--lint-mode panic" + publishDir = [ + path: { params.save_linting_log ? "${params.outdir}/fastq_lint" : params.outdir }, + path: { params.save_linting_log ? "${params.outdir}/fq_lint" : params.outdir }, + mode: params.publish_dir_mode, + saveAs: { filename -> (filename.endsWith('.fq_lint.txt') && params.save_linting_log) ? filename : null } + ] + } + } +} \ No newline at end of file diff --git a/modules/nf-core/fq/lint/tests/main.nf.test b/modules/nf-core/fq/lint/tests/main.nf.test new file mode 100644 index 000000000..ec2eaf8bc --- /dev/null +++ b/modules/nf-core/fq/lint/tests/main.nf.test @@ -0,0 +1,63 @@ +nextflow_process { + + name "Test Process FQ_LINT" + script "../main.nf" + process "FQ_LINT" + + tag "modules" + tag "modules_nfcore" + tag "fq" + tag "fq/lint" + + test("test_fq_lint_success") { + when { + params { + outdir = "$outputDir" + } + process { + """ + input[0] = [ [ id:'test', single_end:false ], // meta map + [ file(params.modules_testdata_base_path + 'genomics/sarscov2/illumina/fastq/test_1.fastq.gz', checkIfExists: true), + file(params.modules_testdata_base_path + 'genomics/sarscov2/illumina/fastq/test_2.fastq.gz', checkIfExists: true) ] + ] + """ + } + } + + then { + assertAll ( + { assert process.success }, + { assert process.out.lint.get(0).get(1) ==~ ".*/test.fq_lint.txt" }, + { assert path(process.out.lint.get(0).get(1)).getText().contains("fq-lint start") }, + { assert path(process.out.lint.get(0).get(1)).getText().contains("read 100 records") }, + { assert path(process.out.lint.get(0).get(1)).getText().contains("fq-lint end") }, + ) + } + + } + + test("test_fq_lint_fail") { + when { + params { + outdir = "$outputDir" + } + process { + """ + input[0] = [ [ id:'test', single_end:false ], // meta map + [ file(params.modules_testdata_base_path + 'genomics/sarscov2/illumina/fastq/test_1.fastq.gz', checkIfExists: true), + file(params.modules_testdata_base_path + 'genomics/prokaryotes/candidatus_portiera_aleyrodidarum/illumina/fastq/test_2.fastq.gz', checkIfExists: true) ] + ] + """ + } + } + + then { + assertAll ( + { assert !process.success }, + { assert snapshot(process.out).match() }, + ) + } + + } + +} diff --git a/modules/nf-core/fq/lint/tests/main.nf.test.snap b/modules/nf-core/fq/lint/tests/main.nf.test.snap new file mode 100644 index 000000000..fec8e5243 --- /dev/null +++ b/modules/nf-core/fq/lint/tests/main.nf.test.snap @@ -0,0 +1,25 @@ +{ + "test_fq_lint_fail": { + "content": [ + { + "0": [ + + ], + "1": [ + + ], + "lint": [ + + ], + "versions": [ + + ] + } + ], + "meta": { + "nf-test": "0.9.0", + "nextflow": "24.04.4" + }, + "timestamp": "2024-10-19T16:37:02.133847389" + } +} \ No newline at end of file diff --git a/modules/nf-core/fq/lint/tests/tags.yml b/modules/nf-core/fq/lint/tests/tags.yml new file mode 100644 index 000000000..9c9c323f8 --- /dev/null +++ b/modules/nf-core/fq/lint/tests/tags.yml @@ -0,0 +1,2 @@ +fq/lint: + - modules/nf-core/fq/lint/** diff --git a/nextflow.config b/nextflow.config index 468792d0f..3b2dba074 100644 --- a/nextflow.config +++ b/nextflow.config @@ -82,6 +82,8 @@ params { unstranded_threshold = 0.1 // QC + skip_linting = false + save_linting_log = true skip_qc = false skip_bigwig = false skip_stringtie = false diff --git a/nextflow_schema.json b/nextflow_schema.json index 802209b42..4e30c81e7 100644 --- a/nextflow_schema.json +++ b/nextflow_schema.json @@ -10,7 +10,10 @@ "type": "object", "fa_icon": "fas fa-terminal", "description": "Define where the pipeline should find input data and save output data.", - "required": ["input", "outdir"], + "required": [ + "input", + "outdir" + ], "properties": { "input": { "type": "string", @@ -224,7 +227,10 @@ "default": "trimgalore", "description": "Specifies the trimming tool to use - available options are 'trimgalore' and 'fastp'.", "fa_icon": "fas fa-cut", - "enum": ["trimgalore", "fastp"] + "enum": [ + "trimgalore", + "fastp" + ] }, "extra_trimgalore_args": { "type": "string", @@ -338,7 +344,13 @@ "default": "directional", "fa_icon": "far fa-object-ungroup", "description": "Method to use to determine read groups by subsuming those with similar UMIs. All methods start by identifying the reads with the same mapping position, but treat similar yet nonidentical UMIs differently.", - "enum": ["unique", "percentile", "cluster", "adjacency", "directional"] + "enum": [ + "unique", + "percentile", + "cluster", + "adjacency", + "directional" + ] }, "umitools_dedup_stats": { "type": "boolean", @@ -360,13 +372,20 @@ "default": "star_salmon", "description": "Specifies the alignment algorithm to use - available options are 'star_salmon', 'star_rsem' and 'hisat2'.", "fa_icon": "fas fa-map-signs", - "enum": ["star_salmon", "star_rsem", "hisat2"] + "enum": [ + "star_salmon", + "star_rsem", + "hisat2" + ] }, "pseudo_aligner": { "type": "string", "description": "Specifies the pseudo aligner to use - available options are 'salmon'. Runs in addition to '--aligner'.", "fa_icon": "fas fa-hamburger", - "enum": ["salmon", "kallisto"] + "enum": [ + "salmon", + "kallisto" + ] }, "pseudo_aligner_kmer_size": { "type": "integer", @@ -475,6 +494,12 @@ "description": "Additional output files produces as intermediates that can be saved", "default": "", "properties": { + "save_linting_log": { + "type": "boolean", + "fa_icon": "fas fa-save", + "description": "Save logs from FastQ file validation step.", + "default": true + }, "save_merged_fastq": { "type": "boolean", "fa_icon": "fas fa-save", @@ -556,7 +581,10 @@ "type": "string", "description": "Tool to use for detecting contaminants in unaligned reads - available options are 'kraken2' and 'kraken2_bracken'", "fa_icon": "fas fa-virus-slash", - "enum": ["kraken2", "kraken2_bracken"] + "enum": [ + "kraken2", + "kraken2_bracken" + ] }, "kraken_db": { "type": "string", @@ -570,7 +598,15 @@ "fa_icon": "fas fa-tree", "description": "Taxonomic level for Bracken abundance estimations.", "help_text": "First letter of Domain / Phylum / Class / Order / Family / Genus / Species", - "enum": ["D", "P", "C", "O", "F", "G", "S"] + "enum": [ + "D", + "P", + "C", + "O", + "F", + "G", + "S" + ] } } }, @@ -580,6 +616,11 @@ "fa_icon": "fas fa-fast-forward", "description": "Options to skip various steps within the workflow.", "properties": { + "skip_linting": { + "type": "boolean", + "fa_icon": "fas fa-forward", + "description": "Skip FastQ file validation." + }, "skip_gtf_filter": { "type": "boolean", "fa_icon": "fas fa-forward", @@ -748,7 +789,14 @@ "description": "Method used to save pipeline results to output directory.", "help_text": "The Nextflow `publishDir` option specifies which intermediate files should be saved to the output directory. This option tells the pipeline what method should be used to move these files. See [Nextflow docs](https://www.nextflow.io/docs/latest/process.html#publishdir) for details.", "fa_icon": "fas fa-copy", - "enum": ["symlink", "rellink", "link", "copy", "copyNoFollow", "move"], + "enum": [ + "symlink", + "rellink", + "link", + "copy", + "copyNoFollow", + "move" + ], "hidden": true }, "email_on_fail": { @@ -863,4 +911,4 @@ "$ref": "#/$defs/generic_options" } ] -} +} \ No newline at end of file diff --git a/subworkflows/nf-core/fastq_qc_trim_filter_setstrandedness/main.nf b/subworkflows/nf-core/fastq_qc_trim_filter_setstrandedness/main.nf index c655af415..252b77d2b 100644 --- a/subworkflows/nf-core/fastq_qc_trim_filter_setstrandedness/main.nf +++ b/subworkflows/nf-core/fastq_qc_trim_filter_setstrandedness/main.nf @@ -4,6 +4,7 @@ include { BBMAP_BBSPLIT } from '../../../modules/nf-core/bbmap include { CAT_FASTQ } from '../../../modules/nf-core/cat/fastq/main' include { SORTMERNA } from '../../../modules/nf-core/sortmerna/main' include { SORTMERNA as SORTMERNA_INDEX } from '../../../modules/nf-core/sortmerna/main' +include { FQ_LINT } from '../../../modules/nf-core/fq/lint/main' include { FASTQ_SUBSAMPLE_FQ_SALMON } from '../fastq_subsample_fq_salmon' include { FASTQ_FASTQC_UMITOOLS_TRIMGALORE } from '../fastq_fastqc_umitools_trimgalore' @@ -106,6 +107,7 @@ workflow FASTQ_QC_TRIM_FILTER_SETSTRANDEDNESS { umi_discard_read // integer: 0, 1 or 2 stranded_threshold // float: The fraction of stranded reads that must be assigned to a strandedness for confident assignment. Must be at least 0.5 unstranded_threshold // float: The difference in fraction of stranded reads assigned to 'forward' and 'reverse' below which a sample is classified as 'unstranded' + skip_lint // boolean: true/false main: @@ -114,6 +116,23 @@ workflow FASTQ_QC_TRIM_FILTER_SETSTRANDEDNESS { ch_trim_read_count = Channel.empty() ch_multiqc_files = Channel.empty() + ch_reads + .map { + meta, fastqs -> + return [meta, fastqs.flatten()] + } + .set { ch_fastq_lint } + + // + // MODULE: Lint FastQ files + // + if (!skip_linting) { + FQ_LINT ( + ch_fastq_lint + ) + ch_versions = ch_versions.mix(FQ_LINT.out.versions.first()) + } + ch_reads .branch { meta, fastqs -> diff --git a/workflows/rnaseq/main.nf b/workflows/rnaseq/main.nf index 84bedaeb6..d65912df0 100755 --- a/workflows/rnaseq/main.nf +++ b/workflows/rnaseq/main.nf @@ -156,7 +156,8 @@ workflow RNASEQ { params.with_umi, params.umi_discard_read, params.stranded_threshold, - params.unstranded_threshold + params.unstranded_threshold, + params.skip_linting ) ch_multiqc_files = ch_multiqc_files.mix(FASTQ_QC_TRIM_FILTER_SETSTRANDEDNESS.out.multiqc_files) diff --git a/workflows/rnaseq/nextflow.config b/workflows/rnaseq/nextflow.config index 9cbf0cd30..85b2ba0fd 100644 --- a/workflows/rnaseq/nextflow.config +++ b/workflows/rnaseq/nextflow.config @@ -1,4 +1,5 @@ includeConfig "../../modules/local/multiqc_custom_biotype/nextflow.config" +includeConfig "../../modules/nf-core/fq/lint/nextflow.config" includeConfig "../../modules/nf-core/bbmap/bbsplit/nextflow.config" includeConfig "../../modules/nf-core/cat/fastq/nextflow.config" includeConfig "../../modules/nf-core/dupradar/nextflow.config"