Skip to content

Commit

Permalink
#20 clean multiqc output to avoid duplicated name in the general stat…
Browse files Browse the repository at this point in the history
…istics table, add section in readme to explain how to look at the results
  • Loading branch information
Juke34 committed Jan 24, 2025
1 parent ea5cc66 commit 31dec6d
Show file tree
Hide file tree
Showing 10 changed files with 136 additions and 31 deletions.
83 changes: 83 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,7 @@ AliNe is a pipeline written in Nextflow that aims to efficiently align reads aga
* [Singularity](#singularity)
* [Usage and test](#usage)
* [Parameters](#parameters)
* [Comparing output](#comparing-output)
* [Contributing](#contributing)

## Foreword
Expand Down Expand Up @@ -350,6 +351,88 @@ On success you should get a message looking like this:
--sublong_options additional options for sublong
```
## Comparing output
In order to compare several aligner output you should activate the `--fastqc` parameter. AliNe will run the FastQC program for each output file, thereafter all FastQC file will be gathered in an html file using MultiQC. The resulting file called `multiqc_report.html` can be found in <outdir>/MultiQC ( <outdir> by default is called `alignment_results` and can be setup using `--outdir` AliNe parameter).
FastQC collect the following information:
* Sequence Counts
* Sequence Quality
* Per Sequence Quality Scores
* Per Base Sequence Content
* Per Sequence GC Content
* Per Base N Content
* Sequence Length Distribution
* Sequence Duplication Levels
* Overrepresented sequences
* Adapter Content
### General Statistics
Sequence Duplication Levels, Per Sequence GC Content and Sequence Counts are reported at the top of the `multiqc_report.html` file in a table called `General Statistics` as % Dups, %GC a,d M Seqs accordingly (see below).
<img src="img/multiqc.png" />
In order to facilitate the reading of this `General Statistics` you can export the table in tsv using the `Export as CSV...` button and execute the following piece of R code on the downloaded `general_stats_table.tsv` file :
```R
# install packages
install.packages("dplyr")
install.packages("stringr")
install.packages("tidyr")
# Load necessary libraries
library(dplyr)
library(stringr)
library(tidyr)
# Read the TSV file
file_path <- "general_stats_table.tsv"
df <- read.delim(file_path, check.names = FALSE)
# sample name as row name
rownames(df) <- df$Sample
# remove Sample column and clean up the column names
tmp <- cbind(ID = rownames(df), stack(df[-1])) |>
transform(ind = as.character(ind) |> stringr::str_remove_all("\\.\\d+"))
# remove na values
tmp <- tmp[!is.na(tmp$values),]
# remove . values
tmp$values <- tmp$values |> stringr::str_remove_all("^\\.$")
# pivot data
tmp <- tmp |> pivot_wider(id_cols = ID , names_from = ind, values_from = values,
values_fn = \(x) paste(unique(x), collapse = ""))
# print only 4 first columns
tmp[1:4]
```
You will get a table similar to this one:
```
ID Dups GC Seqs
<chr> <chr> <chr> <chr>
1 yeast_R1 73.01 43 0.01
2 yeast_R1_seqkit_trim 69.21661409043114 44 0.007607999999999999
3 yeast_R1_seqkit_trim_STAR_sorted 69.54090258811289 44 0.007689
4 yeast_R1_seqkit_trim_bbmap_sorted 69.21661409043114 44 0.007607999999999999
5 yeast_R1_seqkit_trim_bowtie2_sorted 69.21661409043114 44 0.007607999999999999
6 yeast_R1_seqkit_trim_bwaaln_sorted 69.21661409043114 44 0.007607999999999999
7 yeast_R1_seqkit_trim_bwamem_sorted 69.18933123111286 44 0.007611
8 yeast_R1_seqkit_trim_bwasw_sorted 69.23279033105622 44 0.007612
9 yeast_R1_seqkit_trim_graphmap2_sorted 69.21661409043114 44 0.007607999999999999
10 yeast_R1_seqkit_trim_hisat2_sorted 69.21661409043114 44 0.007607999999999999
11 yeast_R1_seqkit_trim_kallisto_sorted 69.21661409043114 44 0.007607999999999999
12 yeast_R1_seqkit_trim_minimap2_sorted 69.63058976020739 44 0.007715
13 yeast_R1_seqkit_trim_nucmer.fixed_sorted 64.70588235294117 42 0.00016999999999999999
14 yeast_R1_seqkit_trim_sublong 53.23343848580442 44 0.007607999999999999
15 yeast_R1_seqkit_trim_subread 69.21661409043114 44 0.007607999999999999
```
## Contributing
Contributions from the community are welcome ! See the [Contributing guidelines](https://github.com/Juke34/aline/blob/main/CONTRIBUTING.md)
3 changes: 2 additions & 1 deletion aline.nf
Original file line number Diff line number Diff line change
Expand Up @@ -509,7 +509,7 @@ workflow align {
// ------------------- QC -----------------
if(params.fastqc){
fastqc_raw(raw_reads, "fastqc/raw", "raw")
logs.mix(fastqc_raw.out)
logs.concat(fastqc_raw.out).set{logs} // save log
}

// ------------------- Standardize Score to Be Phred+33 ----------------
Expand Down Expand Up @@ -1088,4 +1088,5 @@ In aline.nf
Think to convert sam to bam if necessary
Think to sort bam output if necessary
Add info in README.md
Add tool in multiqc config file (at least for fastqc)
*/
16 changes: 13 additions & 3 deletions config/multiqc_conf.yml
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
title: "Nextflow Alignment pipeline"
title: "AliNe - Nextflow Alignment pipeline (https://github.com/Juke34/AliNe)"

run_modules:
- fastqc
Expand Down Expand Up @@ -45,6 +45,10 @@ module_order:
name: "FastQC (graphmap2)"
path_filters:
- "*graphmap2_logs*"
- fastqc:
name: "FastQC (hisat)"
path_filters:
- "*hisat_logs*"
- hisat2
- fastqc:
name: "FastQC (hisat2)"
Expand Down Expand Up @@ -73,6 +77,12 @@ module_order:
- "*sublong_logs*"
- star
- fastqc:
name: "FastQC (star)"
name: "FastQC (STAR)"
path_filters:
- "*star_logs*"
- "*STAR_logs*"
- fastqc:
name: "FastQC (STARlong)"
path_filters:
- "*star_logs*"
- "*starlong_logs*"
- "*STARlong_logs*"
3 changes: 2 additions & 1 deletion config/softwares.config
Original file line number Diff line number Diff line change
Expand Up @@ -33,7 +33,8 @@ process {
container = 'quay.io/biocontainers/minimap2:2.26--he4a0461_1'
}
withName: 'multiqc' {
container = 'quay.io/biocontainers/multiqc:1.14--pyhdfd78af_0'
//container = 'quay.io/biocontainers/multiqc:1.14--pyhdfd78af_0'
container = 'quay.io/biocontainers/multiqc:1.27--pyhdfd78af_0'
}
withLabel: 'mummer4' {
container = 'quay.io/biocontainers/mummer4:4.0.0rc1--pl5321hdbdd923_6'
Expand Down
2 changes: 1 addition & 1 deletion modules/bbmap.nf
Original file line number Diff line number Diff line change
Expand Up @@ -83,7 +83,7 @@ process bbmap {
}

// set fileName
def fileName = fastq[0].baseName.replace('.fastq','')
def fileName = fastq[0].baseName.replace('.fastq','') + "_bbmap"
"""
${tool} \\
ref=${genome_index} \\
Expand Down
13 changes: 8 additions & 5 deletions modules/bowtie2.nf
Original file line number Diff line number Diff line change
Expand Up @@ -30,7 +30,7 @@ process bowtie2 {

output:
tuple val(sample), path ("*.sam"), emit: tuple_sample_sam
path "*bowtie2.log", emit: bowtie2_summary
path "*.log", emit: bowtie2_summary

script:

Expand All @@ -48,22 +48,25 @@ process bowtie2 {
read_orientation = "--ff"
}
}

// catch filename
filename = reads[0].baseName.replace('.fastq','') + "_bowtie2"

if (params.read_type == "short_paired"){
"""
bowtie2 ${params.bowtie2_options} ${read_orientation}\\
-p ${task.cpus} \\
-x ${genome.baseName} \\
-S ${reads[0].baseName.replace('.fastq','')}_bowtie2.sam \\
-1 ${reads[0]} -2 ${reads[1]} 2> ${reads[0].baseName.replace('.fastq','')}_bowtie2.log
-S ${filename}.sam \\
-1 ${reads[0]} -2 ${reads[1]} 2> ${filename}_sorted.log
"""
} else {
"""
bowtie2 ${params.bowtie2_options} ${read_orientation}\\
-p ${task.cpus} \\
-x ${genome.baseName} \\
-S ${reads.baseName.replace('.fastq','')}_bowtie2.sam \\
-U ${reads} 2> ${reads.baseName}_bowtie2.log
-S ${filename}.sam \\
-U ${reads} 2> ${filename}_sorted.log
"""
}

Expand Down
14 changes: 7 additions & 7 deletions modules/hisat2.nf
Original file line number Diff line number Diff line change
Expand Up @@ -34,16 +34,16 @@ process hisat2 {

output:
tuple val(sample), path ("*.sam"), emit: tuple_sample_sam
path "${sample}_splicesite.txt"
path "*hisat2-summary.txt", emit: hisat2_summary
path "*_splicesite.txt"
path "*_summary.txt", emit: hisat2_summary

script:

// catch index basename
index_basename = hisat2_index_files[0].toString() - ~/.\d.ht2l?/

// catch filename
filename = reads[0].baseName.replace('.fastq','')
filename = reads[0].baseName.replace('.fastq','') + "_hisat2"

// deal with library type - default is unstranded.
def lib_strand=""
Expand Down Expand Up @@ -82,14 +82,14 @@ process hisat2 {

if (params.read_type == "short_paired"){
"""
hisat2 ${lib_strand} ${read_orientation} ${params.hisat2_options} --novel-splicesite-outfile ${sample}_splicesite.txt \\
--new-summary --summary-file ${sample}.hisat2-summary.txt \\
hisat2 ${lib_strand} ${read_orientation} ${params.hisat2_options} --novel-splicesite-outfile ${filename}_splicesite.txt \\
--new-summary --summary-file ${filename}_sorted_summary.txt \\
-p ${task.cpus} -x $index_basename -1 ${reads[0]} -2 ${reads[1]} > ${filename}.sam
"""
} else {
"""
hisat2 ${lib_strand} ${read_orientation} ${params.hisat2_options} --novel-splicesite-outfile ${sample}_splicesite.txt \\
--new-summary --summary-file ${sample}.hisat2-summary.txt \\
hisat2 ${lib_strand} ${read_orientation} ${params.hisat2_options} --novel-splicesite-outfile ${filename}_splicesite.txt \\
--new-summary --summary-file ${filename}_sorted_summary.txt \\
-p ${task.cpus} -x $index_basename -U $reads > ${filename}.sam
"""
}
Expand Down
21 changes: 14 additions & 7 deletions modules/kallisto.nf
Original file line number Diff line number Diff line change
Expand Up @@ -26,21 +26,21 @@ process kallisto_index {
process kallisto {
label 'kallisto'
tag "$sample"
publishDir "${params.outdir}/${outpath}", pattern: "*.bam", mode: 'copy'
publishDir "${params.outdir}/${outpath}", pattern: "${filename}/*.bam", mode: 'copy'

input:
tuple val(sample), path(reads), val(library), val(read_length)
path kallisto_index
val outpath

output:
tuple val(sample), path ("*.bam"), emit: tuple_sample_bam
path "*kallisto.log", emit: kallisto_summary
tuple val(sample), path ("${filename}/*.bam"), emit: tuple_sample_bam
path "*.log", emit: kallisto_summary

script:

// catch filename
filename = reads[0].baseName.replace('.fastq','')
filename = reads[0].baseName.replace('.fastq','') + "_kallisto_sorted"

// deal with library type
def read_orientation=""
Expand All @@ -61,7 +61,11 @@ process kallisto {
-t ${task.cpus} \
--pseudobam \
-i ${kallisto_index} \
${reads[0]} ${reads[1]} -o ${filename}.bam 2> ${filename}_kallisto.log
${reads[0]} ${reads[1]} -o ${filename} 2> ${filename}.log
mv ${filename}/pseudoalignments.bam ${filename}/${filename}.bam
# in order that the log file contains the name of the output fastq files (MultiQC)
sed -i 's/${reads[0]}/${filename}.fastq.gz/' ${filename}.log
"""
} else {

Expand All @@ -83,8 +87,11 @@ process kallisto {
--pseudobam \
-i ${kallisto_index} \
--single \
${reads} -o ${filename}.bam 2> ${filename}_kallisto.log
${reads} -o ${filename} 2> ${filename}.log
mv ${filename}/pseudoalignments.bam ${filename}/${filename}.bam
# in order that the log file contains the name of the output fastq files (MultiQC)
sed -i 's/${reads[0]}/${filename}.fastq.gz/' ${filename}.log
"""
}

}
8 changes: 4 additions & 4 deletions modules/star.nf
Original file line number Diff line number Diff line change
Expand Up @@ -107,7 +107,7 @@ process star {
--readFilesIn pipedRead1 pipedRead2 \\
--runThreadN ${task.cpus} \\
--runMode alignReads \\
--outFileNamePrefix ${reads[0].baseName.replace('.fastq','')} \\
--outFileNamePrefix ${reads[0].baseName.replace('.fastq','')}_${star_tool}_sorted \\
--outSAMunmapped Within \\
--outSAMtype BAM SortedByCoordinate
"""
Expand All @@ -121,7 +121,7 @@ process star {
--readFilesIn pipedRead \\
--runThreadN ${task.cpus} \\
--runMode alignReads \\
--outFileNamePrefix ${reads.baseName.replace('.fastq','')} \\
--outFileNamePrefix ${reads.baseName.replace('.fastq','')}_${star_tool}_sorted \\
--outSAMunmapped Within \\
--outSAMtype BAM SortedByCoordinate
"""
Expand Down Expand Up @@ -172,7 +172,7 @@ process star2pass{
--readFilesIn pipedRead1 pipedRead2 \\
--runThreadN ${task.cpus} \\
--runMode alignReads \\
--outFileNamePrefix ${reads.baseName.replace('.fastq','')}_2pass \\
--outFileNamePrefix ${reads.baseName.replace('.fastq','')}_${star_tool}_2pass_sorted \\
--outSAMunmapped Within \\
--outSAMtype BAM SortedByCoordinate \\
--sjdbFileChrStartEnd *SJ.out.tab
Expand All @@ -195,7 +195,7 @@ process star2pass{
--readFilesIn pipedRead \\
--runThreadN ${task.cpus} \\
--runMode alignReads \\
--outFileNamePrefix ${reads[0].baseName.replace('.fastq','')}_2pass \\
--outFileNamePrefix ${reads[0].baseName.replace('.fastq','')}_${star_tool}_2pass_sorted \\
--outSAMunmapped Within \\
--outSAMtype BAM SortedByCoordinate \\
--sjdbFileChrStartEnd *SJ.out.tab
Expand Down
4 changes: 2 additions & 2 deletions modules/subread.nf
Original file line number Diff line number Diff line change
Expand Up @@ -53,7 +53,7 @@ process subread {
}

// remove fastq.gz
def fileName = fastq[0].baseName.replace('.fastq','')
def fileName = fastq[0].baseName.replace('.fastq','') + "_subread"

// prepare index name
def index_prefix = genome.baseName + "_index"
Expand Down Expand Up @@ -120,7 +120,7 @@ process sublong {
script:

// remove fastq.gz
def fileName = fastq[0].baseName.replace('.fastq','')
def fileName = fastq[0].baseName.replace('.fastq','') + "_sublong"

// prepare index name
def index_prefix = genome.baseName + "_index"
Expand Down

0 comments on commit 31dec6d

Please sign in to comment.