Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

output unaligned FASTQ files TheiaCov_Illumina PE and SE #275

Merged
merged 11 commits into from
Feb 21, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
109 changes: 94 additions & 15 deletions tasks/alignment/task_bwa.wdl
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,9 @@ task bwa {
String samplename
File? reference_genome
Int cpu = 6
Int memory = 16
Int disk_size = 100
String docker = "us-docker.pkg.dev/general-theiagen/staphb/ivar:1.3.1-titan"
}
command <<<
# date and version control
Expand All @@ -25,25 +27,98 @@ task bwa {
ref_genome="/artic-ncov2019/primer_schemes/nCoV-2019/V3/nCoV-2019.reference.fasta"
fi

# Map with BWA MEM
echo "Running bwa mem -t ~{cpu} ${ref_genome} ~{read1} ~{read2} | samtools sort | samtools view -F 4 -o ~{samplename}.sorted.bam "
# set cat command based on compression
if [[ "~{read1}" == *".gz" ]] ; then
cat_reads="zcat"
else
cat_reads="cat"
fi

echo -e "\ninput R1 has $(${cat_reads} ~{read1} | grep -c '^@') reads as input"
echo "input R2 has $(${cat_reads} ~{read2} | grep -c '^@') reads as input"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we uncompressing the read files just to output the read number? Seems like an unnecessary overhead on the task, taking longer to run and requiring more computational resources.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Were useful for ensuring the task is doing what we think it is doing. @kapsakcj, maybe we remove?


# Map with BWA MEM; pipe to samtools sort to write sorted SAM file
bwa mem \
-t ~{cpu} \
"${ref_genome}" \
~{read1} ~{read2} |\
samtools sort | samtools view -F 4 -o ~{samplename}.sorted.bam
-t ~{cpu} \
"${ref_genome}" \
~{read1} \
~{read2} | \
samtools sort \
-@ ~{cpu} - \
> ~{samplename}.sorted.sam

# convert SAM to BAM that only includes aligned reads
samtools view \
-@ ~{cpu} \
-F 4 \
-b \
-o ~{samplename}.sorted.bam \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't this file be renamed to ~{samplename}.sorted.aligned-reads.bam?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I kept it as the original filename. I agree it would provide a bit more clarity to someone reading the code, but hopefully the comments are enough.

I would prefer not to make further changes unless truly necessary since we'd like to share with our partner asap

~{samplename}.sorted.sam

# convert SAM to BAM that only includes unaligned reads
samtools view \
-@ ~{cpu} \
-f 4 \
-b \
-o ~{samplename}.sorted.unaligned-reads.bam \
~{samplename}.sorted.sam

# see here for "samtools fastq" options: https://www.htslib.org/doc/samtools-fasta.html
# TL;DR is that "samtools fastq -1 R1.fastq -2 R2.fastq" works with paired-end inputs and will output R1 and R2 reads to separate files due to tags in the SAM & BAM file

# AFAIK for single end alignments w/ bwa mem, the output SAM/BAM files do not have tags to differentiate between R1 and R2 reads, so the "samtools fastq -0 R1.fastq" command is used to output all reads to a single file

# if read2 was provided by user, extract both read1 and read2 from aligned and unaligned BAMs
if [[ ! -z "~{read2}" ]]; then
echo "processing paired reads"
samtools fastq -F4 -1 ~{samplename}_R1.fastq.gz -2 ~{samplename}_R2.fastq.gz ~{samplename}.sorted.bam
echo -e "\nGenerating FASTQs for aligned reads"
samtools fastq \
-@ ~{cpu} \
-F 4 \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

given that the bam files were already filtered by samtools view, do we need the -F 4 flag here too? All reads in the bam file should be aligned

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it may be unnecessary to include the -F 4 flag, I'm unsure. But it runs as expected so I'm hesitant to change & test again 😬

-1 ~{samplename}_R1.fastq.gz \
-2 ~{samplename}_R2.fastq.gz \
~{samplename}.sorted.bam
echo "Generating FASTQs for unaligned reads"
# note the lowercase 'f' here is imporant
samtools fastq \
-@ ~{cpu} \
-f 4 \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment as above

-1 ~{samplename}_unaligned_R1.fastq.gz \
-2 ~{samplename}_unaligned_R2.fastq.gz \
~{samplename}.sorted.unaligned-reads.bam
else
echo "processing single-end reads"
samtools fastq -F4 ~{samplename}.sorted.bam | gzip > ~{samplename}_R1.fastq.gz
echo -e "\nGenerating FASTQs for aligned single-end reads\n"
samtools fastq \
-@ ~{cpu} \
-F 4 \
-0 ~{samplename}_R1.fastq.gz \
~{samplename}.sorted.bam
echo -e "Generating FASTQs for unaligned single-end reads\n"
# again, lowercase 'f' is important for getting all unaligned reads
samtools fastq \
-@ ~{cpu} \
-f 4 \
-0 ~{samplename}_unaligned_R1.fastq.gz \
~{samplename}.sorted.unaligned-reads.bam
fi


# index BAMs
samtools index ~{samplename}.sorted.bam
samtools index ~{samplename}.sorted.bam
samtools index ~{samplename}.sorted.unaligned-reads.bam

# count output reads to ensure we are outputting all reads, regardless if the aligned or not
# if read2 does exist as input, count both R1 and R2
if [[ ! -z "~{read2}" ]]; then
echo -e "\noutput R1_aligned has $(zcat ~{samplename}_R1.fastq.gz | grep -c '^@') reads as input"
echo "output R2_aligned has $(zcat ~{samplename}_R2.fastq.gz | grep -c '^@') reads as input"
echo
echo "output R1_unaligned has $(zcat ~{samplename}_unaligned_R1.fastq.gz | grep -c '^@') reads as input"
echo "output R2_unaligned has $(zcat ~{samplename}_unaligned_R2.fastq.gz | grep -c '^@') reads as input"
# else = if read2 does not exist as input, only count R1
else
echo "output R1_aligned has $(zcat ~{samplename}_R1.fastq.gz | grep -c '^@') reads as input"
echo
echo "output R1_unaligned has $(zcat ~{samplename}_unaligned_R1.fastq.gz | grep -c '^@') reads as input"
fi
>>>
output {
String bwa_version = read_string("BWA_VERSION")
Expand All @@ -52,14 +127,18 @@ task bwa {
File sorted_bai = "${samplename}.sorted.bam.bai"
File read1_aligned = "~{samplename}_R1.fastq.gz"
File? read2_aligned = "~{samplename}_R2.fastq.gz"
File read1_unaligned = "~{samplename}_unaligned_R1.fastq.gz"
File? read2_unaligned = "~{samplename}_unaligned_R2.fastq.gz"
File sorted_bam_unaligned = "~{samplename}.sorted.unaligned-reads.bam"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm having an existential crisis on if we should rename sorted_bam and sorted_bai as sorted_aligned_bam and sorted_aligned_bai for readability 😅
Additionally, I would rename sorted_bam_unaligned to sorted_unaligned_bam so the variable has the file type at the end. It's a bit cleaner and it goes with or style-guide. Same for sorted_bam_unaligned_bai -> sorted_unaligned_bai

Copy link
Contributor

@jrotieno jrotieno Feb 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Personally I prefer names that tell a story of what has been done to the variable output or file, in which case the bam file was sorted, and then unaligned reads extracted. Happy to hear what others think and this could feed onto the style-guide.

File sorted_bam_unaligned_bai = "~{samplename}.sorted.unaligned-reads.bam.bai"
}
runtime {
docker: "us-docker.pkg.dev/general-theiagen/staphb/ivar:1.3.1-titan"
memory: "8 GB"
docker: docker
memory: memory + " GB"
cpu: cpu
disks: "local-disk " + disk_size + " SSD"
disk: disk_size + " GB" # TES
preemptible: 0
#maxRetries: 3
maxRetries: 3
}
}
4 changes: 4 additions & 0 deletions workflows/theiacov/wf_theiacov_illumina_pe.wdl
Original file line number Diff line number Diff line change
Expand Up @@ -342,6 +342,10 @@ workflow theiacov_illumina_pe {
File? read2_aligned = ivar_consensus.read2_aligned
String aligned_bam = select_first([ivar_consensus.aligned_bam, ""])
String aligned_bai = select_first([ivar_consensus.aligned_bai, ""])
File? read1_unaligned = ivar_consensus.read1_unaligned
File? read2_unaligned = ivar_consensus.read2_unaligned
File? sorted_bam_unaligned = ivar_consensus.sorted_bam_unaligned
File? sorted_bam_unaligned_bai = ivar_consensus.sorted_bam_unaligned_bai
# Read Alignment - primer trimming outputs
Float? primer_trimmed_read_percent = ivar_consensus.primer_trimmed_read_percent
String? ivar_version_primtrim = ivar_consensus.ivar_version_primtrim
Expand Down
3 changes: 3 additions & 0 deletions workflows/theiacov/wf_theiacov_illumina_se.wdl
Original file line number Diff line number Diff line change
Expand Up @@ -218,6 +218,9 @@ workflow theiacov_illumina_se {
String? assembly_method = ivar_consensus.assembly_method_nonflu
File? aligned_bam = ivar_consensus.aligned_bam
File? aligned_bai = ivar_consensus.aligned_bai
File? read1_unaligned = ivar_consensus.read1_unaligned
File? sorted_bam_unaligned = ivar_consensus.sorted_bam_unaligned
File? sorted_bam_unaligned_bai = ivar_consensus.sorted_bam_unaligned_bai
# Read Alignment - primer trimming outputs
Float? primer_trimmed_read_percent = ivar_consensus.primer_trimmed_read_percent
String? ivar_version_primtrim = ivar_consensus.ivar_version_primtrim
Expand Down
4 changes: 4 additions & 0 deletions workflows/utilities/wf_ivar_consensus.wdl
Original file line number Diff line number Diff line change
Expand Up @@ -75,6 +75,10 @@ workflow ivar_consensus {
File? read2_aligned = bwa.read2_aligned
File aligned_bam = select_first([primer_trim.trim_sorted_bam, bwa.sorted_bam, ""]) # set default values for select_first() to avoid workflow failures
File aligned_bai = select_first([primer_trim.trim_sorted_bai, bwa.sorted_bai, ""])
File read1_unaligned = bwa.read1_unaligned
File? read2_unaligned = bwa.read2_unaligned
File sorted_bam_unaligned = bwa.sorted_bam_unaligned
File sorted_bam_unaligned_bai = bwa.sorted_bam_unaligned_bai

# primer trimming outputs
Float? primer_trimmed_read_percent = primer_trim.primer_trimmed_read_percent
Expand Down