-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
output unaligned FASTQ files TheiaCov_Illumina PE and SE #275
Changes from all commits
99ae696
e33e337
4abfd61
3d26d0d
88e1d69
236a943
b5b41dd
714931c
c107e81
de53e79
ffb39a8
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -7,7 +7,9 @@ task bwa { | |
String samplename | ||
File? reference_genome | ||
Int cpu = 6 | ||
Int memory = 16 | ||
Int disk_size = 100 | ||
String docker = "us-docker.pkg.dev/general-theiagen/staphb/ivar:1.3.1-titan" | ||
} | ||
command <<< | ||
# date and version control | ||
|
@@ -25,25 +27,98 @@ task bwa { | |
ref_genome="/artic-ncov2019/primer_schemes/nCoV-2019/V3/nCoV-2019.reference.fasta" | ||
fi | ||
|
||
# Map with BWA MEM | ||
echo "Running bwa mem -t ~{cpu} ${ref_genome} ~{read1} ~{read2} | samtools sort | samtools view -F 4 -o ~{samplename}.sorted.bam " | ||
# set cat command based on compression | ||
if [[ "~{read1}" == *".gz" ]] ; then | ||
cat_reads="zcat" | ||
else | ||
cat_reads="cat" | ||
fi | ||
|
||
echo -e "\ninput R1 has $(${cat_reads} ~{read1} | grep -c '^@') reads as input" | ||
echo "input R2 has $(${cat_reads} ~{read2} | grep -c '^@') reads as input" | ||
|
||
# Map with BWA MEM; pipe to samtools sort to write sorted SAM file | ||
bwa mem \ | ||
-t ~{cpu} \ | ||
"${ref_genome}" \ | ||
~{read1} ~{read2} |\ | ||
samtools sort | samtools view -F 4 -o ~{samplename}.sorted.bam | ||
-t ~{cpu} \ | ||
"${ref_genome}" \ | ||
~{read1} \ | ||
~{read2} | \ | ||
samtools sort \ | ||
-@ ~{cpu} - \ | ||
> ~{samplename}.sorted.sam | ||
|
||
# convert SAM to BAM that only includes aligned reads | ||
samtools view \ | ||
-@ ~{cpu} \ | ||
-F 4 \ | ||
-b \ | ||
-o ~{samplename}.sorted.bam \ | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. shouldn't this file be renamed to There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I kept it as the original filename. I agree it would provide a bit more clarity to someone reading the code, but hopefully the comments are enough. I would prefer not to make further changes unless truly necessary since we'd like to share with our partner asap |
||
~{samplename}.sorted.sam | ||
|
||
# convert SAM to BAM that only includes unaligned reads | ||
samtools view \ | ||
-@ ~{cpu} \ | ||
-f 4 \ | ||
-b \ | ||
-o ~{samplename}.sorted.unaligned-reads.bam \ | ||
~{samplename}.sorted.sam | ||
|
||
# see here for "samtools fastq" options: https://www.htslib.org/doc/samtools-fasta.html | ||
# TL;DR is that "samtools fastq -1 R1.fastq -2 R2.fastq" works with paired-end inputs and will output R1 and R2 reads to separate files due to tags in the SAM & BAM file | ||
|
||
# AFAIK for single end alignments w/ bwa mem, the output SAM/BAM files do not have tags to differentiate between R1 and R2 reads, so the "samtools fastq -0 R1.fastq" command is used to output all reads to a single file | ||
|
||
# if read2 was provided by user, extract both read1 and read2 from aligned and unaligned BAMs | ||
if [[ ! -z "~{read2}" ]]; then | ||
echo "processing paired reads" | ||
samtools fastq -F4 -1 ~{samplename}_R1.fastq.gz -2 ~{samplename}_R2.fastq.gz ~{samplename}.sorted.bam | ||
echo -e "\nGenerating FASTQs for aligned reads" | ||
samtools fastq \ | ||
-@ ~{cpu} \ | ||
-F 4 \ | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. given that the bam files were already filtered by There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. it may be unnecessary to include the -F 4 flag, I'm unsure. But it runs as expected so I'm hesitant to change & test again 😬 |
||
-1 ~{samplename}_R1.fastq.gz \ | ||
-2 ~{samplename}_R2.fastq.gz \ | ||
~{samplename}.sorted.bam | ||
echo "Generating FASTQs for unaligned reads" | ||
# note the lowercase 'f' here is imporant | ||
samtools fastq \ | ||
-@ ~{cpu} \ | ||
-f 4 \ | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Same comment as above |
||
-1 ~{samplename}_unaligned_R1.fastq.gz \ | ||
-2 ~{samplename}_unaligned_R2.fastq.gz \ | ||
~{samplename}.sorted.unaligned-reads.bam | ||
else | ||
echo "processing single-end reads" | ||
samtools fastq -F4 ~{samplename}.sorted.bam | gzip > ~{samplename}_R1.fastq.gz | ||
echo -e "\nGenerating FASTQs for aligned single-end reads\n" | ||
samtools fastq \ | ||
-@ ~{cpu} \ | ||
-F 4 \ | ||
-0 ~{samplename}_R1.fastq.gz \ | ||
~{samplename}.sorted.bam | ||
echo -e "Generating FASTQs for unaligned single-end reads\n" | ||
# again, lowercase 'f' is important for getting all unaligned reads | ||
samtools fastq \ | ||
-@ ~{cpu} \ | ||
-f 4 \ | ||
-0 ~{samplename}_unaligned_R1.fastq.gz \ | ||
~{samplename}.sorted.unaligned-reads.bam | ||
fi | ||
|
||
|
||
# index BAMs | ||
samtools index ~{samplename}.sorted.bam | ||
samtools index ~{samplename}.sorted.bam | ||
samtools index ~{samplename}.sorted.unaligned-reads.bam | ||
|
||
# count output reads to ensure we are outputting all reads, regardless if the aligned or not | ||
# if read2 does exist as input, count both R1 and R2 | ||
if [[ ! -z "~{read2}" ]]; then | ||
echo -e "\noutput R1_aligned has $(zcat ~{samplename}_R1.fastq.gz | grep -c '^@') reads as input" | ||
echo "output R2_aligned has $(zcat ~{samplename}_R2.fastq.gz | grep -c '^@') reads as input" | ||
echo | ||
echo "output R1_unaligned has $(zcat ~{samplename}_unaligned_R1.fastq.gz | grep -c '^@') reads as input" | ||
echo "output R2_unaligned has $(zcat ~{samplename}_unaligned_R2.fastq.gz | grep -c '^@') reads as input" | ||
# else = if read2 does not exist as input, only count R1 | ||
else | ||
echo "output R1_aligned has $(zcat ~{samplename}_R1.fastq.gz | grep -c '^@') reads as input" | ||
echo | ||
echo "output R1_unaligned has $(zcat ~{samplename}_unaligned_R1.fastq.gz | grep -c '^@') reads as input" | ||
fi | ||
>>> | ||
output { | ||
String bwa_version = read_string("BWA_VERSION") | ||
|
@@ -52,14 +127,18 @@ task bwa { | |
File sorted_bai = "${samplename}.sorted.bam.bai" | ||
File read1_aligned = "~{samplename}_R1.fastq.gz" | ||
File? read2_aligned = "~{samplename}_R2.fastq.gz" | ||
File read1_unaligned = "~{samplename}_unaligned_R1.fastq.gz" | ||
File? read2_unaligned = "~{samplename}_unaligned_R2.fastq.gz" | ||
File sorted_bam_unaligned = "~{samplename}.sorted.unaligned-reads.bam" | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm having an existential crisis on if we should rename There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Personally I prefer names that tell a story of what has been done to the variable output or file, in which case the bam file was sorted, and then unaligned reads extracted. Happy to hear what others think and this could feed onto the style-guide. |
||
File sorted_bam_unaligned_bai = "~{samplename}.sorted.unaligned-reads.bam.bai" | ||
} | ||
runtime { | ||
docker: "us-docker.pkg.dev/general-theiagen/staphb/ivar:1.3.1-titan" | ||
memory: "8 GB" | ||
docker: docker | ||
memory: memory + " GB" | ||
cpu: cpu | ||
disks: "local-disk " + disk_size + " SSD" | ||
disk: disk_size + " GB" # TES | ||
preemptible: 0 | ||
#maxRetries: 3 | ||
maxRetries: 3 | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we uncompressing the read files just to output the read number? Seems like an unnecessary overhead on the task, taking longer to run and requiring more computational resources.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Were useful for ensuring the task is doing what we think it is doing. @kapsakcj, maybe we remove?