-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
output unaligned FASTQ files TheiaCov_Illumina PE and SE #275
Conversation
…emory default increased from 8 to 16; broke up one liner to produce intermediate SAM file for downstream usage; added 2 samtools view commands to generate separate BAMs for aligned and unaligned reads; updated cmds for FASTQ extraction from 2 BAMS; added indexing for unaligned BAM; added 4 new outputs: read1&2 unaligned, sorted_bam_unaligned, and sorted_bam_unaligned_bai
…v_illumina_pe workflow; both ran successfully with miniwdl
…, sorted_bam_unaligned, sorted_bam_unaligned_bai. tested fine w miniwdl
… out why so much data is being streamed to STDOUT and not directly to a file
…s to count reads before and after alignment & re-production of FASTQs; broke samtools sort into multiple lines for readability; added helpful echo statements throughout; fixed the samtools fastq commands that were not appropriately outputting FASTQ files
NOTE: this change will impact Freyja_FASTQ workflow as it used the same So, we are planning to skip adding these new outputs to Freyja_FASTQ workflow, but it would be good to do a I am worried that this may change the results/outputs of Freyja_FASTQ too, so it would be good to actually compare results instead of just a functional workflow test |
I launched a series of tests on some 2 sars-cov-2 Illumina samples. hoping to compare outputs between the 2 (version1.3.0 vs this dev branch. and single end:
The BAMs are identical in size, as expected
The aligned FASTQs (like the R1 file) for v1.3.0 are slightly larger than those on the dev branch. I believe this is because for the dev branch, the
All of these QC metrics were identical. Meaning that the sorted BAM used for variant calling/consensus FASTA generation are identical to v1.3.0 and downstream analysis is not impacted by these changes. Woo! 🎉
|
fi | ||
|
||
echo -e "\ninput R1 has $(${cat_reads} ~{read1} | grep -c '^@') reads as input" | ||
echo "input R2 has $(${cat_reads} ~{read2} | grep -c '^@') reads as input" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we uncompressing the read files just to output the read number? Seems like an unnecessary overhead on the task, taking longer to run and requiring more computational resources.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Were useful for ensuring the task is doing what we think it is doing. @kapsakcj, maybe we remove?
-@ ~{cpu} \ | ||
-F 4 \ | ||
-b \ | ||
-o ~{samplename}.sorted.bam \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shouldn't this file be renamed to ~{samplename}.sorted.aligned-reads.bam
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I kept it as the original filename. I agree it would provide a bit more clarity to someone reading the code, but hopefully the comments are enough.
I would prefer not to make further changes unless truly necessary since we'd like to share with our partner asap
echo -e "\nGenerating FASTQs for aligned reads" | ||
samtools fastq \ | ||
-@ ~{cpu} \ | ||
-F 4 \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
given that the bam files were already filtered by samtools view
, do we need the -F 4
flag here too? All reads in the bam file should be aligned
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it may be unnecessary to include the -F 4 flag, I'm unsure. But it runs as expected so I'm hesitant to change & test again 😬
# note the lowercase 'f' here is imporant | ||
samtools fastq \ | ||
-@ ~{cpu} \ | ||
-f 4 \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment as above
@@ -52,14 +127,18 @@ task bwa { | |||
File sorted_bai = "${samplename}.sorted.bam.bai" | |||
File read1_aligned = "~{samplename}_R1.fastq.gz" | |||
File? read2_aligned = "~{samplename}_R2.fastq.gz" | |||
File read1_unaligned = "~{samplename}_unaligned_R1.fastq.gz" | |||
File? read2_unaligned = "~{samplename}_unaligned_R2.fastq.gz" | |||
File sorted_bam_unaligned = "~{samplename}.sorted.unaligned-reads.bam" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm having an existential crisis on if we should rename sorted_bam
and sorted_bai
as sorted_aligned_bam
and sorted_aligned_bai
for readability 😅
Additionally, I would rename sorted_bam_unaligned
to sorted_unaligned_bam
so the variable has the file type at the end. It's a bit cleaner and it goes with or style-guide. Same for sorted_bam_unaligned_bai
-> sorted_unaligned_bai
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Personally I prefer names that tell a story of what has been done to the variable output or file, in which case the bam file was sorted, and then unaligned reads extracted. Happy to hear what others think and this could feed onto the style-guide.
tried running theiavalidate on two runs of Freyja_FASTQ but got stuck due to unrelated issues with this PR. A manual size comparison of the outputted bam files confirmed that the two are of the same size: I would like to get theiavalidate results to properly check file content but given that it's not cooperating, and this is an issue that is high priority, I'm not going to hold my review on it. I'll wait for more feedback on the style-guide related issues I mentioned above before hitting approve but functionally this addition is working as expected for all tested workflows. Nicely done @kapsakcj and @jrotieno! |
FYI @jrotieno @cimendes I've set this PR to draft until we hear feedback from our partner. Let's not merge until they test and provide feedback. If we don't get any feedback in the next week or 2, then I say we press forward with merging if we're satisfied with our own testing. It would be good to incorporate this into the next PHB release |
Closes #242
This dev branch should NOT be deleted after merging, we need to notify Alex E. and recommend switching to
main
branch for further usage before deleting it🛠️ Changes Being Made
changes to
tasks/alignment/task_bwa.wdl
memory
anddocker
as optional inputsecho
statements throughout to help with debugging and troubleshooting (we should keep these, they will be useful in the future)bwa
command to produce a sorted SAM file as output. Also pass incpus
to speed things upsamtools view
command to produce a BAM that ONLY includes aligned readssamtools view
command to produce a BAM that ONLY includes unaligned readssamtools fastq
command to produce paired FASTQ files for aligned readssamtools fastq
command to produced paired FASTQ files for unaligned readssamtools fastq
command to produce FASTQ file for aligned reads (single end)samtools fastq
command to produce FASTQ file for unaligned reads (single end)Changes to
workflows/theiacov/wf_theiacov_illumina_pe.wdl
andworkflows/theiacov/wf_theiacov_illumina_se.wdl
andworkflows/utilities/wf_ivar_consensus.wdl
Impacted Workflows/Tasks
bwa
tasked changed, but no workflow changes) - Would be good to compare the results of the same sample analyzed via v1.3.0 vs this dev branch🧠 Context and Rationale
A user requested that unaligned FASTQ files (i.e. the reads that do not align to the target reference genome, example: Mpox) are output from the TheiaCov_Illumina workflows. The goal is to take these reads that did not align to the reference and perform downstream analysis on them (like Kraken for taxonomic assignment, or other purposes like attempting to assemble a genome out of those reads not aligned)
In the process of addressing this request, I also added the
unaligned_bam
and it's index which is a BAM file that only contains reads that did not align to the reference (see lines 59-64 in BWA WDL task to see where this BAM is created). Not sure how this would be used, but it doesn't hurt to output this as well.📋 Workflow/Task Steps
Inputs
added 2 new optional inputs to bwa task:
memory
anddocker
Outputs
Impacted Outputs
Pre-existing outputs that may be impacted (this excludes the new ones)
sorted_bam
which is used heavily downstream in the workflowread1_aligned
&read2_aligned
. Both existed previously, but command syntax has changed slightly so good to double check these.🧪 Testing
Data
Test data used were as follows
Locally
Worked as expected
Terra
https://app.terra.bio/#workspaces/cdph-terrabio-taborda-manual/Global_tree_testing/job_history/43118990-5338-4d92-b429-19158452cf78
Scenarios for Reviewer to Test
🔬 Quality checks
Pull Request (PR) checklist: