output unaligned FASTQ files TheiaCov_Illumina PE and SE #275

kapsakcj · 2023-12-19T15:06:45Z

Closes #242

This dev branch should NOT be deleted after merging, we need to notify Alex E. and recommend switching to main branch for further usage before deleting it

🛠️ Changes Being Made

changes to `tasks/alignment/task_bwa.wdl`

expose memory and docker as optional inputs
added functionality to cope with uncompressed FASTQ files used as input
added lots of echo statements throughout to help with debugging and troubleshooting (we should keep these, they will be useful in the future)
IMPORTANT CHANGES
- lines 41-48 Altered the main bwa command to produce a sorted SAM file as output. Also pass in cpus to speed things up
- lines 51-56 added samtools view command to produce a BAM that ONLY includes aligned reads
- lines 59-64 added samtools view command to produce a BAM that ONLY includes unaligned reads
- lines 74-79 Altered samtools fastq command to produce paired FASTQ files for aligned reads
- lines 82-87 Added samtools fastq command to produced paired FASTQ files for unaligned reads
- lines 89-94 altered samtools fastq command to produce FASTQ file for aligned reads (single end)
- lines 95-101 added samtools fastq command to produce FASTQ file for unaligned reads (single end)
- lines 105-106 added commands for indexing aligned BAM and unaligned BAM
re-enabled memory retry feature (it was commented out, likely by accident)

Changes to `workflows/theiacov/wf_theiacov_illumina_pe.wdl` and `workflows/theiacov/wf_theiacov_illumina_se.wdl` and `workflows/utilities/wf_ivar_consensus.wdl`

added 4 (PE) or 3 (SE) new optional outputs to both theiacov_illumina workflows:

    File? read1_unaligned = ivar_consensus.read1_unaligned
    File? read2_unaligned = ivar_consensus.read2_unaligned
    File? sorted_bam_unaligned = ivar_consensus.sorted_bam_unaligned
    File? sorted_bam_unaligned_bai = ivar_consensus.sorted_bam_unaligned_bai

Impacted Workflows/Tasks

TheiaCov_Illumina_PE_PHB (iVar subworkflow, NOT Flu track)
TheiaCov_Illumina_SE_PHB (iVar subworkflow, NOT Flu track)
Freyja_FASTQ (bwa tasked changed, but no workflow changes) - Would be good to compare the results of the same sample analyzed via v1.3.0 vs this dev branch

🧠 Context and Rationale

A user requested that unaligned FASTQ files (i.e. the reads that do not align to the target reference genome, example: Mpox) are output from the TheiaCov_Illumina workflows. The goal is to take these reads that did not align to the reference and perform downstream analysis on them (like Kraken for taxonomic assignment, or other purposes like attempting to assemble a genome out of those reads not aligned)

In the process of addressing this request, I also added the unaligned_bam and it's index which is a BAM file that only contains reads that did not align to the reference (see lines 59-64 in BWA WDL task to see where this BAM is created). Not sure how this would be used, but it doesn't hurt to output this as well.

📋 Workflow/Task Steps

Inputs

added 2 new optional inputs to bwa task: memory and docker

Outputs

added 4 new optional outputs to both theiacov_illumina workflows (served up via the ivar subworkflow):

    File? read1_unaligned = ivar_consensus.read1_unaligned
    File? read2_unaligned = ivar_consensus.read2_unaligned
    File? sorted_bam_unaligned = ivar_consensus.sorted_bam_unaligned
    File? sorted_bam_unaligned_bai = ivar_consensus.sorted_bam_unaligned_bai

Impacted Outputs

Pre-existing outputs that may be impacted (this excludes the new ones)

sorted_bam which is used heavily downstream in the workflow
read1_aligned & read2_aligned. Both existed previously, but command syntax has changed slightly so good to double check these.

🧪 Testing

Data

Test data used were as follows

Simulated SARS-CoV-2 data. This was expected to map 100% to the reference SARS-CoV-2 genome, i.e. 0 unaligned reads.
Human data. This was expected not to align, but some reads aligned to the the reference SARS-CoV-2 genome.
A spiked dataset from 1. and 2. above. SARS-CoV-2 and some human reads expected to map.
Unaligned reads from 2. This was to test cases when nothing aligns to a reference, in which case unaligned reads file will be created but empty.

Locally

Worked as expected

Terra

https://app.terra.bio/#workspaces/cdph-terrabio-taborda-manual/Global_tree_testing/job_history/43118990-5338-4d92-b429-19158452cf78

Scenarios for Reviewer to Test

🔬 Quality checks

Pull Request (PR) checklist:

Include a description of what is in this pull request in this message.
The workflow/task has been tested locally and on Terra
The CI/CD has been adjusted and tests are passing
Everything follows the style guide

…emory default increased from 8 to 16; broke up one liner to produce intermediate SAM file for downstream usage; added 2 samtools view commands to generate separate BAMs for aligned and unaligned reads; updated cmds for FASTQ extraction from 2 BAMS; added indexing for unaligned BAM; added 4 new outputs: read1&2 unaligned, sorted_bam_unaligned, and sorted_bam_unaligned_bai

…v_illumina_pe workflow; both ran successfully with miniwdl

…, sorted_bam_unaligned, sorted_bam_unaligned_bai. tested fine w miniwdl

… out why so much data is being streamed to STDOUT and not directly to a file

…s to count reads before and after alignment & re-production of FASTQs; broke samtools sort into multiple lines for readability; added helpful echo statements throughout; fixed the samtools fastq commands that were not appropriately outputting FASTQ files

kapsakcj · 2024-02-07T17:29:06Z

NOTE: this change will impact Freyja_FASTQ workflow as it used the same bwa task

So, we are planning to skip adding these new outputs to Freyja_FASTQ workflow, but it would be good to do a ~~functional~~ validation test of the workflow to ensure it still runs as expected

I am worried that this may change the results/outputs of Freyja_FASTQ too, so it would be good to actually compare results instead of just a functional workflow test

kapsakcj · 2024-02-08T17:32:46Z

I launched a series of tests on some 2 sars-cov-2 Illumina samples. hoping to compare outputs between the 2 (version1.3.0 vs this dev branch.

and single end:

~~TODO:~~

review results, compare these between dev branch and v1.3.0 runs:
- file sizes of aligned BAMs (should be identical or extremely close)

The BAMs are identical in size, as expected

file sizes of aligned FASTQs (should be identical or extremely close)

The aligned FASTQs (like the R1 file) for v1.3.0 are slightly larger than those on the dev branch. I believe this is because for the dev branch, the samtools fastq command for outputting aligned reads also requires that reads have a mate/pair where as v1.3.0 did not. So this is to be expected, but is a minor difference from v1.3.0 behavior

general assembly metrics (number N, assembly_length_unambiguous, VADR alerts,)

All of these QC metrics were identical. Meaning that the sorted BAM used for variant calling/consensus FASTA generation are identical to v1.3.0 and downstream analysis is not impacted by these changes. Woo! 🎉

other things I'm forgetting?

cimendes · 2024-02-09T09:18:02Z

tasks/alignment/task_bwa.wdl

+    fi
+
+    echo -e "\ninput R1 has $(${cat_reads} ~{read1} | grep -c '^@') reads as input"
+    echo "input R2 has $(${cat_reads} ~{read2} | grep -c '^@') reads as input"


Are we uncompressing the read files just to output the read number? Seems like an unnecessary overhead on the task, taking longer to run and requiring more computational resources.

Were useful for ensuring the task is doing what we think it is doing. @kapsakcj, maybe we remove?

cimendes · 2024-02-09T09:19:41Z

tasks/alignment/task_bwa.wdl

+      -@ ~{cpu} \
+      -F 4 \
+      -b \
+      -o ~{samplename}.sorted.bam \


shouldn't this file be renamed to ~{samplename}.sorted.aligned-reads.bam?

I kept it as the original filename. I agree it would provide a bit more clarity to someone reading the code, but hopefully the comments are enough.

I would prefer not to make further changes unless truly necessary since we'd like to share with our partner asap

cimendes · 2024-02-09T09:21:39Z

tasks/alignment/task_bwa.wdl

+      echo -e "\nGenerating FASTQs for aligned reads"
+      samtools fastq \
+        -@ ~{cpu} \
+        -F 4 \


given that the bam files were already filtered by samtools view, do we need the -F 4 flag here too? All reads in the bam file should be aligned

it may be unnecessary to include the -F 4 flag, I'm unsure. But it runs as expected so I'm hesitant to change & test again 😬

cimendes · 2024-02-09T09:21:50Z

tasks/alignment/task_bwa.wdl

+      # note the lowercase 'f' here is imporant
+      samtools fastq \
+        -@ ~{cpu} \
+        -f 4 \


Same comment as above

cimendes · 2024-02-09T09:29:21Z

tasks/alignment/task_bwa.wdl

@@ -52,14 +127,18 @@ task bwa {
    File sorted_bai = "${samplename}.sorted.bam.bai"
    File read1_aligned = "~{samplename}_R1.fastq.gz"
    File? read2_aligned = "~{samplename}_R2.fastq.gz"
+    File read1_unaligned = "~{samplename}_unaligned_R1.fastq.gz"
+    File? read2_unaligned = "~{samplename}_unaligned_R2.fastq.gz"
+    File sorted_bam_unaligned = "~{samplename}.sorted.unaligned-reads.bam"


I'm having an existential crisis on if we should rename sorted_bam and sorted_bai as sorted_aligned_bam and sorted_aligned_bai for readability 😅
Additionally, I would rename sorted_bam_unaligned to sorted_unaligned_bam so the variable has the file type at the end. It's a bit cleaner and it goes with or style-guide. Same for sorted_bam_unaligned_bai -> sorted_unaligned_bai

Personally I prefer names that tell a story of what has been done to the variable output or file, in which case the bam file was sorted, and then unaligned reads extracted. Happy to hear what others think and this could feed onto the style-guide.

cimendes · 2024-02-09T10:08:01Z

Workflows using bwa:

TheiaCoV_Illumina_SE and PE through the wf_ivar_consensus.wdl sub-workflow
- Tested by @kapsakcj
Freyja_FASTQ

PR #323 will implement the use of bwa on TheiaMeta_Illumina_PE (aligned bams are input for the new binning task)

cimendes · 2024-02-09T13:08:53Z

tried running theiavalidate on two runs of Freyja_FASTQ but got stuck due to unrelated issues with this PR.

A manual size comparison of the outputted bam files confirmed that the two are of the same size:

I would like to get theiavalidate results to properly check file content but given that it's not cooperating, and this is an issue that is high priority, I'm not going to hold my review on it.

I'll wait for more feedback on the style-guide related issues I mentioned above before hitting approve but functionally this addition is working as expected for all tested workflows. Nicely done @kapsakcj and @jrotieno!

kapsakcj · 2024-02-09T21:14:40Z

FYI @jrotieno @cimendes I've set this PR to draft until we hear feedback from our partner. Let's not merge until they test and provide feedback.

If we don't get any feedback in the next week or 2, then I say we press forward with merging if we're satisfied with our own testing.

It would be good to incorporate this into the next PHB release

kapsakcj added 8 commits October 31, 2023 14:10

Merge remote-tracking branch 'origin/main' into cjk-unaligned-reads

e33e337

added 4 new file outputs to wf_ivar_consensus subworkflow and theiaco…

4abfd61

…v_illumina_pe workflow; both ran successfully with miniwdl

added 3 new optional outputs to theiacov_illumina_se: read1_unaligned…

3d26d0d

…, sorted_bam_unaligned, sorted_bam_unaligned_bai. tested fine w miniwdl

added in cpu option for samtools fastq commands. still need to figure…

88e1d69

… out why so much data is being streamed to STDOUT and not directly to a file

Merge remote-tracking branch 'origin/main' into cjk-unaligned-reads

236a943

turn on memory retry feature for bwa WDL task

714931c

sage-wright linked an issue Dec 20, 2023 that may be closed by this pull request

[feature request] TheiaCov: output unaligned read in FASTQ & BAM format #242

Closed

jrotieno added 2 commits February 7, 2024 17:13

add a check on read compression and end of line formatting

c107e81

Merge remote-tracking branch 'origin/main' into cjk-unaligned-reads

de53e79

kapsakcj changed the title ~~Cjk unaligned reads~~ output unaligned FASTQ files TheiaCov_Illumina PE and SE Feb 7, 2024

Merge remote-tracking branch 'origin/main' into cjk-unaligned-reads

ffb39a8

jrotieno requested review from kevinlibuit and cimendes February 8, 2024 13:37

kapsakcj marked this pull request as ready for review February 8, 2024 19:13

cimendes reviewed Feb 9, 2024

View reviewed changes

tasks/alignment/task_bwa.wdl

# note the lowercase 'f' here is imporant

samtools fastq \

-@ ~{cpu} \

-f 4 \

Copy link

Member

cimendes Feb 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment as above

cimendes reviewed Feb 9, 2024

View reviewed changes

cimendes approved these changes Feb 9, 2024

View reviewed changes

kapsakcj marked this pull request as draft February 9, 2024 21:12

kapsakcj marked this pull request as ready for review February 21, 2024 16:45

cimendes merged commit 254035d into main Feb 21, 2024
8 checks passed

kapsakcj deleted the cjk-unaligned-reads branch April 17, 2024 13:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

output unaligned FASTQ files TheiaCov_Illumina PE and SE #275

output unaligned FASTQ files TheiaCov_Illumina PE and SE #275

kapsakcj commented Dec 19, 2023 •

edited

Loading

kapsakcj commented Feb 7, 2024 •

edited

Loading

kapsakcj commented Feb 8, 2024 •

edited

Loading

cimendes Feb 9, 2024

jrotieno Feb 9, 2024

cimendes Feb 9, 2024

kapsakcj Feb 9, 2024

cimendes Feb 9, 2024

kapsakcj Feb 9, 2024

cimendes Feb 9, 2024

cimendes Feb 9, 2024

jrotieno Feb 9, 2024 •

edited

Loading

cimendes commented Feb 9, 2024 •

edited

Loading

cimendes commented Feb 9, 2024

kapsakcj commented Feb 9, 2024

output unaligned FASTQ files TheiaCov_Illumina PE and SE #275

output unaligned FASTQ files TheiaCov_Illumina PE and SE #275

Conversation

kapsakcj commented Dec 19, 2023 • edited Loading

🛠️ Changes Being Made

changes to tasks/alignment/task_bwa.wdl

Changes to workflows/theiacov/wf_theiacov_illumina_pe.wdl and workflows/theiacov/wf_theiacov_illumina_se.wdl and workflows/utilities/wf_ivar_consensus.wdl

Impacted Workflows/Tasks

🧠 Context and Rationale

📋 Workflow/Task Steps

Inputs

Outputs

Impacted Outputs

🧪 Testing

Data

Locally

Terra

Scenarios for Reviewer to Test

🔬 Quality checks

kapsakcj commented Feb 7, 2024 • edited Loading

kapsakcj commented Feb 8, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jrotieno Feb 9, 2024 • edited Loading

Choose a reason for hiding this comment

cimendes commented Feb 9, 2024 • edited Loading

cimendes commented Feb 9, 2024

kapsakcj commented Feb 9, 2024

kapsakcj commented Dec 19, 2023 •

edited

Loading

changes to `tasks/alignment/task_bwa.wdl`

Changes to `workflows/theiacov/wf_theiacov_illumina_pe.wdl` and `workflows/theiacov/wf_theiacov_illumina_se.wdl` and `workflows/utilities/wf_ivar_consensus.wdl`

kapsakcj commented Feb 7, 2024 •

edited

Loading

kapsakcj commented Feb 8, 2024 •

edited

Loading

jrotieno Feb 9, 2024 •

edited

Loading

cimendes commented Feb 9, 2024 •

edited

Loading