Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

output unaligned FASTQ files TheiaCov_Illumina PE and SE #275

Merged
merged 11 commits into from
Feb 21, 2024

Conversation

kapsakcj
Copy link
Contributor

@kapsakcj kapsakcj commented Dec 19, 2023

Closes #242

This dev branch should NOT be deleted after merging, we need to notify Alex E. and recommend switching to main branch for further usage before deleting it

🛠️ Changes Being Made

changes to tasks/alignment/task_bwa.wdl

  • expose memory and docker as optional inputs
  • added functionality to cope with uncompressed FASTQ files used as input
  • added lots of echo statements throughout to help with debugging and troubleshooting (we should keep these, they will be useful in the future)
  • IMPORTANT CHANGES
    • lines 41-48 Altered the main bwa command to produce a sorted SAM file as output. Also pass in cpus to speed things up
    • lines 51-56 added samtools view command to produce a BAM that ONLY includes aligned reads
    • lines 59-64 added samtools view command to produce a BAM that ONLY includes unaligned reads
    • lines 74-79 Altered samtools fastq command to produce paired FASTQ files for aligned reads
    • lines 82-87 Added samtools fastq command to produced paired FASTQ files for unaligned reads
    • lines 89-94 altered samtools fastq command to produce FASTQ file for aligned reads (single end)
    • lines 95-101 added samtools fastq command to produce FASTQ file for unaligned reads (single end)
    • lines 105-106 added commands for indexing aligned BAM and unaligned BAM
  • re-enabled memory retry feature (it was commented out, likely by accident)

Changes to workflows/theiacov/wf_theiacov_illumina_pe.wdl and workflows/theiacov/wf_theiacov_illumina_se.wdl and workflows/utilities/wf_ivar_consensus.wdl

  • added 4 (PE) or 3 (SE) new optional outputs to both theiacov_illumina workflows:
    File? read1_unaligned = ivar_consensus.read1_unaligned
    File? read2_unaligned = ivar_consensus.read2_unaligned
    File? sorted_bam_unaligned = ivar_consensus.sorted_bam_unaligned
    File? sorted_bam_unaligned_bai = ivar_consensus.sorted_bam_unaligned_bai

Impacted Workflows/Tasks

  • TheiaCov_Illumina_PE_PHB (iVar subworkflow, NOT Flu track)
  • TheiaCov_Illumina_SE_PHB (iVar subworkflow, NOT Flu track)
  • Freyja_FASTQ (bwa tasked changed, but no workflow changes) - Would be good to compare the results of the same sample analyzed via v1.3.0 vs this dev branch

🧠 Context and Rationale

A user requested that unaligned FASTQ files (i.e. the reads that do not align to the target reference genome, example: Mpox) are output from the TheiaCov_Illumina workflows. The goal is to take these reads that did not align to the reference and perform downstream analysis on them (like Kraken for taxonomic assignment, or other purposes like attempting to assemble a genome out of those reads not aligned)

In the process of addressing this request, I also added the unaligned_bam and it's index which is a BAM file that only contains reads that did not align to the reference (see lines 59-64 in BWA WDL task to see where this BAM is created). Not sure how this would be used, but it doesn't hurt to output this as well.

📋 Workflow/Task Steps

Inputs

added 2 new optional inputs to bwa task: memory and docker

Outputs

  • added 4 new optional outputs to both theiacov_illumina workflows (served up via the ivar subworkflow):
    File? read1_unaligned = ivar_consensus.read1_unaligned
    File? read2_unaligned = ivar_consensus.read2_unaligned
    File? sorted_bam_unaligned = ivar_consensus.sorted_bam_unaligned
    File? sorted_bam_unaligned_bai = ivar_consensus.sorted_bam_unaligned_bai

Impacted Outputs

Pre-existing outputs that may be impacted (this excludes the new ones)

  • sorted_bam which is used heavily downstream in the workflow
  • read1_aligned & read2_aligned. Both existed previously, but command syntax has changed slightly so good to double check these.

🧪 Testing

Data

Test data used were as follows

  1. Simulated SARS-CoV-2 data. This was expected to map 100% to the reference SARS-CoV-2 genome, i.e. 0 unaligned reads.
  2. Human data. This was expected not to align, but some reads aligned to the the reference SARS-CoV-2 genome.
  3. A spiked dataset from 1. and 2. above. SARS-CoV-2 and some human reads expected to map.
  4. Unaligned reads from 2. This was to test cases when nothing aligns to a reference, in which case unaligned reads file will be created but empty.

Locally

Worked as expected

Terra

https://app.terra.bio/#workspaces/cdph-terrabio-taborda-manual/Global_tree_testing/job_history/43118990-5338-4d92-b429-19158452cf78

Scenarios for Reviewer to Test

🔬 Quality checks

Pull Request (PR) checklist:

  • Include a description of what is in this pull request in this message.
  • The workflow/task has been tested locally and on Terra
  • The CI/CD has been adjusted and tests are passing
  • Everything follows the style guide

…emory default increased from 8 to 16; broke up one liner to produce intermediate SAM file for downstream usage; added 2 samtools view commands to generate separate BAMs for aligned and unaligned reads; updated cmds for FASTQ extraction from 2 BAMS; added indexing for unaligned BAM; added 4 new outputs: read1&2 unaligned, sorted_bam_unaligned, and sorted_bam_unaligned_bai
…v_illumina_pe workflow; both ran successfully with miniwdl
…, sorted_bam_unaligned, sorted_bam_unaligned_bai. tested fine w miniwdl
… out why so much data is being streamed to STDOUT and not directly to a file
…s to count reads before and after alignment & re-production of FASTQs; broke samtools sort into multiple lines for readability; added helpful echo statements throughout; fixed the samtools fastq commands that were not appropriately outputting FASTQ files
@kapsakcj kapsakcj changed the title Cjk unaligned reads output unaligned FASTQ files TheiaCov_Illumina PE and SE Feb 7, 2024
@kapsakcj
Copy link
Contributor Author

kapsakcj commented Feb 7, 2024

NOTE: this change will impact Freyja_FASTQ workflow as it used the same bwa task

So, we are planning to skip adding these new outputs to Freyja_FASTQ workflow, but it would be good to do a functional validation test of the workflow to ensure it still runs as expected

I am worried that this may change the results/outputs of Freyja_FASTQ too, so it would be good to actually compare results instead of just a functional workflow test

@kapsakcj
Copy link
Contributor Author

kapsakcj commented Feb 8, 2024

I launched a series of tests on some 2 sars-cov-2 Illumina samples. hoping to compare outputs between the 2 (version1.3.0 vs this dev branch.

and single end:

TODO:

  • review results, compare these between dev branch and v1.3.0 runs:
    • file sizes of aligned BAMs (should be identical or extremely close)

The BAMs are identical in size, as expected

  • file sizes of aligned FASTQs (should be identical or extremely close)

The aligned FASTQs (like the R1 file) for v1.3.0 are slightly larger than those on the dev branch. I believe this is because for the dev branch, the samtools fastq command for outputting aligned reads also requires that reads have a mate/pair where as v1.3.0 did not. So this is to be expected, but is a minor difference from v1.3.0 behavior

  • general assembly metrics (number N, assembly_length_unambiguous, VADR alerts,)

All of these QC metrics were identical. Meaning that the sorted BAM used for variant calling/consensus FASTA generation are identical to v1.3.0 and downstream analysis is not impacted by these changes. Woo! 🎉

  • other things I'm forgetting?

@kapsakcj kapsakcj marked this pull request as ready for review February 8, 2024 19:13
fi

echo -e "\ninput R1 has $(${cat_reads} ~{read1} | grep -c '^@') reads as input"
echo "input R2 has $(${cat_reads} ~{read2} | grep -c '^@') reads as input"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we uncompressing the read files just to output the read number? Seems like an unnecessary overhead on the task, taking longer to run and requiring more computational resources.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Were useful for ensuring the task is doing what we think it is doing. @kapsakcj, maybe we remove?

-@ ~{cpu} \
-F 4 \
-b \
-o ~{samplename}.sorted.bam \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't this file be renamed to ~{samplename}.sorted.aligned-reads.bam?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I kept it as the original filename. I agree it would provide a bit more clarity to someone reading the code, but hopefully the comments are enough.

I would prefer not to make further changes unless truly necessary since we'd like to share with our partner asap

echo -e "\nGenerating FASTQs for aligned reads"
samtools fastq \
-@ ~{cpu} \
-F 4 \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

given that the bam files were already filtered by samtools view, do we need the -F 4 flag here too? All reads in the bam file should be aligned

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it may be unnecessary to include the -F 4 flag, I'm unsure. But it runs as expected so I'm hesitant to change & test again 😬

# note the lowercase 'f' here is imporant
samtools fastq \
-@ ~{cpu} \
-f 4 \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment as above

@@ -52,14 +127,18 @@ task bwa {
File sorted_bai = "${samplename}.sorted.bam.bai"
File read1_aligned = "~{samplename}_R1.fastq.gz"
File? read2_aligned = "~{samplename}_R2.fastq.gz"
File read1_unaligned = "~{samplename}_unaligned_R1.fastq.gz"
File? read2_unaligned = "~{samplename}_unaligned_R2.fastq.gz"
File sorted_bam_unaligned = "~{samplename}.sorted.unaligned-reads.bam"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm having an existential crisis on if we should rename sorted_bam and sorted_bai as sorted_aligned_bam and sorted_aligned_bai for readability 😅
Additionally, I would rename sorted_bam_unaligned to sorted_unaligned_bam so the variable has the file type at the end. It's a bit cleaner and it goes with or style-guide. Same for sorted_bam_unaligned_bai -> sorted_unaligned_bai

Copy link
Contributor

@jrotieno jrotieno Feb 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Personally I prefer names that tell a story of what has been done to the variable output or file, in which case the bam file was sorted, and then unaligned reads extracted. Happy to hear what others think and this could feed onto the style-guide.

@cimendes
Copy link
Member

cimendes commented Feb 9, 2024

Workflows using bwa:

  • TheiaCoV_Illumina_SE and PE through the wf_ivar_consensus.wdl sub-workflow
  • Freyja_FASTQ

PR #323 will implement the use of bwa on TheiaMeta_Illumina_PE (aligned bams are input for the new binning task)

@cimendes
Copy link
Member

cimendes commented Feb 9, 2024

tried running theiavalidate on two runs of Freyja_FASTQ but got stuck due to unrelated issues with this PR.

A manual size comparison of the outputted bam files confirmed that the two are of the same size:
image
image

I would like to get theiavalidate results to properly check file content but given that it's not cooperating, and this is an issue that is high priority, I'm not going to hold my review on it.

I'll wait for more feedback on the style-guide related issues I mentioned above before hitting approve but functionally this addition is working as expected for all tested workflows. Nicely done @kapsakcj and @jrotieno!

@kapsakcj kapsakcj marked this pull request as draft February 9, 2024 21:12
@kapsakcj
Copy link
Contributor Author

kapsakcj commented Feb 9, 2024

FYI @jrotieno @cimendes I've set this PR to draft until we hear feedback from our partner. Let's not merge until they test and provide feedback.

If we don't get any feedback in the next week or 2, then I say we press forward with merging if we're satisfied with our own testing.

It would be good to incorporate this into the next PHB release

@kapsakcj kapsakcj marked this pull request as ready for review February 21, 2024 16:45
@cimendes cimendes merged commit 254035d into main Feb 21, 2024
8 checks passed
@kapsakcj kapsakcj deleted the cjk-unaligned-reads branch April 17, 2024 13:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[feature request] TheiaCov: output unaligned read in FASTQ & BAM format
3 participants