Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

IRMA bug fixes & improvements; theiacov_illumina_pe wf updates for Flu #468

Merged
merged 24 commits into from
Jun 19, 2024

Conversation

kapsakcj
Copy link
Contributor

@kapsakcj kapsakcj commented May 8, 2024

Starting a draft while doing testing in Terra. Will update this message periodically

This PR closes #412 closes #437 and closes #457

πŸ—‘οΈ This dev branch should be deleted after merging to main.

🧠 Aim, Context and Functionality

The aim of this PR is to resolve a few different issues/bugs and make general improvements & upgrades to the TheiaCoV workflows for Flu analysis. Much of these changes impact the IRMA task, used in TheiaCoV_Illumina_PE wf, but some changes impact other workflows like TheiaCoV_ONT for Flu analysis.

πŸ› οΈ Impacted Workflows/Tasks & Changes Being Made

This will affect the behavior of the workflow(s) even if users don’t change any workflow inputs relative to the last version : Yes

Running this workflow on different occasions could result in different results, e.g. due to use of a live database, "latest" docker image, or stochastic data processing : No

πŸ“‹ Workflow/Task Step Changes

πŸ”„ Data Processing

Docker/software or software versions changed: cdcgov/irma:v1.1.3 ➑️ "us-docker.pkg.dev/general-theiagen/cdcgov/irma:v1.1.5"

Databases or database versions changed: N/A

Data processing/commands changed: lots of changes to IRMA task & data processing

  • Fixed usage of mafft --thread flag and updated how version is captured
  • added --external-config irma_config.sh to IRMA command to use custom config file created in beginning of task.

File processing changed: lots of changes to output files & file renaming & FASTA header replacement

  • FASTA with all segments now contains headers with ~{samplename}
# example where samplename == ERR11656083-FluB-ONT_B
$ grep '>' 20240509_155808_theiacov_ont/call-irma/work/ERR11656083-FluB-ONT.irma.consensus.fasta 
>ERR11656083-FluB-ONT_B_HA
>ERR11656083-FluB-ONT_B_MP
>ERR11656083-FluB-ONT_B_NA
>ERR11656083-FluB-ONT_B_NP
>ERR11656083-FluB-ONT_B_NS
>ERR11656083-FluB-ONT_B_PA
>ERR11656083-FluB-ONT_B_PB1
>ERR11656083-FluB-ONT_B_PB2
  • Fix for empty HA segment for Flu B (Issue [IRMA] bug in output FASTA file for HA segment of Flu B samples is emptyΒ #437) on line 103. Adjusted sed command to modify file in place instead of redirecting into file.
    • Also applied same change to line 117 for renaming NA FASTA file when it includes the subtype number in output FASTA file name. This only applies to Flu A samples as B do not have subtype numbers int output HA and NA segment FASTA filenames.

Compute resources changed:

  • IRMA task reduced default cpus to 4
    • Additionally - updated irma_config.sh file created in task to include 2 variables for multi-threading IRMA
  • MAFFT task: reduced cpu to 2, memory to 8 and disk_size to 50

➑️ Inputs

  • defaults for docker and cpu were changed

⬅️ Outputs

New IRMA task outputs:

  • added seg_np_assembly & seq_ns_assembly which are the FASTA files corresponding with the NP (Nucleoprotein) and NS (nonstructural) segment
  • String irma_docker
  • String irma_subtype_notes which is either: IRMA does not differentiate Victoria and Yamagata Flu B lineages. See abricate_flu_subtype output column for Flu B samples and blank/empty for non Flu B samples.
    • (EXISTING BEHAVIOR, THIS OUTPUT HAS NOT CHANGED) String irma_subtype will either say H1N1 (or whatever Flu A subtype) for example with Flu A and for Flu B it will say No subtype predicted by IRMA
  • adding these "padded" assemblies as outputs to be passed to VADR and MAFFT (antiviral substitutions tasks). "Padded" in this context means that periods . have been replaced with N's.
    File? irma_assembly_fasta_padded = "~{samplename}.irma.consensus.pad.fasta"
    File? seg_ha_assembly_padded = "~{samplename}_HA.pad.fasta"
    File? seg_na_assembly_padded = "~{samplename}_NA.pad.fasta"
    File? seg_pa_assembly_padded = "~{samplename}_PA.pad.fasta"
    File? seg_pb1_assembly_padded = "~{samplename}_PB1.pad.fasta"
    File? seg_pb2_assembly_padded = "~{samplename}_PB2.pad.fasta"
    File? seg_mp_assembly_padded = "~{samplename}_MP.pad.fasta"
    File? seg_np_assembly_padded = "~{samplename}_NP.pad.fasta"
    File? seg_ns_assembly_padded = "~{samplename}_NS.pad.fasta"

New TheiaCoV_Illumina_PE & ONT outputs:

  • String irma_docker = irma.irma_docker
  • String? irma_subtype_notes = irma.irma_subtype_notes
  • File? irma_ha_segment_fasta = irma.seg_ha_assembly
  • File? irma_na_segment_fasta = irma.seg_na_assembly
  • File? irma_pa_segment_fasta = irma.seg_pa_assembly
  • File? irma_pb1_segment_fasta = irma.seg_pb1_assembly
  • File? irma_pb2_segment_fasta = irma.seg_pb2_assembly
  • File? irma_mp_segment_fasta = irma.seg_mp_assembly
  • File? irma_np_segment_fasta = irma.seg_np_assembly
  • File? irma_ns_segment_fasta = irma.seg_ns_assembly

⚠️ NOTE: I have opted to NOT output the padded FASTA files to the workflow level (i.e. Terra output column) as it would add lots of clutter and likely confuse the user. They are saved as intermediate files during the IRMA task and can be retrieved from the execution directory if necessary.

πŸ§ͺ Testing

Test Dataset

Described below.

Commandline Testing with MiniWDL or Cromwell (optional)

tested lots with miniwdl locally, but don't have output saved.

Terra Testing

Suggested Scenarios for Reviewer to Test

Would be good to test as many types/subtypes as possible and even test with poor quality data to see how the workflow behaves when IRMA cannot produce an assembly.

Need to test ONT as I've primarily been testing the IRMA WDL task updates with Illumina data. I tested 5 ONT samples, but would be good to do more if data is available

Theiagen Version Release Testing (optional)

πŸ”¬ Final Developer Checklist

  • The workflow/task has been tested locally and results, including file contents, are as anticipated
  • The workflow/task has been tested on Terra and results, including file contents, are as anticipated
  • The CI/CD has been adjusted and tests are passing (to be completed by Theiagen developer)
  • Code changes follow the style guide

🎯 Reviewer Checklist

  • All impacted workflows/tasks have been tested on Terra with a different dataset than used for development
  • All reviewer-suggested scenarios have been tested and any additional
  • All changed results have been confirmed to be accurate
  • All workflows/tasks impacted by change/s have been tested using a standard validation dataset to ensure no unintended change of functionality
  • All code adheres to the style guide
  • MD5 sums have been updated
  • The PR author has addressed all comments

πŸ—‚οΈ Associated Documentation (to be completed by Theiagen developer)

  • Relevant documentation on the Public Health Resources "PHB Main" has been updated
  • Workflow diagrams have been updated to reflect changes. Updates to workflow diagram are not necessary

kapsakcj added 21 commits April 11, 2024 12:39
… image on our GAR. tested successfully w miniwdl
… seeing disk space, added many comments throughout, added line to insert samplename into all-segment FASTA file headers, adjusted loop for file renaming to use mv verbose flag instead, fixed cmds for renaming of HA FASTA which was previously resulting in 0 byte FASTA for Flu B HA segment, updated renaming of NA FASTA file, updated sed command for inserting samplename into NA FASTA file, updated if/elif/else block for printing subtype. ran successfully with 2 Flu A samples and 1 Flu B sample
…pe predicted by IRMA, added new string output to state that IRMA doesn't differentiate btwn yamagata and victoria, added mv -v flag for renaming BAM files that have subtype included in filename. tested successfully w miniwdl with 1 Flu A and 1 Flu B
…. Added these 3 outputs as well as the remaining segment assembly files
…m abricate instead of IRMA for determining flu nextclade values
…A unless it cannot predict subtype, then use abricate flu subtype instead
…fixed version capture bug & the usage of mafft --thread flag
…ment for custom config settings; added copies of output assemblies (all segs and individual segs) with periods replaced by Ns; tested successfully with Flu A and B samples
…utions task and VADR task; deleted 2 duplicate outputs for NA and HA FASTA files
# TODO test again and look at .pad.fa files
#echo "ALIGN_AMENDED=1" >> irma_config.sh
#echo "ASSEM_REF=1" >> irma_config.sh
#echo "PADDED_CONSENSUS=1" >> irma_config.sh
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

leaving these lines commented out in case we want to revisit in the future when working on this task

@kapsakcj kapsakcj marked this pull request as ready for review June 12, 2024 18:44
@kapsakcj
Copy link
Contributor Author

OK PR is ready for review.

I'll update the documentation and check the above box when I'm finished.

@sage-wright FYI there was one additional commit made after you forked your branch, you may want to merge that in if you are working on theiacov_ont workflow

@kapsakcj
Copy link
Contributor Author

kapsakcj commented Jun 12, 2024

docs have been updated πŸ‘ (mostly updates to theiacov inputs and outputs table)

@sage-wright
Copy link
Member

TheiaProk_Illumina_PE here success
TheiaProk_ONT here success

@kapsakcj
Copy link
Contributor Author

I relaunched Sage's tests since they were run prior to the last 2 commits. Same samples, call caching off, both run on the cjk-irma-diskcheck branch

⚠️ Need to review outputs, esp. the sample that previously had a - in the output HA segment FASTA file and all segment FASTA file

@cimendes
Copy link
Member

cimendes commented Jun 19, 2024

Code changes look solid! πŸ…

New outputs present as expected:
image

Launching a new set of tests as I don't have access to Sage's workspace:

@cimendes
Copy link
Member

My tests were successful and the workflow is working as expected. @jrotieno you're the main reviewer here but you got my okay!

@cimendes cimendes requested review from jrotieno and cimendes June 19, 2024 10:01
@cimendes cimendes merged commit aabd3b9 into main Jun 19, 2024
12 checks passed
@cimendes cimendes deleted the cjk-irma-diskcheck branch June 19, 2024 16:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants