Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add Flu NA Nextclade outputs to theiacov_illumina_pe #406

Merged
merged 2 commits into from
Apr 11, 2024

Conversation

kapsakcj
Copy link
Contributor

@kapsakcj kapsakcj commented Apr 10, 2024

This PR closes #405

🗑️ This dev branch should be deleted after merging to main.

🧠 Aim, Context and Functionality

The TheiaCoV_Illumina_PE_PHB and TheiaCoV_ONT_PHB workflows run Nextclade on the NA segment, which provides a clade for this segment, but this output is not populated back to the Terra data table unfortunately. You will have to look into the output files on the step that is specific to running nextclade on the NA segment.

Additionally, the output files produced by nextclade (auspice_input_json, nextclade_json, nextclade_tsv) are not populated back to the Terra data table either.

And lastly, the nextclade_qc output string is also not populated to the Terra data table.

🛠️ Impacted Workflows/Tasks & Changes Being Made

This will affect the behavior of the workflow(s) even if users don’t change any workflow inputs relative to the last version : No

Running this workflow on different occasions could result in different results, e.g. due to use of a live database, "latest" docker image, or stochastic data processing: No

Impacted workflows:

  • TheiaCoV_Illumina_PE_PHB
  • TheiaCoV_ONT_PHB

I don't believe this impacts TheiaCoV_FASTA_PHB workflow, as the user must provide either HA or NA for the optional input called flu_segment. TheiaCoV_FASTA_PHB is not currently set up to run both HA and NA segments through in parallel. It only runs nextclade once. So I have skipped making these changes to that workflow because that will be a much more involved change that requires extensive dev and testing.

📋 Workflow/Task Step Changes

🔄 Data Processing

Docker/software or software versions changed: N/A

Databases or database versions changed: N/A

Data processing/commands changed: N/A

File processing changed: N/A

Compute resources changed: N/A

➡️ Inputs

N/A

⬅️ Outputs

Added 5 new outputs to the TheiaCoV_Illumina_PE_PHB and TheiaCoV_ONT_PHB workflows:

  • String nextclade_json_flu_na = select_first([nextclade_flu_na.nextclade_json, ""])
  • String auspice_json_flu_na = select_first([nextclade_flu_na.auspice_json, ""])
  • String nextclade_tsv_flu_na = select_first([nextclade_flu_na.nextclade_tsv, ""])
  • String nextclade_clade_flu_na = select_first([nextclade_output_parser_flu_na.nextclade_clade, ""])
  • String? nextclade_qc_flu_na = nextclade_output_parser_flu_na.nextclade_qc

🧪 Testing

Test Dataset

I used the validation sets for ILMN PE and ONT workflows. I added 2 new H1N1 ONT datasets since the 5 we have currently available are not the best.

Commandline Testing with MiniWDL or Cromwell (optional)

Skipped testing locally except for TheiaCoV_ONT, which was successful (log not shown)

Terra Testing

See below comments for links and screenshots

Suggested Scenarios for Reviewer to Test

Test the 2 impacted workflows with Flu samples

Theiagen Version Release Testing (optional)

  • Will changes require functional or validation testing (checking outputs etc) during the release? validation
  • Do new samples need to be added to validation datasets? If so, upload these to the appropriate validation workspace Google bucket (). Please describe the new samples here and why these have been chosen. I used 2 H1N1 ONT samples from this publication: https://link.springer.com/article/10.1186/s12879-020-05367-y . They have been added to the correct bucket and Terra data table TSV is shared via slack. We can update the PHB_Validation_TEMPLATE workspace data table during validation efforts
  • Are there any output files that should be checked after running the version release testing? nextclade output TSV, nextclade output JSON, and auspice_input_json

🔬 Final Developer Checklist

  • The workflow/task has been tested locally and results, including file contents, are as anticipated
  • The workflow/task has been tested on Terra and results, including file contents, are as anticipated
  • The CI/CD has been adjusted and tests are passing (to be completed by Theiagen developer)
  • Code changes follow the style guide

🎯 Reviewer Checklist

  • All impacted workflows/tasks have been tested on Terra with a different dataset than used for development
  • All reviewer-suggested scenarios have been tested and any additional
  • All changed results have been confirmed to be accurate
  • All workflows/tasks impacted by change/s have been tested using a standard validation dataset to ensure no unintended change of functionality
  • All code adheres to the style guide
  • MD5 sums have been updated
  • The PR author has addressed all comments

🗂️ Associated Documentation (to be completed by Theiagen developer)

  • Relevant documentation on the Public Health Resources "PHB Main" has been updated
  • Workflow diagrams have been updated to reflect changes

@kapsakcj
Copy link
Contributor Author

kapsakcj commented Apr 10, 2024

New outputs are populated properly for TheiaCoV_Illumina_PE with Flu:
image

Successful test that produced these outputs is here: https://app.terra.bio/#workspaces/theiagen-validations/PHB_Validation_nextcladeV3testing/job_history/8fd7b653-eaf1-4bb0-86f6-9f5191e343fe

TODO: look at TheiaCoV_ONT test workflow and see what is going on there.

2 outright failures (likely caused by low quality input data?) and the 3 that succeeded look like the quality of the assemblies was not high enough to pass nextclade's default QC parameters. Here's one example:

In sequence #0 'ERR6359498_A_NA_N1': When processing sequence #0 'ERR6359498_A_NA_N1': When calculating seed matches: Unable to align: seed alignment covers 3.88% of the query sequence, which is less than expected 10.00% (configurable using 'min seed cover' CLI flag or dataset property). This is likely due to low quality of the provided sequence, or due to using incorrect reference sequence.. Note that this sequence will not be included in the results.

SO I went looking for higher-quality ONT Flu data. Results to come in the next comment!

@kapsakcj
Copy link
Contributor Author

Hoping these 2 ONT Flu samples will succeed (they did at the commandline w miniwdl), just had to fix the optional inputs (mainly the "organism" input):

https://app.terra.bio/#workspaces/theiagen-validations/PHB_Validation_nextcladeV3testing/job_history/df910abc-1502-46b8-973a-30e59062dfd6

🤞

@kapsakcj kapsakcj marked this pull request as ready for review April 11, 2024 14:12
@kapsakcj
Copy link
Contributor Author

Here's the new outputs for TheiaCoV_ONT. The 2 new H1N1 samples seem to be of higher quality and thus resulting assemblies were better, and ran more smoothly through nextclade compared to our pre-existing validation datasets for ONT Flu samples:

image

@cimendes
Copy link
Member

Looks amazing @kapsakcj!!! 🌟
I visually inspected the code changes and the outputs of your tests, and everything looks amazing!

@cimendes cimendes self-requested a review April 11, 2024 14:32
@cimendes cimendes merged commit 084ddc3 into main Apr 11, 2024
8 checks passed
@cimendes cimendes deleted the cjk-flu-na-outputs branch April 11, 2024 14:33
@cimendes
Copy link
Member

I had ideas on how we could add some more clarity to the output variables. Left my thoughts on #409

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Nextclade Flu NA outputs not populated to Terra data table
2 participants