Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Snippy_Variants QC outputs to Snippy_Tree and Snippy_Sreamline workflow outputs #592

Merged
merged 16 commits into from
Nov 6, 2024

Conversation

jrotieno
Copy link
Contributor

@jrotieno jrotieno commented Aug 23, 2024

This PR closes #353

🗑️ This dev branch should be deleted after merging to main.

🧠 Summary

This workflow adds the following Snippy_Variants QC output metrics to the Snippy_Tree and Snippy_Streamline workflows. The are useful in assessing the quality of samples included in the phylogenetic tree as well as the alignment quality.

⚡ Impacted Workflows/Tasks

This PR may lead to different results in pre-existing outputs: No

This PR uses an element that could cause duplicate runs to have different results: No

🛠️ Changes

⚙️ Algorithm

The snippy variants produces the following QC columns:
Int snippy_variants_num_reads_aligned
Int snippy_variants_num_variants
File snippy_variants_coverage_tsv
Float snippy_variants_percent_ref_coverage

The goal was to have all these values on a single line per sample included in Snippy_Tree of Snippy_Streamline.

The snippy_variants_coverage_tsv file contains contents like:

#rname startpos endpos numreads covbases coverage meandepth meanbaseq meanmapq
AE003852 1 2961182 2376498 2615376 88.322 114.7 34.2 58.6

And therefore, a typical output of all the Snippy_Variants QC metrics would be like:

samplename reads_aligned_to_reference variants_total percent_ref_coverage #rname startpos endpos numreads covbases coverage meandepth meanbaseq meanmapq
ERR10913488 3504551 84125 84.5073 AE003852 1 2961182 2376498 2615376 88.322 114.7 34.2 58.6

However, for a pathogen like V. cholerae with two chromosomes, the snippy_variants_coverage_tsv output is like:

#rname startpos endpos numreads covbases coverage meandepth meanbaseq meanmapq
AE003852 1 2961182 2376498 2615376 88.322 114.7 34.2 58.6
AE003853 1 1072319 654185 830592 77.4575 87.0957 34.2 59

In such cases, the mapping information for the second chromosomes would be appended after those for the first chromosome, with the implementation capable of taking care of as many chromosomes as there are in the reference fasta file used for read mapping:

samplename reads_aligned_to_reference variants_total percent_ref_coverage #rname startpos endpos numreads covbases coverage meandepth meanbaseq meanmapq #rname startpos endpos numreads covbases coverage meandepth meanbaseq meanmapq
ERR10913488 3504551 84125 84.5073 AE003852 1 2961182 2376498 2615376 88.322 114.7 34.2 58.6 AE003853 1 1072319 654185 830592 77.4575 87.0957 34.2 59

As the Snippy_Tree and Snippy_Streamline workflows are set level, the QC results are combined into a single file with each sample per row in the output TSV file. An example below, allowing for comparisons across samples:

samplename reads_aligned_to_reference variants_total percent_ref_coverage #rname startpos endpos numreads covbases coverage meandepth meanbaseq meanmapq #rname startpos endpos numreads covbases coverage meandepth meanbaseq meanmapq
ERR10913488 3504551 84125 84.5073 AE003852 1 2961182 2376498 2615376 88.322 114.7 34.2 58.6 AE003853 1 1072319 654185 830592 77.4575 87.0957 34.2 59
ERR10913507 3103515 48106 88.2036 AE003852 1 2961182 2004119 2660682 89.852 97.4687 34 58.7 AE003853 1 1072319 611866 923528 86.1244 82.1481 34 59.3
ERR11679333 3263308 181 99.0544 AE003852 1 2961182 2385804 2945659 99.4758 118.083 32.9 58.5 AE003853 1 1072319 772214 1052113 98.1157 105.556 32.9 58.7
ERR117609 2729668 133 99.2926 AE003852 1 2961182 1247903 2958059 99.8945 31.5749 39.3 57.2 AE003853 1 1072319 461573 1053057 98.2037 32.251 39.3 57.7
ERR117612 2803563 138 99.3674 AE003852 1 2961182 1665351 2958119 99.8966 42.052 39.2 57.9 AE003853 1 1072319 603311 1055497 98.4313 42.0585 39.2 57.3
ERR3039947 228407 16 70.6079 AE003852 1 2961182 169379 2957678 99.8817 12.5272 34.7 58.4 AE003853 1 1072319 56924 1071811 99.9526 11.7021 34.7 59.2

➡️ Inputs

⬅️ Outputs

A new output file snippy_variants_qc_metrics for the Snippy_Variants workflow and snippy_combined_qc_metrics for the Snippy_Tree and Snippy_Streamline workflows

🧪 Testing

Snippy_Variants: https://app.terra.bio/#workspaces/theiagen-training-workspaces/Theiagen_Otieno_Sandbox/job_history/2d851d39-78cd-444f-952d-60ab77e7db77

Snippy_Streamline: https://app.terra.bio/#workspaces/theiagen-training-workspaces/Theiagen_Otieno_Sandbox/job_history/bfd8044f-1cfa-4945-bc09-383cfa6bc2d8

Suggested Scenarios for Reviewer to Test

Single chromosome pathogen

🔬 Final Developer Checklist

  • The workflow/task has been tested and results, including file contents, are as anticipated
  • The CI/CD has been adjusted and tests are passing (Theiagen developers)
  • Code changes follow the style guide
  • Documentation and/or workflow diagrams have been updated if applicable (Theiagen developers only)

🎯 Reviewer Checklist

  • All changed results have been confirmed
  • You have tested the PR appropriately (see the testing guide for more information)
  • All code adheres to the style guide
  • MD5 sums have been updated
  • The PR author has addressed all comments
  • The documentation has been updated

@jrotieno jrotieno marked this pull request as ready for review August 23, 2024 12:48
@jrotieno jrotieno requested a review from a team as a code owner August 23, 2024 12:48
@sage-wright
Copy link
Member

Please confirm that any documentation updates you made were incorporated into the new docs, and if not, could you please add them?? Thanks!

@fraser-combe
Copy link
Contributor

fraser-combe commented Oct 17, 2024

Summary of updates

The branch was merged with main branch to allow docs to be updated

In snippy_tree.wdl
Ensured the concatenated_file_name includes the .tsv extension by setting it to tree_name_updated + "_combined_qc_metrics.tsv".

Updated Outputs Section in snippy_streamline and snippy_variantss Documentation:

Added snippy_combined_qc_metrics:
Included in the outputs table with a detailed description and list of column headers.

In snippy_variants.wdl

Updated code to handle Insufficient Data: Added a condition to handle cases where the coverage file may be empty or have insufficient data.

Ensured percent_reads_aligned in qc output for individual sample and conmbined qc metric tsv file

Sample-Level Summary Table

Below is an example of the combined QC metrics output:

samplename reads_aligned_to_reference total_reads percent_reads_aligned variants_total percent_ref_coverage
ERR10913488 2,968,531 3,426,094 86.64 84,115 84.49
ERR10913507 2,585,768 3,066,589 84.32 48,092 88.19
ERR11679333 3,158,188 3,263,308 96.78 181 99.05
ERR117609 1,709,494 2,729,668 62.63 133 99.29
ERR117612 2,268,685 2,803,563 80.92 138 99.37
ERR3039947 226,317 228,407 99.08 16 70.61

See terra output for combined qc metrics here - https://app.terra.bio/#workspaces/theiagen-training-workspaces/Theiagen_Otieno_Sandbox/job_history/6b595a20-3ed0-40e3-889d-655dc70d07c8/e99c632b-e07a-4afa-a46c-86726144d72d

  • This used the same data as previously run above in PR and includes the new column percent_reads_aligned

Copy link
Member

@sage-wright sage-wright left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great! Could you please:

  • Add the QC TSV output to Snippy_Streamline_FASTA as well?
    • If so, add the output to the Snippy_Streamline_FASTA documentation
  • Add the output description in the Snippy_Variants documentation (snippy_variants.md)

I also want to hear back from Andrew on line 89 in task_snippy_variants.wdl before making a final review!

Great work!

@fraser-combe
Copy link
Contributor

Updated with changes

@sage-wright
Copy link
Member

sage-wright commented Oct 24, 2024

Testing Snippy_Streamline here and Snippy_Streamline_FASTA here

@fraser-combe fraser-combe dismissed sage-wright’s stale review November 5, 2024 16:39

correct docs first

Copy link
Member

@sage-wright sage-wright left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sage-wright sage-wright merged commit 4bcf3f2 into main Nov 6, 2024
8 checks passed
@sage-wright sage-wright deleted the jro-snippy-variants-qc-outputs branch November 6, 2024 16:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Snippy_Streamline] Add Snippy_variants QC outputs
4 participants