Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[TheiaProk] Add emmtyper task for Streptococcus pyogenes #524

Merged
merged 6 commits into from
Jul 1, 2024

Conversation

sam-baird
Copy link
Contributor

@sam-baird sam-baird commented Jun 28, 2024

This PR closes #413.

🗑️ This dev branch should be deleted after merging to main.

🧠 Aim, Context and Functionality

This PR adds emmtyper to the TheiaProk workflows to perform emm gene typing on Streptococcus pyogenes assemblies. TheiaProk_Illumina_PE_PHB already contains a similar task using the read-based emmtyping_tool, which is only compatible with paired-end reads. Since emmtyper works on assemblies, it is compatible with the paired-end, single-end, FASTA, and ONT versions of the TheiaProk workflows. Also, there are sometimes differences between the two tools' outputs (see validation test results below), so I think it is useful to make both tools available.

🛠️ Impacted Workflows/Tasks & Changes Being Made

This will affect the behavior of the workflow(s) even if users don’t change any workflow inputs relative to the last version : Yes, there are now a few additional outputs that will be filled if the species is identified as Streptococcus pyogenes, but the change will not affect existing results.

Running this workflow on different occasions could result in different results, e.g. due to use of a live database, "latest" docker image, or stochastic data processing : No

📋 Workflow/Task Step Changes

task_emmtyper.wdl already existed in this repo, but was not being used in any workflows. I made some changes to the existing task and updated the TheiaProk workflows to use the task.

🔄 Data Processing

Docker/software or software versions changed: None

Databases or database versions changed: None

Data processing/commands changed: The task now parses and outputs the "final" emm type from the output TSV. The task now uses --output-format verbose TSV format instead of the default short format.

File processing changed: None

Compute resources changed: None

➡️ Inputs

None.

⬅️ Outputs

The TheiaProk workflows have these new outputs: emmtyper_emm_type, emmtyper_results_tsv, emmtyper_version, and emmtyper_docker.

🧪 Testing

Test Dataset

The workflows were tested using 32-sample validation set of isolates with known emm types provided by the CDC Strep Lab. The reads were fetched from NCBI using the SRA_Fetch_PHB workflow.

Commandline Testing with MiniWDL or Cromwell (optional)

None.

Terra Testing

I performed Terra testing for the TheiaProk_Illumina_PE_PHB, TheiaProk_Illumina_SE_PHB, and TheiaProk_FASTA_PHB workflows in separate data tables with call caching turned off. For testing TheiaProk_Illumina_SE_PHB, only read1 was used. For testing TheiaProk_FASTA_PHB, the assembly_fasta outputs from TheiaProk_Illumina_PE were used.

image

Comparison table of emm type results

BioSample Expected emm type TheiaProk PE emmtyper TheiaProk PE emmtypingtool TheiaProk SE emmtyper TheiaProk FASTA emmtyper
SAMN26504544 83.1 EMM83.1 emm83.1.sds EMM83.1 EMM83.1
SAMN26504545 76.0 EMM76.0 emm76.0.sds EMM76.0 EMM76.0
SAMN26504546 4.0 EMM4.0 emm4.0.sds EMM4.0 EMM4.0
SAMN26504547 73.7 EMM73.7 emm73 ⚠️ EMM73.7 EMM73.7
SAMN26504548 12.0 EMM12.0 emm12.0.sds EMM12.0 EMM12.0
SAMN26504549 4.0 EMM4.0 emm4.0.sds EMM4.0 EMM4.0
SAMN26504550 183.1 EMM183.1 emm183.1.sds/emm149.0.sds ⚠️ EMM183.1 EMM183.1
SAMN26504551 77.0 EMM77.0 emm77.0.sds EMM77.0 EMM77.0
SAMN26504568 28.0 EMM28.0 emm28.0.sds EMM28.0 EMM28.0
SAMN26504569 2.0 EMM2.0 emm2.0.sds EMM2.0 EMM2.0
SAMN26504552 12.0 EMM12.0 emm12.0.sds EMM12.0 EMM12.0
SAMN26504553 89.0 EMM89.0 emm89.0.sds EMM89.0 EMM89.0
SAMN26504554 89.0 EMM89.0 emm89.0.sds EMM89.0 EMM89.0
SAMN26504570 2.0 EMM2.0 emm2.0.sds EMM2.0 EMM2.0
SAMN26504571 2.0 EMM2.0 emm2.0.sds EMM2.0 EMM2.0
SAMN26504555 75.0 EMM75.0 emm75.0.sds EMM75.0 EMM75.0
SAMN26504556 77.0 EMM77.0 emm77.0.sds EMM77.0 EMM77.0
SAMN26504557 77.0 EMM77.0 emm77.0.sds EMM77.0 EMM77.0
SAMN26504558 92.0 EMM92.0 emm92.0.sds EMM92.0 EMM92.0
SAMN26504559 92.0 EMM92.0 emm92.0.sds EMM92.0 EMM92.0
SAMN26504560 90.2 EMM90.2 emm90.2.sds EMM90.2 EMM90.2
SAMN26504562 59.0 EMM59.0 emm59.0.sds EMM59.0 EMM59.0
SAMN26504563 83.1 EMM83.1 emm83.1.sds EMM83.1 EMM83.1
SAMN26504564 77.0 EMM77.0 emm77.0.sds EMM77.0 EMM77.0
SAMN26504572 11.0 EMM11.0 emm11.0.sds EMM11.0 EMM11.0
SAMN26504573 11.0 EMM11.0 emm11.0.sds EMM11.0 EMM11.0
SAMN26504565 28.0 EMM28.0 emm28.0.sds EMM28.0 EMM28.0
SAMN26504566 92.0 EMM92.0 emm92.0.sds EMM92.0 EMM92.0
SAMN26504567 73.0 EMM73.0 emm73.0.sds EMM73.0 EMM73.0
SAMN26504574 1.0 EMM1.0 emm1.0.sds EMM1.0 EMM1.0
SAMN26504575 6.4 EMM6.4 emm6.4.sds EMM6.4 EMM6.4
SAMN26504576 89.0 EMM89.0 emm89.0.sds EMM89.0 EMM89.0

⚠️: SAMN26504547 and SAMN26504550 showed different results between emmtypingtool and emmtyper. Specifically, emmtypingtool did not report the subtype for SAMN26504547, reporting "73" instead of "73.7". emmtypingtool reported two subtypes "183.1/149.0" instead of "183.1" for SAMN26504550. The SAMN26504547 emmtypingtool BAM alignment showed a 3-bp deletion when aligned to emm73.7. This looks to be an artifact caused by emmtypingtool adding 100bp flanking sequences from a reference genome to the 180bp trimmed fragment. The added flanking sequences appear to have caused a change in alignment for a majority of the reads resulting in the 3 bp deletion at the 3' end of the 180 bp region. For SAMN26504550, the extra emm type 149.0 is a known emm-like gene (according to the source code for emmtyper), so this looks like 149.0 should have been excluded.

At least for this dataset, emmtyper seemed to be more accurate since it matched all of the expected emm types.

Suggested Scenarios for Reviewer to Test

I did not test TheiaProk_ONT_PHB workflow.

Theiagen Version Release Testing (optional)

🔬 Final Developer Checklist

  • The workflow/task has been tested locally and results, including file contents, are as anticipated
  • The workflow/task has been tested on Terra and results, including file contents, are as anticipated
  • The CI/CD has been adjusted and tests are passing (to be completed by Theiagen developer)
  • Code changes follow the style guide

🎯 Reviewer Checklist

  • All impacted workflows/tasks have been tested on Terra with a different dataset than used for development
  • All reviewer-suggested scenarios have been tested and any additional
  • All changed results have been confirmed to be accurate
  • All workflows/tasks impacted by change/s have been tested using a standard validation dataset to ensure no unintended change of functionality
  • All code adheres to the style guide
  • MD5 sums have been updated
  • The PR author has addressed all comments

🗂️ Associated Documentation (to be completed by Theiagen developer)

  • Relevant documentation on the Public Health Resources "PHB Main" has been updated
  • Workflow diagrams have been updated to reflect changes

@sage-wright
Copy link
Member

Hey Sam,

Thanks for this really well written PR! I'm currently running some additional tests on Terra (including TheiaProk_ONT!), and I'm planning on merging this PR upon their successful completion/results!

@sage-wright sage-wright self-requested a review July 1, 2024 15:53
Copy link
Member

@sage-wright sage-wright left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fantastic work, Sam! Thank you so much!

@sage-wright sage-wright merged commit 1caf96f into theiagen:main Jul 1, 2024
13 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[TheiaProk] Add emmtyper tool to GAS TheiaProk track
2 participants