[TheiaProk] Add emmtyper task for Streptococcus pyogenes #524
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR closes #413.
🗑️ This dev branch should be deleted after merging to main.
🧠 Aim, Context and Functionality
This PR adds emmtyper to the TheiaProk workflows to perform emm gene typing on Streptococcus pyogenes assemblies. TheiaProk_Illumina_PE_PHB already contains a similar task using the read-based emmtyping_tool, which is only compatible with paired-end reads. Since emmtyper works on assemblies, it is compatible with the paired-end, single-end, FASTA, and ONT versions of the TheiaProk workflows. Also, there are sometimes differences between the two tools' outputs (see validation test results below), so I think it is useful to make both tools available.
🛠️ Impacted Workflows/Tasks & Changes Being Made
This will affect the behavior of the workflow(s) even if users don’t change any workflow inputs relative to the last version : Yes, there are now a few additional outputs that will be filled if the species is identified as Streptococcus pyogenes, but the change will not affect existing results.
Running this workflow on different occasions could result in different results, e.g. due to use of a live database, "latest" docker image, or stochastic data processing : No
📋 Workflow/Task Step Changes
task_emmtyper.wdl
already existed in this repo, but was not being used in any workflows. I made some changes to the existing task and updated the TheiaProk workflows to use the task.🔄 Data Processing
Docker/software or software versions changed: None
Databases or database versions changed: None
Data processing/commands changed: The task now parses and outputs the "final" emm type from the output TSV. The task now uses
--output-format verbose
TSV format instead of the default short format.File processing changed: None
Compute resources changed: None
➡️ Inputs
None.⬅️ Outputs
The TheiaProk workflows have these new outputs:
emmtyper_emm_type
,emmtyper_results_tsv
,emmtyper_version
, andemmtyper_docker
.🧪 Testing
Test Dataset
The workflows were tested using 32-sample validation set of isolates with known emm types provided by the CDC Strep Lab. The reads were fetched from NCBI using the SRA_Fetch_PHB workflow.
Commandline Testing with MiniWDL or Cromwell (optional)
None.
Terra Testing
I performed Terra testing for the TheiaProk_Illumina_PE_PHB, TheiaProk_Illumina_SE_PHB, and TheiaProk_FASTA_PHB workflows in separate data tables with call caching turned off. For testing TheiaProk_Illumina_SE_PHB, only
read1
was used. For testing TheiaProk_FASTA_PHB, theassembly_fasta
outputs from TheiaProk_Illumina_PE were used.Comparison table of emm type results
At least for this dataset, emmtyper seemed to be more accurate since it matched all of the expected emm types.
Suggested Scenarios for Reviewer to Test
I did not test TheiaProk_ONT_PHB workflow.
Theiagen Version Release Testing (optional)
🔬 Final Developer Checklist
🎯 Reviewer Checklist
🗂️ Associated Documentation (to be completed by Theiagen developer)