Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ONT] Remove KMC #578

Merged
merged 2 commits into from
Aug 9, 2024
Merged

[ONT] Remove KMC #578

merged 2 commits into from
Aug 9, 2024

Conversation

sage-wright
Copy link
Member

@sage-wright sage-wright commented Aug 7, 2024

This PR closes #203

πŸ—‘οΈ This dev branch should be deleted after merging to main.

🧠 Summary

This PR removes the KMC module from the TheiaCoV and TheiaProk ONT workflows. In place of the estimated genome length in TheiaProk ONT, I have added a default genome length of 5 Mb, which is around .7 Mb larger than the average genome length as estimated by calculating the mean genome length of all bacteria included in this file.

⚑ Impacted Workflows/Tasks

TheiaProk_ONT and TheiaCoV_ONT

This PR may lead to different results in pre-existing outputs: Yes, assemblies generated may be different due to differences in the use of estimated genome length in RASUSA downsampling.

This PR uses an element that could cause duplicate runs to have different results: No

πŸ› οΈ Changes

βš™οΈ Algorithm

  • 5 Mb is default genome size in the read_qc_trim_ont subworkflow.

➑️ Inputs

No

⬅️ Outputs

KMC outputs have been removed.

πŸ§ͺ Testing

Test on TheiaProk here

Test on TheiaCoV (flu) here

Suggested Scenarios for Reviewer to Test

πŸ”¬ Final Developer Checklist

  • The workflow/task has been tested and results, including file contents, are as anticipated
  • The CI/CD has been adjusted and tests are passing (Theiagen developers)
  • Code changes follow the style guide
  • Documentation and/or workflow diagrams have been updated if applicable (Theiagen developers only)

🎯 Reviewer Checklist

  • All changed results have been confirmed
  • You have tested the PR appropriately (see the testing guide for more information)
  • All code adheres to the style guide
  • MD5 sums have been updated
  • The PR author has addressed all comments
  • The documentation has been updated

@sage-wright sage-wright marked this pull request as ready for review August 7, 2024 16:58
@sage-wright sage-wright requested a review from a team as a code owner August 7, 2024 16:58
@cimendes cimendes self-requested a review August 8, 2024 13:17
@cimendes
Copy link
Member

cimendes commented Aug 8, 2024

TheiaProk_ONT_PHB Test:

<style> </style>
entity:theiaprok_ont_nokmc_id expected_length expected_taxon read1
ERR8958704 5700000 Klebsiella pneumoniae gs://theiagen-large-public-files-rp/terra/phb-validation/theiaprok-ont-reads/ERR8958704_1.fastq.gz
ERR8958706 5700000 Klebsiella pneumoniae gs://theiagen-large-public-files-rp/terra/phb-validation/theiaprok-ont-reads/ERR8958706_1.fastq.gz
ERR8958710 5700000 Klebsiella pneumoniae gs://theiagen-large-public-files-rp/terra/phb-validation/theiaprok-ont-reads/ERR8958710_1.fastq.gz
ERR8958833 2900000 Staphylococcus aureus gs://theiagen-large-public-files-rp/terra/phb-validation/theiaprok-ont-reads/ERR8958833_1.fastq.gz
ERR8958851 2900000 Staphylococcus aureus gs://theiagen-large-public-files-rp/terra/phb-validation/theiaprok-ont-reads/ERR8958851_1.fastq.gz
ERR8958852 2900000 Staphylococcus aureus gs://theiagen-large-public-files-rp/terra/phb-validation/theiaprok-ont-reads/ERR8958852_1.fastq.gz
ERR8958857 6200000 Pseudomonas aeruginosa gs://theiagen-large-public-files-rp/terra/phb-validation/theiaprok-ont-reads/ERR8958857_1.fastq.gz
ERR8958858 6200000 Pseudomonas aeruginosa gs://theiagen-large-public-files-rp/terra/phb-validation/theiaprok-ont-reads/ERR8958858_1.fastq.gz
SAMN04961841 5000000 Salmonella enterica gs://theiagen-large-public-files-rp/terra/phb-validation/theiaprok-ont-reads/SRR19768527_1.fastq.gz
SAMN05250424 5000000 Salmonella enterica gs://theiagen-large-public-files-rp/terra/phb-validation/theiaprok-ont-reads/SRR19768533_1.fastq.gz
SAMN05596277 5000000 Salmonella enterica gs://theiagen-large-public-files-rp/terra/phb-validation/theiaprok-ont-reads/SRR19768539_1.fastq.gz
SAMN20849361 4800000 Shigella sonnei gs://theiagen-large-public-files-rp/terra/phb-validation/theiaprok-ont-reads/SRR18254047_1.fastq.gz
SAMN22657138 4800000 Shigella sonnei gs://theiagen-large-public-files-rp/terra/phb-validation/theiaprok-ont-reads/SRR18254048_1.fastq.gz
SAMN22962688 4800000 Shigella sonnei gs://theiagen-large-public-files-rp/terra/phb-validation/theiaprok-ont-reads/SRR18254049_1.fastq.gz

@cimendes
Copy link
Member

cimendes commented Aug 9, 2024

The run succeeded and I've now managed to look at the results from the assembly lengths produced versus the expected genome size. The differences are very very small except for Shigella sonnei where the assemblies were around 10% larger than expected:

<style> </style>
entity:theiaprok_ont_nokmc_id assembly_length expected_length Difference (%) expected_taxon gambit_predicted_taxon
ERR8958704 5754162 5700000 1% Klebsiella pneumoniae Klebsiella pneumoniae
ERR8958706 5725948 5700000 0% Klebsiella pneumoniae Klebsiella pneumoniae
ERR8958710 5755096 5700000 1% Klebsiella pneumoniae Klebsiella pneumoniae
ERR8958833 2902609 2900000 0% Staphylococcus aureus Staphylococcus aureus
ERR8958851 2902596 2900000 0% Staphylococcus aureus Staphylococcus aureus
ERR8958852 2902610 2900000 0% Staphylococcus aureus Staphylococcus aureus
ERR8958857 6281842 6200000 1% Pseudomonas aeruginosa Pseudomonas aeruginosa
ERR8958858 6294596 6200000 2% Pseudomonas aeruginosa Pseudomonas aeruginosa
SAMN04961841 4833663 5000000 3% Salmonella enterica Salmonella enterica
SAMN05250424 4744535 5000000 5% Salmonella enterica Salmonella enterica
SAMN05596277 4786893 5000000 4% Salmonella enterica Salmonella enterica
SAMN20849361 5275674 4800000 10% Shigella sonnei Shigella sonnei
SAMN22657138 5356338 4800000 12% Shigella sonnei Shigella sonnei
SAMN22962688 5295788 4800000 10% Shigella sonnei Shigella sonnei

The reason why is beyond this PR . BUSCO duplication score is not awful.
image

The results are concordant with what was obtained with PHB v2.1.0 (https://app.terra.bio/#workspaces/theiagen-validations/PHB_Validation_v2-1-0/data)
image

@cimendes
Copy link
Member

cimendes commented Aug 9, 2024

@sage-wright Can you update the documentation while I view the results from TheiaCoV ONT?

@cimendes
Copy link
Member

cimendes commented Aug 9, 2024

@sage-wright
Copy link
Member Author

@cimendes, documentation has been updated!

@cimendes cimendes merged commit a30cccb into main Aug 9, 2024
12 checks passed
@cimendes cimendes deleted the smw-genome-size-dev branch August 9, 2024 13:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Genome Size Estimation - Ways to improve
2 participants