Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extremely Small Assembly Size Using Herro-Corrected Reads #49

Open
NancyChoudhary28 opened this issue Jan 23, 2025 · 3 comments
Open

Extremely Small Assembly Size Using Herro-Corrected Reads #49

NancyChoudhary28 opened this issue Jan 23, 2025 · 3 comments

Comments

@NancyChoudhary28
Copy link

NancyChoudhary28 commented Jan 23, 2025

Hello,
I have ONT R9.4.1 flow cell reads that were Dorado basecalled and Herro corrected. The corrected reads file is approximately 72GB in size and contains around 2 million reads. The estimated genome size is 2.7–2.8 Gb.
When I use the Nanopore-r10.4.1_e8.2-400bps_sup-Herro-Sep2024.conf configuration file, the resulting assembly is only 337 kb in size. However, using the same reads with the older Nanopore-May2022.conf file produces a much larger assembly of 2.7 Gb, which matches the estimated genome size.

Here are the associated AssemblySummary reports:

  • Herro_conf_AssemblySummary.html file:Herro.pdf
  • May2022_conf_AssemblySummary.html file: May2022.pdf

I observed that the number of alignments in the read graph is extremely low (only 4393) when using the Herro-specific configuration file.

Why is the assembly size so small when using the Herro-specific configuration file? Are there specific parameters in the configuration file that might be causing this issue? Could you help me in resolving this and improving the assembly?
Thanks.

@kokyriakidis
Copy link
Collaborator

Hey @NancyChoudhary28!

We will release in a few day the next version of Shasta.

Is it possible to share the data you used to test and optimize Shasta?

@colindaven
Copy link

@NancyChoudhary28 This is older data - R9.4.1 is much lower accuracy than R10 before Herro correction (maybe 93-94 vs 97%, but varies per dataset?). This is why dorado correct will not perform error correction on R9.4.1 data to my knowledge. AFAIK (I'm not a shasta dev) the newer dorado corrected R10 reads will be about 99-100% identical to the genome, and so fit into shasta's expectations for read quality input. I would guess your reads might have a mode of 96% identity to the genome.

You can test this by aligning say 100k reads vs a genome (say your assembly from the May2022 config), then using a tool like cramino to test the actual aligned read identity distribution.

Why not continue your your May2022 assembly and just intensively polish the contigs ?

@paoloshasta
Copy link
Owner

@colindaven is correct that the assembly configuration for Herro-corrected reads requires the latest ONT reads, which have much higher accuracy than your old R9 reads. For the same reason, the new Shasta release that is coming up (as mentioned by @kokyriakidis ) will not help you.

Your assembly with the Nanopore-May2022 configuration is usable. The low N50 (2.8 Mb) is a consequence of low coverage (75 Gb or 30X for a 3 Gb genome).

It seems to me that you have 3 options:

  • Stay with your current assembly with the Nanopore-May2022 assembly configuration, optionally polishing with one of the available polishing tools, as suggested by @colindaven.
  • Try and get more coverage in R9 reads. This is only possible if these reads are already available somewhere, because I don't think ONT supports new R9 sequencing.
  • Repeat sequencing with the latest from ONT. Only in that case you can use the newer assembly configurations.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants