Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sudden stop while running a tweaked Herro configuration #48

Open
sivico26 opened this issue Jan 23, 2025 · 22 comments
Open

Sudden stop while running a tweaked Herro configuration #48

sivico26 opened this issue Jan 23, 2025 · 22 comments

Comments

@sivico26
Copy link

Hello there,

I am giving Shasta a try for my repetitive plant genome. I have good coverage (~70x) of Herro-corrected reads, but my reads are far from ultra-long.

Anyway, I have made a couple of successful but disappointing runs starting from the Herro configuration file and tweaking some parameters. My latest did not finish and I do not know why. Here is the log file for that one

stdout.log.txt

Do you know what could be happening?

I would take the opportunity to ask you about a couple more things:

  1. Which parameters would you focus on or adjust, for high quality (Herro corrected) but shorter reads? So far I am second-guessing what sounds reasonable to me.
  2. Can I turn off the phased assembly feature? I know given the length of my reads I won't be getting good phasing. But I am interested in haploid assembly. Is there a way to tell Shasta that this is the case so it could focus on getting a (likely more contiguous) haploid assembly?
@sivico26 sivico26 changed the title Sudden while running a tweaked Herro configuration Sudden stop while running a tweaked Herro configuration Jan 23, 2025
@kokyriakidis
Copy link
Collaborator

Hey @sivico26!

We will release in a few day the next version of Shasta.

Is it possible to share the data you used to test and optimize Shasta?

@sivico26
Copy link
Author

Hi @kokyriakidis,

Thanks for the offer! It is a bit big but I think I can arrange it. Where should I send it? The mail on your profile or do you have another preference?

Cheers

@kokyriakidis
Copy link
Collaborator

You can send me a link pointing to the data at kokyriakidis@gmail.com

@paoloshasta
Copy link
Owner

The upcoming Shasta release mentioned by @kokyriakidis will not include an assembly configuration for the non-UL version of the latest ONT reads. But producing such an assembly configuration is high on our priority list, and it will certainly help if @kokyriakidis can get access to your reads.

There is no way to tell the assembler to produce a haploid assembly, but you could try a couple of things to achieve that:

  • Use Bandage to manually remove one haplotype of each diploid bubble. Then, in Bandage, use "Edit/Merge all possible nodes". It may also be possible to write a script to achieve the same.
  • Disable the generation of anchors that have coverage indicative of a single copy (sorry we have not yet documented the internals of Shasta Mode 3 assembly, which are still fluid). With Shasta 0.13.0, use --Assembly.mode3.minPrimaryCoverage and specify a value roughly equal to 1.5 times your coverage per haplotype (but you may have to experiment that). For Shasta 0.14.0 the name of that option changes to --Assembly.mode3.minAnchorCoverage. I have never tried something like this - it might work but no guarantees.

Your crash is probably a memory problem. For the coverage used in this assembly (217 Gb), I expect that you will need somewhere between 1 and 2 TB. How much memory does the machine that you ran on have? If it is a memory problem, if your system uses SSD storage (as opposed to disk) you can try running with binary data on disk via the following assembly options: --memoryMode filesystem --memoryBacking disk. This will slow down the assembly a lot but might be able to complete without the required memory being available. More information on that here.

@sivico26
Copy link
Author

Hi @kokyriakidis, you should have received a mail to download the data already. Let me know if you got it.

@paoloshasta, thank you for the recommendations! I will tweak the anchoring and see how it goes.

The Machine for this job was running on a 2 TB machine so I would be surprised if it became an issue. Possible though.

I will add that I made another almost identical run where the behavior was certainly not normal compared to others. As I said, I have made assemblies with this configuration and other tweaks, and it takes between 5-9h to complete a normal run (and around 1.4 TB of RAM). The original one (the log file I attached above) died after 5-6h, probably because of short memory as you said. But when I changed --ReadGraph.strandSeparationMethod 2, it did not die, but it extended for super long (I let it die at 30h) running in a single thread for most of the time (probably that last step it was computing when it died).

Here is the log for that run (in case you are interested): shasta_Herro.o8384138.txt

@paoloshasta
Copy link
Owner

There is definitely interaction between strand separation and Mode 3 assembly. @kokyriakidis has made huge improvements on this recently, and the upcoming release includes a new read graph creation method entirely written by him that also does strand separation, in addition to other nice things such as break detection and adaptivity. For now this only works with Herro reads, but that is what you are interested in.

Regarding a haploid version of the assembly, for one of your assemblies that completed look at the Total non-trivial bubble chain reported near the end of stdout.log. That gives you an idea of the "haploid" contiguity of your genome and should be longer than the assembly N50. A bubble chain is an uninterrupted sequence of bubbles (usually all diploid).

@kokyriakidis
Copy link
Collaborator

@sivico26 Yes, I received your data. I will try to work on them this week and see how we can improve your results.

@paoloshasta
Copy link
Owner

Shasta 0.14.0 is out. You could try --config Nanopore-r10.4.1_e8.2-400bps_sup-Herro-Jan2025 on your reads. It should do much better than 0.13.0, but we have only tested it for Ultra-Long reads so far. Shorter reads are high on our priority list.

@sivico26
Copy link
Author

Thanks for the heads up @paoloshasta, and congrats to you and @kokyriakidis! I will give it a try and report back what I find.

@sivico26
Copy link
Author

sivico26 commented Feb 5, 2025

Hello again @paoloshasta and @kokyriakidis,

I have tried the latest shasta v0.14 and some modifications of the config Nanopore-r10.4.1_e8.2-400bps_sup-Herro-Jan2025. I have not gotten it right yet, so I wanted to ask you about it.

Since the config is thought for Ultra-long reads, I thought the main adjustment to make, at least to start, is precisely to reduce the length requirements. To me that translated into reducing the number of markers required --Align.minAlignedMarkerCount. So I have tried a couple of runs reducing this number to 250 and 150.

In both instances, I got barely anything assembled:

n       n:500   L50     min     N75     N50     N25     E-size  max     sum     name
21      21      6       8133    182623  220169  326102  283931  627694  3444702 results_marker150/ShastaRun/Assembly.fasta
17      17      4       1023    196150  238747  326102  310255  627694  2738119 results_marker250/ShastaRun/Assembly.fasta

Here are the logs for both runs:

If I look into them, to me the critical step seems to be here (taken from marker150.log[L3209-L3215]):

2025-Jan-31 15:27:45.019206 306700000/306764311
The alignmentTable has 2353 entries.
The alignmentTableNotPassFilter has 304019621 entries.
Number of reads: 12588286
Number of oriented reads: 25176572
Finding strict disjointSets step: Keeping 2136 alignments of 306764311
The read graph has 25176572 vertices and 4272 edges.

Somehow all those alignments result in very few entries in the alignment table. I imagine they do not meet the minimum criteria and are filtered out. Is this the case? What can I keep tweaking to prevent this? Going further down with the number of markers? It feels low already.

Any thoughts are welcome.

P.S. While writing this I had the epiphany that I can also increase --Kmers.probability. That is what I am testing right now. I will report how that goes.

@paoloshasta
Copy link
Owner

As you guessed, this means that the alignment criteria are too strict, and it could be a question of length or accuracy or both. If length is responsible for that, decreasing --Align.minAlignedMarkerCount is exactly the right thing to try. That configuration sets it to 500, which at marker density 0.05 (----Kmers.probability) corresponds to 10 Kb. This means that for an alignment to be accepted and used the overlapping portion must be at least 10 Kb long (more likely 12 Kb because of missing aligned markers due to errors). Given that your read N50 is 20 Kb, it is clear that the values 250 and 150 you tried are much more reasonable.

Since that by itself did not improve things, it could mean that there is also an accuracy issue. Were you able to assess the accuracy of your reads by independent means, for example by mapping? If the reads have lower accuracy than the ONT dataset, you should experiment reducing --Align.minAlignedMarkerFraction, which is the minimum fraction of aligned markers for an alignment to be kept.

As you experiment, keep in mind a healthy assembly will have 6 or more alignments per read. Below that the amount of sequence assembled and its N50 will start to rapidly decrease. You can also look at the fraction of isolated reads in the read graph, as reported in AssemblySummary.html. For a healthy assembly that should be no more than 20% or 30%.

Experimenting with --Kmers.probability, the marker density, is a reasonable thing to do. If you do that, keep in mind that the average distance between between markers also changes, which means that quantities expressed in markers, like --Align.minAlignedMarkerCount, may also need to be readjusted.

Please continue to report your findings. We will also start a similar optimization process soon for non-UL reads, and hopefully by combining your work with ours we can generate a new assembly configuration for not-UL reads to be included in the next release.

@kokyriakidis
Copy link
Collaborator

@sivico26 Try to increase --ReadGraph.WThreshold to 1e+10 or 1e+30 or 1e+50 or even higher. This will allow more alignments (from reads with more SNPs) to get into the graph.

@paoloshasta
Copy link
Owner

paoloshasta commented Feb 5, 2025

Actually what I said above is wrong. Because you have plenty of alignments (at least in the assemblies with reduced --Align.minAlignedMarkerCount), but most of them are not used to create the read graph. For example from marker150.log:

Found and stored 306764311 good alignments.
...
Adding alignments for break bridging to connect endNodes in different disjointSets: Keeping 53246 alignments of 306764311

So the problem is in the new read graph creation method we just introduced in Shasta 0.14.0 (--ReadGraph.creationMethod 4). There are a few parameters that control that, but @kokyriakidis, who wrote that code, will be the best person to advise on that.

@paoloshasta
Copy link
Owner

The fact that --ReadGraph.WThreshold may need adjustment could be an indication that your reads have lower accuracy than the Herro dataset released by ONT in May 2024, which was used to optimize this assembly configuration. If possible, it would be good assess accuracy of your dataset by mapping to a reference. It is possible that Herro does not do as well for shorter reads.

@sivico26
Copy link
Author

sivico26 commented Feb 6, 2025

Thank you both for your feedback. I am already trying some runs adjusting --ReadGraph.WThreshold.

As a disclaimer, my data is not comparable to the ONT May 2024 dataset. First of all, this dataset used an experimental Basecaller model that has not been released and is only available upon request. Secondly, my data was generated slightly before the transition from 4kHz to 5kHz, which indeed means lower accuracy than the latest R10.4.1 data. Still, I thought I had more than enough depth to Herro to compensate for it.

That being said, I will try to assess the data quality as you suggest and report back what I find.

@kokyriakidis
Copy link
Collaborator

Hey @sivico26,

Now that I better thought of it, I think you should keep the default WThreshold and change --ReadGraph.epsilon setting it to a value representative of the quality of the data.

For example, if you have Q25 reads you should set it to 3e-3. Keep the default value of --ReadGraph.WThreshold or change is minimally to --ReadGraph.WThreshold 1e-7 or --ReadGraph.WThreshold 1e-6 but only after you change --ReadGraph.epsilon.

You can have a look at this image as a reference.

Image

@paoloshasta
Copy link
Owner

I generally don't trust quality scores much. I think it is better to do this based on actual estimated quality of the reads, from mapping or in ways that don't use quality scores.

@sivico26
Copy link
Author

sivico26 commented Feb 6, 2025

Interesting @kokyriakidis, I will try that as well.

You might still be interested in the results of varying --ReadGraph.WThreshold, so here they are:

n       n:500   L50     min     N75     N50     N25     E-size  max     sum     name
7697    7409    943     501     1175428 2111731 3420985 2668289 14.75e6 6.579e9 results_marker250_kp10_We10/ShastaRun/Assembly.fasta
7523    7214    914     501     1213377 2178843 3575755 2744046 14.75e6 6.591e9 results_marker250_kp10_We30/ShastaRun/Assembly.fasta
7433    7145    912     501     1216094 2178845 3594848 2749824 14.75e6 6.588e9 results_marker250_kp10_We50/ShastaRun/Assembly.fasta

At least they produce reasonable assemblies now, diploid though. I am unsure if they align with your expectations.

@paoloshasta
Copy link
Owner

2 Mb N50, phased, is not bad for non-UL reads. Can you post a view of the assembly graph in Bandage?

@sivico26
Copy link
Author

sivico26 commented Feb 6, 2025

@paoloshasta, the last one looks like this. Based on the statistics, the other should be similar.

Image

@kokyriakidis
Copy link
Collaborator

If you do the change I proposed it will be a lot better :-)

@sivico26
Copy link
Author

sivico26 commented Feb 6, 2025

@kokyriakidis, already on their way!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants