-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Sudden stop while running a tweaked Herro configuration #48
Comments
Hey @sivico26! We will release in a few day the next version of Shasta. Is it possible to share the data you used to test and optimize Shasta? |
Hi @kokyriakidis, Thanks for the offer! It is a bit big but I think I can arrange it. Where should I send it? The mail on your profile or do you have another preference? Cheers |
You can send me a link pointing to the data at kokyriakidis@gmail.com |
The upcoming Shasta release mentioned by @kokyriakidis will not include an assembly configuration for the non-UL version of the latest ONT reads. But producing such an assembly configuration is high on our priority list, and it will certainly help if @kokyriakidis can get access to your reads. There is no way to tell the assembler to produce a haploid assembly, but you could try a couple of things to achieve that:
Your crash is probably a memory problem. For the coverage used in this assembly (217 Gb), I expect that you will need somewhere between 1 and 2 TB. How much memory does the machine that you ran on have? If it is a memory problem, if your system uses SSD storage (as opposed to disk) you can try running with binary data on disk via the following assembly options: |
Hi @kokyriakidis, you should have received a mail to download the data already. Let me know if you got it. @paoloshasta, thank you for the recommendations! I will tweak the anchoring and see how it goes. The Machine for this job was running on a 2 TB machine so I would be surprised if it became an issue. Possible though. I will add that I made another almost identical run where the behavior was certainly not normal compared to others. As I said, I have made assemblies with this configuration and other tweaks, and it takes between 5-9h to complete a normal run (and around 1.4 TB of RAM). The original one (the log file I attached above) died after 5-6h, probably because of short memory as you said. But when I changed Here is the log for that run (in case you are interested): shasta_Herro.o8384138.txt |
There is definitely interaction between strand separation and Mode 3 assembly. @kokyriakidis has made huge improvements on this recently, and the upcoming release includes a new read graph creation method entirely written by him that also does strand separation, in addition to other nice things such as break detection and adaptivity. For now this only works with Herro reads, but that is what you are interested in. Regarding a haploid version of the assembly, for one of your assemblies that completed look at the |
@sivico26 Yes, I received your data. I will try to work on them this week and see how we can improve your results. |
Shasta 0.14.0 is out. You could try |
Thanks for the heads up @paoloshasta, and congrats to you and @kokyriakidis! I will give it a try and report back what I find. |
Hello again @paoloshasta and @kokyriakidis, I have tried the latest shasta Since the config is thought for Ultra-long reads, I thought the main adjustment to make, at least to start, is precisely to reduce the length requirements. To me that translated into reducing the number of markers required In both instances, I got barely anything assembled:
Here are the logs for both runs: If I look into them, to me the critical step seems to be here (taken from marker150.log[L3209-L3215]):
Somehow all those alignments result in very few entries in the alignment table. I imagine they do not meet the minimum criteria and are filtered out. Is this the case? What can I keep tweaking to prevent this? Going further down with the number of markers? It feels low already. Any thoughts are welcome. P.S. While writing this I had the epiphany that I can also increase |
As you guessed, this means that the alignment criteria are too strict, and it could be a question of length or accuracy or both. If length is responsible for that, decreasing Since that by itself did not improve things, it could mean that there is also an accuracy issue. Were you able to assess the accuracy of your reads by independent means, for example by mapping? If the reads have lower accuracy than the ONT dataset, you should experiment reducing As you experiment, keep in mind a healthy assembly will have 6 or more alignments per read. Below that the amount of sequence assembled and its N50 will start to rapidly decrease. You can also look at the fraction of isolated reads in the read graph, as reported in Experimenting with Please continue to report your findings. We will also start a similar optimization process soon for non-UL reads, and hopefully by combining your work with ours we can generate a new assembly configuration for not-UL reads to be included in the next release. |
@sivico26 Try to increase |
Actually what I said above is wrong. Because you have plenty of alignments (at least in the assemblies with reduced
So the problem is in the new read graph creation method we just introduced in Shasta 0.14.0 ( |
The fact that |
Thank you both for your feedback. I am already trying some runs adjusting As a disclaimer, my data is not comparable to the ONT May 2024 dataset. First of all, this dataset used an experimental Basecaller model that has not been released and is only available upon request. Secondly, my data was generated slightly before the transition from 4kHz to 5kHz, which indeed means lower accuracy than the latest R10.4.1 data. Still, I thought I had more than enough depth to Herro to compensate for it. That being said, I will try to assess the data quality as you suggest and report back what I find. |
Hey @sivico26, Now that I better thought of it, I think you should keep the default WThreshold and change For example, if you have Q25 reads you should set it to You can have a look at this image as a reference. |
I generally don't trust quality scores much. I think it is better to do this based on actual estimated quality of the reads, from mapping or in ways that don't use quality scores. |
Interesting @kokyriakidis, I will try that as well. You might still be interested in the results of varying
At least they produce reasonable assemblies now, diploid though. I am unsure if they align with your expectations. |
2 Mb N50, phased, is not bad for non-UL reads. Can you post a view of the assembly graph in Bandage? |
@paoloshasta, the last one looks like this. Based on the statistics, the other should be similar. |
If you do the change I proposed it will be a lot better :-) |
@kokyriakidis, already on their way! |
Hello there,
I am giving Shasta a try for my repetitive plant genome. I have good coverage (~70x) of Herro-corrected reads, but my reads are far from ultra-long.
Anyway, I have made a couple of successful but disappointing runs starting from the Herro configuration file and tweaking some parameters. My latest did not finish and I do not know why. Here is the log file for that one
stdout.log.txt
Do you know what could be happening?
I would take the opportunity to ask you about a couple more things:
The text was updated successfully, but these errors were encountered: