Estimated genome size is half #132

gunjanpandey · 2024-06-19T06:36:28Z

I have assembled for a genome a "suspected" highly hetrozygous genome using hifi.
The assembled genome size is 8.2G, which gives following BUSCO results.

Could you please help me understand how to analyse these results. And how to perform this analysis properly, as I believe, I am somehow getting the genome size estimation half of its real value?

I have run following for the genome size estimation in genomescope2 using paired-end illumina files.

meryl count k=19 output k19.meryl ${R1} ${R2} 
meryl histogram k19.meryl/ > 19_meryl.hist 
Rscript genomescope2.0/genomescope.R -i k19_meryl.hist -k 19 -o k19_genomescpe

and I get the following results summary

And the graph

@rahulvrane, thoughts?

The text was updated successfully, but these errors were encountered:

mschatz · 2024-06-20T03:35:58Z

Can you send the link to the genomescope webpage with your results? Sometimes the automatic modeling process gets confused and needs a hint on how to fit the model. And you report the assembled genome size was 8.2G - is this the total amount of sequence that was assembled? If so, the difference is explained by genomescope reporting the haploid genome size while the assembly size will be about twice this amount for highly heterozygous samples. This is because the two haplotypes will separate out, and cause the duplicate genes that you see in the BUSCO report. For example, for humans it reports the (haploid) genome size as 3Gbp while a phased assembly will be about 6Gbp. Good luck! Mike

…

On Wed, Jun 19, 2024 at 2:36 AM gunjanpandey ***@***.***> wrote: I have assembled for a genome a "suspected" highly hetrozygous genome using hifi. The assembled genome size is 8.2G, which gives following BUSCO results. Could you please help me understand how to analyse these results. And how to perform this analysis properly, as I believe, I am somehow getting the genome size estimation half of its real value? image.png (view on web) <https://github.com/schatzlab/genomescope/assets/50389451/a03d1e86-16b3-4df2-b86e-5dca2c0caf1d> I have run following for the genome size estimation in genomescope2 using paired-end illumina files. meryl count k=19 output k19.meryl ${R1} ${R2} meryl histogram k19.meryl/ > 19_meryl.hist Rscript genomescope2.0/genomescope.R -i k19_meryl.hist -k 19 -o k19_genomescpe and I get the following results summary image.png (view on web) <https://github.com/schatzlab/genomescope/assets/50389451/491cc940-8a9d-4228-aab2-bc447d82a257> And the graph image.png (view on web) <https://github.com/schatzlab/genomescope/assets/50389451/22767229-74e3-48ea-91f5-02f7096a838c> — Reply to this email directly, view it on GitHub <#132>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AABP343PVH2X7BMFL6KX2MLZIERIDAVCNFSM6AAAAABJRMR5B2VHI2DSMVQWIX3LMV43ASLTON2WKOZSGM3DCNBUHEYDEMQ> . You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

gunjanpandey · 2024-06-20T04:05:45Z

Thanks for a quick reply @mschatz

The website is giving me an error so I am uploading the file here.
k19_meryl.zip
it is for kmer length of 19, for 150 bp paired end Illumina library - same as for the screenshots above.

gunjanpandey · 2024-10-24T20:02:37Z

@mschatz - thoughts, please?

mschatz · 2024-10-31T03:49:40Z

Hi, I tried running your file through the website (the raw file is too big so I selected the first 100k rows). By default, it estimates the haploid genome size to be 3.7Gb with a very high heterozygosity rate (7.53%). This is an extreme level of heterozygosity: human is about 0.1%, and an F1 of two wild strains of Arabidopsis is about 2%. http://genomescope.org/genomescope2/analysis.php?code=hTdEo4uGrO6TsCTSfKeX I also did a second run where I gave it a hint that the peak at 50x coverage is really the homozygous peak. I did this by setting "Average k-mer coverage for polyploid genome" to 25. This gives a nice fit for a haploid genome size of 7.3Gb with a much more reasonable heterozygosity rate of 0.1% http://genomescope.org/genomescope2/analysis.php?code=0TnxODdt3XSZQI9lZ4AO From just the kmer profile it is ambiguous which is the correct model fit (although I have seen the lower rate of heterozygosity is more often correct). From your BUSCO results, you have a very high rate of BUSCO gene duplicates, which can occur when you have a heterozygous assembly so that you get separate representations of the maternal and paternal chromosome, but this can also suggest a whole genome duplication event. How did you assemble the genome? Was BUSCO run on all contigs or just the primary assembly? If you extract the duplicate BUSCO genes and align them, do you see that they have a ~7% different rate or a ~0.1% difference rate? This can be a good clue Good luck! Mike

…

On Thu, Oct 24, 2024 at 4:03 PM gunjanpandey ***@***.***> wrote: @mschatz <https://github.com/mschatz> - thoughts, please? — Reply to this email directly, view it on GitHub <#132 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AABP343SDFR5IER6YGQU7ULZ5FG7LAVCNFSM6AAAAABJRMR5B2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDIMZWGIZTQOBWGU> . You are receiving this because you were mentioned.Message ID: ***@***.***>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Estimated genome size is half #132

Estimated genome size is half #132

gunjanpandey commented Jun 19, 2024 •

edited

Loading

mschatz commented Jun 20, 2024 via email

gunjanpandey commented Jun 20, 2024 •

edited

Loading

gunjanpandey commented Oct 24, 2024

mschatz commented Oct 31, 2024 via email

Estimated genome size is half #132

Estimated genome size is half #132

Comments

gunjanpandey commented Jun 19, 2024 • edited Loading

mschatz commented Jun 20, 2024 via email

gunjanpandey commented Jun 20, 2024 • edited Loading

gunjanpandey commented Oct 24, 2024

mschatz commented Oct 31, 2024 via email

gunjanpandey commented Jun 19, 2024 •

edited

Loading

gunjanpandey commented Jun 20, 2024 •

edited

Loading