-
Notifications
You must be signed in to change notification settings - Fork 56
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
understanding cause of poor model fit of heterozygote peak #144
Comments
Thanks for your interest. I agree the model fit diverges from the observed
data, but this is not uncommon as the modeling expects an idealized
coverage distribution. We have seen that some fish genomes have certain
repeats that can be a little tricky for HiFi, but given the level of
coverage you have here I would nevertheless expect a good assembly. For
HiFi data I would recommend using the hifiasm genome assembler. You may
also want to check out the pipeline we developed for VGP and the associated
workflow we have in Galaxy that uses hifiasm plus a few pre- and
post-assembly tools for QC and packaging:
https://www.nature.com/articles/s41587-023-02100-3
Good luck!
Mike
…On Mon, Sep 30, 2024 at 4:35 PM andrbern8000 ***@***.***> wrote:
Good afternoon,
I am assembling fish genomes de novo using hifi data and have run into a
few issues for a few of my target species (all diploid);
first, to better understand the size and heterozygosity of the genome and
to confirm our estimates of sequence coverage, I ran meryl (default
settings for 'count' and 'histogram', k = 21) and genomescope2 (default
settings, k = 21).
The summary output of the genomescope2 model fit was not too bad (~73-89%
- see below), but when the results were visualized, it appears as though
the observed kmer frequencies (blue line) for the 'heterozygote' peak did
not match the distribution estimated using the full model (black line).
Basically, the observed peak spans a much wider coverage range than the
full model peak.
I am wondering what may be driving this observed vs. full model difference
(i.e., sequencing errors?) and if this is a cause for concern (i.e., a data
issue that needs to be addressed prior to assembly). Should I adjust some
of the genomescope2 parameters?
I am very new to genome assembly and would appreciate any advice you (or
anyone else) might have.
Thanks,
Andrea
GenomeScope version 2
p = 2
k = 21
property; min; max
Homozygous (aa); 98.04%; 98.10%
Heterozygous (ab); 1.90%; 1.96%
Genome Haploid Length; 377413934 bp; 379528391 bp
Genome Repeat Length; 61537310 bp; 61882072 bp
Genome Unique Length; 315876624 bp; 317646318 bp
Model Fit; 73.1021%; 88.551%
Read Error Rate; 0.460545%; 0.460545%
cc_meryl_genomescope2_k21.png (view on web)
<https://github.com/user-attachments/assets/77c6a160-317b-4c39-af2c-7792cc3b993e>
—
Reply to this email directly, view it on GitHub
<#144>, or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AABP34Z4O2QI7UA3K46DI3LZZGYYHAVCNFSM6AAAAABPEFNPUKVHI2DSMVQWIX3LMV43ASLTON2WKOZSGU2TONJYHE3DOMI>
.
You are receiving this because you are subscribed to this thread.Message
ID: ***@***.***>
|
Hi Mike, Thanks for your response. I have reviewed/gone through the VGP pipeline on Galaxy using the sample data. It is a wonderful training manual/tutorial. Thank you! I’ve also been using hifiasm to perform some preliminary assemblies on the hifi datasets that show clean kmer profiles using genomescope2. We will likely be obtaining Hi-C data to assist in the assemblies and I’m hoping this will help improve the quality. First: Genome size estimates of congeners (c-values) suggest these fishes should have a haploid genome size of ~500-600Mbp. Regarding the kmer profile: The heterozygous peak is a poor fit for the data (which I now know is okay); but, the real issue is that as I adjust the kmer size from 21 to 31, the haploid genome size almost doubles. I am assuming the cause of this change is the high estimated heterozygosity of the genome. For instance, could it be that as the kmer size increases (k = 21 to 31), more kmers are being identified as ‘unique’ rather than simply the heterozygous counterpart to an existing (and previously identified) kmer. Thus, the haploid genome size is increasing. Is this potentially the issue I’m experiencing? Or is there another (more obvious) issue that I’ve missed? Will this lead to assembly issues that I should look out for? Again, help/advice would be appreciated. See linear kmer plots and summary stats below. |
This comes up in tricky cases where it is ambiguous if the genome has a
smaller haploid genome size with a high rate of heterozygosity or a larger
genome size with a lower rate of heterozygosity -- in your data the options
are 1Gb / 0.66% with an average coverage of 17 or 532Mbp / 2.42% with an
average coverage of 31. GenomeScope uses a heuristic to decide and it is
sensitive to the shape of the peaks. By changing the kmer size the shape of
the peaks get more distorted so if flips between these two estimates. You
can also force it to pick one of these versions by setting the parameter
"Average k-mer coverage for polyploid genome" (here you could set this to
either 17 or 31 to force it into one of these modes.
Fortunately, you know the haploid genome size is about 500Mbp, so we can
assume the 532 Mbp / 2.42% version is correct. With this rate of
heterozygosity the two haplotypes will largely be separate using an
assembler like hifiasm, and you should see widespread gene duplicates in
BUSCO. You can further confirm this estimate by aligning the duplicate
genes to each other and confirming the divergence rate is about 2.42%
Hope this helps!
Mike
…On Fri, Oct 4, 2024 at 2:52 PM andrbern8000 ***@***.***> wrote:
Hi Mike,
Thanks for your response. I have reviewed/gone through the VGP pipeline on
Galaxy using the sample data. It is a wonderful training manual/tutorial.
Thank you! I’ve also been using hifiasm to perform some preliminary
assemblies on the hifi datasets that show clean kmer profiles using
genomescope2.
We will likely be obtaining Hi-C data to assist in the assemblies and I’m
hoping this will help improve the quality.
I’m sorry to trouble you, but I have another fish that has generated
problematic kmer profiles with hifi data and using genomescope2. I’d
appreciate any feedback on this issue as well.
First: Genome size estimates of congeners (c-values) suggest these fishes
should have a haploid genome size of ~500-600Mbp.
Regarding the kmer profile:
The heterozygous peak is a poor fit for the data (which I now know is
okay); but, the real issue is that as I adjust the kmer size from 21 to 31,
the haploid genome size almost doubles. I am assuming the cause of this
change is the high estimated heterozygosity of the genome.
For instance, could it be that as the kmer size increases (k = 21 to 31),
more kmers are being identified as ‘unique’ rather than simply the
heterozygous counterpart to an existing (and previously identified) kmer.
Thus, the haploid genome size is increasing. Is this potentially the issue
I’m experiencing? Or is there another (more obvious) issue that I’ve
missed? Will this lead to assembly issues that I should look out for?
Again, help/advice would be appreciated.
Andrea
See linear kmer plots and summary stats below.
elongatus_K21_linear_plot.png (view on web)
<https://github.com/user-attachments/assets/e3caffae-161d-493b-857f-fc384ab856b2>
image.png (view on web)
<https://github.com/user-attachments/assets/992c62d5-c117-41db-bbca-34e546ad8b77>
elongatus_K31_linear_plot.png (view on web)
<https://github.com/user-attachments/assets/8bfd5de0-a563-41ed-9f19-688ae3fa154d>
image.png (view on web)
<https://github.com/user-attachments/assets/490cf9d7-203b-4afc-a1d8-454bd991351a>
—
Reply to this email directly, view it on GitHub
<#144 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AABP347KKFF54NK4OQYIFI3ZZ3PYFAVCNFSM6AAAAABPEFNPUKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGOJUGM3TKMRXGM>
.
You are receiving this because you commented.Message ID:
***@***.***>
|
Hi Mike, |
Good afternoon,
I am assembling fish genomes de novo using hifi data and have run into a few issues for a few of my target species (all diploid);
first, to better understand the size and heterozygosity of the genome and to confirm our estimates of sequence coverage, I ran meryl (default settings for 'count' and 'histogram', k = 21) and genomescope2 (default settings, k = 21).
The summary output of the genomescope2 model fit was not too bad (~73-89% - see below), but when the results were visualized, it appears as though the observed kmer frequencies (blue line) for the 'heterozygote' peak did not match the distribution estimated using the full model (black line). Basically, the observed peak spans a much wider coverage range than the full model peak.
I am wondering what may be driving this observed vs. full model difference (i.e., sequencing errors?) and if this is a cause for concern (i.e., a data issue that needs to be addressed prior to assembly). Should I adjust some of the genomescope2 parameters?
I am very new to genome assembly and would appreciate any advice you (or anyone else) might have.
Thanks,
Andrea
GenomeScope version 2
p = 2
k = 21
property; min; max
Homozygous (aa); 98.04%; 98.10%
Heterozygous (ab); 1.90%; 1.96%
Genome Haploid Length; 377413934 bp; 379528391 bp
Genome Repeat Length; 61537310 bp; 61882072 bp
Genome Unique Length; 315876624 bp; 317646318 bp
Model Fit; 73.1021%; 88.551%
Read Error Rate; 0.460545%; 0.460545%
The text was updated successfully, but these errors were encountered: