Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

No variants called in any haplotype when SNV is not in linkage with other SNVs? #18

Open
AdmiralenOla opened this issue Mar 25, 2022 · 3 comments

Comments

@AdmiralenOla
Copy link

Dear CliqueSNV team,

I've been experimenting with your tool and think perhaps I have found a bug. If there is a single, isolated SNV with no other SNVs in linkage within the mapping reads, i.e. distance to other SNVs is greater than read length, that SNV is never called in any of the haplotypes.

I'm trying to understand the algorithm described in your paper, and I guess this makes sense, because these SNVs are not in cliques with any other SNVs(?) But for some types of data it will mean that common haplotypes will not be present in the results. In one of my examples, there is a clear 45/55% distribution between C and T at a particular site, and the total read depth is around 30,000.

Graphic presentation of my problem.
unlinked_SNV_cliquesnv_problem

I can provide bam files for testing if you'd like.

@AdmiralenOla
Copy link
Author

Upon closer reading of your paper I see that you are aware of this already:

Another limitation is for variants that differ only by isolated SNVs separated by long conserved genomic regions longer than the read length which may not be accurately inferred by CliqueSNV. While such situations usually do not occur for viruses, where mutations are typically densely concentrated in different genomic regions, we plan to address this limitation in the next version of CliqueSNV.

Is this still planned for an upcoming version?

@vtsyvina
Copy link
Owner

vtsyvina commented Apr 7, 2022

Hello, @AdmiralenOla

Sorry for the late reply. Yes, we are aware of such behavior. It's not really clear what to do with such cases. Some samples may have plenty of such isolated SNVs. For example if you look and corona virus data it is pretty long and mutations are distributed on longer ranges than one read can cover. And we may have sites with 5-50% variant frequency.

Should we try to attach such "orphan" mutations? To what haplotype then? Unfortunately, it is not clear where to get the information to get this decision.

Even if we have just two pairs of linked SNVs far from each other, it is not clear if they come from the same haplotype or different. So we report two haplotypes.

Those are shortcomings of the technology.

@AdmiralenOla
Copy link
Author

Thanks for your reply, @vtsyvina. I agree that this is a limitation in the technology .

However, you noted in your paper that you had a plan to address this limitation, and that got me curious. For example, some type of probabilistic framework that assumes the haplotypes have similar SNV frequency at all sites may in some cases be used to assign full-length haplotypes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants