-
Notifications
You must be signed in to change notification settings - Fork 112
Multiple alts per variant? #169
Comments
@richarddurbin also advocates the mono-allelic model. I still have a few questions. Say we have a VCF line |
Good point...I'm not sure how, if at all, the single-alt model would work with VCF. If a goal is to make the API as close as possible to VCF then a single-alt strategy might not work. |
A colleague forwarded this thread to me… I advocated for the mono-allelic model in the context of the ADAM project for many of the reasons listed above. Not necessarily for use during variant calling, but instead for downstream analysis and particularly for joining with other datasets. In the example above, I think one could use local phasing and possibly the <NON_REF> symbolic allele to maintain the same information after "splitting" multi-allelic entries. For example
would become something like
The result captures that the two different alleles are in trans and that the other allele, although not explicitly specified, is not the reference. While adding |
I do advocate a mono-allelic representation, but this is not the same as having a VCF file with one ALT per record. That would be a bi-allelic representation. Say we have a VCF lineREF=A;ALT=C,T;GT=1/2. Is there a mono-allelic VCF? If so, what is it? REF=A;ALT=C;GT=1/1 andREF=A;ALT=T;GT=1/1? would have three alleles, something like
Note that this is not VCF, that [START, END) is a semi-open interval, that there is a record for each allele, and that the genotype is given by the allele count of each allele in the sample. As @bioinformed pointed out, the VCF spec does not require that overlapping records are merged. There are multiple reasons for this. First, the rules for merging can be complex, Richard On 28 Oct 2014, at 22:04, Michael Linderman notifications@github.com wrote:
The Wellcome Trust Sanger Institute is operated by Genome Research |
Firstly, how would we encode a VCF line
Or perhaps there would be only one reference line? Either way, when we want to get the genotypes, how could we combine the alleles? By [START,END) pair or by having a new object relating alleles? Wouldn’t this bring us back to the mess with VCF and make it worse? With the mono-allelic model, we delay part of the merging problem to the genotyping phase. We have not solved it. Secondly, for a complex variant that can be represented in different ways in VCF, the mono-allelic model is not easier for merging. We still need to go through a realignment procedure to compare alleles. Even if we ignore genotyping, the mono-allelic model is not a full solution to merging. Thirdly, how could we keep genotype likelihoods? It would be cumbersome to store them in the mono-allelic model. Or perhaps we want to get rid of them in GA4GH as they are mostly useful for calling/genotyping only? Generally, I think mono- and multi-allelic models are equivalent at the core. They share the same fundamental problems with edit-based representations. They have the same difficulty in merging when we want to get genotypes. At a higher level, the mono-allelic model is more consistent with annotation as we annotate the function of an allele, usually not a genotype. However, the multi-allelic model of VCF is more convenient when we deal with genotypes. It is also more consistent provided that there are no overlapping VCF records. IMO, the mono-allelic model would be more useful for a context- or sequence-based representation. |
I think I would just have
You are right that we delay the merging until making genotypes, but at that point we only need consider allele fields which have count > 0. This eliminates Your second point about identifiability is valid. This is still an edit representation and we still need a canonical way to define the edits (alleles), unless we The third point about genotype likelihoods was raised by Gabor and Gil. We can keep allele count likelihoods, i.e. the probability of seeing the data given Richard On 29 Oct 2014, at 12:26, Heng Li notifications@github.com wrote:
The Wellcome Trust Sanger Institute is operated by Genome Research |
I am +1 here. Also, if you have the allele count likelihoods, can you eliminate the need for a symbolic alt allele, as is used in gVCF? I believe that you can, but I haven't gone through the math and convinced myself yet. |
I agree that there's a tradeoff between ease-of-genotyping and ease-of-annotation for the multi- vs. mono-allelic models. But I think that the genotyping burden is relatively minor in mono-allelic models, especially so if allele counts and/or likelihoods are stored with each variant. On the other hand, I think the additional burden on downstream analysis is pretty significant for multi-allelic models - it's not just annotations that are more difficult, but set operations like intersections & unions, counting and frequency calculations (how many samples have variant X?), even iteration over the set of variants is not trivial. In house, we always convert VCFs to a mono-allelic representation, and it's been successful for us so far. Of course, we do mostly downstream analyses, and we don't write variant callers, so my opinion is probably biased by the type of work I do. |
On the third point, how to calculate "0 copies of the allele"? What is the formula? Anyway, I haven't worked out how to compute the likelihood of a het P(D|[A1,A2]) from some forms of P(D|A1) and P(D|A2) in the generic cases. I see your point on bounded merge. It is a valid and good point. On a second thought, I think we should drop all reference alleles. For the example above, we can keep it as:
In @brendanofallon The genotyping problem is not that minor. In the mono-allelic model, we have to impose dependencies across alleles. Extra dependencies make a representation more fragile to errors. In addition, retrieving genotypes across multiple lines in the mono-allelic model and traversing alleles in the multi-allelic model are arguably of the same bioinformatic complexity. Finally, most variant callers effectively use the multi-allelic model as they have to evaluate het likelihoods for all allele combinations. |
@lh3, maybe I'm looking at it naively, but wouldn't it sort of come to some form like this?
And for "0 copies of the allele" wouldn't it be something of the form |
Erik Garrison is going to spend some time looking at implementing this with me. I don't really like adding the reference back to the count information for each allele - that gives the reference a privileged position that I think is a problem. Also, I think that it is wrong to single-linkage cluster alleles into sites and calculate hets etc. for all. What matters is the pairwise relation between alleles Anyway, we will aim to come up with a document and implementation that people can look at. Richard On 30 Oct 2014, at 04:07, Heng Li notifications@github.com wrote:
The Wellcome Trust Sanger Institute is operated by Genome Research |
When you separate the reference allele out, you will sometimes need multiple overlapping reference alleles of different lengths, although there is really one reference allele only. This seems inconsistent. Reference allele is anyway special in that the coordinate is determined by the reference. I agree that pairwise relationship is not transitive. This is similar to the argument that merging in the mono-allelic model is and should be bounded, which I like, too. Nonetheless, variant calling is different. Modern callers do not look for alleles and then cluster them. They start from enumerating haplotypes and use them to evaluate data. These haplotypes possibly harbor several atomic SNPs/INDELs if they are placed on the same reads. |
I hear what you say about the reference being special, but I think we should be wary of that. There was an interesting talk at ASHG by Ryan With respect to variant calling, we certainly plan to look at that. For single samples at least I am confident that the mono-alleleic route will Richard On 31 Oct 2014, at 01:32, Heng Li notifications@github.com wrote:
The Wellcome Trust Sanger Institute is operated by Genome Research |
The discussion over representation of reference alleles reminds me of the original development of gVCF, in large part to distinguish between reference-match and non-calls (insufficient read coverage etc.) without going back to the read alignments. Representing non-calls could further complicate the |
In gVCF, we have long stretches of REF alleles not overlapping variants. We just keep that way in the |
Richard/Erik - Would there be any exciting updates regarding the document? I am very curious to read it. Thanks, |
This has been dormant since October. I am closing this in 2 days. |
Closing. This can be reopened if necessary. |
I'm curious to hear people's opinions regarding whether or not a variant can or should encompass multiple alt alleles. It seems that version 0.5 of the variants.avdl supports multiple alts per variant, as alt is an array of strings and not a single string.
FWIW, I think a variant having multiple alts is a bit confusing and leads to some unnecessary complexity. For instance, each alt may overlap different features of interest, will have different annotations, may each be present in distinct sets of samples, etc. This leads to some confusing requirement, for instance, for the set of annotations associated with a multiple-alt variant, only some annotations will apply to each alt allele, and the annotation set would require some sort of mapping back to which alt(s) are tied to the annotation.
An alternative would be to represent each alt as a separate variant having the same start position, and restrict the idea of variant to being a single chr-pos-ref-alt combination.
I realize that it may be a bit late in the game to question this, and that VCFs of course permit multiple alts per variant so it's something people are familiar with... but I do think that it makes life more confusing in several ways, and I'm not sure what the upside is.
The text was updated successfully, but these errors were encountered: