-
Notifications
You must be signed in to change notification settings - Fork 112
basic standardisation - deletion alleles and start/stop coordinates #168
Comments
I believe strongly in showing coordinates to end users in a system that they understand, which often means supporting multiple coordinate schemes for different communities and standards (rather than trying to get everyone to agree on a single representation). However, I believe even more strongly that all "back end" coordinate APIs, internal storage formats and arithmetic should use 0-based half-open coordinates, also called interbase coordinates or UCSC coordinates. We can review HGVS and other end user syntaxes, but for the remainder of this post I'm going to address only the "back end" representation. For completeness, here are all non-degenerate cases by variant type (assuming 0 <= a <= b <= chromosome length):
No special notation is needed for null alleles for insertions or deletions -- they are merely empty strings for reference or alternative, respectively. This convention avoids picking special characters, adding padding bases and a variety of other unnecessary complexities. |
Coordinates: I agree the interbase system is attractive, and indeed was seeking views/discussion about alternative "back end" preferences and practices |
I'm absolutely in favour of standardisation, and I like the ideas expressed above. However, I'd be strongly against using a "-" or anything else in the |
I don't understand why we are discussing this. The current API is clear and derived from VCF, which is also a GA4GH specification. Richard On 28 Oct 2014, at 11:41, Jerome Kelleher notifications@github.com wrote:
The Wellcome Trust Sanger Institute is operated by Genome Research |
Hi Richard - I triggered this thread after I saw the topic flagged up in the MME API, and given that others I spoke with were unsure of the situation, felt there may not be a consensus yet, and suggested I posed the question more widely to the group. I am glad it was so straightforward to get the answers (which it seems everyone agrees with). |
One issue that we've encountered is that some tools (if I remember correctly, JAnnovar and/or Exomiser) do output variants without a common prefix for insertions-deletions, which, although not valid according to the VCF specification, we'd still like to be able to process and expose through GA4GH APIs. Getting back the prefix is a task that would affect the performance. |
In addition to VCF, the other widely used mutation annotation system is HGVS which was discussed in #159. Tools should stop inventing new in-house representations of INDELs. I wouldn't mind if GA4GH ignores a few tools that do not conform to standards. The mainstream annotators all support VCFs. |
I'm generally not in favor of supporting broken implementations of On Tue, Oct 28, 2014 at 8:57 AM, Sergiu Dumitriu notifications@github.com
|
IMHO, the HGVS standard is a bit strange in the way it handles indels. And FYI, HGVS and ISCN are now attempting to align their respective nomenclature systems. HGVS nomenclature uses 'start' and 'end/stop' as follows: |
Re HGVS: don't forget (tandem) duplications provide the coordinates of the On Tue, Oct 28, 2014 at 10:32 AM, antbro notifications@github.com wrote:
|
And HGVS nomenclature defines duplications as an entity that's separate from the notion of insertion |
Do you have examples (in addition to insertions at the beginning of chromosomes)?
We should move HGVS discussions to #159. |
So in programming languages, the less the number of reserved words, the more systematic the programming language was to utilize in creating and utilizing more complex structures. Maybe we can start having a small set of atomic operations and from those build up all the variations we require. Otherwise we're too general, or cannot encompass the possibilities that others might deem important. |
Here is one example:
There are two problems:
|
I think we have to distinguish between what we have settled on as a We further agreed at the GA4GH meeting that people would send these "hard PS: One concrete test we discussed is
Simple, but important to check that this works in all cases. Similar tests On Tue, Oct 28, 2014 at 7:42 AM, antbro notifications@github.com wrote:
|
Hi, David. I'm here and participating! My email address is jacobs@bioinformed.com. Justin, Benedict, and Steve have come up with a Google form to collect VCF comparison oddballs and hairballs that populates this spreadsheet. The plan is to start advertising this resource on the next WG call. |
Can someone make a pull request that enshrines these conventions for start/end/ref/alt in the comments of our avro file? Then we'll have a good place to point anyone who has questions. |
@bioinformed Your example shows two overlapping variants. The preferred VCF should have one line with REF=A, ALT=T,TG and GT=1/2 instead. Inconsistencies between overlapping variants are inherent to all edit-based approaches. Stripping the prefix happens to remove the overlap in your example, but it does not solve the general problem. IMO, a well-formed VCF should contain no overlapping records. This is true for HaplotypeCaller and freebayes, I believe. In all, I don't think prefix is to blame here. @cassiedoll Personally, I'd prefer to just state clearly what we have in GA4GH/VCF. Explaining alternatives would be very lengthy and might confuse less experienced users. Except HGVS, these alternatives are largely in-house and much less used. |
excellent. Thanks Kevin! Can I make a request that we add to the Google PS: You could start with just uploading examples consisting of one So great to get rolling on this! -D On Tue, Oct 28, 2014 at 8:53 AM, bioinformed notifications@github.com
|
PPS: To input an example consisting of a heterozygous diploid variant On Tue, Oct 28, 2014 at 9:52 AM, David Haussler haussler@soe.ucsc.edu
|
"One concrete test we discussed is..." |
@lh3 - sorry for being unclear, that's what I meant. If you just look at the avro/docs, we don't say anything about how you represent an indel/deletion/etc - which just seems like an oversight :) |
Yes. we discussed a multi-step test like this
-D On Tue, Oct 28, 2014 at 9:58 AM, antbro notifications@github.com wrote:
|
@cassiedoll - this is great! After the collection of examples, I think we can build a fundamental set of a atomic language constructs with simple rules to build up all these cases. As noted by @haussler, these can very easily form tests. |
I support @lh3 on this. I don't think that the so-called prefix in VCF creates problems - I actually see it as removing problems by providing very clean
VCF as a format does not specify that the replaced base is at the start of the replacement string - that is just a convention to make the representation canonical. It is also true that the VCF model requires that overlapping variants are merged. This is what makes it messy to merge VCF files. Richard On 28 Oct 2014, at 16:50, Heng Li notifications@github.com wrote:
The Wellcome Trust Sanger Institute is operated by Genome Research |
@cassiedoll I see what you mean now. The current description is avro is reasonably accurate and expressive about our intended representation. At least I do not have something to add for now. |
@richarddurbin and @lh3 - so should we have a VCF checker that validates it before reading it into the schema? |
There is a VCF validator. I am copying Petr Danecek who should be able to point you to it. Richard On 28 Oct 2014, at 17:29, Paul Grosu notifications@github.com wrote:
The Wellcome Trust Sanger Institute is operated by Genome Research |
Good to have examples that cannot be represented in VCF or HGVS too. Thanks As we discussed at the GA4GH meeting, it goes the other way too. Since they Kevin, it is your call how soon you want to get into this sampling part on On Tue, Oct 28, 2014 at 9:58 AM, antbro notifications@github.com wrote:
|
The sooner we start using and testing the monallelic representation On Tue, Oct 28, 2014 at 10:23 AM, Richard Durbin notifications@github.com
|
@richarddurbin, @lh3: As I read it, the VCF spec does not require that overlapping records are merged. Do you mean something different by the "VCF model"? Here is the only text I can find in the VCF 4.2 spec on the topic:
One nit in the wording: If multiple records can have the same POS then they're technically required to be in non-decreasing order. This also opens a loophole where there is no canonical total ordering for VCF records (modulo contig order). That said, I agree that padding doesn't create a fundamental problem with the VCF model, per se. I still assert that any program attempting to interpret complex variation should strip extraneous reference bases when attempting to model complex variation to avoid making potentially invalid reference assertions. This is why I believe that APIs presenting models of variants based on a reference interval and substituted alleles should support both null reference ranges (insertions) and null alternative alleles (deletions). @richarddurbin: I'm joining the conversation late and haven't been able to find the monoallelic representation proposal. Can you please point me in the right direction? |
@haussler: Another of my projects is to validate that all HGVS representations in HGMD Pro and Clinvar match the VCF representation of the same variants and vice versa. I'm using my colleague Reece Hart's UTA and HGVS infrastructure. This round of testing doesn't include complex phased haplotypes, but that is the next logical step. |
Peter can you share your tools for validating HGVS? I could not find them On Tue, Oct 28, 2014 at 10:58 AM, Kevin Jacobs notifications@github.com
|
@bioinformed VCF does not specify whether it allows overlapping variants. I was saying that overlapping variants may lead to inconsistencies no matter whether you have prefix or not. The problem is not prefix, so it is not necessary to strip it. Allowing null alleles will make it difficult to export the GA4GH representation to VCF and add unnecessary complexity to make analysis code work with both cases. I think for edit-based representation, GA4GH should just stick with VCF. |
@richarddurbin - Thank you, yes, vcf-tools are very helpful - almost forgot about the validator :) I guess I intended say that before the "variant call" file gets loaded into our schema it performs checks and reports back recommendations. For instance - like @lh3 mentioned - if it finds overlapping variants, it will ask the user to reformat the file or it will provide recommendations. This step can have multiple validation stages, before it would go into a GA4GH repository. If so, based on the examples we collect here, we can form all the checks as part of GA4GH. I feel integrating already publicly available tools will speed up our process here. |
@pgrosu @richarddurbin The validator from vcf-tools may not be the best choice, it's written in perl and may be too slow for this purpose. I understand the motivation is to check overlapping variants here? |
Dear David, You can find the Mutalyzer suite at https://mutalyzer.nl. We have published an extended Backus-Naur form of the HGVS nomenclature syntax in 2011 in BMC Bioinformatics (doi: 10.1186/1471-2105-12-S4-S5). The EBNF is used to generate Mutalyzer’s HGVS syntax parser. Best regards, Peter From: David Haussler [mailto:haussler@soe.ucsc.edu] Peter can you share your tools for validating HGVS? I could not find them at http://www.hgvs.org/mutnomen/ On Tue, Oct 28, 2014 at 10:58 AM, Kevin Jacobs <notifications@github.commailto:notifications@github.com> wrote: @hausslerhttps://github.com/haussler: Another of my projects is to validate that all HGVS representations in HGMD Pro and Clinvar match the VCF representation of the same variants and vice versa. I'm using my colleague Reece Hart's UTAhttps://pypi.python.org/pypi/uta/0.1.8 and HGVShttps://pypi.python.org/pypi/hgvs infrastructure. This round of testing doesn't include complex phased haplotypes, but that is the next logical step. — |
@pd3, overlapping variants would be one case, which also got continued in another discussion in #169. But we are trying to brainstorm different "hard to represent" genetic changes that we might encounter and how to best handle them. The validation I initially brought up as a checkpoint to determine if such a file would contain genetic changes that might be better represented another way or if we want to perform specific checks before the data would go into a GA4GH repository, based on best-practices we intend for data-representation in our schema. |
I'm coming to the conversation late. A few comments on the thread:
|
Just in case it may come in handy for anyone, here's a schematic by @asimenos we use to illustrate interbase coordinates (at least I hope we're referring to the same thing :) |
In the same vein, here's a a Google spreadsheet that has a bunch of coordinate system and mapping examples that I put together when we were working on HGVS variant mapping. http://goo.gl/b1nUxl |
This issue has had a lot of discussion. It would be great to hear final comment on which standard GA4GH has settled on for deletion alleles and start/stop coordinates. |
Check created time of analysis field
It seems there may be different views and practices regarding how we should specify deletion alleles ("^", "*", "_", "-",...) and start/stop coordinates (first base, base before, last base, base after). I suggest it may be good to review who uses which alternatives and why, and ideally settle on a GA4GH standard for these very basic items.
The text was updated successfully, but these errors were encountered: