Skip to content
This repository has been archived by the owner on Oct 28, 2022. It is now read-only.

basic standardisation - deletion alleles and start/stop coordinates #168

Open
antbro opened this issue Oct 27, 2014 · 42 comments
Open

basic standardisation - deletion alleles and start/stop coordinates #168

antbro opened this issue Oct 27, 2014 · 42 comments

Comments

@antbro
Copy link

antbro commented Oct 27, 2014

It seems there may be different views and practices regarding how we should specify deletion alleles ("^", "*", "_", "-",...) and start/stop coordinates (first base, base before, last base, base after). I suggest it may be good to review who uses which alternatives and why, and ideally settle on a GA4GH standard for these very basic items.

@bioinformed
Copy link

I believe strongly in showing coordinates to end users in a system that they understand, which often means supporting multiple coordinate schemes for different communities and standards (rather than trying to get everyone to agree on a single representation). However, I believe even more strongly that all "back end" coordinate APIs, internal storage formats and arithmetic should use 0-based half-open coordinates, also called interbase coordinates or UCSC coordinates. We can review HGVS and other end user syntaxes, but for the remainder of this post I'm going to address only the "back end" representation.

For completeness, here are all non-degenerate cases by variant type (assuming 0 <= a <= b <= chromosome length):

class start stop ref len (stop - start) alt length
SNV a a + 1 1 1
Insertion a a 0 > 0
Deletion a b > 0 0
MNV1 a b > 1 > 0
MNV2 a b 1 > 1

No special notation is needed for null alleles for insertions or deletions -- they are merely empty strings for reference or alternative, respectively. This convention avoids picking special characters, adding padding bases and a variety of other unnecessary complexities.

@antbro
Copy link
Author

antbro commented Oct 28, 2014

Coordinates: I agree the interbase system is attractive, and indeed was seeking views/discussion about alternative "back end" preferences and practices
Null allele notation: empty strings are easily missed/lost in processing, so a notation character would be safer. The "-" character seems to be the most widely used
Perhaps its all simple then - these could be the convention for GA4GH APIs? [or perhaps de facto already are??]

@jeromekelleher
Copy link
Contributor

I'm absolutely in favour of standardisation, and I like the ideas expressed above. However, I'd be strongly against using a "-" or anything else in the referenceBases and alternateBases fields to denote a null allele; these fields should only include bases and not any encoded auxiliary information. If it's necessary, we should have an explicit field like isNullAllele or have an enum for the different types of alleles/variants.

@richarddurbin
Copy link
Contributor

I don't understand why we are discussing this. The current API is clear and derived from VCF, which is also a GA4GH specification.
As Jerome says, it has no gap character, so no use of '-', '_', or '*'. Instead it uses pure replacement semantics, specifying a string
in the reference which is to be replaced by a different string in the alternate. Coordinates are 0 based. This is robust, easy to parse,
and has worked for tens of millions of variants in large scale sequencing projects. None of the proposed future graph representations
include gap characters. All use 0-based coordinates.

Richard

On 28 Oct 2014, at 11:41, Jerome Kelleher notifications@github.com wrote:

I'm absolutely in favour of standardisation, and I like the ideas expressed above. However, I'd be strongly against using a "-" or anything else in the referenceBases and alternateBases fields to denote a null allele; these fields should only include bases and not any encoded auxiliary information. If it's necessary, we should have an explicit field like isNullAllele or have an enum for the different types of alleles/variants.


Reply to this email directly or view it on GitHub.

The Wellcome Trust Sanger Institute is operated by Genome Research
Limited, a charity registered in England with number 1021457 and a
company registered in England with number 2742969, whose registered
office is 215 Euston Road, London, NW1 2BE.

@antbro
Copy link
Author

antbro commented Oct 28, 2014

Hi Richard - I triggered this thread after I saw the topic flagged up in the MME API, and given that others I spoke with were unsure of the situation, felt there may not be a consensus yet, and suggested I posed the question more widely to the group. I am glad it was so straightforward to get the answers (which it seems everyone agrees with).
Beyond the backend processing we've been discussing, several different systems are in use for human readable data. I wonder if it would it make sense to issue a GA4GH convention on these matters, to guide newcomers who may be creating GUIs for such purposes?

@sdumitriu
Copy link
Member

One issue that we've encountered is that some tools (if I remember correctly, JAnnovar and/or Exomiser) do output variants without a common prefix for insertions-deletions, which, although not valid according to the VCF specification, we'd still like to be able to process and expose through GA4GH APIs. Getting back the prefix is a task that would affect the performance.

@lh3
Copy link
Member

lh3 commented Oct 28, 2014

In addition to VCF, the other widely used mutation annotation system is HGVS which was discussed in #159. Tools should stop inventing new in-house representations of INDELs. I wouldn't mind if GA4GH ignores a few tools that do not conform to standards. The mainstream annotators all support VCFs.

@bioinformed
Copy link

I'm generally not in favor of supporting broken implementations of
standards. However, I believe that any sane VCF implementation should
strip leading and trailing reference bases added for padding as a
recommended normalization step. The avoidance of empty alleles in the VCF
spec, in my view, is a concession to end user formatting that only creates
problems when processing any sufficiently complex VCF data. I'm happy to
provide examples, but this issue may not be the right place to dive down
this particular rabbit hole. Standards compliant VCF writers will have to
re-insert the padding, which can generally be cached to avoid most
reference lookups.

On Tue, Oct 28, 2014 at 8:57 AM, Sergiu Dumitriu notifications@github.com
wrote:

One issue that we've encountered is that some tools (if I remember
correctly, JAnnovar and/or Exomiser) do output variants without a common
prefix for insertions-deletions, which, although not valid according to the
VCF specification, we'd still like to be able to process and expose through
GA4GH APIs. Getting back the prefix is a task that would affect the
performance.


Reply to this email directly or view it on GitHub
#168 (comment).

@antbro
Copy link
Author

antbro commented Oct 28, 2014

IMHO, the HGVS standard is a bit strange in the way it handles indels. And FYI, HGVS and ISCN are now attempting to align their respective nomenclature systems.

HGVS nomenclature uses 'start' and 'end/stop' as follows:
They number bases not junctions between bases
For insertions 'start' and 'end/stop' bases are those BETWEEN which the insertion takes place.
For deletions 'start' and 'end/stop' bases are INCLUDED in the deletion.

@bioinformed
Copy link

Re HGVS: don't forget (tandem) duplications provide the coordinates of the
duplicated sequence.

On Tue, Oct 28, 2014 at 10:32 AM, antbro notifications@github.com wrote:

IMHO, the HGVS standard is a bit strange in the way it handles indels. And
FYI, HGVS and ISCN are now attempting to align their respective
nomenclature systems.

HGVS nomenclature uses 'start' and 'end/stop' as follows:
They number bases not junctions between bases
For insertions 'start' and 'end/stop' bases are those BETWEEN which the
insertion takes place.
For deletions 'start' and 'end/stop' bases are INCLUDED in the deletion.


Reply to this email directly or view it on GitHub
#168 (comment).

@antbro
Copy link
Author

antbro commented Oct 28, 2014

And HGVS nomenclature defines duplications as an entity that's separate from the notion of insertion

@lh3
Copy link
Member

lh3 commented Oct 28, 2014

The avoidance of empty alleles in the VCF spec, in my view, is a concession to end user formatting that only creates problems when processing any sufficiently complex VCF data.

Do you have examples (in addition to insertions at the beginning of chromosomes)?

IMHO, the HGVS standard ...

We should move HGVS discussions to #159.

@pgrosu
Copy link
Contributor

pgrosu commented Oct 28, 2014

So in programming languages, the less the number of reserved words, the more systematic the programming language was to utilize in creating and utilizing more complex structures. Maybe we can start having a small set of atomic operations and from those build up all the variations we require. Otherwise we're too general, or cannot encompass the possibilities that others might deem important.

@bioinformed
Copy link

Do you have examples (in addition to insertions at the beginning of chromosomes)?

Here is one example:

CHROM POS REF ALT […] S1
Z 1 A T ... 1/1
Z 1 A AG ... 0/1

There are two problems:

  1. The left padding creates overlapping (yet valid) records. Many VCF processing tools will handle these records incorrectly.
  2. The second record could be interpreted as making a reference assertion at position 1, which would be incorrect for sample S1 who is homozygous for an alternative allele at position 1.

@haussler
Copy link

I think we have to distinguish between what we have settled on as a
standard abstract machine-readable representation scheme, which is as
Richard describes, and a widely used human-readable text notation, the one
getting the most attention from GA4GH being the HGVS nomenclature. A few of
us attended the HGVS nomenclature meeting at ASHG last week, including the
speaker Johan den Dunnen and lead developer Peter Taschner, cced. The
perspective we came to is simply that computers and people respond best to
different data representations. The best way forward is to define a
standard abstract machine-readable representation scheme for computers and
build tools to translate back and forth between that and a widely used
human-readable text notation like HGVS. In the process of creating these
tools and establishing that they are semantically consistent (in main part
by actually encoding "hard to represent" genetic changes given by actual
genetic examples), we will learn a lot.

We further agreed at the GA4GH meeting that people would send these "hard
to represent" genetic changes to Kevin Jacobs, whose email I don't have
handy (ccing Justin Zook for this). Let's contact Kevin and see if he has
received anything, and if not, ping folks. -D

PS: One concrete test we discussed is

  1. Taking a reference DNA sequence R and a DNA sequence A that is
    alternative version R, and representing the variants in A relative to R as
    a set V of changes.
  2. Then taking V and using it to convert R into an alternate DNA sequence,
    which should be A

Simple, but important to check that this works in all cases. Similar tests
can include translation from one format to the other, etc.

On Tue, Oct 28, 2014 at 7:42 AM, antbro notifications@github.com wrote:

And HGVS nomenclature defines duplications as an entity that's separate
from the notion of insertion


Reply to this email directly or view it on GitHub
#168 (comment).

@bioinformed
Copy link

Let's contact Kevin and see if he has received anything, and if not, ping folks. -D

Hi, David. I'm here and participating! My email address is jacobs@bioinformed.com.

Justin, Benedict, and Steve have come up with a Google form to collect VCF comparison oddballs and hairballs that populates this spreadsheet. The plan is to start advertising this resource on the next WG call.

@cassiedoll
Copy link
Member

Can someone make a pull request that enshrines these conventions for start/end/ref/alt in the comments of our avro file?

Then we'll have a good place to point anyone who has questions.

@lh3
Copy link
Member

lh3 commented Oct 28, 2014

@bioinformed Your example shows two overlapping variants. The preferred VCF should have one line with REF=A, ALT=T,TG and GT=1/2 instead. Inconsistencies between overlapping variants are inherent to all edit-based approaches. Stripping the prefix happens to remove the overlap in your example, but it does not solve the general problem. IMO, a well-formed VCF should contain no overlapping records. This is true for HaplotypeCaller and freebayes, I believe. In all, I don't think prefix is to blame here.

@cassiedoll Personally, I'd prefer to just state clearly what we have in GA4GH/VCF. Explaining alternatives would be very lengthy and might confuse less experienced users. Except HGVS, these alternatives are largely in-house and much less used.

@haussler
Copy link

excellent. Thanks Kevin! Can I make a request that we add to the Google
form (or create a second Google form with) an option so that one can upload
a set of DNA sequences that embodies a variant they think is hard to
represent relative to a specified set of segments in a reference genome?
This will help bring in more "hard to represent" examples. -D

PS: You could start with just uploading examples consisting of one
alternate DNA segment relative to a given reference segment, but in cases
like reciprocal translocations, you can't represent the changes to the two
breakpoint regions using just one DNA segment. That said, if you want to
start with just one reference DNA segment and one alternate DNA segment,
that would be fine.

So great to get rolling on this! -D

On Tue, Oct 28, 2014 at 8:53 AM, bioinformed notifications@github.com
wrote:

Let's contact Kevin and see if he has received anything, and if not, ping
folks. -D

Hi, David. I'm here and participating! My email address is
jacobs@bioinformed.com.

Justin, Benedict, and Steve have come up with a Google form
https://docs.google.com/forms/d/1ou6Ozdc6M28gHSo-nn_XpHwbRspVRfhii-0MdJHt57w/viewform
to collect VCF comparison oddballs and hairballs that populates this
spreadsheet
https://docs.google.com/spreadsheets/d/1FQfq6EGnNohjSa44Rgs2lmV9W8hij8X_Dj2l6mMFqPU/edit#gid=347931999.
The plan is to start advertising this resource on the next WG call.


Reply to this email directly or view it on GitHub
#168 (comment).

@haussler
Copy link

PPS: To input an example consisting of a heterozygous diploid variant
relative to a single reference segment, two variant DNA sequences would be
provided: one identical to the reference and one changed relative to the
reference.

On Tue, Oct 28, 2014 at 9:52 AM, David Haussler haussler@soe.ucsc.edu
wrote:

excellent. Thanks Kevin! Can I make a request that we add to the Google
form (or create a second Google form with) an option so that one can upload
a set of DNA sequences that embodies a variant they think is hard to
represent relative to a specified set of segments in a reference genome?
This will help bring in more "hard to represent" examples. -D

PS: You could start with just uploading examples consisting of one
alternate DNA segment relative to a given reference segment, but in cases
like reciprocal translocations, you can't represent the changes to the two
breakpoint regions using just one DNA segment. That said, if you want to
start with just one reference DNA segment and one alternate DNA segment,
that would be fine.

So great to get rolling on this! -D

On Tue, Oct 28, 2014 at 8:53 AM, bioinformed notifications@github.com
wrote:

Let's contact Kevin and see if he has received anything, and if not, ping
folks. -D

Hi, David. I'm here and participating! My email address is
jacobs@bioinformed.com.

Justin, Benedict, and Steve have come up with a Google form
https://docs.google.com/forms/d/1ou6Ozdc6M28gHSo-nn_XpHwbRspVRfhii-0MdJHt57w/viewform
to collect VCF comparison oddballs and hairballs that populates this
spreadsheet
https://docs.google.com/spreadsheets/d/1FQfq6EGnNohjSa44Rgs2lmV9W8hij8X_Dj2l6mMFqPU/edit#gid=347931999.
The plan is to start advertising this resource on the next WG call.


Reply to this email directly or view it on GitHub
#168 (comment).

@antbro
Copy link
Author

antbro commented Oct 28, 2014

"One concrete test we discussed is..."
Please note - HGVS nomenclature includes synonyms, so perhaps include examples of such in this 'concrete test'. There are also many variants that cannot be represented in HGVS format. I'll chase down examples of these

@cassiedoll
Copy link
Member

@lh3 - sorry for being unclear, that's what I meant. If you just look at the avro/docs, we don't say anything about how you represent an indel/deletion/etc - which just seems like an oversight :)

@haussler
Copy link

Yes. we discussed a multi-step test like this

  1. input HGVS variant set V expressed relative to a given reference DNA
    sequence R
  2. compute the corresponding alternate DNA sequence A
  3. Given A and R, compute the corresponding HGVS canonical representation
    V' of the difference between A and R. Note that it is expected that V' may
    not equal V. Synonyms are allowed in HGVS. It is claimed that V' and V
    should be two equivalent representations. To further check this
  4. Given reference R and set of HGVS changes V', compute the alternate DNA
    sequence A'. Veryify that A' = A.

-D

On Tue, Oct 28, 2014 at 9:58 AM, antbro notifications@github.com wrote:

"One concrete test we discussed is..."
Please note - HGVS nomenclature includes synonyms, so perhaps include
examples of such in this 'concrete test'. There are also many variants that
cannot be represented in HGVS format. I'll chase down examples of these


Reply to this email directly or view it on GitHub
#168 (comment).

@pgrosu
Copy link
Contributor

pgrosu commented Oct 28, 2014

@cassiedoll - this is great! After the collection of examples, I think we can build a fundamental set of a atomic language constructs with simple rules to build up all these cases. As noted by @haussler, these can very easily form tests.

@richarddurbin
Copy link
Contributor

I support @lh3 on this. I don't think that the so-called prefix in VCF creates problems - I actually see it as removing problems by providing very clean
replacement semantics. Note that there is no problem with VCF representing an insertion at the start of the chromosome. You simply replace the first
base by a string with the new sequence followed by the first base. e.g. an insertion of TT before start of chromosome base A is given by

CHR1    1   A   TTA

VCF as a format does not specify that the replaced base is at the start of the replacement string - that is just a convention to make the representation canonical.
This is why I wrote "so-called prefix" above. But it is clear that there is only one way to represent a start-of-chromosome insertion with a minimal replacement,
so that must be canonical.

It is also true that the VCF model requires that overlapping variants are merged. This is what makes it messy to merge VCF files.
By the way, in the monoallelic representation we would have alleles "REF", "T then second base" and "TG then second base" and the individual would have
allele count 0 for the REF allele and 1 for the other two alleles. There are no merging problems in the monoallelic representation.

Richard

On 28 Oct 2014, at 16:50, Heng Li notifications@github.com wrote:

@bioinformed Your example shows two overlapping variants. The preferred VCF should have one line with REF=A, ALT=T,TG and GT=1/2 instead. Inconsistencies between overlapping variants are inherent to all edit-based approaches. Stripping the prefix happens to remove the overlap in your example, but it does not solve the general problem. IMO, a well-formed VCF should contain no overlapping records. This is true for HaplotypeCaller and freebayes, I believe. In all, I don't think prefix is to blame here.

@cassiedoll Personally, I'd prefer to just state clearly what we have in GA4GH/VCF. Explaining alternatives would be very lengthy and might confuse less experienced users. Except HGVS, these alternatives are also much less used.


Reply to this email directly or view it on GitHub.

The Wellcome Trust Sanger Institute is operated by Genome Research
Limited, a charity registered in England with number 1021457 and a
company registered in England with number 2742969, whose registered
office is 215 Euston Road, London, NW1 2BE.

@lh3
Copy link
Member

lh3 commented Oct 28, 2014

@cassiedoll I see what you mean now. The current description is avro is reasonably accurate and expressive about our intended representation. At least I do not have something to add for now.

@pgrosu
Copy link
Contributor

pgrosu commented Oct 28, 2014

@richarddurbin and @lh3 - so should we have a VCF checker that validates it before reading it into the schema?

@richarddurbin
Copy link
Contributor

There is a VCF validator. I am copying Petr Danecek who should be able to point you to it.
This should be a GA4GH file formats tool. Maybe you can find it linked from the VCF specification page.

Richard

On 28 Oct 2014, at 17:29, Paul Grosu notifications@github.com wrote:

@richarddurbin and @lh3 - so should we have a VCF checker that validates it before reading it into the schema?


Reply to this email directly or view it on GitHub.

The Wellcome Trust Sanger Institute is operated by Genome Research
Limited, a charity registered in England with number 1021457 and a
company registered in England with number 2742969, whose registered
office is 215 Euston Road, London, NW1 2BE.

@haussler
Copy link

Good to have examples that cannot be represented in VCF or HGVS too. Thanks
for collecting these!

As we discussed at the GA4GH meeting, it goes the other way too. Since they
do not require full representation of phasing, in the diploid case VCF and
HGVS unphased representations don't usually semantically correspond to a
single pair of alternate DNA sequences relative to the reference.
Semantically, they correspond to a set of possible alternate diploid
configurations obtained by listing all possible phasings. Other issues can
complicate this further. Since the set of possible "allowable" DNA
interpretations of a VCF or HGVS file/string can be very large, Kevin has
written code that samples it to test equivalence.

Kevin, it is your call how soon you want to get into this sampling part on
the testing software end, but I would say that even if at the start we just
explore a large set of simplified examples where there is just one DNA
representation to consider when you are given a reference and a VCF or HGVS
file, we would still learn a lot. -D

On Tue, Oct 28, 2014 at 9:58 AM, antbro notifications@github.com wrote:

"One concrete test we discussed is..."
Please note - HGVS nomenclature includes synonyms, so perhaps include
examples of such in this 'concrete test'. There are also many variants that
cannot be represented in HGVS format. I'll chase down examples of these


Reply to this email directly or view it on GitHub
#168 (comment).

@haussler
Copy link

The sooner we start using and testing the monallelic representation
alongside VCF and HGVS the better. -D

On Tue, Oct 28, 2014 at 10:23 AM, Richard Durbin notifications@github.com
wrote:

I support @lh3 on this. I don't think that the so-called prefix in VCF
creates problems - I actually see it as removing problems by providing very
clean
replacement semantics. Note that there is no problem with VCF representing
an insertion at the start of the chromosome. You simply replace the first
base by a string with the new sequence followed by the first base. e.g. an
insertion of TT before start of chromosome base A is given by

CHR1 1 A TTA

VCF as a format does not specify that the replaced base is at the start of
the replacement string - that is just a convention to make the
representation canonical.
This is why I wrote "so-called prefix" above. But it is clear that there
is only one way to represent a start-of-chromosome insertion with a minimal
replacement,
so that must be canonical.

It is also true that the VCF model requires that overlapping variants are
merged. This is what makes it messy to merge VCF files.
By the way, in the monoallelic representation we would have alleles "REF",
"T then second base" and "TG then second base" and the individual would
have
allele count 0 for the REF allele and 1 for the other two alleles. There
are no merging problems in the monoallelic representation.

Richard

On 28 Oct 2014, at 16:50, Heng Li notifications@github.com wrote:

@bioinformed Your example shows two overlapping variants. The preferred
VCF should have one line with REF=A, ALT=T,TG and GT=1/2 instead.
Inconsistencies between overlapping variants are inherent to all edit-based
approaches. Stripping the prefix happens to remove the overlap in your
example, but it does not solve the general problem. IMO, a well-formed VCF
should contain no overlapping records. This is true for HaplotypeCaller and
freebayes, I believe. In all, I don't think prefix is to blame here.

@cassiedoll Personally, I'd prefer to just state clearly what we have in
GA4GH/VCF. Explaining alternatives would be very lengthy and might confuse
less experienced users. Except HGVS, these alternatives are also much less
used.


Reply to this email directly or view it on GitHub.

The Wellcome Trust Sanger Institute is operated by Genome Research
Limited, a charity registered in England with number 1021457 and a
company registered in England with number 2742969, whose registered
office is 215 Euston Road, London, NW1 2BE.


Reply to this email directly or view it on GitHub
#168 (comment).

@bioinformed
Copy link

@richarddurbin, @lh3: As I read it, the VCF spec does not require that overlapping records are merged. Do you mean something different by the "VCF model"?

Here is the only text I can find in the VCF 4.2 spec on the topic:

POS - position: The reference position, with the 1st base having position 1. Positions are sorted numerically, in increasing order, within each reference sequence CHROM. It is permitted to have multiple records with the same POS.

One nit in the wording: If multiple records can have the same POS then they're technically required to be in non-decreasing order. This also opens a loophole where there is no canonical total ordering for VCF records (modulo contig order).

That said, I agree that padding doesn't create a fundamental problem with the VCF model, per se. I still assert that any program attempting to interpret complex variation should strip extraneous reference bases when attempting to model complex variation to avoid making potentially invalid reference assertions. This is why I believe that APIs presenting models of variants based on a reference interval and substituted alleles should support both null reference ranges (insertions) and null alternative alleles (deletions).

@richarddurbin: I'm joining the conversation late and haven't been able to find the monoallelic representation proposal. Can you please point me in the right direction?

@bioinformed
Copy link

@haussler: Another of my projects is to validate that all HGVS representations in HGMD Pro and Clinvar match the VCF representation of the same variants and vice versa. I'm using my colleague Reece Hart's UTA and HGVS infrastructure. This round of testing doesn't include complex phased haplotypes, but that is the next logical step.

@haussler
Copy link

Peter can you share your tools for validating HGVS? I could not find them
at http://www.hgvs.org/mutnomen/

On Tue, Oct 28, 2014 at 10:58 AM, Kevin Jacobs notifications@github.com
wrote:

@haussler https://github.com/haussler: Another of my projects is to
validate that all HGVS representations in HGMD Pro and Clinvar match the
VCF representation of the same variants and vice versa. I'm using my
colleague Reece Hart's UTA https://pypi.python.org/pypi/uta/0.1.8 and
HGVS https://pypi.python.org/pypi/hgvs infrastructure. This round of
testing doesn't include complex phased haplotypes, but that is the next
logical step.


Reply to this email directly or view it on GitHub
#168 (comment).

@lh3
Copy link
Member

lh3 commented Oct 28, 2014

@bioinformed VCF does not specify whether it allows overlapping variants. I was saying that overlapping variants may lead to inconsistencies no matter whether you have prefix or not. The problem is not prefix, so it is not necessary to strip it. Allowing null alleles will make it difficult to export the GA4GH representation to VCF and add unnecessary complexity to make analysis code work with both cases. I think for edit-based representation, GA4GH should just stick with VCF.

@pgrosu
Copy link
Contributor

pgrosu commented Oct 28, 2014

@richarddurbin - Thank you, yes, vcf-tools are very helpful - almost forgot about the validator :) I guess I intended say that before the "variant call" file gets loaded into our schema it performs checks and reports back recommendations. For instance - like @lh3 mentioned - if it finds overlapping variants, it will ask the user to reformat the file or it will provide recommendations. This step can have multiple validation stages, before it would go into a GA4GH repository. If so, based on the examples we collect here, we can form all the checks as part of GA4GH. I feel integrating already publicly available tools will speed up our process here.

@pd3
Copy link

pd3 commented Oct 29, 2014

@pgrosu @richarddurbin The validator from vcf-tools may not be the best choice, it's written in perl and may be too slow for this purpose. I understand the motivation is to check overlapping variants here?

@haussler
Copy link

Dear David,

You can find the Mutalyzer suite at https://mutalyzer.nl. We have published an extended Backus-Naur form of the HGVS nomenclature syntax in 2011 in BMC Bioinformatics (doi: 10.1186/1471-2105-12-S4-S5). The EBNF is used to generate Mutalyzer’s HGVS syntax parser.

Best regards,

Peter

From: David Haussler [mailto:haussler@soe.ucsc.edu]
Sent: dinsdag 28 oktober 2014 19:04
To: ga4gh/schemas
Cc: ga4gh/schemas; Taschner, P.E.M. (HG)
Subject: Re: [schemas] basic standardisation - deletion alleles and start/stop coordinates (#168)

Peter can you share your tools for validating HGVS? I could not find them at http://www.hgvs.org/mutnomen/

On Tue, Oct 28, 2014 at 10:58 AM, Kevin Jacobs <notifications@github.commailto:notifications@github.com> wrote:

@hausslerhttps://github.com/haussler: Another of my projects is to validate that all HGVS representations in HGMD Pro and Clinvar match the VCF representation of the same variants and vice versa. I'm using my colleague Reece Hart's UTAhttps://pypi.python.org/pypi/uta/0.1.8 and HGVShttps://pypi.python.org/pypi/hgvs infrastructure. This round of testing doesn't include complex phased haplotypes, but that is the next logical step.


Reply to this email directly or view it on GitHubhttps://github.com//issues/168#issuecomment-60801343.

@pgrosu
Copy link
Contributor

pgrosu commented Oct 29, 2014

@pd3, overlapping variants would be one case, which also got continued in another discussion in #169. But we are trying to brainstorm different "hard to represent" genetic changes that we might encounter and how to best handle them. The validation I initially brought up as a checkpoint to determine if such a file would contain genetic changes that might be better represented another way or if we want to perform specific checks before the data would go into a GA4GH repository, based on best-practices we intend for data-representation in our schema.

@reece
Copy link
Member

reece commented Oct 29, 2014

I'm coming to the conversation late. A few comments on the thread:

  • Interbase is the only coordinate system being discussed that can represent all of the major edit types without corner cases. Although equal numerically to 0-based, right open, base coordinates, interbase is conceptually much cleaner. The biggest issue is that it forces implementations to think in terms of intervals throughout the code.
  • HGVS, a human readable syntax for variants, should be kept far away from the backend representation. I would not use it internally or for database representation. I personally think about this just like people think about utf-8 -- encode/decode at the IO boundaries and use unicode internally, everywhere.
  • I like @haussler's invertible operation demonstration of correctness. As David and I discussed at the HGVS meeting, the invertibility needs to be tested under an equivalence function that accounts for canonicalization. Also, I would think about this on the underlying representation (a graph, preferably) rather than in HGVS because some HGVS operations are lossy and therefore not invertible.
  • FWIW, the HGVS code (http://bitbucket.org/hgvs/hgvs) has an experimental script that adds an HGVS info field to a VCF.
  • I'll continue HGVS-specific comments in Best practices for joining NGS-derived & clinical variation databases #159.

@mlin
Copy link
Member

mlin commented Oct 29, 2014

Just in case it may come in handy for anyone, here's a schematic by @asimenos we use to illustrate interbase coordinates (at least I hope we're referring to the same thing :)
Taken from https://wiki.dnanexus.com/Types/gri

interbase

@reece
Copy link
Member

reece commented Oct 30, 2014

In the same vein, here's a a Google spreadsheet that has a bunch of coordinate system and mapping examples that I put together when we were working on HGVS variant mapping. http://goo.gl/b1nUxl

@skeenan
Copy link
Member

skeenan commented Apr 8, 2015

This issue has had a lot of discussion. It would be great to hear final comment on which standard GA4GH has settled on for deletion alleles and start/stop coordinates.

dcolligan pushed a commit to dcolligan/ga4gh-schemas that referenced this issue Jul 20, 2016
Check created time of analysis field
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests