Variant annotation support #302

sarahhunt · 2015-05-07T10:59:54Z

These three new protocols form an initial proposal for variant annotation support.

VariantAnnotation and AlleleAnnotation records hold different types of annotation derived by comparing a Variant or Allele to a set of reference data. AnnotationSets group VariantAnnotation/ AlleleAnnotation records and hold full details of all software and reference data sets used.

The effect of alternate alleles on transcript sets is the first type of annotation to be considered in detail.

Two methods protocols are proposed. alleleAnnotationmethods.avdl supports the mining of pre-calculated annotation data and annotateAllelemethods.avdl supports annotation as a service.

pull in sequence annotation updates

reece · 2015-05-09T15:56:03Z

Verbatim copy of GA4GH_prototype_annotations2.docx, sent by Sarah Hunt to ga4gh-dwg-annotation@googlegroups.com on May 7 is available as a gist at https://gist.github.com/reece/0604e25e78f5c63e1a5f .

pgrosu · 2015-05-09T16:12:57Z

Thank you for sharing @reece - it definitely helps having a concrete example. I'm not on that email list, since it would be helpful to see how we got to this design. Is it free to join?

Thanks,
Paul

reece · 2015-05-11T14:20:35Z

@pgrosu I think you need to contact @skeenan to be added to the annotation list.

pgrosu · 2015-05-11T15:58:40Z

Ah, thank you @reece. @skeenan any possibility? It would just help with being in sync.

Thank you,
Paul

reece · 2015-05-12T07:32:16Z

@sarahhunt - Thanks for the proposal. That helped tremendously!

Below is a long list of comments specific to your proposal and, more importantly, some design goals that I believe we should consider.

Putting money where my mouth is, I created an example (https://gist.github.com/reece/880aa4283d404a2563cd) that illustrates some of these goals. (N.B. My example has it's own flaws, like not being able to store annotations from multiple versions of the same tool. This group should consider it a strawman alternative, not a proposed solution.)

Specific suggestions about the proposal:

Use consistent syntax. I'm especially thinking about camelCase v. underscore v. all-caps. Assuming we're standardizing on cameCase, "IMPACT" should be "impact", "variantannotations" should be "variantAnnotations", etc. I'm fine with notable exceptions like "HGVSg". (I don't care at all about which convention, just that there is exactly one.)
Drop the concept of annotationset in the annotation itself. This is better modeled as an annotation set containing annotations (rather than annotations referring to their container).

Goals:

Files structures containing variants and variant annotations should be easily mergable. The primary use case is to merge precomputed variant annotations with a patient's set of annotations.
Predictions should be typed. That is, annotations from a single source or of a particular type should follow a similar schema.
Use stable external refs wherever possible. For example, "effects" should use SO ids. (The human readable names are not guaranteed stable and have changed at least once to my knowledge.)
All locations must have a reference sequence specified by accession (not name, like "chr22"). Doing otherwise is as good as a address number without a street name. I think it's essential that this annotation spec make the sequence reference unambiguous.

pcingola · 2015-05-12T12:05:35Z

Hi @reece, thank you for your comments, here are some answers (at least my opinion):

i) Consistent syntax: The examples you mentioned seem to in proper cameCase in Sarah's PR ('variantannotations' is lowercase only in POST URLs, and I did not find the "IMPACT" in all caps). May be I'm looking at the wrong place...?

ii) The concept of annotationSet in the annotation itself is analogous to the concept of 'readGroupId' in 'ReadAlignment' itself. This was added as a suggestion form the group (the initial version did not have it) and we all thought it was quite useful.

Goals:
i) Your comment about "variants and variant annotations should be easily mergable", in my opinion has been take care of. "Merging" variant with their respective annotations is a function of the annotation algorithm, which is included in Sarah's proposal (see AnnotateVariantsRequest). The API user should NEVER have to this manually because doing so is error prone and annotation algorithms take care of many corner cases that users are usually unaware of.

ii) I'm not sure what you meant with "Predictions should be typed", can you clarify?

iii) SO IDs: We could include them, but in my opinion this makes the record less useful ('missense_variant' is much more readable than 'SO:0001583'). In any case, I think that what you mention is a problem about versioning ontologies. A problem that we don't intend to solve here, furthermore, the structure of the ontology can also change, so using IDs doesn't solve it. We should leave this as it is and modify it later, when GA4GH solves ontologies versioning issues and then use an 'OntologyTerm'.

iv) Annotations are anchored to variants, which already have reference sequence ID, so asking for "chr22" makes a lot of sense because this is a context specific query (may be I misunderstood your point here). In any case, there should perhaps be an option for using sequence IDs instead of chromosome names, but I would suggest to add it after the PR has been accepted.

helenp · 2015-05-12T12:15:22Z

Hi @pcingola iii) you could use the SO term, id and version. If you don't do this then ids need to be looked up and there is an annotation-ontology mapping problem and an ontology versioning problem to deal with. We will work on the ontology component to make some recommendations for discussion in Leiden so that this is consistent across different components.
All ontologies can change structure and there are diff tools that allow tracking of changes - moving between ontology versions is a complex problem and I don't think GA4GH can solve this though there are some tools available.

sarahhunt · 2015-05-12T12:25:39Z

Hi @reece and @pcingola

A few additional comments -

The avro schemas follow the standard syntax: https://github.com/ga4gh/schemas/blob/master/CONTRIBUTING.md#syntax_style, apart from the HGVS attributes, which maybe should really be all be lower case too. @reece - I suspect you are looking at the output from the prototype implementation that I mailed round the group rather than the schemas. As mentioned, it is a work in progress and you have spotted a couple of compliance errors! Maybe it was too early to share this output.

The annotationSet id is in the annotation record to support the extraction of a subset of annotation - maybe over a specific gene - and the annotationSet record which contains information on software and reference data versions used. An annotationSet could be a database of 'all current variants vs all current transcripts', so if the set referenced the annotations it could get quite unwieldy.

Thanks for the recommendation @helenp. The effects are currently represented by an array of OntologyTerm's. There is no version in the OntologyTerm at the moment, ( just source, id, name) but it would make sense to add it.

pcingola · 2015-05-12T19:45:38Z

+1

calbach · 2015-05-29T05:11:38Z

src/main/resources/avro/alleleAnnotationmethods.avdl

+  Only return variant annotations for any of these features
+  If null, return all variant annotations in specified window.
+  */
+  union { null, array<org.ga4gh.models.Feature> } features;


How does the matching work? Is this matching feature IDs of TranscriptEffects or is this doing a live JOIN against the regions annotated by those features?

If it's the former, this should be named more specifically and should refer to the feature_id. If it's the latter, I'm not sure what this buys you beyond the above start/end parameters.

It's the former. Do the new comments at the top of the file make this clear?

sarahhunt · 2015-05-29T11:21:07Z

Thanks for the input @calbach!

@diekhans - I've added to the documentation, but if you, or anyone else, find anything else too vague do let me know.

calbach · 2015-06-26T23:21:40Z

src/main/resources/avro/alleleAnnotations.avdl

+
+  /** The ID of the annotation set this record belongs to. */
+  string annotationSetId;
+


nit: Please remove these double newlines throughout, unless this style convention is used elsewhere(?).

Thanks - that's tidied up now.

calbach · 2015-06-26T23:29:21Z

I've added some more comments inline. In general things look sane, but I haven't dug too much into the details of the contents of the actual annotations. I would like it if someone from a more bio-heavy background could do a thorough review of those fields.

jacmarjorie · 2015-06-30T17:39:49Z

+1

AngieHinrichs · 2015-06-30T18:10:58Z

+1 Thanks for laying the foundation!

rajgopals · 2015-07-01T05:46:35Z

+1

skeenan · 2015-07-01T08:51:42Z

Merging to master.

Variant annotation support

stevenbrenner · 2015-07-01T23:19:20Z

+1

sarahhunt added 19 commits March 18, 2015 14:07

proposed variant annotation additions

104a7cf

Merge branch 'master' into variation_annotation

9c4dbf4

pull in sequence annotation updates

add variants/annotate

bcf7fa1

add id for annotation record & use Feature instead of id for feature

aafa632

use org.ga4gh.models.OntologyTerm rather than string for effect

c5bee31

group location/sequence information in different coordinate systems

2b858d2

Add draft annotation set

18e9323

add Impact as defined in PR126

3c608de

Add search annotation sets option

5831db1

add support for non-standard annotation information

3111794

remove extraneous alt allele - defined in parent

298e429

reinstate misssing alt allele

0c90353

move variant annotate to separate protocol

eb14934

shrink/wrap comment lines

4952e46

move annotation out of Variant

8e19d02

remove unnecessary includes

4dafb03

pull in updates from master

ff9dd19

clarify/neaten comments

84e188d

Add get for annotation set; remove gets for annotation records for now

de05933

pgrosu mentioned this pull request May 7, 2015

Variant Annotation branch - initial API draft #290

Closed

delagoya added the Variant Annotation label May 7, 2015

calbach reviewed May 29, 2015
View reviewed changes

sarahhunt added 6 commits May 29, 2015 09:36

search single dataset/annotationSets only as perl PR253

119cbca

improve comments re search by feature

0d710ef

add pagination to annotationsets/search

c78ed14

improve documentation

9e4b476

switch col-located variants to use ids for simple look up

a4df121

add ids for VariantAnnotation and AlleleAnnotation records

1142f02

sarahhunt closed this May 29, 2015

sarahhunt reopened this May 29, 2015

calbach reviewed Jun 26, 2015
View reviewed changes

sarahhunt added 2 commits June 29, 2015 09:26

switch query by feature to query by feature id

aceb224

reduce whitespace

e2c98a2

skeenan added a commit that referenced this pull request Jul 1, 2015

Merge pull request #302 from ga4gh/variation_annotation

2d1afa8

Variant annotation support

skeenan merged commit 2d1afa8 into master Jul 1, 2015

sarahhunt mentioned this pull request Dec 19, 2015

Variant Annotation re-merge #519

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Variant annotation support #302

Variant annotation support #302

sarahhunt commented May 7, 2015

reece commented May 9, 2015

pgrosu commented May 9, 2015

reece commented May 11, 2015

pgrosu commented May 11, 2015

reece commented May 12, 2015

pcingola commented May 12, 2015

helenp commented May 12, 2015

sarahhunt commented May 12, 2015

pcingola commented May 12, 2015

calbach May 29, 2015

sarahhunt May 29, 2015

sarahhunt commented May 29, 2015

calbach Jun 26, 2015

sarahhunt Jun 29, 2015

calbach commented Jun 26, 2015

jacmarjorie commented Jun 30, 2015

AngieHinrichs commented Jun 30, 2015

rajgopals commented Jul 1, 2015

skeenan commented Jul 1, 2015

stevenbrenner commented Jul 1, 2015


		/** The ID of the annotation set this record belongs to. */
		string annotationSetId;

Variant annotation support #302

Variant annotation support #302

Conversation

sarahhunt commented May 7, 2015

reece commented May 9, 2015

pgrosu commented May 9, 2015

reece commented May 11, 2015

pgrosu commented May 11, 2015

reece commented May 12, 2015

Specific suggestions about the proposal:

Goals:

pcingola commented May 12, 2015

helenp commented May 12, 2015

sarahhunt commented May 12, 2015

pcingola commented May 12, 2015

calbach May 29, 2015

Choose a reason for hiding this comment

sarahhunt May 29, 2015

Choose a reason for hiding this comment

sarahhunt commented May 29, 2015

calbach Jun 26, 2015

Choose a reason for hiding this comment

sarahhunt Jun 29, 2015

Choose a reason for hiding this comment

calbach commented Jun 26, 2015

jacmarjorie commented Jun 30, 2015

AngieHinrichs commented Jun 30, 2015

rajgopals commented Jul 1, 2015

skeenan commented Jul 1, 2015

stevenbrenner commented Jul 1, 2015