-
Notifications
You must be signed in to change notification settings - Fork 112
Conversation
pull in sequence annotation updates
Verbatim copy of GA4GH_prototype_annotations2.docx, sent by Sarah Hunt to ga4gh-dwg-annotation@googlegroups.com on May 7 is available as a gist at https://gist.github.com/reece/0604e25e78f5c63e1a5f . |
Thank you for sharing @reece - it definitely helps having a concrete example. I'm not on that email list, since it would be helpful to see how we got to this design. Is it free to join? Thanks, |
@sarahhunt - Thanks for the proposal. That helped tremendously! Below is a long list of comments specific to your proposal and, more importantly, some design goals that I believe we should consider. Putting money where my mouth is, I created an example (https://gist.github.com/reece/880aa4283d404a2563cd) that illustrates some of these goals. (N.B. My example has it's own flaws, like not being able to store annotations from multiple versions of the same tool. This group should consider it a strawman alternative, not a proposed solution.) Specific suggestions about the proposal:
Goals:
|
Hi @reece, thank you for your comments, here are some answers (at least my opinion): i) Consistent syntax: The examples you mentioned seem to in proper cameCase in Sarah's PR ('variantannotations' is lowercase only in POST URLs, and I did not find the "IMPACT" in all caps). May be I'm looking at the wrong place...? ii) The concept of annotationSet in the annotation itself is analogous to the concept of 'readGroupId' in 'ReadAlignment' itself. This was added as a suggestion form the group (the initial version did not have it) and we all thought it was quite useful. Goals: ii) I'm not sure what you meant with "Predictions should be typed", can you clarify? iii) SO IDs: We could include them, but in my opinion this makes the record less useful ('missense_variant' is much more readable than 'SO:0001583'). In any case, I think that what you mention is a problem about versioning ontologies. A problem that we don't intend to solve here, furthermore, the structure of the ontology can also change, so using IDs doesn't solve it. We should leave this as it is and modify it later, when GA4GH solves ontologies versioning issues and then use an 'OntologyTerm'. iv) Annotations are anchored to variants, which already have reference sequence ID, so asking for "chr22" makes a lot of sense because this is a context specific query (may be I misunderstood your point here). In any case, there should perhaps be an option for using sequence IDs instead of chromosome names, but I would suggest to add it after the PR has been accepted. |
Hi @pcingola iii) you could use the SO term, id and version. If you don't do this then ids need to be looked up and there is an annotation-ontology mapping problem and an ontology versioning problem to deal with. We will work on the ontology component to make some recommendations for discussion in Leiden so that this is consistent across different components. |
A few additional comments - The avro schemas follow the standard syntax: https://github.com/ga4gh/schemas/blob/master/CONTRIBUTING.md#syntax_style, apart from the HGVS attributes, which maybe should really be all be lower case too. @reece - I suspect you are looking at the output from the prototype implementation that I mailed round the group rather than the schemas. As mentioned, it is a work in progress and you have spotted a couple of compliance errors! Maybe it was too early to share this output. The annotationSet id is in the annotation record to support the extraction of a subset of annotation - maybe over a specific gene - and the annotationSet record which contains information on software and reference data versions used. An annotationSet could be a database of 'all current variants vs all current transcripts', so if the set referenced the annotations it could get quite unwieldy. Thanks for the recommendation @helenp. The effects are currently represented by an array of OntologyTerm's. There is no version in the OntologyTerm at the moment, ( just source, id, name) but it would make sense to add it. |
+1 |
Only return variant annotations for any of these features | ||
If null, return all variant annotations in specified window. | ||
*/ | ||
union { null, array<org.ga4gh.models.Feature> } features; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How does the matching work? Is this matching feature IDs of TranscriptEffects or is this doing a live JOIN against the regions annotated by those features?
If it's the former, this should be named more specifically and should refer to the feature_id. If it's the latter, I'm not sure what this buys you beyond the above start/end parameters.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's the former. Do the new comments at the top of the file make this clear?
|
||
/** The ID of the annotation set this record belongs to. */ | ||
string annotationSetId; | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Please remove these double newlines throughout, unless this style convention is used elsewhere(?).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks - that's tidied up now.
I've added some more comments inline. In general things look sane, but I haven't dug too much into the details of the contents of the actual annotations. I would like it if someone from a more bio-heavy background could do a thorough review of those fields. |
+1 |
+1 Thanks for laying the foundation! |
+1 |
Merging to master. |
+1 |
These three new protocols form an initial proposal for variant annotation support.
VariantAnnotation and AlleleAnnotation records hold different types of annotation derived by comparing a Variant or Allele to a set of reference data. AnnotationSets group VariantAnnotation/ AlleleAnnotation records and hold full details of all software and reference data sets used.
The effect of alternate alleles on transcript sets is the first type of annotation to be considered in detail.
Two methods protocols are proposed. alleleAnnotationmethods.avdl supports the mining of pre-calculated annotation data and annotateAllelemethods.avdl supports annotation as a service.