Skip to content
This repository has been archived by the owner on Oct 28, 2022. It is now read-only.

BioMetadata Protocol #636

Merged
merged 7 commits into from
Jun 28, 2016
Merged

BioMetadata Protocol #636

merged 7 commits into from
Jun 28, 2016

Conversation

david4096
Copy link
Member

@david4096 david4096 commented Jun 16, 2016

The GA4GH API currently lacks a unified way for describing samples across data types. This means that it is difficult to reconcile whether a variant call or read group is from the same sample. This PR adds the ability to filter by samples to these types.

The BioMetadata protocol allows read groups and call sets to be related to BioSamples. This is done by adding bioSampleId fields to both of those records. A bioSampleId field is then an optional field in both in SearchReadGroupSetsRequest and SearchCallSetsRequest. In addition to being able to filter callsets and readgroups, search and get endpoints for biosamples and individuals endpoints have been added.

Individuals can be searched by name, and biosamples can be filtered by individualId and name.

Given some read find which individual it derives from:

  1. Read the bioSampleId from a ReadGroup record.
  2. Get the BioSample record by performing a get at http://ga4gh/biosamples/.
  3. Read the individualId from the BioSample record.
  4. Get the Individual record by performing a get at http://ga4gh/individuals/.

Given some individual find reads:

  1. From the Individual record get the individualId.
  2. Create a SearchBioSamplesRequest with the provided individualId.
  3. Perform a SearchReadGroupSetsRequest with each returned bioSampleId.
  4. Only read groups matching the bioSampleId will be returned. Perform a SearchReadsRequest for each read group in the SearchReadGroupSetsResponse.

To clarify the metadata roles, some reusable elements have been moved to AssayMetadata (Experiment and Analysis). Other assisting documentation is included in this PR to assist proper usage of Ontology Terms. Thanks to everyone for all the effort put into this @mbaudis @diekhans @jeromekelleher @bwalsh @saupchurch @sarahhunt @dzerbino


// The time at which this message was created.
// Format: ISO 8601, YYYY-MM-DDTHH:MM:SS.SSS (e.g. 2015-02-10T00:03:42.123Z)
string message_create_time = 4;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why message_create_time here? The work message seems ambigous, i.e., does this refer to the individual message objects as they are being sent to the client, or to the 'record' that is stored on the server? I'm guessing that this is because of a clash with run_time? I would have thought it would be better to just use created and updated just for the sake of being consistent with all the other objects (and change them all in one go to something more specific like creation_timestamp, update_timestamp, if/when this is done).

I realise this happened before this specific set of changes, so please feel free to point me in the direction of the relevant discussion. This isn't a criticism of the present PR, I just thought I'd bring it up while we're reorganising this metadata stuff.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The time fields in the API are weakly defined. The other
objects use Unix time and appear to be about when data is
loaded into the database This is relevant to
environments like the Google cloud, but less important
when serving up an existing dataset.

Jerome Kelleher notifications@github.com writes:

In src/main/proto/ga4gh/assay_metadata.proto:

+// An experimental preparation of a sample.
+message Experiment {

  • // The experiment UUID. This is globally unique.
  • string id = 1;
  • // The name of the experiment.
  • string name = 2;
  • // A description of the experiment.
  • string description = 3;
  • // The time at which this message was created.
  • // Format: ISO 8601, YYYY-MM-DDTHH:MM:SS.SSS (e.g. 2015-02-10T00:03:42.123Z)
  • string message_create_time = 4;

Why message_create_time here? The work message seems ambigous, i.e., does this
refer to the individual message objects as they are being sent to the client,
or to the 'record' that is stored on the server? I'm guessing that this is
because of a clash with run_time? I would have thought it would be better to
just use created and updated just for the sake of being consistent with all the
other objects (and change them all in one go to something more specific like
creation_timestamp, update_timestamp, if/when this is done).

I realise this happened before this specific set of changes, so please feel
free to point me in the direction of the relevant discussion. This isn't a
criticism of the present PR, I just thought I'd bring it up while we're
reorganising this metadata stuff.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or mute the thread.*

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought the consensus is to do ISO, and settle for a standard naming for the object housekeeping time attributes? I/m trying to find my lengthy exchange with @diekhans ...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I'm sorry I brought this up now! It's what's in master currently, so not strictly related to this PR. I've created an issue to track it: #637.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The time of time has not be changed yet in the other records.

For now, lets make the name consistent the other records (created and updated), but leave the time ISO. We can then take on global conversion to time ISO.

message_create_time is a confusing name anyway. It should be abstracted away from protobuf.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Full agreement with @diekhans here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

id should be a server-defined id, not a UUID.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re: UUID, fixed here and in master.

@jeromekelleher
Copy link
Contributor

This looks hugely useful, and fixes a lot of the shortcomings we're experiencing in the protocol when working with the WGS500 dataset. I have a couple of minor points above, but overall I think this is a massive improvement and we should try to merge it as soon as possible.

@diekhans
Copy link
Contributor

@david4096 pointed out that many of the inconsistency that are being noted are already in master. Given this, we should go ahead with this PR and fix the inconsistencies in separate PRs

that is +1

@jeromekelleher
Copy link
Contributor

I'm happy to +1 this if we remove the dataset_id argument from the searches. I don't think we've clearly defined the model here. This is definitely different to other objects which unequivocally have a dataset_id attribute, and can therefore be said to belong to a dataset. Rather than get bogged down in this, I suggest we just remove dataset_id from the search functions for now, as it's optional anyway.

@diekhans
Copy link
Contributor

agreed to move forward and fix dataset inconsistency in another PR.

Jerome Kelleher notifications@github.com writes:

I'm happy to +1 this if we remove the dataset_id argument from the searches. I
don't think we've clearly defined the model here. This is definitely different
to other objects which unequivocally have a dataset_id attribute, and can
therefore be said to belong to a dataset. Rather than get bogged down in this,
I suggest we just remove dataset_id from the search functions for now, as it's
optional anyway.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or mute the thread.*

Mention that dataset_id is optional
@david4096
Copy link
Member Author

david4096 commented Jun 20, 2016

@jeromekelleher I added a line stating the dataset_id is optional. Looking forward to moving ahead with this! Please see here and here for previous discussion.

@diekhans
Copy link
Contributor

@jeromekelleher the problem here is that dataset was removed from the metadata objects. All objects should be in datasets (IMHO, it's wrong that references special cases and there is a discussion about fixing it).

dataset is a ownership bag. This ownership appears to have the side-effect that it also scopes name and makes it the owners responsibility to ensure name uniqueness.

It is specifically not a barrier to doing cross-dataset analysis. This is really expected to be the norm.

@calbach, @dglazer Dataset still causes lots of confusion with people wanting it to be something that it's not. Could you please work on more detailed documentation.

I am +1 for putting all metadata in a dataset and leaving it in the query.

@pgrosu
Copy link
Contributor

pgrosu commented Jun 21, 2016

You might remember my diagram from a couple of years ago - you don't really need a dataset id, as all the generated components are tied to one another, and any (manually/auto-generated) additional meta-attributes would be automatically bound and propagated from the previous data component that generated them, all stemming from an initial collection of samples associated to a project/study:

#74 (comment)

Each specific Project/Study - which encompasses the Experiment and Analysis performed - would thus differentiate similar collections of Samples that were processed in different ways.

@diekhans
Copy link
Contributor

The current approach of the GA4GH API uses datasets as the container.

This could be change, we would do this a part of a full API design,
not piece meal.

Paul Grosu notifications@github.com writes:

You might remember my diagram from a couple of years ago - you don't really
need a dataset id, as all the generated components are tied to one another, and
any (manually/auto-generated) additional meta-attributes would be automatically
bound and propagated from the previous data component that generated them, all
stemming from an initial sample:

#74 (comment)

Each specific Project/Study - which encompasses the Experiment and Analysis
performed - would thus differentiate similar collections of Samples that were
processed in different ways.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or mute the thread.*

@pgrosu
Copy link
Contributor

pgrosu commented Jun 22, 2016

So part of the reason why I wrote up the comprehensive analysis of the GA4GH ecosystem - which was the final push to switch us to Protocol Buffers - was to take a step back and consider all the moving parts and types of data models for GA4GH, including their associated methods and queries. When I created the above-referenced diagram, it was to initially get us to consider another approach of how data moves along a wire from multiple global repositories, in order to help us put together the schemas and methods which would allow for complex queries that @lh3 and @jeromekelleher are looking for. Yes, a full API would have additional <Key,Value> model-based components with several dynamically-updated inverted indices to allow examples like the below-referenced comments, but they are based on having the minimally-required definitions of a GA4GH API wire-protocol that would naturally lend itself to auto-updating all the dependent components, based on the volume of data flowing over-the-wire in the coming years:

#248 (comment)

#142 (comment)

#154 (comment)

I also suspect that digests of such data structures would surely be distributed, where load-balanced analyses would be performed across multiple ever-increasing repositories in order to also pick specific versions of the sets of interest before having the processed results reported back to the user - which is another layer of complexity. In any case, as I'm a fairly patient person :)

Hope it helps,
Paul

@kozbo
Copy link
Contributor

kozbo commented Jun 22, 2016

poke. Can we wrap up this discussion with another +1 from someone? @jeromekelleher have your concerns been met with both spawned off issues and some minor adjustments?

@jeromekelleher
Copy link
Contributor

I'm still not sure we've properly defined the semantics of the dataset_id arguments here @kozbo, which is why I suggested deleting them for now. However, I guess we can just think of these objects as just inheriting their dataset from the VariantSet or ReadGroupSet or whatever and so the meaning is clear enough.

+1

@macieksmuga
Copy link
Contributor

+1 exactly as is now (ie. with dataset_id optionally in requests, but not as part of individual or biosample objects). We will presently be proposing a separate PR that aims to clarify the role of Dataset across the entire API.

@diekhans
Copy link
Contributor

I don't see how we can have it in the request is it's not part
of the object. There is nothing to query against.

Maciek Smuga-Otto notifications@github.com writes:

+1 exactly as is now (ie. with dataset_id optionally in requests, but not as
part of individual or biosample objects). We will presently be proposing a
separate PR that aims to clarify the role of Dataset across the entire API.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or mute the thread.*

@macieksmuga
Copy link
Contributor

@diekhans Having the dataset_id in the search but not in the object models an optional many-to-many relationship between Individual and Dataset. It is exactly the same formalism as currently found in the API between Reference and ReferenceSet. I agree that adding documentation to explicitly state this is desirable, but we were hoping to introduce said documentation as part of a larger Dataset overhaul.

@diekhans
Copy link
Contributor

I don't understand. The Reference and ReferenceSet searches
don't take dataset_id.

There is no concept of optional dataset in the schemas.

Dataset was removed from the object based on confusion.
Can we just put it back in?

Maciek Smuga-Otto notifications@github.com writes:

@diekhans Having the dataset_id in the search but not in the object models an
optional many-to-many relationship between Individual and Dataset. It is
exactly the same formalism as currently found in the API between Reference and
ReferenceSet. I agree that adding documentation to explicitly state this is
desirable, but we were hoping to introduce said documentation as part of a
larger Dataset overhaul.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or mute the thread.*

@pgrosu
Copy link
Contributor

pgrosu commented Jun 23, 2016

Mark, it's encapsulated by the intersection of shared information, which connects the two together via source_accessions leading to natural many-to-many translation tables. It basically would allow for tables I wrote here:

#125 (comment)

Then the more frequent many-to-many connections can be cached for faster access.

~p

Add dataset_id into messages
@macieksmuga macieksmuga merged commit 7e58683 into master Jun 28, 2016
@macieksmuga macieksmuga deleted the biodataproto branch June 28, 2016 18:52
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants