BioMetadata Protocol #636

david4096 · 2016-06-16T16:51:13Z

The GA4GH API currently lacks a unified way for describing samples across data types. This means that it is difficult to reconcile whether a variant call or read group is from the same sample. This PR adds the ability to filter by samples to these types.

The BioMetadata protocol allows read groups and call sets to be related to BioSamples. This is done by adding bioSampleId fields to both of those records. A bioSampleId field is then an optional field in both in SearchReadGroupSetsRequest and SearchCallSetsRequest. In addition to being able to filter callsets and readgroups, search and get endpoints for biosamples and individuals endpoints have been added.

Individuals can be searched by name, and biosamples can be filtered by individualId and name.

Given some read find which individual it derives from:

Read the bioSampleId from a ReadGroup record.
Get the BioSample record by performing a get at http://ga4gh/biosamples/.
Read the individualId from the BioSample record.
Get the Individual record by performing a get at http://ga4gh/individuals/.

Given some individual find reads:

From the Individual record get the individualId.
Create a SearchBioSamplesRequest with the provided individualId.
Perform a SearchReadGroupSetsRequest with each returned bioSampleId.
Only read groups matching the bioSampleId will be returned. Perform a SearchReadsRequest for each read group in the SearchReadGroupSetsResponse.

To clarify the metadata roles, some reusable elements have been moved to AssayMetadata (Experiment and Analysis). Other assisting documentation is included in this PR to assist proper usage of Ontology Terms. Thanks to everyone for all the effort put into this @mbaudis @diekhans @jeromekelleher @bwalsh @saupchurch @sarahhunt @dzerbino

Update png files

jeromekelleher · 2016-06-20T16:06:38Z

src/main/proto/ga4gh/assay_metadata.proto

+
+  // The time at which this message was created.
+  // Format: ISO 8601, YYYY-MM-DDTHH:MM:SS.SSS (e.g. 2015-02-10T00:03:42.123Z)
+  string message_create_time = 4;


Why message_create_time here? The work message seems ambigous, i.e., does this refer to the individual message objects as they are being sent to the client, or to the 'record' that is stored on the server? I'm guessing that this is because of a clash with run_time? I would have thought it would be better to just use created and updated just for the sake of being consistent with all the other objects (and change them all in one go to something more specific like creation_timestamp, update_timestamp, if/when this is done).

I realise this happened before this specific set of changes, so please feel free to point me in the direction of the relevant discussion. This isn't a criticism of the present PR, I just thought I'd bring it up while we're reorganising this metadata stuff.

It's in the master metadata.proto
https://github.com/ga4gh/schemas/blob/5ccf897fc1ee9c0401db0b2d117e80d87388bec9/src/main/proto/ga4gh/metadata.proto#L32 . Stubbed out this issue here: #611

The time fields in the API are weakly defined. The other
objects use Unix time and appear to be about when data is
loaded into the database This is relevant to
environments like the Google cloud, but less important
when serving up an existing dataset.

Jerome Kelleher notifications@github.com writes:

In src/main/proto/ga4gh/assay_metadata.proto:

+// An experimental preparation of a sample.
+message Experiment {

// The experiment UUID. This is globally unique.

string id = 1;

// The name of the experiment.

string name = 2;

// A description of the experiment.

string description = 3;

// The time at which this message was created.

// Format: ISO 8601, YYYY-MM-DDTHH:MM:SS.SSS (e.g. 2015-02-10T00:03:42.123Z)

string message_create_time = 4;

Why message_create_time here? The work message seems ambigous, i.e., does this
refer to the individual message objects as they are being sent to the client,
or to the 'record' that is stored on the server? I'm guessing that this is
because of a clash with run_time? I would have thought it would be better to
just use created and updated just for the sake of being consistent with all the
other objects (and change them all in one go to something more specific like
creation_timestamp, update_timestamp, if/when this is done).

I realise this happened before this specific set of changes, so please feel
free to point me in the direction of the relevant discussion. This isn't a
criticism of the present PR, I just thought I'd bring it up while we're
reorganising this metadata stuff.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or mute the thread.*

I thought the consensus is to do ISO, and settle for a standard naming for the object housekeeping time attributes? I/m trying to find my lengthy exchange with @diekhans ...

OK, I'm sorry I brought this up now! It's what's in master currently, so not strictly related to this PR. I've created an issue to track it: #637.

The time of time has not be changed yet in the other records.

For now, lets make the name consistent the other records (created and updated), but leave the time ISO. We can then take on global conversion to time ISO.

message_create_time is a confusing name anyway. It should be abstracted away from protobuf.

Full agreement with @diekhans here.

id should be a server-defined id, not a UUID.

Re: UUID, fixed here and in master.

jeromekelleher · 2016-06-20T16:19:38Z

This looks hugely useful, and fixes a lot of the shortcomings we're experiencing in the protocol when working with the WGS500 dataset. I have a couple of minor points above, but overall I think this is a massive improvement and we should try to merge it as soon as possible.

Remove UUID comment

diekhans · 2016-06-20T18:04:39Z

@david4096 pointed out that many of the inconsistency that are being noted are already in master. Given this, we should go ahead with this PR and fix the inconsistencies in separate PRs

that is +1

jeromekelleher · 2016-06-20T19:07:51Z

I'm happy to +1 this if we remove the dataset_id argument from the searches. I don't think we've clearly defined the model here. This is definitely different to other objects which unequivocally have a dataset_id attribute, and can therefore be said to belong to a dataset. Rather than get bogged down in this, I suggest we just remove dataset_id from the search functions for now, as it's optional anyway.

diekhans · 2016-06-20T19:18:48Z

agreed to move forward and fix dataset inconsistency in another PR.

Jerome Kelleher notifications@github.com writes:

I'm happy to +1 this if we remove the dataset_id argument from the searches. I
don't think we've clearly defined the model here. This is definitely different
to other objects which unequivocally have a dataset_id attribute, and can
therefore be said to belong to a dataset. Rather than get bogged down in this,
I suggest we just remove dataset_id from the search functions for now, as it's
optional anyway.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or mute the thread.*

Mention that dataset_id is optional

david4096 · 2016-06-20T21:09:53Z

@jeromekelleher I added a line stating the dataset_id is optional. Looking forward to moving ahead with this! Please see here and here for previous discussion.

diekhans · 2016-06-21T18:34:11Z

@jeromekelleher the problem here is that dataset was removed from the metadata objects. All objects should be in datasets (IMHO, it's wrong that references special cases and there is a discussion about fixing it).

dataset is a ownership bag. This ownership appears to have the side-effect that it also scopes name and makes it the owners responsibility to ensure name uniqueness.

It is specifically not a barrier to doing cross-dataset analysis. This is really expected to be the norm.

@calbach, @dglazer Dataset still causes lots of confusion with people wanting it to be something that it's not. Could you please work on more detailed documentation.

I am +1 for putting all metadata in a dataset and leaving it in the query.

pgrosu · 2016-06-21T21:10:55Z

You might remember my diagram from a couple of years ago - you don't really need a dataset id, as all the generated components are tied to one another, and any (manually/auto-generated) additional meta-attributes would be automatically bound and propagated from the previous data component that generated them, all stemming from an initial collection of samples associated to a project/study:

#74 (comment)

Each specific Project/Study - which encompasses the Experiment and Analysis performed - would thus differentiate similar collections of Samples that were processed in different ways.

diekhans · 2016-06-21T21:22:35Z

The current approach of the GA4GH API uses datasets as the container.

This could be change, we would do this a part of a full API design,
not piece meal.

Paul Grosu notifications@github.com writes:

You might remember my diagram from a couple of years ago - you don't really
need a dataset id, as all the generated components are tied to one another, and
any (manually/auto-generated) additional meta-attributes would be automatically
bound and propagated from the previous data component that generated them, all
stemming from an initial sample:

#74 (comment)

Each specific Project/Study - which encompasses the Experiment and Analysis
performed - would thus differentiate similar collections of Samples that were
processed in different ways.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or mute the thread.*

pgrosu · 2016-06-22T00:52:36Z

So part of the reason why I wrote up the comprehensive analysis of the GA4GH ecosystem - which was the final push to switch us to Protocol Buffers - was to take a step back and consider all the moving parts and types of data models for GA4GH, including their associated methods and queries. When I created the above-referenced diagram, it was to initially get us to consider another approach of how data moves along a wire from multiple global repositories, in order to help us put together the schemas and methods which would allow for complex queries that @lh3 and @jeromekelleher are looking for. Yes, a full API would have additional <Key,Value> model-based components with several dynamically-updated inverted indices to allow examples like the below-referenced comments, but they are based on having the minimally-required definitions of a GA4GH API wire-protocol that would naturally lend itself to auto-updating all the dependent components, based on the volume of data flowing over-the-wire in the coming years:

#248 (comment)

#142 (comment)

#154 (comment)

I also suspect that digests of such data structures would surely be distributed, where load-balanced analyses would be performed across multiple ever-increasing repositories in order to also pick specific versions of the sets of interest before having the processed results reported back to the user - which is another layer of complexity. In any case, as I'm a fairly patient person :)

Hope it helps,
Paul

kozbo · 2016-06-22T20:38:45Z

poke. Can we wrap up this discussion with another +1 from someone? @jeromekelleher have your concerns been met with both spawned off issues and some minor adjustments?

jeromekelleher · 2016-06-23T13:54:10Z

I'm still not sure we've properly defined the semantics of the dataset_id arguments here @kozbo, which is why I suggested deleting them for now. However, I guess we can just think of these objects as just inheriting their dataset from the VariantSet or ReadGroupSet or whatever and so the meaning is clear enough.

+1

macieksmuga · 2016-06-23T18:13:45Z

+1 exactly as is now (ie. with dataset_id optionally in requests, but not as part of individual or biosample objects). We will presently be proposing a separate PR that aims to clarify the role of Dataset across the entire API.

diekhans · 2016-06-23T18:21:29Z

I don't see how we can have it in the request is it's not part
of the object. There is nothing to query against.

Maciek Smuga-Otto notifications@github.com writes:

+1 exactly as is now (ie. with dataset_id optionally in requests, but not as
part of individual or biosample objects). We will presently be proposing a
separate PR that aims to clarify the role of Dataset across the entire API.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or mute the thread.*

macieksmuga · 2016-06-23T20:20:36Z

@diekhans Having the dataset_id in the search but not in the object models an optional many-to-many relationship between Individual and Dataset. It is exactly the same formalism as currently found in the API between Reference and ReferenceSet. I agree that adding documentation to explicitly state this is desirable, but we were hoping to introduce said documentation as part of a larger Dataset overhaul.

diekhans · 2016-06-23T20:37:36Z

I don't understand. The Reference and ReferenceSet searches
don't take dataset_id.

There is no concept of optional dataset in the schemas.

Dataset was removed from the object based on confusion.
Can we just put it back in?

Maciek Smuga-Otto notifications@github.com writes:

@diekhans Having the dataset_id in the search but not in the object models an
optional many-to-many relationship between Individual and Dataset. It is
exactly the same formalism as currently found in the API between Reference and
ReferenceSet. I agree that adding documentation to explicitly state this is
desirable, but we were hoping to introduce said documentation as part of a
larger Dataset overhaul.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or mute the thread.*

pgrosu · 2016-06-23T22:21:26Z

Mark, it's encapsulated by the intersection of shared information, which connects the two together via source_accessions leading to natural many-to-many translation tables. It basically would allow for tables I wrote here:

#125 (comment)

Then the more frequent many-to-many connections can be cached for faster access.

~p

Add dataset_id into messages

david4096 added 3 commits June 13, 2016 10:42

BioMetadata proto descriptions

09f84b3

Import documentation from biosample branch

2ad738f

Remove unused images

f490a3e

Update png files

macieksmuga added the in progress label Jun 16, 2016

This was referenced Jun 16, 2016

BioMetadata ga4gh/compliance#195

Merged

Biosamples and individuals #606

Closed

BioMetadata Protocol ga4gh/ga4gh-server#1267

Merged

jeromekelleher reviewed Jun 20, 2016
View reviewed changes

jeromekelleher mentioned this pull request Jun 20, 2016

message_create_time inconsistent and ambiguous in Experiment #637

Closed

Minor fixes, resize image, toctrees

7087b2c

Remove UUID comment

david4096 force-pushed the biodataproto branch from 15cb3e1 to 7087b2c Compare June 20, 2016 17:33

merge commit

4e98eb1

Remove refs to name

2efea41

Mention that dataset_id is optional

david4096 force-pushed the biodataproto branch from 07f8233 to 2efea41 Compare June 20, 2016 21:09

david4096 force-pushed the biodataproto branch from f208132 to 5c0049b Compare June 24, 2016 21:39

Fix get requests

5ce34ad

Add dataset_id into messages

david4096 force-pushed the biodataproto branch from 5c0049b to 5ce34ad Compare June 28, 2016 00:43

macieksmuga merged commit 7e58683 into master Jun 28, 2016

macieksmuga removed the in progress label Jun 28, 2016

macieksmuga deleted the biodataproto branch June 28, 2016 18:52

david4096 mentioned this pull request Jun 28, 2016

Fix UUID typo in Experiment #638

Closed

david4096 mentioned this pull request Jul 8, 2016

Look up CallSets by sample ID #208

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BioMetadata Protocol #636

BioMetadata Protocol #636

david4096 commented Jun 16, 2016 •

edited

Loading

jeromekelleher Jun 20, 2016

david4096 Jun 20, 2016

diekhans Jun 20, 2016

mbaudis Jun 20, 2016

jeromekelleher Jun 20, 2016

diekhans Jun 20, 2016

mbaudis Jun 20, 2016

diekhans Jun 20, 2016

david4096 Jun 20, 2016

jeromekelleher commented Jun 20, 2016

diekhans commented Jun 20, 2016

jeromekelleher commented Jun 20, 2016

diekhans commented Jun 20, 2016

david4096 commented Jun 20, 2016 •

edited

Loading

diekhans commented Jun 21, 2016

pgrosu commented Jun 21, 2016 •

edited

Loading

diekhans commented Jun 21, 2016

pgrosu commented Jun 22, 2016

kozbo commented Jun 22, 2016

jeromekelleher commented Jun 23, 2016

macieksmuga commented Jun 23, 2016

diekhans commented Jun 23, 2016

macieksmuga commented Jun 23, 2016

diekhans commented Jun 23, 2016

pgrosu commented Jun 23, 2016

BioMetadata Protocol #636

BioMetadata Protocol #636

Conversation

david4096 commented Jun 16, 2016 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jeromekelleher commented Jun 20, 2016

diekhans commented Jun 20, 2016

jeromekelleher commented Jun 20, 2016

diekhans commented Jun 20, 2016

david4096 commented Jun 20, 2016 • edited Loading

diekhans commented Jun 21, 2016

pgrosu commented Jun 21, 2016 • edited Loading

diekhans commented Jun 21, 2016

pgrosu commented Jun 22, 2016

kozbo commented Jun 22, 2016

jeromekelleher commented Jun 23, 2016

macieksmuga commented Jun 23, 2016

diekhans commented Jun 23, 2016

macieksmuga commented Jun 23, 2016

diekhans commented Jun 23, 2016

pgrosu commented Jun 23, 2016

david4096 commented Jun 16, 2016 •

edited

Loading

david4096 commented Jun 20, 2016 •

edited

Loading

pgrosu commented Jun 21, 2016 •

edited

Loading