-
Notifications
You must be signed in to change notification settings - Fork 112
Conversation
|
||
// The time at which this message was created. | ||
// Format: ISO 8601, YYYY-MM-DDTHH:MM:SS.SSS (e.g. 2015-02-10T00:03:42.123Z) | ||
string message_create_time = 4; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why message_create_time here? The work message
seems ambigous, i.e., does this refer to the individual message objects as they are being sent to the client, or to the 'record' that is stored on the server? I'm guessing that this is because of a clash with run_time
? I would have thought it would be better to just use created
and updated
just for the sake of being consistent with all the other objects (and change them all in one go to something more specific like creation_timestamp
, update_timestamp
, if/when this is done).
I realise this happened before this specific set of changes, so please feel free to point me in the direction of the relevant discussion. This isn't a criticism of the present PR, I just thought I'd bring it up while we're reorganising this metadata stuff.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's in the master metadata.proto
https://github.com/ga4gh/schemas/blob/5ccf897fc1ee9c0401db0b2d117e80d87388bec9/src/main/proto/ga4gh/metadata.proto#L32 . Stubbed out this issue here: #611
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The time fields in the API are weakly defined. The other
objects use Unix time and appear to be about when data is
loaded into the database This is relevant to
environments like the Google cloud, but less important
when serving up an existing dataset.
Jerome Kelleher notifications@github.com writes:
In src/main/proto/ga4gh/assay_metadata.proto:
+// An experimental preparation of a sample.
+message Experiment {
- // The experiment UUID. This is globally unique.
- string id = 1;
- // The name of the experiment.
- string name = 2;
- // A description of the experiment.
- string description = 3;
- // The time at which this message was created.
- // Format: ISO 8601, YYYY-MM-DDTHH:MM:SS.SSS (e.g. 2015-02-10T00:03:42.123Z)
- string message_create_time = 4;
Why message_create_time here? The work message seems ambigous, i.e., does this
refer to the individual message objects as they are being sent to the client,
or to the 'record' that is stored on the server? I'm guessing that this is
because of a clash with run_time? I would have thought it would be better to
just use created and updated just for the sake of being consistent with all the
other objects (and change them all in one go to something more specific like
creation_timestamp, update_timestamp, if/when this is done).I realise this happened before this specific set of changes, so please feel
free to point me in the direction of the relevant discussion. This isn't a
criticism of the present PR, I just thought I'd bring it up while we're
reorganising this metadata stuff.—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or mute the thread.*
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought the consensus is to do ISO, and settle for a standard naming for the object housekeeping time attributes? I/m trying to find my lengthy exchange with @diekhans ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, I'm sorry I brought this up now! It's what's in master currently, so not strictly related to this PR. I've created an issue to track it: #637.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The time of time has not be changed yet in the other records.
For now, lets make the name consistent the other records (created and updated), but leave the time ISO. We can then take on global conversion to time ISO.
message_create_time is a confusing name anyway. It should be abstracted away from protobuf.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Full agreement with @diekhans here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
id should be a server-defined id, not a UUID.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Re: UUID, fixed here and in master.
This looks hugely useful, and fixes a lot of the shortcomings we're experiencing in the protocol when working with the WGS500 dataset. I have a couple of minor points above, but overall I think this is a massive improvement and we should try to merge it as soon as possible. |
Remove UUID comment
@david4096 pointed out that many of the inconsistency that are being noted are already in master. Given this, we should go ahead with this PR and fix the inconsistencies in separate PRs that is +1 |
I'm happy to +1 this if we remove the |
agreed to move forward and fix dataset inconsistency in another PR. Jerome Kelleher notifications@github.com writes:
|
Mention that dataset_id is optional
@jeromekelleher I added a line stating the dataset_id is optional. Looking forward to moving ahead with this! Please see here and here for previous discussion. |
@jeromekelleher the problem here is that dataset was removed from the metadata objects. All objects should be in datasets (IMHO, it's wrong that references special cases and there is a discussion about fixing it). dataset is a ownership bag. This ownership appears to have the side-effect that it also scopes name and makes it the owners responsibility to ensure name uniqueness. It is specifically not a barrier to doing cross-dataset analysis. This is really expected to be the norm. @calbach, @dglazer Dataset still causes lots of confusion with people wanting it to be something that it's not. Could you please work on more detailed documentation. I am +1 for putting all metadata in a dataset and leaving it in the query. |
You might remember my diagram from a couple of years ago - you don't really need a dataset id, as all the generated components are tied to one another, and any (manually/auto-generated) additional meta-attributes would be automatically bound and propagated from the previous data component that generated them, all stemming from an initial collection of samples associated to a project/study: Each specific Project/Study - which encompasses the Experiment and Analysis performed - would thus differentiate similar collections of Samples that were processed in different ways. |
The current approach of the GA4GH API uses datasets as the container. This could be change, we would do this a part of a full API design, Paul Grosu notifications@github.com writes:
|
So part of the reason why I wrote up the comprehensive analysis of the GA4GH ecosystem - which was the final push to switch us to Protocol Buffers - was to take a step back and consider all the moving parts and types of data models for GA4GH, including their associated methods and queries. When I created the above-referenced diagram, it was to initially get us to consider another approach of how data moves along a wire from multiple global repositories, in order to help us put together the schemas and methods which would allow for complex queries that @lh3 and @jeromekelleher are looking for. Yes, a full API would have additional <Key,Value> model-based components with several dynamically-updated inverted indices to allow examples like the below-referenced comments, but they are based on having the minimally-required definitions of a GA4GH API wire-protocol that would naturally lend itself to auto-updating all the dependent components, based on the volume of data flowing over-the-wire in the coming years: I also suspect that digests of such data structures would surely be distributed, where load-balanced analyses would be performed across multiple ever-increasing repositories in order to also pick specific versions of the sets of interest before having the processed results reported back to the user - which is another layer of complexity. In any case, as I'm a fairly patient person :) Hope it helps, |
poke. Can we wrap up this discussion with another +1 from someone? @jeromekelleher have your concerns been met with both spawned off issues and some minor adjustments? |
I'm still not sure we've properly defined the semantics of the +1 |
+1 exactly as is now (ie. with |
I don't see how we can have it in the request is it's not part Maciek Smuga-Otto notifications@github.com writes:
|
@diekhans Having the |
I don't understand. The Reference and ReferenceSet searches There is no concept of optional dataset in the schemas. Dataset was removed from the object based on confusion. Maciek Smuga-Otto notifications@github.com writes:
|
Mark, it's encapsulated by the intersection of shared information, which connects the two together via Then the more frequent many-to-many connections can be cached for faster access. ~p |
Add dataset_id into messages
The GA4GH API currently lacks a unified way for describing samples across data types. This means that it is difficult to reconcile whether a variant call or read group is from the same sample. This PR adds the ability to filter by samples to these types.
The BioMetadata protocol allows read groups and call sets to be related to
BioSample
s. This is done by addingbioSampleId
fields to both of those records. AbioSampleId
field is then an optional field in both in SearchReadGroupSetsRequest and SearchCallSetsRequest. In addition to being able to filter callsets and readgroups, search and get endpoints for biosamples and individuals endpoints have been added.Individuals can be searched by name, and biosamples can be filtered by individualId and name.
Given some read find which individual it derives from:
Given some individual find reads:
To clarify the metadata roles, some reusable elements have been moved to
AssayMetadata
(Experiment
andAnalysis
). Other assisting documentation is included in this PR to assist proper usage of Ontology Terms. Thanks to everyone for all the effort put into this @mbaudis @diekhans @jeromekelleher @bwalsh @saupchurch @sarahhunt @dzerbino