-
Notifications
You must be signed in to change notification settings - Fork 180
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dataset metadata spec #164
Changes from 8 commits
4ff110f
2980c07
c4b6e94
8dab2ac
9e9414b
431fe02
d32c1e2
c31422e
6224a83
4fef59f
28d25fc
f4ccca6
e7a9e5c
032cf97
89e35a9
e7d7641
6a32aca
af1b16a
a592be6
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,62 @@ | ||
# Dataset Spec for STAC | ||
|
||
## Introduction | ||
|
||
One topic of interest has been the search of datasets*, instead of within a dataset, i.e. in (sub-)catalogs, items and assets. [STAC](https://github.com/radiantearth/stac-spec) is focused on search within a dataset, but it includes some simple constructs to catalog datasets. This could be an independent spec that STAC uses, and others can also independently use, to describe datasets in a lightweight way. | ||
|
||
*\* There is no standardized name for the concept we are describing here. Others called it: dataset series (ISO 19115), collection (CNES, NASA), dataset (JAXA), dataset series (ESA), product (JAXA).* | ||
|
||
## Core | ||
|
||
| Element | Type | Name | Description | | ||
| --------------- | ------------------------------------- | ------------------------------- | ------------------------------------------------------------ | | ||
| id | string | Dataset ID (required) | Identifier for the dataset that is unique across the provider. MUST follow the pattern ` ^[A-Za-z0-9_\-\/]+$ `. TODO: Allow slash? | | ||
| title | string | Title | A short descriptive one-line title for the dataset. | | ||
| description | string | Description (required) | Detailed multi-line description to fully explain the entity. [CommonMark 0.28](http://commonmark.org/) syntax MAY be used for rich text representation. | | ||
| keywords | [string] | Keywords | List of keywords describing the dataset. | | ||
| version | string | Dataset Version | Version of the dataset. [Semantic Versioning (SemVer)](https://semver.org/) SHOULD be followed. | | ||
| license | string | Dataset License Name (required) | Dataset's license(s) as a [SPDX License identifier or expression](https://spdx.org/licenses/) or `proprietary` if the license is not on the SPDX license list. See `license_url` for more information. | | ||
| license_url | string | Dataset License URL | Dataset's license URL SHOULD be specified if `license` is set to `proprietary`. | | ||
| provider | [Provider Object] | Data Provider | The organizations that created the content of the dataset. | | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I suggest a single provider here. I added support for multiple providers in the EE catalog, and never found any use for it other than expressing the processing chain, which we want to do more systematically elsewhere. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I am hesitant to only rely on the process chain extension (or however we call it). It is still a long way to have standard to properly define these processing information apart from maybe a provider and a dataset url. Still, we should have something small in the core, I think. It should be easy to users to get at least some information about the history. I think the processing chain information would be much much harder to express and then it will just be left out. We also discussed the field And for me it's also to give proper credit and having them all makes clear what to put here. A single provider again leaves it open whether it's the RAW data provider or the last one processing it. Maybe we could also just have something like "history" in the core, which has a list of provider name + provider homepage + dataset url (derived_from). As we don't need the dataset url for the last provider (it's that catalog) the last provider would be the provider in the dataset. A process chain extension just extends the History Object and the Dataset Object, so that a process_chain can simply be added to each history element, too. Example follows... There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
We could also put the last provider directly into the history, dataset_url would be to self or omitted. Then we would have no direct provider in the top-level, but that would be okay for me.
|
||
| host | Host Object | Storage Provider | The organization that hosts the dataset. | | ||
| spatial_extent | [GeoJSON Object](http://geojson.org/) | Spatial extent (required) | The spatial extent covered by the dataset as [GeoJSON](http://geojson.org/) object. | | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @m-mohr Is this to be interpreted as 'possible extent' or as 'current extent'? My concern here is that some missions have the capacity to image most of the Earth but do not make a systematic acquisitions - CBERS for instance. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I would prefer possible extent to keep search results (over datasets, not over items in a dataset) more consistent. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. For openEO we had it defined as current extent and so I had that in mind, but I am open to both. Whoever has the best arguments wins. ;-) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think I prefer possible extent. That's a lot easier to implement for static providers. And someone who wants it to be 'current extent' can do that if they want - the current extent is certainly within the possible extent. I don't want catalogs to feel like they have to always be updating this field. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I lean towards 'possible extent'. An implementor could choose to make theirs 'current extent', since the current should be a subset of the possible. But I think it's better to not require static catalogs to keep updating their extent every single time there's new data outside their current. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. "possible" vs "current" also applies for the temporal_extent as open date ranges would always be "possible". I would like to have them and for consistency and the reasons mentioned above, I now slightly prefer "possible", too. But with open date ranges we are not compatible with WFS, see also opengeospatial/ogcapi-features#155. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Added "potential" to both extents. |
||
| temporal_extent | string | Temporal extent (required) | Temporal extent covered by the dataset. Date/time intervals MUST be formatted according to ISO 8601. ToDo: Support open date ranges | | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It could be really good to try to get compatibility with WFS on the extent fields. They use:
If we want we can try to influence them to adopt our convention, but we should have good reasoning. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I just used 'example 4' from the WFS spec. It notes: 'Coordinate reference system information is not provided as the service provides geometries only in the default system (WGS84 longitude/latitude)'. So seems like we could just say for dataset we require default. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Okay, than that's fine. Curious what the WFS crew is coming up with for the other issues mentioned. Will change that in the dataset spec. Temporal extent is still a string for now and need to find out how they define 3D bboxes (= incl. the z-axis). There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Should be mostly WFS compatible now. |
||
| links | [Link Object] | Links (required) | A list of references to other documents, see Link Object for further documentation. TODO: Remove if catalog is revised and links are specified on the catalog level. | | ||
|
||
### Provider Object | ||
|
||
| Element | Type | Name | Description | | ||
| ------- | ------ | --------------------- | ----------------------------------------------- | | ||
| name | string | Organization name | The name of the organization or the individual. | | ||
| url | string | Organization homepage | Homepage of the provider. | | ||
|
||
### Host Object | ||
|
||
| Element | Type | Name | Description | | ||
| -------------- | ------- | --------------------- | ------------------------------------------------------------ | | ||
| description | string | Description | Detailed description to explain the hosting details. [CommonMark 0.28](http://commonmark.org/) syntax MAY be used for rich text representation. | | ||
| scheme | string | Scheme (required) | Values: S3, GCS, URL, OTHER | | ||
| id | string | Identifier (required) | Host-specific identifier such as an URL or asset id. | | ||
| region | string | Region | Provider specific region where the data is stored. | | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is region primarily an AWS thing or is it general to all cloud providers? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The schema was written before we knew about the idea of storage profiles in #148. I would really like to have profiles instead of having the storage details directly baked into the dataset spec. region would probably be an AWS specific thing, which should be in a separate profile as proposed in #148. If others have regions, too, then they should have it separately in their profiles aswell. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. GCS has regions too There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Ok, that makes sense though - each cloud provider should define it's own storage profile (extension?) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'd go that route, yes. Is that dataset specific or could that be also catalog or item specific? |
||
| requester_pays | boolean | Requester pays | `true` if requester pays, `false` if host pays. Defaults to `false`. | | ||
|
||
**Note:** The idea of storage profiles is currently [discussed](https://github.com/radiantearth/stac-spec/issues/148). Therefore, scheme, id and region may be removed from the final spec. | ||
|
||
### Link Object | ||
|
||
| Element | Type | Name | Description | | ||
| ------- | ------ | ------------------- | ------------------------------------------------------------ | | ||
| href | string | Link (required) | The actual link in the format of an URL. Relative and absolute links are both allowed. | | ||
| rel | string | Relation (required) | Relationship between the current document and the linked document. | | ||
| type | string | MIME-type | MIME-type of the referenced entity. | | ||
| title | string | Title | Human-readable title for the link. | | ||
|
||
## Extensions | ||
|
||
Related extensions to be used with the dataset spec: | ||
|
||
* [EO extension](../extensions/stac-eo-spec.md) | ||
Please note that some fields such as `eo:sun_elevation ` or `eo:sun_azimuth` are only meaningful on the item level and MUST not be used in datasets. | ||
* [Dimensions extension](../extensions/dimension) (currently in review, see [PR #164](https://github.com/radiantearth/stac-spec/pull/164)) | ||
* [Scientific extension](../extensions/scientific) (currently in review, see [PR #186](https://github.com/radiantearth/stac-spec/pull/186)) | ||
* Provenance extension (planned, see [issue #179](https://github.com/radiantearth/stac-spec/issues/179)) |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,136 @@ | ||
{ | ||
"$schema": "http://json-schema.org/draft-06/schema#", | ||
"id": "dataset.json#", | ||
"title": "Dataset Item", | ||
"description": "This object represents the dataset in a SpatioTemporal Asset Catalog.", | ||
"type": "object", | ||
"required": [ | ||
"id", | ||
"description", | ||
"license", | ||
"spatial_extent", | ||
"temporal_extent", | ||
"links" | ||
], | ||
"properties": { | ||
"id": { | ||
"title": "Provider ID", | ||
"type": "string", | ||
"pattern": "^[A-Za-z0-9_\\-\/]+$" | ||
}, | ||
"title": { | ||
"title": "Title", | ||
"type": "string" | ||
}, | ||
"description": { | ||
"title": "Description", | ||
"type": "string" | ||
}, | ||
"keywords": { | ||
"title": "Keywords", | ||
"type": "array", | ||
"items": { | ||
"type": "string" | ||
} | ||
}, | ||
"license": { | ||
"title": "License Name", | ||
"type": "string" | ||
}, | ||
"license_url": { | ||
"title": "License URL", | ||
"type": "string", | ||
"format": "url" | ||
}, | ||
"provider": { | ||
"type": "array", | ||
"items": { | ||
"properties": { | ||
"name": { | ||
"title": "Organization Name", | ||
"type": "string" | ||
}, | ||
"url": { | ||
"title": "Organization homepage", | ||
"type": "string", | ||
"format": "url" | ||
} | ||
} | ||
} | ||
}, | ||
"host": { | ||
"required": [ | ||
"id", | ||
"scheme" | ||
], | ||
"properties": { | ||
"id": { | ||
"title": "Identifirer", | ||
"type": "string" | ||
}, | ||
"scheme": { | ||
"title": "Scheme", | ||
"type": "string", | ||
"enum": [ | ||
"S3", | ||
"GCS", | ||
"URL", | ||
"OTHER" | ||
] | ||
}, | ||
"description": { | ||
"title": "Description", | ||
"type": "string" | ||
}, | ||
"region": { | ||
"title": "Region", | ||
"type": "string" | ||
}, | ||
"requester_pays": { | ||
"title": "Requester Pays", | ||
"type": "boolean", | ||
"default": false | ||
} | ||
} | ||
}, | ||
"version": { | ||
"title": "Version", | ||
"type": "string" | ||
}, | ||
"temporal_extent": { | ||
"title": "Temporal extent", | ||
"type": "string" | ||
}, | ||
"spatial_extent": { | ||
"type": "object" | ||
}, | ||
"links": { | ||
"type": "array", | ||
"items": { | ||
"type": "object", | ||
"required": [ | ||
"href", | ||
"rel" | ||
], | ||
"properties": { | ||
"href": { | ||
"title": "Link", | ||
"type": "string" | ||
}, | ||
"rel": { | ||
"title": "Relation", | ||
"type": "string" | ||
}, | ||
"type": { | ||
"title": "type", | ||
"type": "string" | ||
}, | ||
"title": { | ||
"title": "Title", | ||
"type": "string" | ||
} | ||
} | ||
} | ||
} | ||
} | ||
} |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,19 @@ | ||
# STAC Dimensions Extension Spec | ||
|
||
This document explains the fields of the STAC Dimensions Extension (dim) to a STAC `Dataset`. Data can have different dimensions (= axes), e.g. in meteorology. The properties of these dimensions can be defined with this extension. | ||
|
||
## Dimensions Extension Description | ||
|
||
This is the field that extends the `Dataset` object: | ||
|
||
| Element | Type | Name | Description | | ||
| ---------------- | -------------------- | ------------------------- | ------------------------------------------------------------ | | ||
| dim:dimensions | [Dimension Object] | Dimensions | Dimensions of the data. If the dimensions have an order, the order SHOULD be reflected in the order of the array. | | ||
|
||
### Dimension Object | ||
|
||
| Element | Type | Name | Description | | ||
| ------- | ---------------- | ------------------- | ------------------------------------------------------------ | | ||
| label | string | Label (required) | Human-readable label for the dimension. | | ||
| unit | string | Unit of Measurement | Unit of measurement, preferably SI. ToDo: Any standard to express this, e.g. [UDUNITS](https://www.unidata.ucar.edu/software/udunits/) or this [dict](https://www.unc.edu/~rowlett/units/)? | | ||
| extent | [number\|string] | Data Extent | Specifies the extent of the data, i.e. the lower bound as the first element and the upper bound as the second element of the array. | |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,23 @@ | ||
{ | ||
"dim:dimensions": [ | ||
{ | ||
"label": "Longitude", | ||
"unit": "°", | ||
"extent": [-180, 180] | ||
}, | ||
{ | ||
"label": "Latitude", | ||
"unit": "°", | ||
"extent": [-90, 90] | ||
}, | ||
{ | ||
"label": "Temperature", | ||
"unit": "°C", | ||
"extent": [-20, 60] | ||
}, | ||
{ | ||
"label": "Date", | ||
"extent": ["2018-01-01T00:00:00Z", "2018-01-31T23:59:59Z"] | ||
} | ||
] | ||
} |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,36 @@ | ||
{ | ||
"$schema": "http://json-schema.org/draft-07/schema#", | ||
"type": "object", | ||
"title": "STAC Dimensions Extension Spec", | ||
"properties": { | ||
"dim:dimensions": { | ||
"type": "array", | ||
"title": "Dimensions", | ||
"items": { | ||
"type": "object", | ||
"required": [ | ||
"label" | ||
], | ||
"properties": { | ||
"label": { | ||
"type": "string", | ||
"title": "Label" | ||
}, | ||
"unit": { | ||
"type": "string", | ||
"title": "Unit of Measurement" | ||
}, | ||
"extent": { | ||
"type": "array", | ||
"title": "Data Extent", | ||
"minItems": 2, | ||
"maxItems": 2, | ||
"items": { | ||
"type": ["number", "string"] | ||
} | ||
} | ||
} | ||
} | ||
} | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the EO spec we have gsd (Ground Sample Distance) at the top level, and it's also provided per band because resolution may vary by band. At the top level it represents the best resolution to enable searching. I think it makes sense at the Dataset level rather than the Item level.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@matthewhanson What do you think? Would it make sense to extend and share the EO extension across datasets and items or to have it separated? Or should there be one extension, which has sections on items and datasets? I think I'd prefer to share the same extension... some definitions probably make sense for items and datasets equally.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I entirely get what you are saying.
I think datasets are a concept which will be used in general with STAC - although I'm still not sure if the intention is that datasets are always present and if they are themselves part of core. I personally think they should part of core, the include core fields: temporal and spatial extent are the unions of core fields, license, provider...
So the EO extension, or any extension, should define additions to both the Dataset and the Item.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I hope datasets are part of the core and always present, but not sure what others think.
I agree with what you are saying about extensions and that answers basically my question. So I would expect that the additional EO fields we are proposing for datasets will be incorporated into the EO extension. If you are okay with that, I'd already move them to the eo extension in the branch we are currently working on. They are currently a bit badly located in the dataset-spec.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I think those additional fields belong in the EO extension...although I think that I'd still like to see some of the non-varying asset info in Datasets, but I'll make the case and provide some examples in the EO extension.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure whether speaking about the same. I am speaking about an EO extension that - whenever meaningful - is shared between Dataset and Item and can be used in both locations! Are you just talking about an EO extension limited to items? Otherwise I don't get the point you make with "I'd still like to see some of the non-varying asset info in Datasets".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we are talking about the same thing, EO extension is shared between Dataset and Item.
What I'm saying is that some of the Asset information, such as the list of possible assets and what their types are, can be added at the Dataset level. I think I explained it a bit better elsewhere. Maybe need to make a new issue for it, these PR is getting a bit hard to follow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great, then I'd like to have your support here (regarding global extensions): #186 ;-)
Sure, I think more issues would make sense! I think we are also discussing that in #174. I'd like a proposal on this. I don't think it's simply copy and paste (minus url) from the assets spec in the items?