Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dataset metadata spec #164

Merged
merged 19 commits into from
Aug 30, 2018
Merged
Show file tree
Hide file tree
Changes from 8 commits
Commits
Show all changes
19 commits
Select commit Hold shift + click to select a range
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
62 changes: 62 additions & 0 deletions dataset-spec/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,62 @@
# Dataset Spec for STAC

## Introduction

One topic of interest has been the search of datasets*, instead of within a dataset, i.e. in (sub-)catalogs, items and assets. [STAC](https://github.com/radiantearth/stac-spec) is focused on search within a dataset, but it includes some simple constructs to catalog datasets. This could be an independent spec that STAC uses, and others can also independently use, to describe datasets in a lightweight way.

*\* There is no standardized name for the concept we are describing here. Others called it: dataset series (ISO 19115), collection (CNES, NASA), dataset (JAXA), dataset series (ESA), product (JAXA).*

## Core

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the EO spec we have gsd (Ground Sample Distance) at the top level, and it's also provided per band because resolution may vary by band. At the top level it represents the best resolution to enable searching. I think it makes sense at the Dataset level rather than the Item level.

Copy link
Collaborator

@m-mohr m-mohr Aug 21, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@matthewhanson What do you think? Would it make sense to extend and share the EO extension across datasets and items or to have it separated? Or should there be one extension, which has sections on items and datasets? I think I'd prefer to share the same extension... some definitions probably make sense for items and datasets equally.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I entirely get what you are saying.
I think datasets are a concept which will be used in general with STAC - although I'm still not sure if the intention is that datasets are always present and if they are themselves part of core. I personally think they should part of core, the include core fields: temporal and spatial extent are the unions of core fields, license, provider...

So the EO extension, or any extension, should define additions to both the Dataset and the Item.

Copy link
Collaborator

@m-mohr m-mohr Aug 21, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I hope datasets are part of the core and always present, but not sure what others think.

I agree with what you are saying about extensions and that answers basically my question. So I would expect that the additional EO fields we are proposing for datasets will be incorporated into the EO extension. If you are okay with that, I'd already move them to the eo extension in the branch we are currently working on. They are currently a bit badly located in the dataset-spec.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think those additional fields belong in the EO extension...although I think that I'd still like to see some of the non-varying asset info in Datasets, but I'll make the case and provide some examples in the EO extension.

Copy link
Collaborator

@m-mohr m-mohr Aug 22, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure whether speaking about the same. I am speaking about an EO extension that - whenever meaningful - is shared between Dataset and Item and can be used in both locations! Are you just talking about an EO extension limited to items? Otherwise I don't get the point you make with "I'd still like to see some of the non-varying asset info in Datasets".

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we are talking about the same thing, EO extension is shared between Dataset and Item.

What I'm saying is that some of the Asset information, such as the list of possible assets and what their types are, can be added at the Dataset level. I think I explained it a bit better elsewhere. Maybe need to make a new issue for it, these PR is getting a bit hard to follow.

Copy link
Collaborator

@m-mohr m-mohr Aug 22, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great, then I'd like to have your support here (regarding global extensions): #186 ;-)

Sure, I think more issues would make sense! I think we are also discussing that in #174. I'd like a proposal on this. I don't think it's simply copy and paste (minus url) from the assets spec in the items?

| Element | Type | Name | Description |
| --------------- | ------------------------------------- | ------------------------------- | ------------------------------------------------------------ |
| id | string | Dataset ID (required) | Identifier for the dataset that is unique across the provider. MUST follow the pattern ` ^[A-Za-z0-9_\-\/]+$ `. TODO: Allow slash? |
| title | string | Title | A short descriptive one-line title for the dataset. |
| description | string | Description (required) | Detailed multi-line description to fully explain the entity. [CommonMark 0.28](http://commonmark.org/) syntax MAY be used for rich text representation. |
| keywords | [string] | Keywords | List of keywords describing the dataset. |
| version | string | Dataset Version | Version of the dataset. [Semantic Versioning (SemVer)](https://semver.org/) SHOULD be followed. |
| license | string | Dataset License Name (required) | Dataset's license(s) as a [SPDX License identifier or expression](https://spdx.org/licenses/) or `proprietary` if the license is not on the SPDX license list. See `license_url` for more information. |
| license_url | string | Dataset License URL | Dataset's license URL SHOULD be specified if `license` is set to `proprietary`. |
| provider | [Provider Object] | Data Provider | The organizations that created the content of the dataset. |
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest a single provider here. I added support for multiple providers in the EE catalog, and never found any use for it other than expressing the processing chain, which we want to do more systematically elsewhere.

Copy link
Collaborator

@m-mohr m-mohr Aug 22, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am hesitant to only rely on the process chain extension (or however we call it). It is still a long way to have standard to properly define these processing information apart from maybe a provider and a dataset url. Still, we should have something small in the core, I think. It should be easy to users to get at least some information about the history. I think the processing chain information would be much much harder to express and then it will just be left out. We also discussed the field derived_from. Can we combine that somehow?

And for me it's also to give proper credit and having them all makes clear what to put here. A single provider again leaves it open whether it's the RAW data provider or the last one processing it.

Maybe we could also just have something like "history" in the core, which has a list of provider name + provider homepage + dataset url (derived_from). As we don't need the dataset url for the last provider (it's that catalog) the last provider would be the provider in the dataset. A process chain extension just extends the History Object and the Dataset Object, so that a process_chain can simply be added to each history element, too.

Example follows...

Copy link
Collaborator

@m-mohr m-mohr Aug 22, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

{
  "id":"sentinel2-processed",
  "description":"Sentinel 2 NDVI max composite processed by Final Processor and Another Processing Comp, data originates from ESA.",
  "spatial_extent":{

  },
  "temporal_extent":"2015/2018",
  "license":"Apache-2.0",
  "provider":{
    "name":"Final processor, Inc.",
    "url":"http://www.final-corporation.com"
  },
  "pc:process_chain":{
    "process":"max_time_composite"
  },
  "history":[
    {
      "provider":{
        "organization":"Another Processing Comp, Inc.",
        "url":"http://processing.inc"
      },
      "dataset_url":"http://processing.inc/datasets/sentinel2-processed/catalog.json",
      "pc:process_chain":{
        "process":"ndvi"
      }
    },
    {
      "provider":{
        "organization":"ESA",
        "url":"http://esa.eu"
      },
      "dataset_url":"http://esa.eu/data/sentinel-2"
    }
  ]
}

We could also put the last provider directly into the history, dataset_url would be to self or omitted. Then we would have no direct provider in the top-level, but that would be okay for me.

{
  "id":"sentinel2-processed",
  "description":"Sentinel 2 NDVI max composite processed by Final Processor and Another Processing Comp, data originates from ESA.",
  "spatial_extent":{

  },
  "temporal_extent":"2015/2018",
  "license":"Apache-2.0",
  "history":[
    {
      "provider":{
        "name":"Final processor, Inc.",
        "url":"http://www.final-corporation.com"
      },
      "pc:process_chain":{
        "process":"max_time_composite"
      }
    },
    {
      "provider":{
        "organization":"Another Processing Comp, Inc.",
        "url":"http://processing.inc"
      },
      "dataset_url":"http://processing.inc/datasets/sentinel2-processed/catalog.json",
      "pc:process_chain":{
        "process":"ndvi"
      }
    },
    {
      "provider":{
        "organization":"ESA",
        "url":"http://esa.eu"
      },
      "dataset_url":"http://esa.eu/data/sentinel-2"
    }
  ]
}

| host | Host Object | Storage Provider | The organization that hosts the dataset. |
| spatial_extent | [GeoJSON Object](http://geojson.org/) | Spatial extent (required) | The spatial extent covered by the dataset as [GeoJSON](http://geojson.org/) object. |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@m-mohr Is this to be interpreted as 'possible extent' or as 'current extent'? My concern here is that some missions have the capacity to image most of the Earth but do not make a systematic acquisitions - CBERS for instance.
The footprint for CBERS-4 MUX scenes may be obtained here, just select CBERS on the upper right panel. You'll see that most of northern Canada is not covered yet by the dataset, but a new scene may be acquired anytime in the future. In that case should the spatial_extent be changed? If is is changed could the dataset version be the same?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer possible extent to keep search results (over datasets, not over items in a dataset) more consistent.

Copy link
Collaborator

@m-mohr m-mohr Aug 22, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For openEO we had it defined as current extent and so I had that in mind, but I am open to both. Whoever has the best arguments wins. ;-)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I prefer possible extent. That's a lot easier to implement for static providers. And someone who wants it to be 'current extent' can do that if they want - the current extent is certainly within the possible extent.

I don't want catalogs to feel like they have to always be updating this field.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I lean towards 'possible extent'. An implementor could choose to make theirs 'current extent', since the current should be a subset of the possible. But I think it's better to not require static catalogs to keep updating their extent every single time there's new data outside their current.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"possible" vs "current" also applies for the temporal_extent as open date ranges would always be "possible". I would like to have them and for consistency and the reasons mentioned above, I now slightly prefer "possible", too. But with open date ranges we are not compatible with WFS, see also opengeospatial/ogcapi-features#155.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added "potential" to both extents.

| temporal_extent | string | Temporal extent (required) | Temporal extent covered by the dataset. Date/time intervals MUST be formatted according to ISO 8601. ToDo: Support open date ranges |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It could be really good to try to get compatibility with WFS on the extent fields. They use:

      "extent": {
        "spatial": [ 7.01, 50.63, 7.22, 50.78 ],
        "temporal": [ "2010-02-15T12:34:56Z", "2018-03-18T12:11:00Z" ]
      },

If we want we can try to influence them to adopt our convention, but we should have good reasoning.

Copy link
Collaborator

@m-mohr m-mohr Aug 24, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just used 'example 4' from the WFS spec. It notes: 'Coordinate reference system information is not provided as the service provides geometries only in the default system (WGS84 longitude/latitude)'. So seems like we could just say for dataset we require default.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, than that's fine. Curious what the WFS crew is coming up with for the other issues mentioned. Will change that in the dataset spec. Temporal extent is still a string for now and need to find out how they define 3D bboxes (= incl. the z-axis).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be mostly WFS compatible now.

| links | [Link Object] | Links (required) | A list of references to other documents, see Link Object for further documentation. TODO: Remove if catalog is revised and links are specified on the catalog level. |

### Provider Object

| Element | Type | Name | Description |
| ------- | ------ | --------------------- | ----------------------------------------------- |
| name | string | Organization name | The name of the organization or the individual. |
| url | string | Organization homepage | Homepage of the provider. |

### Host Object

| Element | Type | Name | Description |
| -------------- | ------- | --------------------- | ------------------------------------------------------------ |
| description | string | Description | Detailed description to explain the hosting details. [CommonMark 0.28](http://commonmark.org/) syntax MAY be used for rich text representation. |
| scheme | string | Scheme (required) | Values: S3, GCS, URL, OTHER |
| id | string | Identifier (required) | Host-specific identifier such as an URL or asset id. |
| region | string | Region | Provider specific region where the data is stored. |
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is region primarily an AWS thing or is it general to all cloud providers?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The schema was written before we knew about the idea of storage profiles in #148. I would really like to have profiles instead of having the storage details directly baked into the dataset spec.

region would probably be an AWS specific thing, which should be in a separate profile as proposed in #148. If others have regions, too, then they should have it separately in their profiles aswell.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GCS has regions too

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, that makes sense though - each cloud provider should define it's own storage profile (extension?)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd go that route, yes. Is that dataset specific or could that be also catalog or item specific?

| requester_pays | boolean | Requester pays | `true` if requester pays, `false` if host pays. Defaults to `false`. |

**Note:** The idea of storage profiles is currently [discussed](https://github.com/radiantearth/stac-spec/issues/148). Therefore, scheme, id and region may be removed from the final spec.

### Link Object

| Element | Type | Name | Description |
| ------- | ------ | ------------------- | ------------------------------------------------------------ |
| href | string | Link (required) | The actual link in the format of an URL. Relative and absolute links are both allowed. |
| rel | string | Relation (required) | Relationship between the current document and the linked document. |
| type | string | MIME-type | MIME-type of the referenced entity. |
| title | string | Title | Human-readable title for the link. |

## Extensions

Related extensions to be used with the dataset spec:

* [EO extension](../extensions/stac-eo-spec.md)
Please note that some fields such as `eo:sun_elevation ` or `eo:sun_azimuth` are only meaningful on the item level and MUST not be used in datasets.
* [Dimensions extension](../extensions/dimension) (currently in review, see [PR #164](https://github.com/radiantearth/stac-spec/pull/164))
* [Scientific extension](../extensions/scientific) (currently in review, see [PR #186](https://github.com/radiantearth/stac-spec/pull/186))
* Provenance extension (planned, see [issue #179](https://github.com/radiantearth/stac-spec/issues/179))
136 changes: 136 additions & 0 deletions dataset-spec/json-schema/dataset.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,136 @@
{
"$schema": "http://json-schema.org/draft-06/schema#",
"id": "dataset.json#",
"title": "Dataset Item",
"description": "This object represents the dataset in a SpatioTemporal Asset Catalog.",
"type": "object",
"required": [
"id",
"description",
"license",
"spatial_extent",
"temporal_extent",
"links"
],
"properties": {
"id": {
"title": "Provider ID",
"type": "string",
"pattern": "^[A-Za-z0-9_\\-\/]+$"
},
"title": {
"title": "Title",
"type": "string"
},
"description": {
"title": "Description",
"type": "string"
},
"keywords": {
"title": "Keywords",
"type": "array",
"items": {
"type": "string"
}
},
"license": {
"title": "License Name",
"type": "string"
},
"license_url": {
"title": "License URL",
"type": "string",
"format": "url"
},
"provider": {
"type": "array",
"items": {
"properties": {
"name": {
"title": "Organization Name",
"type": "string"
},
"url": {
"title": "Organization homepage",
"type": "string",
"format": "url"
}
}
}
},
"host": {
"required": [
"id",
"scheme"
],
"properties": {
"id": {
"title": "Identifirer",
"type": "string"
},
"scheme": {
"title": "Scheme",
"type": "string",
"enum": [
"S3",
"GCS",
"URL",
"OTHER"
]
},
"description": {
"title": "Description",
"type": "string"
},
"region": {
"title": "Region",
"type": "string"
},
"requester_pays": {
"title": "Requester Pays",
"type": "boolean",
"default": false
}
}
},
"version": {
"title": "Version",
"type": "string"
},
"temporal_extent": {
"title": "Temporal extent",
"type": "string"
},
"spatial_extent": {
"type": "object"
},
"links": {
"type": "array",
"items": {
"type": "object",
"required": [
"href",
"rel"
],
"properties": {
"href": {
"title": "Link",
"type": "string"
},
"rel": {
"title": "Relation",
"type": "string"
},
"type": {
"title": "type",
"type": "string"
},
"title": {
"title": "Title",
"type": "string"
}
}
}
}
}
}
19 changes: 19 additions & 0 deletions extensions/dimension/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
# STAC Dimensions Extension Spec

This document explains the fields of the STAC Dimensions Extension (dim) to a STAC `Dataset`. Data can have different dimensions (= axes), e.g. in meteorology. The properties of these dimensions can be defined with this extension.

## Dimensions Extension Description

This is the field that extends the `Dataset` object:

| Element | Type | Name | Description |
| ---------------- | -------------------- | ------------------------- | ------------------------------------------------------------ |
| dim:dimensions | [Dimension Object] | Dimensions | Dimensions of the data. If the dimensions have an order, the order SHOULD be reflected in the order of the array. |

### Dimension Object

| Element | Type | Name | Description |
| ------- | ---------------- | ------------------- | ------------------------------------------------------------ |
| label | string | Label (required) | Human-readable label for the dimension. |
| unit | string | Unit of Measurement | Unit of measurement, preferably SI. ToDo: Any standard to express this, e.g. [UDUNITS](https://www.unidata.ucar.edu/software/udunits/) or this [dict](https://www.unc.edu/~rowlett/units/)? |
| extent | [number\|string] | Data Extent | Specifies the extent of the data, i.e. the lower bound as the first element and the upper bound as the second element of the array. |
23 changes: 23 additions & 0 deletions extensions/dimension/example.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
{
"dim:dimensions": [
{
"label": "Longitude",
"unit": "°",
"extent": [-180, 180]
},
{
"label": "Latitude",
"unit": "°",
"extent": [-90, 90]
},
{
"label": "Temperature",
"unit": "°C",
"extent": [-20, 60]
},
{
"label": "Date",
"extent": ["2018-01-01T00:00:00Z", "2018-01-31T23:59:59Z"]
}
]
}
36 changes: 36 additions & 0 deletions extensions/dimension/schema.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@
{
"$schema": "http://json-schema.org/draft-07/schema#",
"type": "object",
"title": "STAC Dimensions Extension Spec",
"properties": {
"dim:dimensions": {
"type": "array",
"title": "Dimensions",
"items": {
"type": "object",
"required": [
"label"
],
"properties": {
"label": {
"type": "string",
"title": "Label"
},
"unit": {
"type": "string",
"title": "Unit of Measurement"
},
"extent": {
"type": "array",
"title": "Data Extent",
"minItems": 2,
"maxItems": 2,
"items": {
"type": ["number", "string"]
}
}
}
}
}
}
}
10 changes: 5 additions & 5 deletions extensions/stac-collection-spec.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,11 +4,11 @@ A group of STAC `Item` objects from a single source can share a lot of common me

## Collection Extension Description

| element | type info | name | description |
|----------------------|---------------------------|-------------------------|---------------------------------------------------------------------------------------------|
| c:id | string | Collection ID | Machine readable ID for the collection
| c:name | string (optional) | Collection Name | A name given to the Collection, used for display
| c:description | string (optional) | Collection Description | A human readable description of the collection. [CommonMark 0.28](http://commonmark.org/) syntax MAY be used for rich text representation.
| element | type info | name | description |
| ------------- | ----------------- | ---------------------- | ------------------------------------------------ |
| c:id | string | Collection ID | Machine readable ID for the collection |
| c:name | string (optional) | Collection Name | A name given to the Collection, used for display |
| c:description | string (optional) | Collection Description | A human readable description of the collection. [CommonMark 0.28](http://commonmark.org/) syntax MAY be used for rich text representation. |

A `Collection` does not have many specific fields, as it may contain any fields that are in the core spec as well as any other extension. This provides maximum flexibility to data providers, as some the set of common metadata fields can vary between different types of data. For instance, Landsat and Sentinel data always has a eo:off_nadir value of 0, because those satellites are always pointed downward (i.e., nadir), while satellite that can be pointed will have varying eo:off_nadir values.

Expand Down
Loading