CRS as dimension in data cubes #9

kempenep · 2019-12-06T13:56:20Z

@m-mohr I think we already had several discussions on coordinate reference system (crs).
In my opinion, there should be a crs parameter in the load_collection. Not only for selecting the region of interest as we do have now, but for defining the target reference system of the resulting cube. Once input images with different crs are selected and need to be combined, we need to be able to re-project them in a single cube. I think it is also best to define the crs as soon as possible in the work flow, which is in the load_collection process, I guess.

m-mohr · 2019-12-09T12:52:57Z

This issue has a long history, see issues Open-EO/openeo-api#4, Open-EO/openeo-api#28, Open-EO/openeo-api#29 and Open-EO/openeo-api#89. (edit: fixed links to issues)

Currently, load_collection already covers filter_bbox, filter_temporal, filter_bands and filter. In addition, it would then also cover resample_spatial with additional 4 parameters.
Reason is that it's not just the CRS that needs to match, but also the resolution (and maybe more things?). It is likely that we overload load_collection and it's getting to complicated.

So potential solutions are:

Add the 4 additional parameters from resample_spatial to load_collection
Expose collections with just a single CRS and/or expose collections matching all potential CRS for the collection.
Somehow get this information from the process graph (i.e. resample or save_result calls, but this might be ambiguous).
...?

I must admit I don't like any of the options yet. Ideas? Thoughts?

kempenep · 2019-12-09T14:42:13Z

I believe the spatial (and temporal) resolution need to be defined before actually loading a collection. Even a single collection such as Sentinel-2 can be stored in multiple projection (UTM) zones. Selecting a bounding box on the border of different UTM zones is ambiguous if not defining a target projection. I would be in favor of solution 1 (but I think we should also add the parameters for temporal resolution)

m-mohr · 2019-12-09T14:50:42Z

We need to decide on the call on Thursday whether a process with 10+ parameters is desirable and what alternatives we have. I just covered spatial above, but temporal is indeed also an issue. Also, does this also apply to a process such as load_results or load_user_data (see Open-EO/openeo-processes#83)?

lforesta · 2019-12-09T15:10:57Z

I am not in favor of a 10+ parameters process in general.

Resampling/reprojecting "early" may not always be the best option, since one may end up doing that operation for several hundreds or maybe thousands of timestamps (rasters). Instead the reprojection/downsampling can be potentially done at the end, it should be the user deciding when/if to do that. And I think the user now has all the capabilities to work with the projecton and spatial/temporal sampling that they want. If the user needs to have data in projection EPSG:xxxx, they simply use resample_spatial and/or resample_temporal just after load_collection and before applying any other process.
Are there cases which are not covered by this approach?

kempenep · 2019-12-09T15:17:26Z

If load_collection results in a (non-virtual) data cube, some interpolation/projection has been involved already. A subsequent resample_spatial or temporal implies another resampling step.
To make things worse, an additional parameter would be the data type. Some back-ends might use Float32/64 as default, others adopt the native data type as stored on file.

aljacob · 2019-12-10T12:16:55Z

In my opinion, collections should just have a single projection.
I still think of the more scientific users, when thinking of designing this and they should be themselves able to decide if the want to do re-projection early or late.
I would keep that separate that from load collection.

Not sure if we can define a construct, where somebody can prior to load collection define something like a target grid and hand that over to load collection.

kempenep · 2019-12-10T16:08:14Z

A construct prior to load_collection that defines the target grid would be an option indeed. The same could apply for the target resolution in both time and space.

m-mohr · 2019-12-10T16:20:52Z

What's the difference of such a construct compared to additional parameters in load_collection? I think it would make things even more complex than the additional parameters...

I'm thinking it may even be required to separate scientific users and "i don't care about projection non-remote senser" users and cater them with different collections. There could be the "EURAC" approach with several collections based on projection, resolution, etc. and then a separate "ready-to-use" collection as you get them usually in GEE for example. In this way it's up to the back-end to provide the best solution for their target audience and not really an API/processes issue any longer.

jdries · 2019-12-10T19:19:42Z

I think I agree with @lforesta that we can already do a lot with the current resample process. What we do additionally in our backend, is trying to 'push-down' certain parameters from higher-level processes, or deriving them from the requested type of output.
For instance, when creating a viewing service in web mercator, we know we can start with loading all data in web mercator. Similarly, the bounding box could be limited to a given UTM zone, allowing you to decide to work in that zone.
Or more explicitly, a resample operation done at some point in the graph could also be applied in the beginning, especially when it comes right after the 'load_collection' step. So in my opinion, it is up to the backend how to optimally execute the graph, as long as the result is correct.

lforesta · 2019-12-11T09:55:02Z

In my opinion, collections should just have a single projection. @aljacob

We can't enforce this in any way, a back-end will store data with the projection(s) they prefer for their purpose. In our case, for S2 L1C data, we don't reproject TBs of data to a common projection, but we do for some higher level data.
This should be clear in the collection's description though, so the user will have this piece of information and will decide if and when to reproject/resample in the pg.

m-mohr · 2019-12-11T10:05:12Z

Well, you could enforce it by splitting into as many collections as there are projections. That's what EURAC is doing currently. That's probably not very user-friendly for people who are coming from GEE and are used fully reprojected datasets with a single projection, but scientific users are probably happy about having full control.

The underlying issue is: If you have for example the full Sentinel 2 archive with it's dozens of UTM projections and you load that into a data cube that is expected to have a single projection. How to load that data cube? Currently, for openEO it's either the "EURAC" or the "GEE" way. The question is whether there's a better alternative, e.g. the 10+ parameter load_collection process or ...?

jdries · 2019-12-11T10:39:44Z

Dealing with the UTM zones is implementation specific imo, for now a lot of backends indeed rely on reprojecting to an intermediate projection, but nothing is stopping them from doing something more advanced.
If we force the user to specify a projection, than this will also block backends from being more innovative/creative when it comes to projection handling, as the API will inforce reprojecting.

edzer · 2019-12-11T11:10:22Z

From the documentation: Loads a collection from the current back-end by its id and returns it as processable data cube. I understand this like this:

image collections may consist of images registered at different CRS (like Sentinel 2 over different UTM zones)
load_collection returns a data cube, which is still only a view on the data (it doesn't need to write it, so to speak, as a new GeoTIFF or so at load time)
Since (spatial) dimensions of a cube require a unique mapping from pixel space to coordinates, this cube needs to have a single CRS
at time of load_collection, a resolution is not specified since we may want to browse results using a WMTS, and compute pixels at increasing/decreasing resolution and changing extent on-the-fly
at time of downloading result, a resolution is required
when CRS of cube and image collection are identical (like Jeroen's example of working in a single UTM zone) the native resolution of the image collection should be used when no resolution is specified (no resampling).

m-mohr · 2019-12-11T11:20:01Z

@edzer All true, but you did not cover the case we are discussion here: What happens if the image collection may consist of images registered at different CRS and you load it into a single CRS data cube? That is somewhat undefined at the moment.

We already had this on our agenda several times, but at some point it got lost, see Open-EO/openeo-api#4, Open-EO/openeo-api#28, Open-EO/openeo-api#29 and Open-EO/openeo-api#89

kempenep · 2019-12-11T11:22:45Z

Since a cube needs to have a single CRS, I am in favor of letting the user decide the CRS when (or before by setting some variable) creating a cube.
I agree there can be some "smart" defaults that are well documented, e.g., 3857 for WMTS, or keeping the original projection of a tile as stored in the resp. back-end when a requested region of interest does not extend more than one projection zone... We have to be aware though this can lead to non-transparent differences between the results of the various back-end.

aljacob · 2019-12-11T11:31:43Z

When we had the meeting with guys from OGC in London, for the API Hackathon, this issue was also discussed. You could also define a collection of collections, covering the case where different sub-collections have different projections and tackle this using the virtual cube design discussed by @edzer and @kempenep, having a default mechanism of the backend describing this with a common global CRS like EPSG:3857 or EPSG:4326. How this virtual cube is created is of course up to the backend, and should allow for innovative solutions as proposed by @jdries .

m-mohr · 2019-12-11T11:41:00Z

@aljacob How is that different from what we have now? It seems to be the same, except that we hide sub-collections and don't have a way to specify the "global" CRS (which is what we are discussing here).
"How this virtual cube is created is of course up to the backend" => That is the current implicit behaviour of load_collection.

edzer · 2019-12-11T11:42:43Z

Which CRS: what is wrong with taking the CRS of the spatial_extent as the CRS for the cube returned?

"How this virtual cube is created is of course up to the backend" : only to the extent that when you download the result, you'd like to have an unambiguous (reproducible) result.

kempenep · 2019-12-11T11:50:13Z

@edzer I support this idea to take the CRS of the spatial_extent as the CRS for the cube returned. This allows the user to define the CRS and avoids adding a new parameter.

aljacob · 2019-12-11T12:01:24Z

The spatial extent is always given in EPSG:4326 aka wgs84 according to stac. The collection-id endpoint however has as one field also the native projection of the data cube (eo:epsg). So @edzer using the spatial_extent as it is implemented now would limit us to work in that one projection, which might not always be desirable.
@m-mohr it is not different, was just to illustrate how it could be setup in a specific back-end. I also agree, that we can do all of this already with the existing processes. Using load_collection, resample_cube_spatial and merge_cubes, should give a user every possibility to fine-control this.

m-mohr · 2019-12-11T12:45:14Z

The spatial extent is always given in EPSG:4326 aka wgs84 according to stac.

This is a misunderstanding. As far as I understood @edzer, he meant the spatial extent specified in load_collection.

I guess going the route via the spatial_extent CRS could be an option, although it has two drawbacks:

It may need some dedicated work and good explanations in clients. For example, the Web Editor specifies the CRS always as 3857 as it's the CRS of the mapping library you choose the bbox with.
Also, a GeoJSON geometry is always WGS84 and as such you wouldn't have a way to specify the CRS.

Therefore it would be better to specify the CRS separate from the spatial_extent, I think.

In general, there are three other options in resample_spatial (resolution, method, align). Do we need any of them in load_collection? I guess we don't need to think about resolution and then method (and align?) are not needed, right?

The collection-id endpoint however has as one field also the native projection of the data cube (eo:epsg).

Side note: This changes in STAC 0.9 an you can specify multiple projections per collection.

So @edzer using the spatial_extent as it is implemented now would limit us to work in that one projection, which might not always be desirable.

The data cube always has one projection. If you need multiple projections in a workflow, create different data cubes.

@kempenep

If load_collection results in a (non-virtual) data cube, some interpolation/projection has been involved already. A subsequent resample_spatial or temporal implies another resampling step.

I would think we don't need any dedicated temporal resampling in load_collection?!

To make things worse, an additional parameter would be the data type. Some back-ends might use Float32/64 as default, others adopt the native data type as stored on file.

I don't think we should expose this to the user. If there's really a conflict then it's up to the back-end to decide. At the moment we don't even expose the internal data type to a user anyway.

m-mohr · 2019-12-11T13:06:52Z

Okay, I created PR Open-EO/openeo-processes#102 so that we have an actual proposal to discuss and improve.

lforesta · 2019-12-12T12:23:40Z

Telco discussion -> no conclusion reached.

Main question to answer is, do we allow an openEO datacube to have multiple CRS associated with it? Currently, this is implicitly the case.

Currently, the user can force a specific projection by using resample_spatial (e.g. just after load_collection). But this is not good enough for some back-ends which may load data to memory directly with the load_collection process (hence the need for CRS to be a parameter of load_collection).

m-mohr · 2019-12-13T08:24:59Z

Currently it is not expected to have multiple CRS per data cube (although I'm not sure this is documented somewhere), which leads to undefined behavior if the we don't implement Open-EO/openeo-processes#102 or an alternative approach.

jdries · 2019-12-13T15:36:50Z

The point made during the telco is that in a lot of cases the user doesn't really care about the CRS, even if it is multi-crs.
If the user wants a specific single crs, than he should enforce it using resample_spatial. If the user doesn't enforce it, the backend will choose the best option when needed.
The most important result of this is that a generated raster (e.g. geotiff) can have a different crs depending on the backend. This was discussed, but it wasn't clear why this would be a problem. (Given that the option to enforce an output crs is there.)

m-mohr · 2019-12-16T10:18:59Z

The crs field is not required for single-CRS collections. If he doesn't care, he doesn't specify it, although for multi-CRS collections it should throw an error as (at the moment) the data cube model in openEO bases on a single CRS to make things comparable and reproducible. Just letting the back-end choose the best option is by no means comparable or reproducible.

jdries · 2019-12-16T11:33:01Z

Users that:

want to compare results between backends
AND don't know how to compare results with a different CRS
will have to specify a resample operation, that's not too hard.
There's a lot of use cases that don't fall in this category, for them things will be even easier and more effective in terms of cost and performance because the backend is free to optimize.
Unfortunately, Sentinel-2 is a multi-crs collection for a lot of backends, I don't want to throw an error if a user happens to forget the CRS, or doesn't care about it.

jdries · 2019-12-17T09:37:53Z

To come to a conclusion on this one, I would say that there is at least more backend implementation work/experiments needed, especially in the area of multi-crs collections. On VITO side, this is normally planned for the next months. I don't think this issue is blocking anything so I propose to put it on-hold until there are new arguments pro or con.

m-mohr · 2019-12-17T15:57:31Z

We discussed this a little more here and in principle we may also need a target resolution parameter in addition to the target CRS.

m-mohr · 2020-01-13T15:52:03Z

Telco: PR Open-EO/openeo-processes#102 has been closed in favor of a new approach.
We are trying to make CRS a dimension as described here: openEO_dimensions.pdf

ToDos:

Edzer is going to update the glossary accordingly.
I'm going to check the API for potential changes.
Processes will be updated only whenever we find issues

edzer · 2020-02-21T17:10:52Z

I added a section here; for @jdries to review - is this comprehensible?

jdries · 2020-02-21T18:57:56Z

Yes, I think the problem and solution is clearly explained. More details on actual impact on the implementation might be added when we get to that point.

m-mohr · 2020-03-04T09:18:46Z

Regarding the last part:

The "reduction" (removal) of the crs dimension that is meaningful involves the resampling/warping of all sub-cubes for the crs dimension to a single, common target coordinate reference system.

I'm thinking 1) how this translates into a process graph and 2) whether this implies any changes to the processes then? It probably needs clarifications in the spatial resample processes. There it doesn't talk about (removal of) dimensions at all.

m-mohr · 2021-05-18T15:26:53Z

Moving to Open-EO/openeo-processes#251

m-mohr referenced this issue in Open-EO/openeo-processes Dec 11, 2019

CRS for load_collection (#98)

bf45e64

m-mohr referenced this issue in Open-EO/openeo-processes Dec 11, 2019

CRS for load_collection (#98)

d783610

m-mohr referenced this issue in Open-EO/openeo-processes Dec 11, 2019

CRS for load_collection (#98)

04e6304

m-mohr assigned edzer Jan 13, 2020

m-mohr self-assigned this Jan 13, 2020

m-mohr changed the title ~~coordinate reference system for load_collection~~ CRS as dimension in data cubes Jan 22, 2020

m-mohr transferred this issue from Open-EO/openeo-processes Jan 22, 2020

m-mohr added the enhancement label Jan 22, 2020

edzer added a commit that referenced this issue Feb 21, 2020

describe crs-dimensioned data cube; #9

c149c31

m-mohr mentioned this issue Sep 15, 2020

Tile-based processing Open-EO/openeo-processes#108

Closed

m-mohr mentioned this issue May 18, 2021

Processes to work with a CRS dimension? Open-EO/openeo-processes#251

Open

m-mohr closed this as completed May 18, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CRS as dimension in data cubes #9

CRS as dimension in data cubes #9

kempenep commented Dec 6, 2019

m-mohr commented Dec 9, 2019 •

edited

Loading

kempenep commented Dec 9, 2019

m-mohr commented Dec 9, 2019

lforesta commented Dec 9, 2019

kempenep commented Dec 9, 2019

aljacob commented Dec 10, 2019

kempenep commented Dec 10, 2019

m-mohr commented Dec 10, 2019

jdries commented Dec 10, 2019

lforesta commented Dec 11, 2019

m-mohr commented Dec 11, 2019

jdries commented Dec 11, 2019

edzer commented Dec 11, 2019

m-mohr commented Dec 11, 2019 •

edited

Loading

kempenep commented Dec 11, 2019 •

edited

Loading

aljacob commented Dec 11, 2019

m-mohr commented Dec 11, 2019

edzer commented Dec 11, 2019

kempenep commented Dec 11, 2019

aljacob commented Dec 11, 2019

m-mohr commented Dec 11, 2019 •

edited by edzer

Loading

m-mohr commented Dec 11, 2019

lforesta commented Dec 12, 2019

m-mohr commented Dec 13, 2019 •

edited

Loading

jdries commented Dec 13, 2019

m-mohr commented Dec 16, 2019

jdries commented Dec 16, 2019

jdries commented Dec 17, 2019

m-mohr commented Dec 17, 2019

m-mohr commented Jan 13, 2020 •

edited

Loading

edzer commented Feb 21, 2020

jdries commented Feb 21, 2020 via email •

edited by m-mohr

Loading

m-mohr commented Mar 4, 2020

m-mohr commented May 18, 2021

CRS as dimension in data cubes #9

CRS as dimension in data cubes #9

Comments

kempenep commented Dec 6, 2019

m-mohr commented Dec 9, 2019 • edited Loading

kempenep commented Dec 9, 2019

m-mohr commented Dec 9, 2019

lforesta commented Dec 9, 2019

kempenep commented Dec 9, 2019

aljacob commented Dec 10, 2019

kempenep commented Dec 10, 2019

m-mohr commented Dec 10, 2019

jdries commented Dec 10, 2019

lforesta commented Dec 11, 2019

m-mohr commented Dec 11, 2019

jdries commented Dec 11, 2019

edzer commented Dec 11, 2019

m-mohr commented Dec 11, 2019 • edited Loading

kempenep commented Dec 11, 2019 • edited Loading

aljacob commented Dec 11, 2019

m-mohr commented Dec 11, 2019

edzer commented Dec 11, 2019

kempenep commented Dec 11, 2019

aljacob commented Dec 11, 2019

m-mohr commented Dec 11, 2019 • edited by edzer Loading

m-mohr commented Dec 11, 2019

lforesta commented Dec 12, 2019

m-mohr commented Dec 13, 2019 • edited Loading

jdries commented Dec 13, 2019

m-mohr commented Dec 16, 2019

jdries commented Dec 16, 2019

jdries commented Dec 17, 2019

m-mohr commented Dec 17, 2019

m-mohr commented Jan 13, 2020 • edited Loading

edzer commented Feb 21, 2020

jdries commented Feb 21, 2020 via email • edited by m-mohr Loading

m-mohr commented Mar 4, 2020

m-mohr commented May 18, 2021

m-mohr commented Dec 9, 2019 •

edited

Loading

m-mohr commented Dec 11, 2019 •

edited

Loading

kempenep commented Dec 11, 2019 •

edited

Loading

m-mohr commented Dec 11, 2019 •

edited by edzer

Loading

m-mohr commented Dec 13, 2019 •

edited

Loading

m-mohr commented Jan 13, 2020 •

edited

Loading

jdries commented Feb 21, 2020 via email •

edited by m-mohr

Loading