Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CRS as dimension in data cubes #9

Closed
kempenep opened this issue Dec 6, 2019 · 34 comments
Closed

CRS as dimension in data cubes #9

kempenep opened this issue Dec 6, 2019 · 34 comments
Assignees

Comments

@kempenep
Copy link

kempenep commented Dec 6, 2019

@m-mohr I think we already had several discussions on coordinate reference system (crs).
In my opinion, there should be a crs parameter in the load_collection. Not only for selecting the region of interest as we do have now, but for defining the target reference system of the resulting cube. Once input images with different crs are selected and need to be combined, we need to be able to re-project them in a single cube. I think it is also best to define the crs as soon as possible in the work flow, which is in the load_collection process, I guess.

@m-mohr
Copy link
Member

m-mohr commented Dec 9, 2019

This issue has a long history, see issues Open-EO/openeo-api#4, Open-EO/openeo-api#28, Open-EO/openeo-api#29 and Open-EO/openeo-api#89. (edit: fixed links to issues)

Currently, load_collection already covers filter_bbox, filter_temporal, filter_bands and filter. In addition, it would then also cover resample_spatial with additional 4 parameters.
Reason is that it's not just the CRS that needs to match, but also the resolution (and maybe more things?). It is likely that we overload load_collection and it's getting to complicated.

So potential solutions are:

  1. Add the 4 additional parameters from resample_spatial to load_collection
  2. Expose collections with just a single CRS and/or expose collections matching all potential CRS for the collection.
  3. Somehow get this information from the process graph (i.e. resample or save_result calls, but this might be ambiguous).
  4. ...?

I must admit I don't like any of the options yet. Ideas? Thoughts?

@kempenep
Copy link
Author

kempenep commented Dec 9, 2019

I believe the spatial (and temporal) resolution need to be defined before actually loading a collection. Even a single collection such as Sentinel-2 can be stored in multiple projection (UTM) zones. Selecting a bounding box on the border of different UTM zones is ambiguous if not defining a target projection. I would be in favor of solution 1 (but I think we should also add the parameters for temporal resolution)

@m-mohr
Copy link
Member

m-mohr commented Dec 9, 2019

We need to decide on the call on Thursday whether a process with 10+ parameters is desirable and what alternatives we have. I just covered spatial above, but temporal is indeed also an issue. Also, does this also apply to a process such as load_results or load_user_data (see Open-EO/openeo-processes#83)?

@lforesta
Copy link

lforesta commented Dec 9, 2019

I am not in favor of a 10+ parameters process in general.

Resampling/reprojecting "early" may not always be the best option, since one may end up doing that operation for several hundreds or maybe thousands of timestamps (rasters). Instead the reprojection/downsampling can be potentially done at the end, it should be the user deciding when/if to do that. And I think the user now has all the capabilities to work with the projecton and spatial/temporal sampling that they want. If the user needs to have data in projection EPSG:xxxx, they simply use resample_spatial and/or resample_temporal just after load_collection and before applying any other process.
Are there cases which are not covered by this approach?

@kempenep
Copy link
Author

kempenep commented Dec 9, 2019

If load_collection results in a (non-virtual) data cube, some interpolation/projection has been involved already. A subsequent resample_spatial or temporal implies another resampling step.
To make things worse, an additional parameter would be the data type. Some back-ends might use Float32/64 as default, others adopt the native data type as stored on file.

@aljacob
Copy link
Member

aljacob commented Dec 10, 2019

In my opinion, collections should just have a single projection.
I still think of the more scientific users, when thinking of designing this and they should be themselves able to decide if the want to do re-projection early or late.
I would keep that separate that from load collection.

Not sure if we can define a construct, where somebody can prior to load collection define something like a target grid and hand that over to load collection.

@kempenep
Copy link
Author

A construct prior to load_collection that defines the target grid would be an option indeed. The same could apply for the target resolution in both time and space.

@m-mohr
Copy link
Member

m-mohr commented Dec 10, 2019

What's the difference of such a construct compared to additional parameters in load_collection? I think it would make things even more complex than the additional parameters...

I'm thinking it may even be required to separate scientific users and "i don't care about projection non-remote senser" users and cater them with different collections. There could be the "EURAC" approach with several collections based on projection, resolution, etc. and then a separate "ready-to-use" collection as you get them usually in GEE for example. In this way it's up to the back-end to provide the best solution for their target audience and not really an API/processes issue any longer.

@jdries
Copy link
Contributor

jdries commented Dec 10, 2019

I think I agree with @lforesta that we can already do a lot with the current resample process. What we do additionally in our backend, is trying to 'push-down' certain parameters from higher-level processes, or deriving them from the requested type of output.
For instance, when creating a viewing service in web mercator, we know we can start with loading all data in web mercator. Similarly, the bounding box could be limited to a given UTM zone, allowing you to decide to work in that zone.
Or more explicitly, a resample operation done at some point in the graph could also be applied in the beginning, especially when it comes right after the 'load_collection' step. So in my opinion, it is up to the backend how to optimally execute the graph, as long as the result is correct.

@lforesta
Copy link

In my opinion, collections should just have a single projection. @aljacob

We can't enforce this in any way, a back-end will store data with the projection(s) they prefer for their purpose. In our case, for S2 L1C data, we don't reproject TBs of data to a common projection, but we do for some higher level data.
This should be clear in the collection's description though, so the user will have this piece of information and will decide if and when to reproject/resample in the pg.

@m-mohr
Copy link
Member

m-mohr commented Dec 11, 2019

Well, you could enforce it by splitting into as many collections as there are projections. That's what EURAC is doing currently. That's probably not very user-friendly for people who are coming from GEE and are used fully reprojected datasets with a single projection, but scientific users are probably happy about having full control.

The underlying issue is: If you have for example the full Sentinel 2 archive with it's dozens of UTM projections and you load that into a data cube that is expected to have a single projection. How to load that data cube? Currently, for openEO it's either the "EURAC" or the "GEE" way. The question is whether there's a better alternative, e.g. the 10+ parameter load_collection process or ...?

@jdries
Copy link
Contributor

jdries commented Dec 11, 2019

Dealing with the UTM zones is implementation specific imo, for now a lot of backends indeed rely on reprojecting to an intermediate projection, but nothing is stopping them from doing something more advanced.
If we force the user to specify a projection, than this will also block backends from being more innovative/creative when it comes to projection handling, as the API will inforce reprojecting.

@edzer
Copy link
Member

edzer commented Dec 11, 2019

From the documentation: Loads a collection from the current back-end by its id and returns it as processable data cube. I understand this like this:

  • image collections may consist of images registered at different CRS (like Sentinel 2 over different UTM zones)
  • load_collection returns a data cube, which is still only a view on the data (it doesn't need to write it, so to speak, as a new GeoTIFF or so at load time)
  • Since (spatial) dimensions of a cube require a unique mapping from pixel space to coordinates, this cube needs to have a single CRS
  • at time of load_collection, a resolution is not specified since we may want to browse results using a WMTS, and compute pixels at increasing/decreasing resolution and changing extent on-the-fly
  • at time of downloading result, a resolution is required
  • when CRS of cube and image collection are identical (like Jeroen's example of working in a single UTM zone) the native resolution of the image collection should be used when no resolution is specified (no resampling).

@m-mohr
Copy link
Member

m-mohr commented Dec 11, 2019

@edzer All true, but you did not cover the case we are discussion here: What happens if the image collection may consist of images registered at different CRS and you load it into a single CRS data cube? That is somewhat undefined at the moment.

We already had this on our agenda several times, but at some point it got lost, see Open-EO/openeo-api#4, Open-EO/openeo-api#28, Open-EO/openeo-api#29 and Open-EO/openeo-api#89

@kempenep
Copy link
Author

kempenep commented Dec 11, 2019

Since a cube needs to have a single CRS, I am in favor of letting the user decide the CRS when (or before by setting some variable) creating a cube.
I agree there can be some "smart" defaults that are well documented, e.g., 3857 for WMTS, or keeping the original projection of a tile as stored in the resp. back-end when a requested region of interest does not extend more than one projection zone... We have to be aware though this can lead to non-transparent differences between the results of the various back-end.

@aljacob
Copy link
Member

aljacob commented Dec 11, 2019

When we had the meeting with guys from OGC in London, for the API Hackathon, this issue was also discussed. You could also define a collection of collections, covering the case where different sub-collections have different projections and tackle this using the virtual cube design discussed by @edzer and @kempenep, having a default mechanism of the backend describing this with a common global CRS like EPSG:3857 or EPSG:4326. How this virtual cube is created is of course up to the backend, and should allow for innovative solutions as proposed by @jdries .

@m-mohr
Copy link
Member

m-mohr commented Dec 11, 2019

@aljacob How is that different from what we have now? It seems to be the same, except that we hide sub-collections and don't have a way to specify the "global" CRS (which is what we are discussing here).
"How this virtual cube is created is of course up to the backend" => That is the current implicit behaviour of load_collection.

@edzer
Copy link
Member

edzer commented Dec 11, 2019

Which CRS: what is wrong with taking the CRS of the spatial_extent as the CRS for the cube returned?

"How this virtual cube is created is of course up to the backend" : only to the extent that when you download the result, you'd like to have an unambiguous (reproducible) result.

@kempenep
Copy link
Author

@edzer I support this idea to take the CRS of the spatial_extent as the CRS for the cube returned. This allows the user to define the CRS and avoids adding a new parameter.

@aljacob
Copy link
Member

aljacob commented Dec 11, 2019

The spatial extent is always given in EPSG:4326 aka wgs84 according to stac. The collection-id endpoint however has as one field also the native projection of the data cube (eo:epsg). So @edzer using the spatial_extent as it is implemented now would limit us to work in that one projection, which might not always be desirable.
@m-mohr it is not different, was just to illustrate how it could be setup in a specific back-end. I also agree, that we can do all of this already with the existing processes. Using load_collection, resample_cube_spatial and merge_cubes, should give a user every possibility to fine-control this.

@m-mohr
Copy link
Member

m-mohr commented Dec 11, 2019

The spatial extent is always given in EPSG:4326 aka wgs84 according to stac.

This is a misunderstanding. As far as I understood @edzer, he meant the spatial extent specified in load_collection.

I guess going the route via the spatial_extent CRS could be an option, although it has two drawbacks:

  1. It may need some dedicated work and good explanations in clients. For example, the Web Editor specifies the CRS always as 3857 as it's the CRS of the mapping library you choose the bbox with.
  2. Also, a GeoJSON geometry is always WGS84 and as such you wouldn't have a way to specify the CRS.

Therefore it would be better to specify the CRS separate from the spatial_extent, I think.

In general, there are three other options in resample_spatial (resolution, method, align). Do we need any of them in load_collection? I guess we don't need to think about resolution and then method (and align?) are not needed, right?

The collection-id endpoint however has as one field also the native projection of the data cube (eo:epsg).

Side note: This changes in STAC 0.9 an you can specify multiple projections per collection.

So @edzer using the spatial_extent as it is implemented now would limit us to work in that one projection, which might not always be desirable.

The data cube always has one projection. If you need multiple projections in a workflow, create different data cubes.

@kempenep

If load_collection results in a (non-virtual) data cube, some interpolation/projection has been involved already. A subsequent resample_spatial or temporal implies another resampling step.

I would think we don't need any dedicated temporal resampling in load_collection?!

To make things worse, an additional parameter would be the data type. Some back-ends might use Float32/64 as default, others adopt the native data type as stored on file.

I don't think we should expose this to the user. If there's really a conflict then it's up to the back-end to decide. At the moment we don't even expose the internal data type to a user anyway.

m-mohr referenced this issue in Open-EO/openeo-processes Dec 11, 2019
m-mohr referenced this issue in Open-EO/openeo-processes Dec 11, 2019
@m-mohr
Copy link
Member

m-mohr commented Dec 11, 2019

Okay, I created PR Open-EO/openeo-processes#102 so that we have an actual proposal to discuss and improve.

m-mohr referenced this issue in Open-EO/openeo-processes Dec 11, 2019
@lforesta
Copy link

Telco discussion -> no conclusion reached.

Main question to answer is, do we allow an openEO datacube to have multiple CRS associated with it? Currently, this is implicitly the case.

Currently, the user can force a specific projection by using resample_spatial (e.g. just after load_collection). But this is not good enough for some back-ends which may load data to memory directly with the load_collection process (hence the need for CRS to be a parameter of load_collection).

@m-mohr
Copy link
Member

m-mohr commented Dec 13, 2019

Currently it is not expected to have multiple CRS per data cube (although I'm not sure this is documented somewhere), which leads to undefined behavior if the we don't implement Open-EO/openeo-processes#102 or an alternative approach.

@jdries
Copy link
Contributor

jdries commented Dec 13, 2019

The point made during the telco is that in a lot of cases the user doesn't really care about the CRS, even if it is multi-crs.
If the user wants a specific single crs, than he should enforce it using resample_spatial. If the user doesn't enforce it, the backend will choose the best option when needed.
The most important result of this is that a generated raster (e.g. geotiff) can have a different crs depending on the backend. This was discussed, but it wasn't clear why this would be a problem. (Given that the option to enforce an output crs is there.)

@m-mohr
Copy link
Member

m-mohr commented Dec 16, 2019

The crs field is not required for single-CRS collections. If he doesn't care, he doesn't specify it, although for multi-CRS collections it should throw an error as (at the moment) the data cube model in openEO bases on a single CRS to make things comparable and reproducible. Just letting the back-end choose the best option is by no means comparable or reproducible.

@jdries
Copy link
Contributor

jdries commented Dec 16, 2019

Users that:

  1. want to compare results between backends
  2. AND don't know how to compare results with a different CRS
    will have to specify a resample operation, that's not too hard.
    There's a lot of use cases that don't fall in this category, for them things will be even easier and more effective in terms of cost and performance because the backend is free to optimize.
    Unfortunately, Sentinel-2 is a multi-crs collection for a lot of backends, I don't want to throw an error if a user happens to forget the CRS, or doesn't care about it.

@jdries
Copy link
Contributor

jdries commented Dec 17, 2019

To come to a conclusion on this one, I would say that there is at least more backend implementation work/experiments needed, especially in the area of multi-crs collections. On VITO side, this is normally planned for the next months. I don't think this issue is blocking anything so I propose to put it on-hold until there are new arguments pro or con.

@m-mohr
Copy link
Member

m-mohr commented Dec 17, 2019

We discussed this a little more here and in principle we may also need a target resolution parameter in addition to the target CRS.

@m-mohr
Copy link
Member

m-mohr commented Jan 13, 2020

Telco: PR Open-EO/openeo-processes#102 has been closed in favor of a new approach.
We are trying to make CRS a dimension as described here: openEO_dimensions.pdf

ToDos:

  1. Edzer is going to update the glossary accordingly.
  2. I'm going to check the API for potential changes.
  3. Processes will be updated only whenever we find issues

@m-mohr m-mohr self-assigned this Jan 13, 2020
@m-mohr m-mohr changed the title coordinate reference system for load_collection CRS as dimension in data cubes Jan 22, 2020
@m-mohr m-mohr transferred this issue from Open-EO/openeo-processes Jan 22, 2020
edzer added a commit that referenced this issue Feb 21, 2020
@edzer
Copy link
Member

edzer commented Feb 21, 2020

I added a section here; for @jdries to review - is this comprehensible?

@jdries
Copy link
Contributor

jdries commented Feb 21, 2020 via email

@m-mohr
Copy link
Member

m-mohr commented Mar 4, 2020

Regarding the last part:

The "reduction" (removal) of the crs dimension that is meaningful involves the resampling/warping of all sub-cubes for the crs dimension to a single, common target coordinate reference system.

I'm thinking 1) how this translates into a process graph and 2) whether this implies any changes to the processes then? It probably needs clarifications in the spatial resample processes. There it doesn't talk about (removal of) dimensions at all.

@m-mohr
Copy link
Member

m-mohr commented May 18, 2021

Moving to Open-EO/openeo-processes#251

@m-mohr m-mohr closed this as completed May 18, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants