-
Notifications
You must be signed in to change notification settings - Fork 17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CRS as dimension in data cubes #9
Comments
This issue has a long history, see issues Open-EO/openeo-api#4, Open-EO/openeo-api#28, Open-EO/openeo-api#29 and Open-EO/openeo-api#89. (edit: fixed links to issues) Currently, load_collection already covers filter_bbox, filter_temporal, filter_bands and filter. In addition, it would then also cover resample_spatial with additional 4 parameters. So potential solutions are:
I must admit I don't like any of the options yet. Ideas? Thoughts? |
I believe the spatial (and temporal) resolution need to be defined before actually loading a collection. Even a single collection such as Sentinel-2 can be stored in multiple projection (UTM) zones. Selecting a bounding box on the border of different UTM zones is ambiguous if not defining a target projection. I would be in favor of solution 1 (but I think we should also add the parameters for temporal resolution) |
We need to decide on the call on Thursday whether a process with 10+ parameters is desirable and what alternatives we have. I just covered spatial above, but temporal is indeed also an issue. Also, does this also apply to a process such as load_results or load_user_data (see Open-EO/openeo-processes#83)? |
I am not in favor of a 10+ parameters process in general. Resampling/reprojecting "early" may not always be the best option, since one may end up doing that operation for several hundreds or maybe thousands of timestamps (rasters). Instead the reprojection/downsampling can be potentially done at the end, it should be the user deciding when/if to do that. And I think the user now has all the capabilities to work with the projecton and spatial/temporal sampling that they want. If the user needs to have data in projection EPSG:xxxx, they simply use resample_spatial and/or resample_temporal just after load_collection and before applying any other process. |
If load_collection results in a (non-virtual) data cube, some interpolation/projection has been involved already. A subsequent resample_spatial or temporal implies another resampling step. |
In my opinion, collections should just have a single projection. Not sure if we can define a construct, where somebody can prior to load collection define something like a target grid and hand that over to load collection. |
A construct prior to load_collection that defines the target grid would be an option indeed. The same could apply for the target resolution in both time and space. |
What's the difference of such a construct compared to additional parameters in load_collection? I think it would make things even more complex than the additional parameters... I'm thinking it may even be required to separate scientific users and "i don't care about projection non-remote senser" users and cater them with different collections. There could be the "EURAC" approach with several collections based on projection, resolution, etc. and then a separate "ready-to-use" collection as you get them usually in GEE for example. In this way it's up to the back-end to provide the best solution for their target audience and not really an API/processes issue any longer. |
I think I agree with @lforesta that we can already do a lot with the current resample process. What we do additionally in our backend, is trying to 'push-down' certain parameters from higher-level processes, or deriving them from the requested type of output. |
We can't enforce this in any way, a back-end will store data with the projection(s) they prefer for their purpose. In our case, for S2 L1C data, we don't reproject TBs of data to a common projection, but we do for some higher level data. |
Well, you could enforce it by splitting into as many collections as there are projections. That's what EURAC is doing currently. That's probably not very user-friendly for people who are coming from GEE and are used fully reprojected datasets with a single projection, but scientific users are probably happy about having full control. The underlying issue is: If you have for example the full Sentinel 2 archive with it's dozens of UTM projections and you load that into a data cube that is expected to have a single projection. How to load that data cube? Currently, for openEO it's either the "EURAC" or the "GEE" way. The question is whether there's a better alternative, e.g. the 10+ parameter load_collection process or ...? |
Dealing with the UTM zones is implementation specific imo, for now a lot of backends indeed rely on reprojecting to an intermediate projection, but nothing is stopping them from doing something more advanced. |
From the documentation: Loads a collection from the current back-end by its id and returns it as processable data cube. I understand this like this:
|
@edzer All true, but you did not cover the case we are discussion here: What happens if the image collection may consist of images registered at different CRS and you load it into a single CRS data cube? That is somewhat undefined at the moment. We already had this on our agenda several times, but at some point it got lost, see Open-EO/openeo-api#4, Open-EO/openeo-api#28, Open-EO/openeo-api#29 and Open-EO/openeo-api#89 |
Since a cube needs to have a single CRS, I am in favor of letting the user decide the CRS when (or before by setting some variable) creating a cube. |
When we had the meeting with guys from OGC in London, for the API Hackathon, this issue was also discussed. You could also define a collection of collections, covering the case where different sub-collections have different projections and tackle this using the virtual cube design discussed by @edzer and @kempenep, having a default mechanism of the backend describing this with a common global CRS like EPSG:3857 or EPSG:4326. How this virtual cube is created is of course up to the backend, and should allow for innovative solutions as proposed by @jdries . |
@aljacob How is that different from what we have now? It seems to be the same, except that we hide sub-collections and don't have a way to specify the "global" CRS (which is what we are discussing here). |
Which CRS: what is wrong with taking the CRS of the "How this virtual cube is created is of course up to the backend" : only to the extent that when you download the result, you'd like to have an unambiguous (reproducible) result. |
@edzer I support this idea to take the CRS of the spatial_extent as the CRS for the cube returned. This allows the user to define the CRS and avoids adding a new parameter. |
The spatial extent is always given in EPSG:4326 aka wgs84 according to stac. The collection-id endpoint however has as one field also the native projection of the data cube (eo:epsg). So @edzer using the spatial_extent as it is implemented now would limit us to work in that one projection, which might not always be desirable. |
This is a misunderstanding. As far as I understood @edzer, he meant the spatial extent specified in load_collection. I guess going the route via the spatial_extent CRS could be an option, although it has two drawbacks:
Therefore it would be better to specify the CRS separate from the spatial_extent, I think. In general, there are three other options in resample_spatial (resolution, method, align). Do we need any of them in load_collection? I guess we don't need to think about resolution and then method (and align?) are not needed, right?
Side note: This changes in STAC 0.9 an you can specify multiple projections per collection.
The data cube always has one projection. If you need multiple projections in a workflow, create different data cubes.
I would think we don't need any dedicated temporal resampling in load_collection?!
I don't think we should expose this to the user. If there's really a conflict then it's up to the back-end to decide. At the moment we don't even expose the internal data type to a user anyway. |
Okay, I created PR Open-EO/openeo-processes#102 so that we have an actual proposal to discuss and improve. |
Telco discussion -> no conclusion reached. Main question to answer is, do we allow an openEO datacube to have multiple CRS associated with it? Currently, this is implicitly the case. Currently, the user can force a specific projection by using resample_spatial (e.g. just after load_collection). But this is not good enough for some back-ends which may load data to memory directly with the load_collection process (hence the need for CRS to be a parameter of load_collection). |
Currently it is not expected to have multiple CRS per data cube (although I'm not sure this is documented somewhere), which leads to undefined behavior if the we don't implement Open-EO/openeo-processes#102 or an alternative approach. |
The point made during the telco is that in a lot of cases the user doesn't really care about the CRS, even if it is multi-crs. |
The crs field is not required for single-CRS collections. If he doesn't care, he doesn't specify it, although for multi-CRS collections it should throw an error as (at the moment) the data cube model in openEO bases on a single CRS to make things comparable and reproducible. Just letting the back-end choose the best option is by no means comparable or reproducible. |
Users that:
|
To come to a conclusion on this one, I would say that there is at least more backend implementation work/experiments needed, especially in the area of multi-crs collections. On VITO side, this is normally planned for the next months. I don't think this issue is blocking anything so I propose to put it on-hold until there are new arguments pro or con. |
We discussed this a little more here and in principle we may also need a target resolution parameter in addition to the target CRS. |
Telco: PR Open-EO/openeo-processes#102 has been closed in favor of a new approach. ToDos:
|
Yes, I think the problem and solution is clearly explained. More details on actual impact on the implementation might be added when we get to that point.
|
Regarding the last part:
I'm thinking 1) how this translates into a process graph and 2) whether this implies any changes to the processes then? It probably needs clarifications in the spatial resample processes. There it doesn't talk about (removal of) dimensions at all. |
Moving to Open-EO/openeo-processes#251 |
@m-mohr I think we already had several discussions on coordinate reference system (crs).
In my opinion, there should be a crs parameter in the load_collection. Not only for selecting the region of interest as we do have now, but for defining the target reference system of the resulting cube. Once input images with different crs are selected and need to be combined, we need to be able to re-project them in a single cube. I think it is also best to define the crs as soon as possible in the work flow, which is in the load_collection process, I guess.
The text was updated successfully, but these errors were encountered: