CRS for load_collection #102

m-mohr · 2019-12-11T13:06:04Z

Here's a proposal for #98 to base discussions upon. Please comment and/or up/down-vote.

jdries · 2019-12-11T19:26:47Z

Not a big fan of this proposal, I prefer that users and process graphs do not enforce a CRS, unless the use case strictly demands it.
There is also not really a concept of a 'multi-crs' collection, in the sense that it is not exposed in the metadata. So users might be tempted to start defining crs's everywhere 'just in case', at the expense of performance and cost.

m-mohr · 2019-12-11T20:09:13Z

There is also not really a concept of a 'multi-crs' collection, in the sense that it is not exposed in the metadata.

That's not correct, the API supports it since version 0.4.

lforesta · 2019-12-16T10:51:13Z

I agree with @jdries that performance may be badly affected. Many users will be tempted to specify the CRS without a proper understanding of the implications (whereas if they explicitly use the resample_spatial process at any point in the pg, they are conscious about their choice).

If we go for it, the CRS should be null by default (as in the current PR), because we should also consider that EO includes lots of data still in sensor geometry (e.g. S1 and S3 L1 data), and simply specifying a CRS at load_collection will not suffice in this case (maybe relatively few users will need to work with this level of data, but many scientists do research on the L1 to L2 processing step).

m-mohr · 2019-12-16T12:50:33Z

I agree with @jdries that performance may be badly affected. Many users will be tempted to specify the CRS without a proper understanding of the implications (whereas if they explicitly use the resample_spatial process at any point in the pg, they are conscious about their choice).

We can simply add a note to the description of the parameter so a user is aware of the issue.

load_collection.json

edzer · 2019-12-20T13:30:01Z

@jdries and I had a long discussion on this issue in the train on Wednesday, and may have come to a possible resolution. The problem is that load_collection() shall return a data cube, that a data cube with x and y dimensions shall have an unambiguous mapping from index to coordinates, and that this is not the case when tiles in a collection span several UTM zones. Pragmatic solutions to this (picking the CRS at the center of the extent, picking the single CRS when the extent only spans a single CRS) may sometimes work but not always. Of course for downloading a GeoTIFF or viewing a WMTS a CRS needs to be set, but the use case where someone wants to compute (and save) NDVI (or something much more expensive) in the native pixel / tile space spanning several CRS's is valid, and should not be cut off by requiring setting a single CRS (and working on it) here.

The way out I propose here is to make CRS a dimension.

This way, x and y map into the same coordinates for different CRS, and the CRS slice you're in determines how they map to world coordinates. This way, you can do everything on spectral and time dimensions, and simply loop over x, y and CRS with no need for resampling.

Certain spatial things that only work on the local (relative) context can also be done, think of a low-pass filter. Other spatial things, like aggregate_spatial with a set of polygons, need an unfolding of the CRS dimension into a mosaic with a single CRS before they make sense, or be done in an apply over the CRS dimension (in which case polygons crossing UTM zones will not result in a single result).

A similar idea of having CRS as a dimension would be having tile ID as a dimension: x and y indexes are the row/cols and then map into coordinates relative to the tile origin, with similar consequences.

@mkadunc what do you think?

m-mohr · 2019-12-20T13:42:15Z

Hm, interesting idea. We need to think about how that would work with our processes. What do we need to change in resample_spatial / resample_spatial_cube and how would for example a reduce on the crs dimension work?

edzer · 2019-12-20T13:56:53Z

I'm planning to write a bit of a longer piece about this, looking into the topics you mention.

mkadunc · 2019-12-20T14:24:42Z

Having CRS as a dimension makes a lot of sense, especially if it is treated as one of the spatial dimensions.

One concern I have is that adding CRS dimension might unnecessarily burden the user with this extra "singleton" dimension - just the fact that it's there is not a problem, but it can become one if you need to take it into account on almost each node of the process graph, even when the operations are not spatial. Maybe we should do some experiments with extra singleton dimensions and see how much they impact working with data cubes - if there's no impact we're OK, but if it's significant, we could find ways to implicitly take such extra dimensions into account without burdening the user.

Side note: There is another (a bit more involved) option, where CRS would not necessarily be a dimension - if we extended the concept of axis labels into a more general concept of "cell-attributes", then we could just require that every spatial cell has a non-null crs attribute: the value of this could come from:

the root of the cube (when the whole cube is a single CRS)
the separate CRS axis (when there are different CRSs and not aligned with other axes)
some other axis (when CRS can be derived from some other axis, as in the tileId-axis example)

In NetCDF, one would solve this with an extra variable (whose value is defined just once as a scalar or as N-D array along a subset of axes), but since we decided to only hold a single value in a cube, we can't use the same mechanism.

The cell-attributes concept has additional uses, e.g.:

absoluteOrbitId can be used as a dimension, and its labels can provide the well-known time attribute so it can function as a temporal axis
productId as a dimension could, in addition to time and crs, provide per-scene info about the platform (S2A / S2B), processing version and other scene-related metadata (sun angles, viewing angles etc.)

This are maybe all things that could be handled with extra dimensions, but I think that it is more intuitive to think about dimensions, especially in the data cubes world, as "the minimal independent set of cell attributes that unambiguously address a single value" (in relational world these would be called "candidate key") — adding extra dimensions just to provide some metadata to the cells feels wrong.

m-mohr · 2019-12-20T14:42:28Z

We (@flahn and me) went a bit crazy discussing this proposal and have some more thoughts:

The proposal with the crs parameter in load_collection is the simplest from the user perspective. This would be very similar to what GEE offers. There are obviously quite a bunch of users that want this simplicity.
The proposal with crs as dimension is not easy to understand, I think. A GEE user will probably not want to think about how to reduce a crs dimension and just stick with the simpler GEE approach. Scientists who strictly want to work on the original data/crs will probably like this approach.

So could we combine these two approaches? Could we say that if crs is not required and only if the collection has multiple CRS and the crs parameter in load_collection is not set, then a user gets a crs dimension?

The question really is: Who is the target audience? Who do we cater for? Which audience would lead to most uptake? etc.

Another thought was to allow something like this for advanced user that want to work on the original crs (in JS-like pseude code):

var epsgCodesArray = get_metadata(field = 'eo:epsg') // Get all epsg codes from metadata
var dataCubesArray = epsgCodesArray.apply(epsgCode => {
  dataCube = load_collection( // Load all data that is stored with this epsg code
    id = "Sentinel-2",
    temporal_extent = null,
    spatial_extent = null,
    properties = eq('eo:epsg', epsgCode)
  )
  // further process data cubes
})
var mergedDataCube = dataCubesArray.reduce(process = resample_spatial_cube, binary = true)
save_result(mergedDataCube)

That's mostly possible already. It basically returns one data cube per crs defined for the collection, all of them just containing the data that is available for that specific crs. Afterwards process them individually and merge them using a binary reducer if wanted.

mkadunc · 2019-12-23T10:16:21Z

load_collection.json

@@ -268,13 +303,19 @@
                                    }
                                }
                            }
-                        }
+                        },
+                        "crs": 3857


I wouldn't put the crs parameter in the example, as it would suggest that this is something that should always be used - users should rather be encouraged to leave the CRS choice to the backend.

mkadunc · 2019-12-23T10:19:18Z

load_collection.json

+    "exceptions": {
+        "TargetCrsMissing": {
+            "message": "The data requested from the collection has multiple CRS assigned and thus a valid target CRS must be specified."
+        }


Even when the data has multiple CRSs assigned, the backend can probably do a better job of selecting the more appropriate option than the user. Forcing the backend to fail when CRS is not provided is IMO counter-productive.

m-mohr · 2020-01-13T15:05:32Z

Closing, we are going the route described in #98.

m-mohr added the help wanted Extra attention is needed label Dec 11, 2019

m-mohr added this to the v1.0 milestone Dec 11, 2019

m-mohr requested review from edzer, lforesta, jdries, aljacob, mkadunc and kempenep December 11, 2019 13:06

CRS for load_collection (#98)

04e6304

m-mohr force-pushed the issue-98 branch from d783610 to 04e6304 Compare December 11, 2019 13:09

m-mohr mentioned this pull request Dec 13, 2019

Math Operation on multi-temporal and mono-temporal datacube #96

Closed

lforesta reviewed Dec 16, 2019

View reviewed changes

load_collection.json Outdated Show resolved Hide resolved

Update exception in load_collection

4f35fd1

m-mohr mentioned this pull request Dec 17, 2019

Tile-based processing #108

Closed

mkadunc reviewed Dec 23, 2019

View reviewed changes

m-mohr added the waiting label Jan 6, 2020

m-mohr closed this Jan 13, 2020

m-mohr deleted the issue-98 branch January 20, 2020 15:09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CRS for load_collection #102

CRS for load_collection #102

m-mohr commented Dec 11, 2019 •

edited

Loading

jdries commented Dec 11, 2019

m-mohr commented Dec 11, 2019

lforesta commented Dec 16, 2019

m-mohr commented Dec 16, 2019

edzer commented Dec 20, 2019 •

edited

Loading

m-mohr commented Dec 20, 2019

edzer commented Dec 20, 2019

mkadunc commented Dec 20, 2019 •

edited

Loading

m-mohr commented Dec 20, 2019 •

edited

Loading

mkadunc Dec 23, 2019

mkadunc Dec 23, 2019

m-mohr commented Jan 13, 2020

CRS for load_collection #102

CRS for load_collection #102

Conversation

m-mohr commented Dec 11, 2019 • edited Loading

jdries commented Dec 11, 2019

m-mohr commented Dec 11, 2019

lforesta commented Dec 16, 2019

m-mohr commented Dec 16, 2019

edzer commented Dec 20, 2019 • edited Loading

m-mohr commented Dec 20, 2019

edzer commented Dec 20, 2019

mkadunc commented Dec 20, 2019 • edited Loading

m-mohr commented Dec 20, 2019 • edited Loading

mkadunc Dec 23, 2019

Choose a reason for hiding this comment

mkadunc Dec 23, 2019

Choose a reason for hiding this comment

m-mohr commented Jan 13, 2020

m-mohr commented Dec 11, 2019 •

edited

Loading

edzer commented Dec 20, 2019 •

edited

Loading

mkadunc commented Dec 20, 2019 •

edited

Loading

m-mohr commented Dec 20, 2019 •

edited

Loading