How to compute the average over spatial dimensions? #226

clausmichele · 2021-03-31T13:46:43Z

I would like to compute the average over the data I'm loading from a small bbox, to reduce noise and get a smoother signal. It is what I did in python in User Story 2 using the Xarray method .mean(['x','y'], skipna=True)

How can we perform this task with opneEO processes? I considered various options:

reduce_dimension over x and then y? Is this supported by the back-ends?
aggregate_spatial without a polygon but over the whole area?
resample_spatial where the target is 1 pixel and covers the input area?

The text was updated successfully, but these errors were encountered:

jdries · 2021-04-01T07:40:09Z

Did you try something like the example below already? In principle, if you use aggregate_spatial, you can leave out other spatial filtering. So you only need to specify your small bbox once, which is most easily done with the shapely 'box' function.

from shapely.geometry import box
cube = c.load_collection("TERRASCOPE_S2_NDVI_V2",bands=["NDVI_10M"]).filter_temporal(["2019-05-24T00:00:00Z", "2019-05-30T00:00:00Z"])
cube.polygonal_mean_timeseries(box(minx=5.027, maxx=5.0438, miny=51.1974, maxy=51.2213)).execute()

clausmichele · 2021-04-01T08:13:15Z

Thanks @jdries. I've just tried and the result is the following as a netCDF. I put here the output without applying polygonal_mean_timeseries and with it:
Original result:

<xarray.Dataset>
Dimensions:      (t: 524, x: 4, y: 5)
Coordinates:
    spatial_ref  int64 ...
  * t            (t) datetime64[ns] 2017-01-05 2017-01-06 ... 2019-12-28
  * x            (x) float64 6.855e+05 6.855e+05 6.855e+05 6.855e+05
  * y            (y) float64 5.135e+06 5.135e+06 5.135e+06 5.135e+06 5.135e+06
Data variables:
    VH_gamma0    (t, y, x) float32 ...
    VV_gamma0    (t, y, x) float32 ...
Attributes:
    nodata:        0.0
    crs:           +proj=utm +zone=32 +datum=WGS84 +units=m +no_defs 
    grid_mapping:  spatial_ref

Using aggregate_spatial (polygonal_mean_timeseries)

<xarray.Dataset>
Dimensions:        (feature: 1, time: 524)
Coordinates:
  * time           (time) datetime64[ns] 2017-01-05 2017-01-06 ... 2019-12-28
    lat            (feature) float64 ...
    lon            (feature) float64 ...
    feature_names  (feature) object ...
Dimensions without coordinates: feature
Data variables:
    band_0         (feature, time) float64 ...
    band_1         (feature, time) float64 ...
Attributes:
    Conventions:  CF-1.8
    source:       Aggregated timeseries generated by openEO GeoPySpark backend.

The differences I see are:

Lost band naming
Different time coordinate name (time instead of t)
Different projection (returns lat/lon instead of original projected UTM)
Different data type, float64 instead of float32

jdries · 2021-04-01T08:16:46Z

I usually use the json output format after aggregating the spatial dimension, but we can indeed improve the NetCDF.
Especially band naming, time coordinate needs to be fixed. Would you be using this solution if we make these improvements?

clausmichele · 2021-04-01T08:19:33Z

Yes, it would fit what I am looking for without the need to create or modify other processes!

m-mohr · 2021-04-01T08:41:56Z

What is polygonal_mean_timeseries using internally? That is not a standard process...

jdries · 2021-04-01T08:57:46Z

This seems to be a recurring question, it is
https://processes.openeo.org/#aggregate_spatial
the python client has a few convenience methods to simplify certain use cases that would otherwise require more complex code.

m-mohr · 2021-04-01T09:02:44Z

The geometry is taken from load_collection with either the geometry itself or computed from the box?

jdries · 2021-04-01T09:07:48Z

It's the behaviour we discussed here:
#101
a geometry is passed to aggregate_spatial, and the backend determines which data to load based on that geometry, so no need to specify it in load_collection or some other filter.

m-mohr · 2021-04-01T09:14:43Z

Oh, yes, I did not see the parameter being set in the function. Of course, in back-ends without these optimizations, it might lead to loading data for the whole world. So basically, this is not really a general solution for the issue, but it works at VITO. In other cases you may need to specify the geometries twice (which also shouldn't be a major issue?)

m-mohr · 2021-04-01T09:17:10Z

A question that is still open is what could be done to make spatial dimensions better usable in functions like reduce or so. Setting the dimension to either x or y is usually not very meaningful. Two reduce after each other don't work well for all processes (median?).

clausmichele · 2021-04-01T09:33:43Z

Oh, yes, I did not see the parameter being set in the function. Of course, in back-ends without these optimizations, it might lead to loading data for the whole world. So basically, this is not really a general solution for the issue, but it works at VITO. In other cases you may need to specify the geometries twice (which also shouldn't be a major issue?)

I also had the same feeling about loading too much data if you don't specify the bbox in load_collection, this is back-end specific.
Anyway, specifying the bbox twice is not an issue if you use the python client, but with the Web Editor could be, since you would need to manually select the same area, or copy the coordinates in the json process graph which is not user friendly.

EDIT: the API states that aggregate_spatial should return 3 dimensions, but in the proposed case it's returning 4, 2 sptail, time and bands. I'm actually fine to have it with also the bands but maybe could be confusing?

https://processes.openeo.org/#aggregate_spatial

The data cube must have been reduced to only contain two spatial dimensions and a third dimension the values are aggregated for, for example the temporal dimension to get a time series. Otherwise this process fails with the TooManyDimensions error.

The number of total and valid pixels is returned together with the calculated values.

jdries · 2021-04-01T09:44:11Z

But didn't we specify in #101 that backends have to support this?
We even have the code for it in open source if needed. If we really want to claim that openEO makes things more simple, than it doesn't help that users have to specify the same argument in multiple locations.

m-mohr · 2021-04-01T10:02:13Z

No, we did not. We just specified that aggregate_spatial applies filter_spatial just before the execution of the process (highlights by me):

The data cube implicitly gets restricted to the bounds of the geometries as if filter_spatial would have been used with the same values for the corresponding parameters immediately before this process.

I agree that things should be as simple as possible, but things can get tricky. If you run load_collection once and then filter twice for lets say Belgium and Germany and then run aggregate_spatial for Germany only, you could run into issues with no data for Belgium being loaded?

I feel like that in theory reduce_dimension should have something like a special value "spatial" for "dimension", which would solve the issue. I mean there's also the issue about multiple dimensions #126 with aggregate_spatial. aggregate_spatial was basically meant to run on multiple geometries and for a reduction over all data it's reduce_dimension.

m-mohr · 2021-04-01T10:04:14Z

I also had the same feeling about loading too much data if you don't specify the bbox in load_collection, this is back-end specific.
Anyway, specifying the bbox twice is not an issue if you use the python client, but with the Web Editor could be, since you would need to manually select the same area, or copy the coordinates in the json process graph which is not user friendly.

I guess I could implement in the Web Editor that you can re-use the geometries/bboxes. For programmatic use (Py, JS, R) you can always just use variables. So could just be a client issue, too.

m-mohr · 2021-05-18T15:52:00Z

Getting back to this issue: Just wondering, would it make sense to allow setting the geometries to null in aggregate_spatial and thus allow the process to aggregate over the x and y at the same time without a spatial constraint like a geometry? If that's too much of a stretch for the very geometries-focused process we could also consider the option to define reduce_spatial process that allows reducing over the two spatial dimensions x and y at the same time (or is a generic reduce_dimensions better?)

soxofaan · 2021-05-19T09:40:47Z

would it make sense to allow setting the geometries to null in aggregate_spatial

while that sounds feasible to do in the VITO backend, I wonder whether it would create problems on the output side. Vector cubes are not fully defined, but the null would become the identifier/key/field in the output object in some way I guess, which will probably cause difficulties elsewhere.

m-mohr · 2021-05-19T09:49:45Z

Yes, indeed. The return value I had not really taken into account. The question is what you want to get back? If it's still a "raster-cube" it has to be reduce_spatial (or a variant, the specifics are unclear), if it's a vector-cube it's aggregate_spatial (or a variant). So we may even need both?

soxofaan · 2021-05-19T09:56:19Z

I think both use cases (raster cube or vector cube output) are valid indeed

mkadunc · 2021-05-19T11:11:18Z

My take on the subject:

I strongly object to using null to mean special things in aggregate_spatial - let's stick with geometries and treat null as an illegal argument; too many things tend to go wrong with nulls.
I'd be OK with a solution that would use a special "infinite" geometry or whole-Earth-polygon for that use-case;
we could use another optional (boolean) parameter in addition to geometry, but then it may be better to define another process just for this separate use-case (which wouldn't need to return a vector cube, and which could handle special things such as a CRS dimension);
if we use aggregate_spatial for this purpose, it should return a vector cube with a single label on the vector/spatial dimension (the infinite polygon, or similar) to preserve the general properties of the process at least in terms of the input and output data types.
if we create a separate process for reducing all spatial dimensions, or if we use reduce_dimensions for this purpose, the return type should be a raster data cube (same as for other reduce_dimension calls)
I would like to have a reduce process that was able to handle more than one dimension at a time, e.g. reduce_dimensions, with the same behaviour as current reduce_dimension but more than one dimension to reduce - this would not only solve the problem of aggregating spatially, but also help in other situations such as those that need reducing time+bands to produce a single-pixel result for each location (e.g. cloudless mosaic, land-cover classification from time-series etc.)
In case we have a special process for "reduce all spatial dimensions", and the data cube has a CRS dimension, we could consider including that as one of the spatial dimensions (although we could also reduce just on x and y first and leave the user to do an additional CRS reduction or not); if the input is a vector cube, the reduction should probably happen over the vector spatial dimension.

m-mohr · 2021-05-19T14:15:18Z

FWIW, Google Earth Engine uses the null value for "unbounded" aggregation (or let's say bounded to the footprint of the actual data), see their reduceRegion function. So I don't see a big difference between null and for example false.

Bringing in the CRS is actually a good point and leads us back to thinking about how to handle that dimension anyway, see #251. For this process, you may want to actually require reducing the CRS dimension first, but there's no way to do that right now.

I think with all other points I basically agree, but we indeed need to think about what we want to support. I don't fully understand the use case for a reduce with multiple dimensions on time + bands, e.g. why a cloudless mosaic would require a reduce over two dimensions at the same time?

mkadunc · 2021-05-19T15:37:56Z

FWIW, Google Earth Engine uses the null value for "unbounded" aggregation (or let's say bounded to the footprint of the actual data), see their reduceRegion function. So I don't see a big difference between null and for example false.

If null signals 'unbounded' aggregation, then the following failure mode is possible:

   for each parcel_id {
       parcel_geom := fetch_geom(parcel_id)
       stats := aggregate_spatial(..., parcel_geom);
    }

Now imagine that fetch_geom returns null in case it doesn't find the ID in the database - in this case you'll do a whole lot of processing only to get useless results, which might take a while to notice. With an additional parameter, this kind of unintentional-runtime-behavior-switching does not happen.

Bringing in the CRS is actually a good point and leads us back to thinking about how to handle that dimension anyway, see #251. For this process, you may want to actually require reducing the CRS dimension first, but there's no way to do that right now.

I think it might be better for performance reasons to reduce CRS last (there's a chance you could avoid doing reprojection of rasters in this case), but as you say it seems a bit early to consider all aspects of crs-as-a-spatial-dimension right now.

I think with all other points I basically agree, but we indeed need to think about what we want to support. I don't fully understand the use case for a reduce with multiple dimensions on time + bands, e.g. why a cloudless mosaic would require a reduce over two dimensions at the same time?

For the typical cloudless mosaic, if there isn't a separate cloudiness band, the reducer needs all bands to determine cloudiness for a particular timestamp, and then use that to weigh the timestamps for producing output (either select one, or pick median, or similar). See e.g. https://custom-scripts.sentinel-hub.com/custom-scripts/sentinel-2/cloudless_mosaic/# .

Of course the same could be represented with separate steps and some extra memory to hold the temporary data (reduce on bands first to produce per-timestamp cloudiness, then combine by matching indexes with time-reducer to compute output data), but to me it feels more natural to think of it as a multi-dimensional reduction - 'take all bands for all timestamps on that location, do some crunching and some statistics on the values and produce an output'). I may be biased because reducing by time+bands to produce an output pixel is the most common thing that people do in Sentinel Hub.

m-mohr · 2021-05-20T09:13:05Z

Now imagine that fetch_geom returns null in case it doesn't find the ID in the database - in this case you'll do a whole lot of processing only to get useless results, which might take a while to notice. With an additional parameter, this kind of unintentional-runtime-behavior-switching does not happen.

I see, although openEO processes would throw an error, null is not used to indicate errors. So this would be purely a client issue, which could also be the case for "false" (depending on how functions "return" error states). Also, an additional parameter would not work here as the geometries parameter is required. So you'd still need to pass a geometry, just to disable it afterwards in the new parameter. That's also weird. So short-term (i.e. non-breaking) the only solution is false/null (or some weird string) or defining a new process.

Of course the same could be represented with separate steps and some extra memory to hold the temporary data (reduce on bands first to produce per-timestamp cloudiness, then combine by matching indexes with time-reducer to compute output data), but to me it feels more natural to think of it as a multi-dimensional reduction - 'take all bands for all timestamps on that location, do some crunching and some statistics on the values and produce an output'). I may be biased because reducing by time+bands to produce an output pixel is the most common thing that people do in Sentinel Hub.

That's valuable input, but we need to think about how that could work with the openEO processes. For example, what would be passed to a "callback" in reduce_multiple_dimensions and how could the processes make use of it (keeping in mind that array handling support in openEO is not as full-fledged as in normal programming languages). I'll have to dig into this later...

soxofaan · 2021-05-20T10:17:36Z

I agree with the objection against a null geometry to trigger special case handling. In the VITO backend it would require quite a different code path compared to a normal geometry use case, and as such it would feel more natural to introduce a new process.
So reduce_spatial looks like a nice solution that fits nicely between reduce_dimension and aggregate_spatial.

Conceptually it's probably cleaner to define a general reduce_dimensions (plural) to handle multiple dimensions, but I think that is a lot harder for backends to implement because not all dimensions can be manipulated in the same way practically, depending on the underlying technology, and on top of that you have to handle it in a generic way (e.g. one user want to reduce [t, bands], another [bands, y]). It also adds a level of complexity for the end user and for the API specification.

So in that sense reduce_spatial is probably the easiest solution to cover the initial feature request. The spec would be a simplified version of reduce_dimension and most backends will probably be able to implement it with low effort.

If there would be need for reduce_dimensions (plural), that still can be added. But the question is a bit whether there is enough need for it (given reduce_spatial exisits) and how eager backends are to implement it. The cloudless mosaic is an interesting use case, but can already be done with existing processes if I understood correctly.

To come back to the question about the "output" of the two approaches reduce_spatial vs aggregate_spatial("infinite geometry"): would there actually be an essential difference in practice? The former would return a "raster cube" with dims t and bands and the latter would return a "vector cube" with the same dims and maybe a dimension to just hold the "infinite geometry" label, but that is not really that useful.

m-mohr · 2021-05-31T12:21:46Z

We seem to conclude that having a process reduce_spatial(raster-cube, process, context) -> raster-cube is the way to go. I'll make a proposal for that.

One thing that came to mind here is that for aggregate_spatial and reduce_dimension we have the binary variants (e.g. reduce_dimension_binary). We'd need a binary variant also for reduce_spatial, but on the other hand I'm wondering whether we should remove these binary variants for now? There's no implementation yet and I've also not heard that anyone is actually planning to implement one of them. See #258 and #259

soxofaan · 2021-06-01T10:39:46Z

... for aggregate_spatial and reduce_dimension we have the binary variants ...

Indeed, in VITO backend we don't support the binary variants of these processes and no user ever asked for them as far as I know. I guess most use cases are covered with the classics mean, median, min, max, ... And for a bit more flexibility users can switch to UDFs.
I also think the _binary variants are quite hard to understand for regular users that are not familiar with functional programming concepts (higher order functions and such).

I agree that we should consider moving aggregate_spatial_binary and reduce_dimension_binary out of the "official" listing and demote them to experimental/proposal/other.

m-mohr · 2021-06-01T10:43:01Z

The binary variants are already in the experimental proposals, but I think about removing them completely.
The main use case is UDFs, but on the other hand, it could be that this is too fine-grained and not very performant for UDFs anyway.

soxofaan · 2021-06-01T10:44:59Z

it could be that this is too fine-grained and not very performant for UDFs anyway.

yes I'm afraid that UDFs in a "binary" approach will be horrible for performance for most backend implementations.

reduce_spatial #226 and clarifications #260

m-mohr · 2021-06-04T12:05:24Z

@clausmichele A new process "reduce_spatial" has just been merged. Let's hope some implementations will implement it soon.

clausmichele · 2021-06-04T12:17:59Z

@clausmichele A new process "reduce_spatial" has just been merged. Let's hope some implementations will implement it soon.

I'll implement it in our back-end soon and then support EODC for having it tere as well!

m-mohr added the platform label Apr 1, 2021

m-mohr added this to the 1.1.0 milestone Apr 12, 2021

m-mohr mentioned this issue May 18, 2021

define aggregate_bbox method Open-EO/openeo-python-client#203

Open

m-mohr self-assigned this May 18, 2021

m-mohr added must-have help wanted Extra attention is needed labels May 18, 2021

m-mohr added a commit that referenced this issue May 31, 2021

Remove binary reducers #226

8050017

This was referenced May 31, 2021

Remove binary versions of processes? #258

Closed

Remove binary reducers #258 #259

Merged

m-mohr added a commit that referenced this issue May 31, 2021

reduce_spatial #226 and clarifications #260

f6b72ca

m-mohr mentioned this issue May 31, 2021

reduce_spatial #226 and clarifications #260 #261

Merged

m-mohr added a commit that referenced this issue May 31, 2021

reduce_spatial #226 and clarifications #260

185a4e9

m-mohr linked a pull request May 31, 2021 that will close this issue

reduce_spatial #226 and clarifications #260 #261

Merged

m-mohr linked a pull request Jun 1, 2021 that will close this issue

Remove binary reducers #258 #259

Merged

m-mohr added a commit that referenced this issue Jun 4, 2021

Merge pull request #261 from Open-EO/reduce-spatial

09167a5

reduce_spatial #226 and clarifications #260

m-mohr closed this as completed Jun 4, 2021

m-mohr mentioned this issue Aug 19, 2021

Make data re-usable (e.g. GeoJSON) Open-EO/openeo-vue-components#64

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to compute the average over spatial dimensions? #226

How to compute the average over spatial dimensions? #226

clausmichele commented Mar 31, 2021

jdries commented Apr 1, 2021

clausmichele commented Apr 1, 2021

jdries commented Apr 1, 2021

clausmichele commented Apr 1, 2021

m-mohr commented Apr 1, 2021

jdries commented Apr 1, 2021

m-mohr commented Apr 1, 2021

jdries commented Apr 1, 2021

m-mohr commented Apr 1, 2021

m-mohr commented Apr 1, 2021

clausmichele commented Apr 1, 2021 •

edited

Loading

jdries commented Apr 1, 2021

m-mohr commented Apr 1, 2021 •

edited

Loading

m-mohr commented Apr 1, 2021 •

edited

Loading

m-mohr commented May 18, 2021 •

edited

Loading

soxofaan commented May 19, 2021

m-mohr commented May 19, 2021

soxofaan commented May 19, 2021

mkadunc commented May 19, 2021

m-mohr commented May 19, 2021

mkadunc commented May 19, 2021

m-mohr commented May 20, 2021 •

edited

Loading

soxofaan commented May 20, 2021 •

edited

Loading

m-mohr commented May 31, 2021 •

edited

Loading

soxofaan commented Jun 1, 2021

m-mohr commented Jun 1, 2021

soxofaan commented Jun 1, 2021

m-mohr commented Jun 4, 2021

clausmichele commented Jun 4, 2021

How to compute the average over spatial dimensions? #226

How to compute the average over spatial dimensions? #226

Comments

clausmichele commented Mar 31, 2021

jdries commented Apr 1, 2021

clausmichele commented Apr 1, 2021

jdries commented Apr 1, 2021

clausmichele commented Apr 1, 2021

m-mohr commented Apr 1, 2021

jdries commented Apr 1, 2021

m-mohr commented Apr 1, 2021

jdries commented Apr 1, 2021

m-mohr commented Apr 1, 2021

m-mohr commented Apr 1, 2021

clausmichele commented Apr 1, 2021 • edited Loading

jdries commented Apr 1, 2021

m-mohr commented Apr 1, 2021 • edited Loading

m-mohr commented Apr 1, 2021 • edited Loading

m-mohr commented May 18, 2021 • edited Loading

soxofaan commented May 19, 2021

m-mohr commented May 19, 2021

soxofaan commented May 19, 2021

mkadunc commented May 19, 2021

m-mohr commented May 19, 2021

mkadunc commented May 19, 2021

m-mohr commented May 20, 2021 • edited Loading

soxofaan commented May 20, 2021 • edited Loading

m-mohr commented May 31, 2021 • edited Loading

soxofaan commented Jun 1, 2021

m-mohr commented Jun 1, 2021

soxofaan commented Jun 1, 2021

m-mohr commented Jun 4, 2021

clausmichele commented Jun 4, 2021

clausmichele commented Apr 1, 2021 •

edited

Loading

m-mohr commented Apr 1, 2021 •

edited

Loading

m-mohr commented Apr 1, 2021 •

edited

Loading

m-mohr commented May 18, 2021 •

edited

Loading

m-mohr commented May 20, 2021 •

edited

Loading

soxofaan commented May 20, 2021 •

edited

Loading

m-mohr commented May 31, 2021 •

edited

Loading