Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to compute the average over spatial dimensions? #226

Closed
clausmichele opened this issue Mar 31, 2021 · 29 comments · Fixed by #261 or #259
Closed

How to compute the average over spatial dimensions? #226

clausmichele opened this issue Mar 31, 2021 · 29 comments · Fixed by #261 or #259
Assignees
Labels
help wanted Extra attention is needed must-have platform
Milestone

Comments

@clausmichele
Copy link
Member

I would like to compute the average over the data I'm loading from a small bbox, to reduce noise and get a smoother signal. It is what I did in python in User Story 2 using the Xarray method .mean(['x','y'], skipna=True)

How can we perform this task with opneEO processes? I considered various options:

  1. reduce_dimension over x and then y? Is this supported by the back-ends?
  2. aggregate_spatial without a polygon but over the whole area?
  3. resample_spatial where the target is 1 pixel and covers the input area?
@jdries
Copy link
Contributor

jdries commented Apr 1, 2021

Did you try something like the example below already? In principle, if you use aggregate_spatial, you can leave out other spatial filtering. So you only need to specify your small bbox once, which is most easily done with the shapely 'box' function.

from shapely.geometry import box
cube = c.load_collection("TERRASCOPE_S2_NDVI_V2",bands=["NDVI_10M"]).filter_temporal(["2019-05-24T00:00:00Z", "2019-05-30T00:00:00Z"])
cube.polygonal_mean_timeseries(box(minx=5.027, maxx=5.0438, miny=51.1974, maxy=51.2213)).execute()

@clausmichele
Copy link
Member Author

Thanks @jdries. I've just tried and the result is the following as a netCDF. I put here the output without applying polygonal_mean_timeseries and with it:
Original result:

<xarray.Dataset>
Dimensions:      (t: 524, x: 4, y: 5)
Coordinates:
    spatial_ref  int64 ...
  * t            (t) datetime64[ns] 2017-01-05 2017-01-06 ... 2019-12-28
  * x            (x) float64 6.855e+05 6.855e+05 6.855e+05 6.855e+05
  * y            (y) float64 5.135e+06 5.135e+06 5.135e+06 5.135e+06 5.135e+06
Data variables:
    VH_gamma0    (t, y, x) float32 ...
    VV_gamma0    (t, y, x) float32 ...
Attributes:
    nodata:        0.0
    crs:           +proj=utm +zone=32 +datum=WGS84 +units=m +no_defs 
    grid_mapping:  spatial_ref

Using aggregate_spatial (polygonal_mean_timeseries)

<xarray.Dataset>
Dimensions:        (feature: 1, time: 524)
Coordinates:
  * time           (time) datetime64[ns] 2017-01-05 2017-01-06 ... 2019-12-28
    lat            (feature) float64 ...
    lon            (feature) float64 ...
    feature_names  (feature) object ...
Dimensions without coordinates: feature
Data variables:
    band_0         (feature, time) float64 ...
    band_1         (feature, time) float64 ...
Attributes:
    Conventions:  CF-1.8
    source:       Aggregated timeseries generated by openEO GeoPySpark backend.

The differences I see are:

  • Lost band naming
  • Different time coordinate name (time instead of t)
  • Different projection (returns lat/lon instead of original projected UTM)
  • Different data type, float64 instead of float32

@jdries
Copy link
Contributor

jdries commented Apr 1, 2021

I usually use the json output format after aggregating the spatial dimension, but we can indeed improve the NetCDF.
Especially band naming, time coordinate needs to be fixed. Would you be using this solution if we make these improvements?

@clausmichele
Copy link
Member Author

Yes, it would fit what I am looking for without the need to create or modify other processes!

@m-mohr
Copy link
Member

m-mohr commented Apr 1, 2021

What is polygonal_mean_timeseries using internally? That is not a standard process...

@jdries
Copy link
Contributor

jdries commented Apr 1, 2021

This seems to be a recurring question, it is
https://processes.openeo.org/#aggregate_spatial
the python client has a few convenience methods to simplify certain use cases that would otherwise require more complex code.

@m-mohr
Copy link
Member

m-mohr commented Apr 1, 2021

The geometry is taken from load_collection with either the geometry itself or computed from the box?

@jdries
Copy link
Contributor

jdries commented Apr 1, 2021

It's the behaviour we discussed here:
#101
a geometry is passed to aggregate_spatial, and the backend determines which data to load based on that geometry, so no need to specify it in load_collection or some other filter.

@m-mohr
Copy link
Member

m-mohr commented Apr 1, 2021

Oh, yes, I did not see the parameter being set in the function. Of course, in back-ends without these optimizations, it might lead to loading data for the whole world. So basically, this is not really a general solution for the issue, but it works at VITO. In other cases you may need to specify the geometries twice (which also shouldn't be a major issue?)

@m-mohr
Copy link
Member

m-mohr commented Apr 1, 2021

A question that is still open is what could be done to make spatial dimensions better usable in functions like reduce or so. Setting the dimension to either x or y is usually not very meaningful. Two reduce after each other don't work well for all processes (median?).

@clausmichele
Copy link
Member Author

clausmichele commented Apr 1, 2021

Oh, yes, I did not see the parameter being set in the function. Of course, in back-ends without these optimizations, it might lead to loading data for the whole world. So basically, this is not really a general solution for the issue, but it works at VITO. In other cases you may need to specify the geometries twice (which also shouldn't be a major issue?)

I also had the same feeling about loading too much data if you don't specify the bbox in load_collection, this is back-end specific.
Anyway, specifying the bbox twice is not an issue if you use the python client, but with the Web Editor could be, since you would need to manually select the same area, or copy the coordinates in the json process graph which is not user friendly.

EDIT: the API states that aggregate_spatial should return 3 dimensions, but in the proposed case it's returning 4, 2 sptail, time and bands. I'm actually fine to have it with also the bands but maybe could be confusing?

https://processes.openeo.org/#aggregate_spatial

The data cube must have been reduced to only contain two spatial dimensions and a third dimension the values are aggregated for, for example the temporal dimension to get a time series. Otherwise this process fails with the TooManyDimensions error.

The number of total and valid pixels is returned together with the calculated values.

@jdries
Copy link
Contributor

jdries commented Apr 1, 2021

But didn't we specify in #101 that backends have to support this?
We even have the code for it in open source if needed. If we really want to claim that openEO makes things more simple, than it doesn't help that users have to specify the same argument in multiple locations.

@m-mohr
Copy link
Member

m-mohr commented Apr 1, 2021

No, we did not. We just specified that aggregate_spatial applies filter_spatial just before the execution of the process (highlights by me):

The data cube implicitly gets restricted to the bounds of the geometries as if filter_spatial would have been used with the same values for the corresponding parameters immediately before this process.

I agree that things should be as simple as possible, but things can get tricky. If you run load_collection once and then filter twice for lets say Belgium and Germany and then run aggregate_spatial for Germany only, you could run into issues with no data for Belgium being loaded?

I feel like that in theory reduce_dimension should have something like a special value "spatial" for "dimension", which would solve the issue. I mean there's also the issue about multiple dimensions #126 with aggregate_spatial. aggregate_spatial was basically meant to run on multiple geometries and for a reduction over all data it's reduce_dimension.

@m-mohr
Copy link
Member

m-mohr commented Apr 1, 2021

I also had the same feeling about loading too much data if you don't specify the bbox in load_collection, this is back-end specific.
Anyway, specifying the bbox twice is not an issue if you use the python client, but with the Web Editor could be, since you would need to manually select the same area, or copy the coordinates in the json process graph which is not user friendly.

I guess I could implement in the Web Editor that you can re-use the geometries/bboxes. For programmatic use (Py, JS, R) you can always just use variables. So could just be a client issue, too.

@m-mohr
Copy link
Member

m-mohr commented May 18, 2021

Getting back to this issue: Just wondering, would it make sense to allow setting the geometries to null in aggregate_spatial and thus allow the process to aggregate over the x and y at the same time without a spatial constraint like a geometry? If that's too much of a stretch for the very geometries-focused process we could also consider the option to define reduce_spatial process that allows reducing over the two spatial dimensions x and y at the same time (or is a generic reduce_dimensions better?)

@m-mohr m-mohr self-assigned this May 18, 2021
@m-mohr m-mohr added must-have help wanted Extra attention is needed labels May 18, 2021
@soxofaan
Copy link
Member

would it make sense to allow setting the geometries to null in aggregate_spatial

while that sounds feasible to do in the VITO backend, I wonder whether it would create problems on the output side. Vector cubes are not fully defined, but the null would become the identifier/key/field in the output object in some way I guess, which will probably cause difficulties elsewhere.

@m-mohr
Copy link
Member

m-mohr commented May 19, 2021

Yes, indeed. The return value I had not really taken into account. The question is what you want to get back? If it's still a "raster-cube" it has to be reduce_spatial (or a variant, the specifics are unclear), if it's a vector-cube it's aggregate_spatial (or a variant). So we may even need both?

@soxofaan
Copy link
Member

I think both use cases (raster cube or vector cube output) are valid indeed

@mkadunc
Copy link
Member

mkadunc commented May 19, 2021

My take on the subject:

  • I strongly object to using null to mean special things in aggregate_spatial - let's stick with geometries and treat null as an illegal argument; too many things tend to go wrong with nulls.

  • I'd be OK with a solution that would use a special "infinite" geometry or whole-Earth-polygon for that use-case;

  • we could use another optional (boolean) parameter in addition to geometry, but then it may be better to define another process just for this separate use-case (which wouldn't need to return a vector cube, and which could handle special things such as a CRS dimension);

  • if we use aggregate_spatial for this purpose, it should return a vector cube with a single label on the vector/spatial dimension (the infinite polygon, or similar) to preserve the general properties of the process at least in terms of the input and output data types.

  • if we create a separate process for reducing all spatial dimensions, or if we use reduce_dimensions for this purpose, the return type should be a raster data cube (same as for other reduce_dimension calls)

  • I would like to have a reduce process that was able to handle more than one dimension at a time, e.g. reduce_dimensions, with the same behaviour as current reduce_dimension but more than one dimension to reduce - this would not only solve the problem of aggregating spatially, but also help in other situations such as those that need reducing time+bands to produce a single-pixel result for each location (e.g. cloudless mosaic, land-cover classification from time-series etc.)

  • In case we have a special process for "reduce all spatial dimensions", and the data cube has a CRS dimension, we could consider including that as one of the spatial dimensions (although we could also reduce just on x and y first and leave the user to do an additional CRS reduction or not); if the input is a vector cube, the reduction should probably happen over the vector spatial dimension.

@m-mohr
Copy link
Member

m-mohr commented May 19, 2021

FWIW, Google Earth Engine uses the null value for "unbounded" aggregation (or let's say bounded to the footprint of the actual data), see their reduceRegion function. So I don't see a big difference between null and for example false.

Bringing in the CRS is actually a good point and leads us back to thinking about how to handle that dimension anyway, see #251. For this process, you may want to actually require reducing the CRS dimension first, but there's no way to do that right now.

I think with all other points I basically agree, but we indeed need to think about what we want to support. I don't fully understand the use case for a reduce with multiple dimensions on time + bands, e.g. why a cloudless mosaic would require a reduce over two dimensions at the same time?

@mkadunc
Copy link
Member

mkadunc commented May 19, 2021

FWIW, Google Earth Engine uses the null value for "unbounded" aggregation (or let's say bounded to the footprint of the actual data), see their reduceRegion function. So I don't see a big difference between null and for example false.

If null signals 'unbounded' aggregation, then the following failure mode is possible:

   for each parcel_id {
       parcel_geom := fetch_geom(parcel_id)
       stats := aggregate_spatial(..., parcel_geom);
    }

Now imagine that fetch_geom returns null in case it doesn't find the ID in the database - in this case you'll do a whole lot of processing only to get useless results, which might take a while to notice. With an additional parameter, this kind of unintentional-runtime-behavior-switching does not happen.

Bringing in the CRS is actually a good point and leads us back to thinking about how to handle that dimension anyway, see #251. For this process, you may want to actually require reducing the CRS dimension first, but there's no way to do that right now.

I think it might be better for performance reasons to reduce CRS last (there's a chance you could avoid doing reprojection of rasters in this case), but as you say it seems a bit early to consider all aspects of crs-as-a-spatial-dimension right now.

I think with all other points I basically agree, but we indeed need to think about what we want to support. I don't fully understand the use case for a reduce with multiple dimensions on time + bands, e.g. why a cloudless mosaic would require a reduce over two dimensions at the same time?

For the typical cloudless mosaic, if there isn't a separate cloudiness band, the reducer needs all bands to determine cloudiness for a particular timestamp, and then use that to weigh the timestamps for producing output (either select one, or pick median, or similar). See e.g. https://custom-scripts.sentinel-hub.com/custom-scripts/sentinel-2/cloudless_mosaic/# .

Of course the same could be represented with separate steps and some extra memory to hold the temporary data (reduce on bands first to produce per-timestamp cloudiness, then combine by matching indexes with time-reducer to compute output data), but to me it feels more natural to think of it as a multi-dimensional reduction - 'take all bands for all timestamps on that location, do some crunching and some statistics on the values and produce an output'). I may be biased because reducing by time+bands to produce an output pixel is the most common thing that people do in Sentinel Hub.

@m-mohr
Copy link
Member

m-mohr commented May 20, 2021

Now imagine that fetch_geom returns null in case it doesn't find the ID in the database - in this case you'll do a whole lot of processing only to get useless results, which might take a while to notice. With an additional parameter, this kind of unintentional-runtime-behavior-switching does not happen.

I see, although openEO processes would throw an error, null is not used to indicate errors. So this would be purely a client issue, which could also be the case for "false" (depending on how functions "return" error states). Also, an additional parameter would not work here as the geometries parameter is required. So you'd still need to pass a geometry, just to disable it afterwards in the new parameter. That's also weird. So short-term (i.e. non-breaking) the only solution is false/null (or some weird string) or defining a new process.

Of course the same could be represented with separate steps and some extra memory to hold the temporary data (reduce on bands first to produce per-timestamp cloudiness, then combine by matching indexes with time-reducer to compute output data), but to me it feels more natural to think of it as a multi-dimensional reduction - 'take all bands for all timestamps on that location, do some crunching and some statistics on the values and produce an output'). I may be biased because reducing by time+bands to produce an output pixel is the most common thing that people do in Sentinel Hub.

That's valuable input, but we need to think about how that could work with the openEO processes. For example, what would be passed to a "callback" in reduce_multiple_dimensions and how could the processes make use of it (keeping in mind that array handling support in openEO is not as full-fledged as in normal programming languages). I'll have to dig into this later...

@soxofaan
Copy link
Member

soxofaan commented May 20, 2021

I agree with the objection against a null geometry to trigger special case handling. In the VITO backend it would require quite a different code path compared to a normal geometry use case, and as such it would feel more natural to introduce a new process.
So reduce_spatial looks like a nice solution that fits nicely between reduce_dimension and aggregate_spatial.

Conceptually it's probably cleaner to define a general reduce_dimensions (plural) to handle multiple dimensions, but I think that is a lot harder for backends to implement because not all dimensions can be manipulated in the same way practically, depending on the underlying technology, and on top of that you have to handle it in a generic way (e.g. one user want to reduce [t, bands], another [bands, y]). It also adds a level of complexity for the end user and for the API specification.

So in that sense reduce_spatial is probably the easiest solution to cover the initial feature request. The spec would be a simplified version of reduce_dimension and most backends will probably be able to implement it with low effort.

If there would be need for reduce_dimensions (plural), that still can be added. But the question is a bit whether there is enough need for it (given reduce_spatial exisits) and how eager backends are to implement it. The cloudless mosaic is an interesting use case, but can already be done with existing processes if I understood correctly.

To come back to the question about the "output" of the two approaches reduce_spatial vs aggregate_spatial("infinite geometry"): would there actually be an essential difference in practice? The former would return a "raster cube" with dims t and bands and the latter would return a "vector cube" with the same dims and maybe a dimension to just hold the "infinite geometry" label, but that is not really that useful.

@m-mohr
Copy link
Member

m-mohr commented May 31, 2021

We seem to conclude that having a process reduce_spatial(raster-cube, process, context) -> raster-cube is the way to go. I'll make a proposal for that.

One thing that came to mind here is that for aggregate_spatial and reduce_dimension we have the binary variants (e.g. reduce_dimension_binary). We'd need a binary variant also for reduce_spatial, but on the other hand I'm wondering whether we should remove these binary variants for now? There's no implementation yet and I've also not heard that anyone is actually planning to implement one of them. See #258 and #259

@soxofaan
Copy link
Member

soxofaan commented Jun 1, 2021

... for aggregate_spatial and reduce_dimension we have the binary variants ...

Indeed, in VITO backend we don't support the binary variants of these processes and no user ever asked for them as far as I know. I guess most use cases are covered with the classics mean, median, min, max, ... And for a bit more flexibility users can switch to UDFs.
I also think the _binary variants are quite hard to understand for regular users that are not familiar with functional programming concepts (higher order functions and such).

I agree that we should consider moving aggregate_spatial_binary and reduce_dimension_binary out of the "official" listing and demote them to experimental/proposal/other.

@m-mohr
Copy link
Member

m-mohr commented Jun 1, 2021

The binary variants are already in the experimental proposals, but I think about removing them completely.
The main use case is UDFs, but on the other hand, it could be that this is too fine-grained and not very performant for UDFs anyway.

@soxofaan
Copy link
Member

soxofaan commented Jun 1, 2021

it could be that this is too fine-grained and not very performant for UDFs anyway.

yes I'm afraid that UDFs in a "binary" approach will be horrible for performance for most backend implementations.

@m-mohr m-mohr linked a pull request Jun 1, 2021 that will close this issue
m-mohr added a commit that referenced this issue Jun 4, 2021
reduce_spatial #226 and clarifications #260
@m-mohr
Copy link
Member

m-mohr commented Jun 4, 2021

@clausmichele A new process "reduce_spatial" has just been merged. Let's hope some implementations will implement it soon.

@m-mohr m-mohr closed this as completed Jun 4, 2021
@clausmichele
Copy link
Member Author

@clausmichele A new process "reduce_spatial" has just been merged. Let's hope some implementations will implement it soon.

I'll implement it in our back-end soon and then support EODC for having it tere as well!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Extra attention is needed must-have platform
Projects
None yet
5 participants