Add array_apply #72

LukeWeidenwalker · 2023-02-27T16:24:00Z

My previous comments on where I'd start with this:

Here we can't rely on process to handle this correctly - similarly to the other parent processes, we'll need to make sure the right parameters are being passed down with positional_parameters and named_parameters. Without spending more time I can't say in advance what the solution would look like, but I'd have a look into np.vectorize and dask.array.gufunc.apply_gufunc!
For the index parameter: I don't think we'd want to do an explicit for-loop over elements for obvious performance reasons, so we won't be able to pass index to subprocesses. To satisfy the parameter resolution in @process, I think we can pass this through too, but just not use it further down.
Or if it’s absolutely necessary to pass the index down, we’d have to look into something like numba, but this seems like unnecessary complication for what is essentially a map operation on a numpy array.

danielFlemstrom · 2024-09-12T07:28:50Z

As a computer scientist and not an EO expert, I understand that the common approach in OpenEO involves defining an area of interest (AOI, described by a few polygons) and performing time series analysis, where the primary focus is tracking changes over time.

A common use case for our users involves monitoring forested areas or agricultural fields to detect changes, such as tree cover loss or crop health variations. Each day, we aim to calculate a value representing, for example, how green an area is, creating a time series to track changes. The result may then be communicated to the owner of that area. For example, we might detect when trees are cut down or when crops show signs of stress. This is a simplified example, but it illustrates the typical scenario.

This corresponds to processing 100,000 to 200,000 polygons daily, with each polygon representing a specific area of interest. The results would be stored and analyzed on a per-polygon basis. In contrast to OpenEO's main scenario, which often focuses on time series analysis for tracking changes across a few polygons, our use case processes a large number of polygons at a single time step, with the option to consider time as a secondary dimension later. Given the sheer number of AOIs requiring individual calculations, we assume that an intermediate step, focusing on polygons first before considering temporal changes, is necessary.

Creating individual jobs for every polygon seems inefficient, and looping through 100,000 polygons may not be optimal. Could one solution be to sort polygons by locality, returning chunks of arrays to achieve parallelism, and then applying array_apply on these chunks? This way, we could process multiple polygons simultaneously, reducing overhead. As noted by Luke, ensuring that the right parameters are passed down correctly (such as positional and named parameters) is key.

TLDR
; While an explicit for-loop over elements might not be ideal for performance, would this approach be acceptable if the array is sufficiently chunked beforehand, or if the process can be applied in parallel to the chunks? At least as an initial solution this would be better than nothing for us anyway :).

With this in mind, should we explore the array_apply approach under these conditions, or move towards a more generic process, as in aggregate_spatial, which might require implementing user-defined functions instead of relying on built-in reducers like mean?

ValentinaHutter · 2024-09-19T06:55:06Z

I think what you describe is not directly related to the array_apply process, but makes more sense in combination with aggregate_spatial. In our case, we somehow came to the conclusion, that we can run some of the computation in parallel and sort the polygons somehow, but we still needed a for loop... I tried to explain what I did here: https://github.com/Open-EO/openeo-processes-dask/blob/main/docs/scalability/aggregate-large-spatial-extents.md not sure if this applies to your use case :)

LukeWeidenwalker mentioned this issue Feb 27, 2023

Core profile: Complete missing array processes #69

Closed

6 tasks

ValentinaHutter closed this as completed Sep 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add array_apply #72

Add array_apply #72

LukeWeidenwalker commented Feb 27, 2023

danielFlemstrom commented Sep 12, 2024

ValentinaHutter commented Sep 19, 2024

Add array_apply #72

Add array_apply #72

Comments

LukeWeidenwalker commented Feb 27, 2023

danielFlemstrom commented Sep 12, 2024

ValentinaHutter commented Sep 19, 2024