Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add array_apply #72

Closed
Tracked by #69
LukeWeidenwalker opened this issue Feb 27, 2023 · 2 comments
Closed
Tracked by #69

Add array_apply #72

LukeWeidenwalker opened this issue Feb 27, 2023 · 2 comments

Comments

@LukeWeidenwalker
Copy link
Contributor

My previous comments on where I'd start with this:

Here we can't rely on process to handle this correctly - similarly to the other parent processes, we'll need to make sure the right parameters are being passed down with positional_parameters and named_parameters. Without spending more time I can't say in advance what the solution would look like, but I'd have a look into np.vectorize and dask.array.gufunc.apply_gufunc!
For the index parameter: I don't think we'd want to do an explicit for-loop over elements for obvious performance reasons, so we won't be able to pass index to subprocesses. To satisfy the parameter resolution in @process, I think we can pass this through too, but just not use it further down.
Or if it’s absolutely necessary to pass the index down, we’d have to look into something like numba, but this seems like unnecessary complication for what is essentially a map operation on a numpy array.

@danielFlemstrom
Copy link

As a computer scientist and not an EO expert, I understand that the common approach in OpenEO involves defining an area of interest (AOI, described by a few polygons) and performing time series analysis, where the primary focus is tracking changes over time.

A common use case for our users involves monitoring forested areas or agricultural fields to detect changes, such as tree cover loss or crop health variations. Each day, we aim to calculate a value representing, for example, how green an area is, creating a time series to track changes. The result may then be communicated to the owner of that area. For example, we might detect when trees are cut down or when crops show signs of stress. This is a simplified example, but it illustrates the typical scenario.

This corresponds to processing 100,000 to 200,000 polygons daily, with each polygon representing a specific area of interest. The results would be stored and analyzed on a per-polygon basis. In contrast to OpenEO's main scenario, which often focuses on time series analysis for tracking changes across a few polygons, our use case processes a large number of polygons at a single time step, with the option to consider time as a secondary dimension later. Given the sheer number of AOIs requiring individual calculations, we assume that an intermediate step, focusing on polygons first before considering temporal changes, is necessary.

Creating individual jobs for every polygon seems inefficient, and looping through 100,000 polygons may not be optimal. Could one solution be to sort polygons by locality, returning chunks of arrays to achieve parallelism, and then applying array_apply on these chunks? This way, we could process multiple polygons simultaneously, reducing overhead. As noted by Luke, ensuring that the right parameters are passed down correctly (such as positional and named parameters) is key.

TLDR
; While an explicit for-loop over elements might not be ideal for performance, would this approach be acceptable if the array is sufficiently chunked beforehand, or if the process can be applied in parallel to the chunks? At least as an initial solution this would be better than nothing for us anyway :).

With this in mind, should we explore the array_apply approach under these conditions, or move towards a more generic process, as in aggregate_spatial, which might require implementing user-defined functions instead of relying on built-in reducers like mean?

@ValentinaHutter
Copy link
Collaborator

I think what you describe is not directly related to the array_apply process, but makes more sense in combination with aggregate_spatial. In our case, we somehow came to the conclusion, that we can run some of the computation in parallel and sort the polygons somehow, but we still needed a for loop... I tried to explain what I did here: https://github.com/Open-EO/openeo-processes-dask/blob/main/docs/scalability/aggregate-large-spatial-extents.md not sure if this applies to your use case :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants