-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Process graph simulation #243
Comments
It's something I have been thinking about as well, but not always very easy. Could try it as an experimental feature if other backends are interested. |
Well, users are interested as it seems to be something which is so important that it may actually break the whole "cloud" workflow for a user. So that seems to be something we need to address somehow. We still need to figure out what that is. Could also be the debug process, which would be similar to GEE's print, but would need very good integration in clients so users don't need to scroll through logs. I'm not sure whether your alternative approach would be something users would really do?! |
Just to add from user perspective that the timestamps are important metadata. As image observation dates are often irregular, user would like to know timestamps before specifying time interval, e.g., for the temporal aggregation process, or analysis of time series, etc. |
Just getting timestamps for the collections in general is already possible, but it's not widely implemented as it (1) will be many timestamps and (2) it depends on the location etc. That's why we probably need a bit more advanced way to "simulate" process graphs so that intermediate states of the metadata can be queried. Or the
Yes, it involves costs. A "simulation" mode or any alternative should be free or very, very cheap, I think. |
What we discussed wasn't a process graph simulation, just the option of getting metadata about the selected cube view that the user wants to work with before actually running any processing. And I am definitely interested in supporting this functionality on our back-end, because I think it is essential for users. I am not sure how we can simulate the process graph and return metadata at a point x in the middle of the process graph? For example, it could be possible to know the cube dimensions and size, but how to know if there is a valid pixel value at timestamp T5 in that cube after having applied n processes? My suggestion is to keep exploration of 'initial' (meta)data separate from exploration of 'intermediate' (meta)data. |
We discussed both and both is useful. Users may need the information somewhere in-between. Nevertheless, we can start with the simplest approach (just after load_collection + debug) first and then expand later. Question is how to best place it in the API? Options could be:
Please note that the endpoints may need to return multiple sets of metadata as multiple load_collection calls can be contained in the process graph. |
I have been pondering about this kind or problem too, but then in the context of band (metadata) handling in the (python) client: should the client "simulate" what a backend is supposed to do in order to resolve band names/indexes properly, or should there be a way for the client to query metadata about an intermediate cube? Anyway, about the original user question "for what dates do I have observations?" Given this perspective, let me spitball another possible solution: define a new process, e.g. "observation_dates" (feel free to replace with better name) that returns list of dates with "enough" (TBD what this should mean) non-empty data coverage in the spatial bbox. Advantages:
|
--- define a new process, e.g. "observation_dates" --- --- there are some general comments when time stamps could be useful for a user: --
--- a general note --- |
Most of the use cases mentioned are supported already:
Maybe it is a matter of properly documenting this, or even providing some convenience functions to the user, like an 'observation dates' function. |
Indeed, counting is supported already. What is missing is getting the dimension labels (e.g. observation dates). What would solve the issue is implementing #91. Then it would be as easy as using
In this case we would not need another process and it would mostly be inline with how counting works. If we add another process, I would probably define it more general as The difference between simulation in the API and a normal process is the cost! Simulation would probably be somewhat cheaper or free instead of doing normal processing as it belongs to the normal data discovery process, I guess? |
"valid" -> not sure what everyone means with this expression here. We can use the process count to count pixels/dates, but judging their validity is a different topic. Unless the user with a-priori knowledge expects to have N dates and get n<N and hence knows some data are not there (maybe the collection has already a cloud mask applied or so). initial metadata in between (meta)data The main thing is that I would not mix the two topics into one. |
Valid according to the spec: see https://processes.openeo.org/#count and https://processes.openeo.org/#is_valid
Not sure whether for such a limited scope we really need a new endpoint. If it's really just passing some parameters, I'd be in favor of adding parameters to the /collection/id endpoint as proposed above.
If one would want to implement it, a process graph could technically be somewhat simulated. I guess a user would find it useful, but I see it's very hard to implement.
I'd expect that passing a data cube to debug would result in a similar output as /collections/id. Telling a user that he needs to run and pay for a process graph twice is probably not selling very well. |
Can we maybe define specifically what this simulation should return? For example the metadata of the datacube (dimensions and their cardinality) after each process? Or something more?
Probably we are thinking of different things? I'm thinking of debug as a sort of (asynchronous) breakpoint that a user can place in a process graph, get the output of and then resume processing after inspecting the intermediate result (difficult and expensive to implement/run, but not completely impossible) |
Whatever the user needs (not sure yet - we need to ask the users, e.g. @przell, @MilutinMM, ...), but at least the dimension labels were requested above and that's nothing a user can know from the processes itself. Indeed, the user should know the dimensions by inspecting the processes.
Seems so. Have you read https://processes.openeo.org/draft/#debug ? The API doesn't support halt/pause processing at any point, debug just sends information to a log file. |
Yes I read it and I know there's not pause in the API, I thought we were also discussing if/how to change the debug process, never mind. Back to the "simulation", I think it's hard (impossible?) to get dimensions labels and other types of infos about the datacube at a point x in between the pg without doing any processing. But I'll think more about our own implementation and see if there's a solution. |
That could be a conclusion and then the user has to bite the bullet. It's the same in GEE, with the "minor" difference that user's just do it and don't complain because it's free and usually fast. |
This has been mentioned before by others: I think it makes sense to distinguish the two use cases we are talking about (and placing them in different issues).
To 1: As a user I would expect to get the time steps of a complete data cube in a similar way as |
Okay, let's conclude for now:
|
There's a lot of focus on cost on this issue, but please know that even if you call it 'simulation' or 'debug' in the api, the cost will be entirely the same. There's no 'cheap' option for me to lookup nodata pixels inside an area other than reading and counting them. |
@jdries: I think what you are describing is even another use case: |
Moving this from 1.0-rc1 to final as I'd first like to see some implementations of debug and how well it works and what is missing before adding more things to the API. Afterwards we can evaluate what is missing and choose the best option to proceed with. |
Any implementation available? Otherwise we probably have to push this from 1.0-final to future... |
FYI: I just got this request from an aspiring openeo user (new VITO colleague):
(translated from Dutch) |
@soxofaan That's what the |
I don't completely agree:
I'm not debating against |
Just to make sure: What the user refers to above is the getInfo() call in GEE. Have you worked with that before and are you aware how it works?
Indeed, because it's up to the implementation to decide what it can reasonably support. For Platform, we may need to agree on a common behavior, but that's nothing that we need to define in openEO in general. It also depends on what you insert there. The back-ends need to define behavior for different types, e.g. a summary for raster-cubes (e.g. STAC's cube:dimensions structure), vector-cubes, arrays (e.g. print some high-level statistics, see R), etc.
The user requested basically getInfo (because it's the only way you can actually print in GEE) and debug is the equivalent with very similar behavior. getInfo also executes the whole flow and reports back in-between, but still runs the whole processing chain. Also, you don't need to download "a whole cube". How did you get to that assumption?
No, you can do small-scale synchronous processing and still use logs.
Very easy in the Web Editor. For sync jobs, they open up automatically after completion, for batch jobs and services you can simply click the "bug" button, and a continuously updating UI for logs is shown. Other clients may need to catch up in supporting logs better, indeed.
This is IMHO not an issue. You can set a custom "debug" identifier in the
Well, that was not the question you asked above. It was "I [...] was wondering if it is possible in openeo to print information". There might be better alternatives for the examples given, but for the general question the answer is: "debug". It was specified to capture exactly this use case. The question about timestamps already came up in H2020: #346 |
During the processes telco, an issue came up that we don't have intermediate information about dimension labels available, especially individual timestamps after filtering. We need to find a way to allow this as it's important to know from a user perspective.
We could add a (free-to-use) /explore endpoint you can send a part of the process graph and it resturns cube metadata, e.g. the timestamps.
The text was updated successfully, but these errors were encountered: