-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rework cohorts algorithm and heuristics #396
Comments
Now the overhead is in creating 1000s of dask arrays: here's the profile of
cc @phofl The current approach in flox is pretty flawed --- it will create 10,000-ish layers hehe but perhaps there are some quick wins. |
Alternatively, we need a better approach. |
Just linking my other comment here: dask/dask#11026 (comment) |
Yes some clever use of shuffle might be the way here. |
Can you help me understand a bit better what high-level API would be useful here? My understanding of the steps involved is as follows:
The pattern of creating a new array per group comes with significant overhead and also makes life for the scheduler harder. A type of highlevel API that would be useful here (based on my understanding):
Basically each array now would be a branch in the future, kind of like a shuffle but with no guarantee that each group ends up in a single array. Is that understanding roughly correct? |
yes some kind of shuffle will help, but the harder bit is the "tree reduction on each branch". That API basically takes Arrays, and uses Arrays internally. PS: I'm low on bandwidth at the moment, and can only really engage here in two weeks. Too many balls in the air! |
No worries, let's just tackle this a bit later then. |
While working on my talk, I just remembered that you all had fixed a major scheduling problem that made |
Can you elaborate a bit more? Did we break something in Dask since we fixed the ordering? Edit: I think I misread your comment. You mean that the improved ordering behavior made map-reduce the better choice not that we broke something since we fixed the problem, correct? |
No your fix makes my heuristics useless :P we need to update them and choose |
Puuuh, you had me worried there for a bit 😂 |
Example ERA5
All the overhead is in
subset_to_blocks
The text was updated successfully, but these errors were encountered: