-
Notifications
You must be signed in to change notification settings - Fork 245
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEA] Update GpuWindowExec to use OOM retry framework #7254
Comments
A first crack at this adds the retry logic within A follow on issue needs to be defined for split and retry and running window seems like the first candidate. Running window already has logic to handle fixup around batch boundaries, to carry the window aggregation through, and the split/merge logic would plug into this. But again, likely should be split up from this issue when we start looking at this. |
I have something working here, with some tests that touch row, range, and row-optimized windows (just going off what I see in |
Is your feature request related to a problem? Please describe.
Once #7253 is complete we should update GpuWindowExec to use it.
Describe the solution you'd like
There are a few parts to this that we need to get right, and may need to be done in multiple separate steps or PRs.
We need to combine the preceding GpuCoalesceBatches with the GpuWindowExec. This is needed because the current design of the GpuMemoryLeaseManager does not allow for passing batches between nodes in a SparkPlan that are larger than the default lease size. This is because of limitations in Spark itself, but also because it should help with debugging and avoiding leaks of leases.
Once the two are combined after the task knows how much memory it will need to hold the input and the number of input rows it should do a high water mark estimation on how much GPU memory it will need to complete the operation. This is likely to be tricky to know and we should use some of the memory usage tools being worked on to help us know what the size actually use is. We should also look at adding some scale tests using the same technology. If the technology is not available when we first start to work on this, they should be filled as follow on issues so we don't lose track of them.
If the node needs a higher lease to complete the processing it should ask for the lease at this point before going on, as all of the data is already spillable.
Once it has the lease it should do the operations, and ideally split the output batch into smaller chunks that are about the target batch size.
For window operations this can get to be rather complicated because there are multiple different backends that could be used to perform a given window aggregation/function. We should strive for accuracy in the size of the data used as much as possible, but in cases where we just don't know we should try to err on the side of the worst case situation. Only if we see serious performance degradation should we rethink this plan.
This is intended to be the first operator that we are going to tackle. This is because the input to a window operation can be unbounded, and there are a number of cases where we do not currently support out of core algorithms to allow us to process arbitrary amounts of data.
The text was updated successfully, but these errors were encountered: