-
Notifications
You must be signed in to change notification settings - Fork 244
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add cpu oom retry split handling to InternalRowToColumnarBatchIterator #10011
Conversation
…rator Signed-off-by: Jim Brennan <jimb@nvidia.com>
build |
@@ -144,7 +144,7 @@ private class HostAlloc(nonPinnedLimit: Long) extends HostMemoryAllocator with L | |||
logDebug(s"Targeting host store size of $targetSize bytes") | |||
// We could not make it work so try and spill enough to make it work | |||
val maybeAmountSpilled = | |||
RapidsBufferCatalog.synchronousSpill(RapidsBufferCatalog.getHostStorage, allocSize) | |||
RapidsBufferCatalog.synchronousSpill(RapidsBufferCatalog.getHostStorage, targetSize) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice catch
SpillPriorities$.MODULE$.ACTIVE_ON_DECK_PRIORITY(), | ||
RapidsBufferCatalog$.MODULE$.singleton()); | ||
hBufs[0] = null; // Was closed by spillable | ||
hBufs[1] = HostAlloc$.MODULE$.alloc(offsetBytes, true); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we need to wrap this in a try/catch so we can close sBufs[0] if alloc
throws?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh nm, I think allocBuffersWithRestore
takes care of it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there may be a problem here actually. If we throw with something other than a retry exception, the withRestoreOnRetry won't handle it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah that would be an issue
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
return (int) Math.max(1, | ||
Math.min(Integer.MAX_VALUE - 1, targetBytes / sizePerRowEstimate)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: casting is forced by targetBytes / sizePerRowEstimate
? I'd prefer pushing it down to the origin:
return (int) Math.max(1, | |
Math.min(Integer.MAX_VALUE - 1, targetBytes / sizePerRowEstimate)); | |
return Math.max(1, | |
Math.min(Integer.MAX_VALUE - 1, (int) (targetBytes / sizePerRowEstimate))); |
sql-plugin/src/main/java/com/nvidia/spark/rapids/InternalRowToColumnarBatchIterator.java
Show resolved
Hide resolved
Thanks for the review comments @abellina and @gerashegalov! I believe I have addressed them in this latest commit. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
build |
This is a followup to #9929 and is part of addressing issue #8887.
This adds handing for
CpuSplitAndRetryOOM
inInternalRowToColumnarBatchIterator
.For splits, we divide the target number of rows in half via
splitTargetSizeInHalfCpu
, and recalculate the buffer sizes from that. It will try that until we hit 1 row.I also included a workaround for #10004, by combining the
withHostBufferWriteLock
with thewithHostBufferReadOnly
blocks. In this case we will hold the write lock during the full operation, but this should allow us to proceed without introducing data corruption until we have a fix for #10004.One other fix is in HostAlloc, where it was passing the wrong argument to
RapidsBufferCatalog.synchronousSpill
- it was passingallocSize
instead oftargetSize
, causing us to spill a lot more than needed in some cases.