-
Notifications
You must be signed in to change notification settings - Fork 244
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] If a host memory buffer is spilled, it cannot be unspilled #10004
Labels
bug
Something isn't working
reliability
Features to improve reliability or bugs that severly impact the reliability of the plugin
Comments
Here is the debug output that shows the problem (I added some logs):
|
A related issue is that the |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
bug
Something isn't working
reliability
Features to improve reliability or bugs that severly impact the reliability of the plugin
Describe the bug
While testing the host memory retry code for
InternalRowToColumnarBatchIterator
, I found that some nds queries were producing incorrect results.After some debugging, and discussion with @abellina, I found the source of the problem.
The retry code in question allocates two buffers, makes them spillable, fills the buffers inside a
withHostBufferWriteLock
block, and then uses the buffers inside awithHostBufferReadOnly
block. The last part looks like this:The problem occurs when one or both of the buffers are spilled before we enter the
withHostBufferWriteLock
block.When the spill happens, the rapids buffer associated with the
RapidsBufferHandle
(in theSpillableHostBuffers
) switches from aRapidsHostMemoryBuffer
toRapidsDiskBuffer
. So when we enter the write block, theRapidsDiskBuffer
allocates a newHostMemoryBuffer
and copies into it from disk (this is all zeros, because we spilled before filling it). Thishostbuffer
is part of theRapidsDiskBuffer
. We then fill up this buffer with useful data.When we exit the write block, this
hostbuffer
is closed, so when we enter the read block, we get the sameRapidsDiskBuffer
, and because itshostbuffer
is closed, we reload from disk, which is all zeros.Steps/Code to reproduce bug
I was able to reproduce this by running nds query10 at scale 100 on my desktop with the following configs:
I was running with 16 executor cores.
Expected behavior
We need to figure out where to actually unspill HostMemoryBuffers that have been spilled.
Instead of using a local
hostbuffer
inside of theRapidsDiskBuffer
, we might need to allocate a newRapidsHostMemoryBuffer
and unspill into it. Maybe something like whatRapidsBufferStore.getDeviceMemoryBuffer
does.Environment details (please complete the following information)
I ran this on my local RTX4000 desktop.
Additional context
We can workaround this issue in InternalRowToColumnarBatchIterator by just using a single write block. I will probably just do this as part of adding split handling.
The text was updated successfully, but these errors were encountered: