Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Making a batch spillable is expensive #3749

Closed
jlowe opened this issue Oct 5, 2021 · 9 comments
Closed

Making a batch spillable is expensive #3749

jlowe opened this issue Oct 5, 2021 · 9 comments
Assignees
Labels
performance A performance related task/issue task Work required that improves the product but is not user facing

Comments

@jlowe
Copy link
Contributor

jlowe commented Oct 5, 2021

Making a batch spillable involves calling contiguousSplit to form a contiguous buffer which is then spilled as a single unit. The contiguous split can be relatively expensive, especially when the schema is just a column of long values for some reason, and is unnecessary if the batch never is spilled.

Ideally contiguousSplit should be as cheap as possible, but there may be ways to avoid calling it when it is unnecessary. For example, we could keep a "bounce buffer" for spilling, similar to the bounce buffers used for UCX and GDS, where we can copy batch buffers into a contiguous form in device memory before copying the contiguous buffer to host memory, or potentially copying the buffers directly to host memory with a multi-buffer copy kernel if the host memory is pinned. Essentially the idea is to perform an "on-the-fly" contiguous split only when it is needed which means making a batch spillable is very cheap, performing no wasted GPU operations since the transformation of a buffer into a contiguous buffer for spilling is performed lazily and only when needed.

@jlowe jlowe added ? - Needs Triage Need team to review and classify task Work required that improves the product but is not user facing performance A performance related task/issue labels Oct 5, 2021
@abellina
Copy link
Collaborator

abellina commented Oct 5, 2021

20% of the time of q72 is spent in copy_partitions which is part of contiguousSplit. It seems to me, other than trying out ways to not call it as much, or working on an alternate implementation, that a next step here is to capture those batches that cause it to be "slow" (and by that I mean 80ms slow in some tests I am running).

@sameerz sameerz removed the ? - Needs Triage Need team to review and classify label Oct 5, 2021
@sperlingxx sperlingxx self-assigned this Dec 16, 2021
@sperlingxx
Copy link
Collaborator

I would like to try this task if no one has already worked in progress or prepares to work on it.

@jlowe
Copy link
Contributor Author

jlowe commented Dec 16, 2021

Nobody has work in progress on this, but note that I suspect this is a pretty significant task. Besides being relatively complex with the packing and chunking of small and large tables, respectively, through the bounce buffer and construction of the corresponding packed metadata, it will trigger refactoring around SpillableBatch / LazySpillableBatch as there should be no reason for the latter if spilling is "free."

@abellina
Copy link
Collaborator

abellina commented Jan 7, 2022

@sperlingxx have you been able to look into this issue? If not, I should have time to look into it, if you have something I am also happy to review or help test it out.

@sperlingxx
Copy link
Collaborator

Hi @abellina, I haven't started it. I am just learning the backgrounds either. I am happy if you want to take over this task, since you are more familiar with BounceBuffer and Spilling of spark-rapids.

@sperlingxx sperlingxx removed their assignment Jan 10, 2022
@abellina
Copy link
Collaborator

I am on the hook to provide data on this. I'll update the issue with a SOL set of numbers than we can prioritize it accordingly with the data on hand.

@abellina abellina self-assigned this Jan 28, 2022
@abellina
Copy link
Collaborator

abellina commented Feb 11, 2022

Sorry this has taken me a while to get back to provide some numbers. Here are a couple of queries that are very sensitive to contiguous_split: q72 and q95.

In the past, calling contigous_split could account for 20% or higher of the kernel time in these queries. But since rapidsai/cudf#9755, the percentage of time that we spend here has gone down drastically. When @nvdbaranec's change was merged, q72 became nearly 1 minute cheaper (1.3x faster), q95 became 22 seconds (2.3x faster).

As of today, if I run with a PoC that doesn't call continguous_split when we hand the batch to the catalog, I am seeing a reduction of ~9% of time for q72. q72 nowadays is closer to 120 seconds, and with the change I see it taking 110 seconds. Note this is a speed of light number and the PoC doesn't take into account the cost needed to actually handle laying out the columns in a contiguous buffer when we do need to spill. For q95 I am not seeing a noticeable difference. I'll re-run the SOL patch against all queries and report back with the overall effect, but from prior runs I have done with this it was near noise.

@abellina
Copy link
Collaborator

Running all queries with the SOL code I get the sum of time is about the same for both, with the proof of concept being 0.05x faster, so very much in the noise. So in terms of the original request, and the complexity to add the bounce buffer to actually handle this well, I don't believe it's the highest priority task. @jlowe would you want to keep this open to have someone look into it in the future?

@jlowe
Copy link
Contributor Author

jlowe commented Feb 11, 2022

I think for now we can close this. We can reopen or file a new issue if we notice it becoming a significant drag on performance in the future.

@jlowe jlowe closed this as completed Feb 11, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance A performance related task/issue task Work required that improves the product but is not user facing
Projects
None yet
Development

No branches or pull requests

4 participants