Making a batch spillable is expensive #3749

jlowe · 2021-10-05T14:58:42Z

Making a batch spillable involves calling contiguousSplit to form a contiguous buffer which is then spilled as a single unit. The contiguous split can be relatively expensive, especially when the schema is just a column of long values for some reason, and is unnecessary if the batch never is spilled.

Ideally contiguousSplit should be as cheap as possible, but there may be ways to avoid calling it when it is unnecessary. For example, we could keep a "bounce buffer" for spilling, similar to the bounce buffers used for UCX and GDS, where we can copy batch buffers into a contiguous form in device memory before copying the contiguous buffer to host memory, or potentially copying the buffers directly to host memory with a multi-buffer copy kernel if the host memory is pinned. Essentially the idea is to perform an "on-the-fly" contiguous split only when it is needed which means making a batch spillable is very cheap, performing no wasted GPU operations since the transformation of a buffer into a contiguous buffer for spilling is performed lazily and only when needed.

The text was updated successfully, but these errors were encountered:

abellina · 2021-10-05T16:59:44Z

20% of the time of q72 is spent in copy_partitions which is part of contiguousSplit. It seems to me, other than trying out ways to not call it as much, or working on an alternate implementation, that a next step here is to capture those batches that cause it to be "slow" (and by that I mean 80ms slow in some tests I am running).

sperlingxx · 2021-12-16T09:08:04Z

I would like to try this task if no one has already worked in progress or prepares to work on it.

jlowe · 2021-12-16T14:40:04Z

Nobody has work in progress on this, but note that I suspect this is a pretty significant task. Besides being relatively complex with the packing and chunking of small and large tables, respectively, through the bounce buffer and construction of the corresponding packed metadata, it will trigger refactoring around SpillableBatch / LazySpillableBatch as there should be no reason for the latter if spilling is "free."

abellina · 2022-01-07T15:52:04Z

@sperlingxx have you been able to look into this issue? If not, I should have time to look into it, if you have something I am also happy to review or help test it out.

sperlingxx · 2022-01-10T01:50:02Z

Hi @abellina, I haven't started it. I am just learning the backgrounds either. I am happy if you want to take over this task, since you are more familiar with BounceBuffer and Spilling of spark-rapids.

abellina · 2022-01-28T19:11:43Z

I am on the hook to provide data on this. I'll update the issue with a SOL set of numbers than we can prioritize it accordingly with the data on hand.

abellina · 2022-02-11T15:58:07Z

Sorry this has taken me a while to get back to provide some numbers. Here are a couple of queries that are very sensitive to contiguous_split: q72 and q95.

In the past, calling contigous_split could account for 20% or higher of the kernel time in these queries. But since rapidsai/cudf#9755, the percentage of time that we spend here has gone down drastically. When @nvdbaranec's change was merged, q72 became nearly 1 minute cheaper (1.3x faster), q95 became 22 seconds (2.3x faster).

As of today, if I run with a PoC that doesn't call continguous_split when we hand the batch to the catalog, I am seeing a reduction of ~9% of time for q72. q72 nowadays is closer to 120 seconds, and with the change I see it taking 110 seconds. Note this is a speed of light number and the PoC doesn't take into account the cost needed to actually handle laying out the columns in a contiguous buffer when we do need to spill. For q95 I am not seeing a noticeable difference. I'll re-run the SOL patch against all queries and report back with the overall effect, but from prior runs I have done with this it was near noise.

abellina · 2022-02-11T17:14:45Z

Running all queries with the SOL code I get the sum of time is about the same for both, with the proof of concept being 0.05x faster, so very much in the noise. So in terms of the original request, and the complexity to add the bounce buffer to actually handle this well, I don't believe it's the highest priority task. @jlowe would you want to keep this open to have someone look into it in the future?

jlowe · 2022-02-11T17:37:31Z

I think for now we can close this. We can reopen or file a new issue if we notice it becoming a significant drag on performance in the future.

jlowe added ? - Needs Triage Need team to review and classify task Work required that improves the product but is not user facing performance A performance related task/issue labels Oct 5, 2021

jlowe mentioned this issue Oct 5, 2021

Avoid using AST on inner joins and avoid coalesce after nested loop join filter #3746

Merged

sameerz removed the ? - Needs Triage Need team to review and classify label Oct 5, 2021

sperlingxx self-assigned this Dec 16, 2021

sperlingxx removed their assignment Jan 10, 2022

abellina self-assigned this Jan 28, 2022

jlowe closed this as completed Feb 11, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Making a batch spillable is expensive #3749

Making a batch spillable is expensive #3749

jlowe commented Oct 5, 2021

abellina commented Oct 5, 2021 •

edited

Loading

sperlingxx commented Dec 16, 2021

jlowe commented Dec 16, 2021

abellina commented Jan 7, 2022

sperlingxx commented Jan 10, 2022

abellina commented Jan 28, 2022

abellina commented Feb 11, 2022 •

edited

Loading

abellina commented Feb 11, 2022

jlowe commented Feb 11, 2022

Making a batch spillable is expensive #3749

Making a batch spillable is expensive #3749

Comments

jlowe commented Oct 5, 2021

abellina commented Oct 5, 2021 • edited Loading

sperlingxx commented Dec 16, 2021

jlowe commented Dec 16, 2021

abellina commented Jan 7, 2022

sperlingxx commented Jan 10, 2022

abellina commented Jan 28, 2022

abellina commented Feb 11, 2022 • edited Loading

abellina commented Feb 11, 2022

jlowe commented Feb 11, 2022

abellina commented Oct 5, 2021 •

edited

Loading

abellina commented Feb 11, 2022 •

edited

Loading