[QST] Performance question #4894

eyalhir74 · 2022-03-02T15:33:36Z

I'm in the beginning of trying converting our HUGE data/spark code from the CPU to the GPU. While I have vast experience with GPUs/CUDA, I find it a bit hard to pin-point exactly how/what the performance limitations I'm seeing. Furthermore, I'm looking at many queries, with different issues (such as non-supported features, strings, parquet issues etc.)

I've followed all the performance tuning guides and the suggested actions here, still would be happy to get more insights and assistance if possible :)

I am currently looking at the following query. CPU time is roughly the same as the GPU.

select first(x1, true), first(x2, true), ... first(x26, true), sum(case when y1 = 'aa' and y2 then 1 else 0 END), sum(case when y1 = 'bb' and y2 then 1 else 0 END) where (z1 is null or z1 = false) and (z2 is null or z2 = false) and (z3 is null or z3 = false) and (z4 is null or z4 = false) group by some_string_field_up_to_50_chars

Fields x1 to x26 are either strings or booleans.

Attached is the nvprof output. If I understand correctly 50% of the compute time is string related? and 20% is decoding the PARQUET's page information (i.e. not even decompressing the data itself?)

I guess my questions are:

Is there some way of improving the query's GPU performance (I've replaced the FIRST operators to MIN, didn't help)
Can I somehow evaluate if the parquet file/configuration is harming the GPU's performance (I'm running it on a small subset of the data - 100 ~300MB files each for a total of 35GB. Full data is over a Tera)
Previous answers here suggested running the qualification tool and explain tool. Can I somehow retrieve information if the data being fed to the GPU is too small (because of a mis-configuration/small Parquet files/etc)?

Any further assistance is more than welcomed :)

The text was updated successfully, but these errors were encountered:

jlowe · 2022-03-02T16:21:02Z

Is there some way of improving the query's GPU performance (I've replaced the FIRST operators to MIN, didn't help)

What version of the RAPIDS Accelerator are you using? There have been recent performance optimizations in contiguous_split that could improve the GPU efficiency, see rapidsai/cudf#9755. Normally contiguous_split should not be the most expensive kernel being run on the GPU.

The profile traces also show a large Parquet decode time which we've seen in some input files, cc: @nvdbaranec who is currently working on optimizing for those cases.

GPU traces are nice to figure out what's going on with the GPU, but it doesn't do the best job of conveying what's happening at a high-level with the query. I would suggest looking at the Spark SQL and job web UIs of the CPU and GPU queries and see if there are indications there where the bottlenecks are. For example, which stage(s) are consuming the most time? What operations within those stages appear to be the most expensive according to the SQL metrics? If it's an initial stage and loading Parquet, does the buffer time greatly exceed the GPU decode time, indicating the query has a notable I/O bottleneck?

Can I somehow evaluate if the parquet file/configuration is harming the GPU's performance

This is related to the "where are we spending all the time" question. I cannot tell just from the trace above whether reading Parquet is the real bottleneck for this query or not, since I cannot see the query plans and how much time is spent in each stage (or nodes within stages). The Parquet decode kernel time is pretty big, so I suspect it is a significant contributor.

You could try varying the number of input tasks (e.g.: via changing the max input partition size config) to see if that noticeably changes the performance. That would indicate whether scaling the input data per task is helping increase the GPU efficiency.

Previous answers here suggested running the qualification tool and explain tool.

This is because the first step is figuring out where the bottleneck is at a high level. The qualification/explain tool can show whether parts of the query are not running on the GPU and therefore incur CPU<->GPU transition overhead. The profiling tool can examine the eventlog and extract metrics which is similar to examining the Spark SQL web UI for your query. Once we have a high-level idea of where the bottleneck is in the query then we can focus more on what's going on with that area of the query.

Can I somehow retrieve information if the data being fed to the GPU is too small (because of a mis-configuration/small Parquet files/etc)?

This is difficult to automatically determine. The GPU parallelism of a task read is dependent on how many columns you're loading and how many data buffers are being loaded per column. The worst-case scenario is loading only one column that has only a single, large buffer to decompress and decode, as there aren't very many opportunities for parallelism there (i.e.: compression decode, Parquet decode). Having multiple buffers that need compression and Parquet decoding is where a lot of the GPU parallelism (and therefore performance) is derived during Parquet loads.

eyalhir74 · 2022-03-03T09:37:39Z

@jlowe I'm using RAPIDS Accelerator 21.12.0 using cudf 21.12.0
I would be very happy to assist/check our queries if there's a performance enhancement Parquet branch.

Changing the spark.sql.files.maxPartitionBytes and spark.rapids.sql.concurrentGpuTasks didn't provide for a significant performance change.

Going through the explain plan, I see a few non-GPU compliant ops, maybe this can explain the low performance
#4900 ?

As for the last item, regarding the data size fed into the GPU, I wasn't clear about what I thought. I understand that generally saying "this work is too small/big for the GPU" is not trivial/doable. I was thinking more in the lines that RAPIDS would be able to specify how many columns/rows/parquet pages (and any other performance relevant properties) was used and maybe that would give hints as to whether the spark configuration/parquet files are a reason for underutilizing the GPU.. hope it makes sense.

Attached is the plan file for this query. I've replaced all field names as they can't be shared.

Any further ideas, would be greatly appereciated.
Query.txt

eyalhir74 · 2022-03-03T09:48:21Z

Some SparkUI screen shots.

`== Physical Plan ==
CollectLimit (13)
+- * Project (12)
+- GpuColumnarToRow (11)
+- GpuHashAggregate (10)
+- GpuSort (9)
+- GpuShuffleCoalesce (8)
+- GpuColumnarExchange (7)
+- GpuHashAggregate (6)
+- GpuSort (5)
+- GpuProject (4)
+- GpuCoalesceBatches (3)
+- GpuFilter (2)
+- GpuScan parquet (1)

`

jlowe · 2022-03-08T17:39:33Z

I'm using RAPIDS Accelerator 21.12.0 using cudf 21.12.0

Given the relatively high cost of the contiguous_split kernel from your traces, I suggest updating to the 22.02 release of the RAPIDS Accelerator and cudf. That includes some performance fixes for contiguous_split that may help your use-case.

Going through the explain plan, I see a few non-GPU compliant ops

According to the physical plan and stage runtime statistics, I don't think the impact of this is significant. The only operations that aren't running on the GPU are the Project and CollectLimit occurring right at the end of the query. This is the last stage of the job, which appears to have taken only 0.4 seconds, while the main stage took 1.1 minutes. Also note that the SQL metrics show that only 26,080 rows went through that part, which is a small fraction of the 33+ million rows we started with.

This job appears to be mostly all about the initial stage, and specifically the Parquet load within that stage. That explains why changing other parts of the query doesn't seem to help much. Looking at the metrics associated with the Parquet load, buffer time is significant with the average task spending 3.6 seconds just fetching the data from the distributed filesystem, and the GPU takes 305 milliseconds decoding it afterwards. So I/O overhead seems to be a significant factor for this query, which helps explain the low GPU utilization.

I was thinking more in the lines that RAPIDS would be able to specify how many columns/rows/parquet pages (and any other performance relevant properties) was used and maybe that would give hints as to whether the spark configuration/parquet files are a reason for underutilizing the GPU

This isn't something that RAPIDS cudf supports reporting back as a side-product of the load, although one could identify those metrics separately via Parquet tools that examine the footer of the Parquet files. Although given the metrics I'm seeing above, it seems to be more about waiting for I/O than waiting for the GPU. The metrics show tasks are spending over 10x the time waiting for raw Parquet data from the distributed filesystem rather than waiting for the GPU to decompress and decode it.

eyalhir74 added ? - Needs Triage Need team to review and classify question Further information is requested labels Mar 2, 2022

revans2 assigned jlowe Mar 2, 2022

eyalhir74 closed this as completed Mar 10, 2022

sameerz removed the ? - Needs Triage Need team to review and classify label Mar 11, 2022

NVIDIA locked and limited conversation to collaborators Apr 27, 2022

sameerz converted this issue into discussion #5333 Apr 27, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

This issue was moved to a discussion.

[QST] Performance question #4894

[QST] Performance question #4894

eyalhir74 commented Mar 2, 2022 •

edited

Loading

jlowe commented Mar 2, 2022

eyalhir74 commented Mar 3, 2022 •

edited

Loading

eyalhir74 commented Mar 3, 2022

jlowe commented Mar 8, 2022

This issue was moved to a discussion.

This issue was moved to a discussion.

[QST] Performance question #4894

[QST] Performance question #4894

Comments

eyalhir74 commented Mar 2, 2022 • edited Loading

jlowe commented Mar 2, 2022

eyalhir74 commented Mar 3, 2022 • edited Loading

eyalhir74 commented Mar 3, 2022

jlowe commented Mar 8, 2022

This issue was moved to a discussion.

eyalhir74 commented Mar 2, 2022 •

edited

Loading

eyalhir74 commented Mar 3, 2022 •

edited

Loading