[QST] SQL Performance issue #5342

eyalhir74 · 2021-06-22T19:04:16Z

eyalhir74
Jun 22, 2021

I'm running a TPCH sql query (see at the bottom) against a parquet file (100GB db)
If I understand the first screenshot correctly, than half the time is decoding the parquet data on the GPU and other half of the time, the GPU sits idle (with "local[*]" configuration)
Second screenshot is the compute itself (? not sure).. where the compute takes 140ms on the GPU and then the GPU is idle for ~500ms.

What can be the reason for such low utilization?
Is there a way to instrument the code (cudf/blazingsql/...) with custom cupti markers to pin-point what causes the idle time?

select s_name, count(*) as numwait from supplier, lineitem l1, orders, nation where s_suppkey = l1.l_suppkey and o_orderkey = l1.l_orderkey and o_orderstatus = 'F' and l1.l_receiptdate > l1.l_commitdate and exists ( select * from lineitem l2 where l2.l_orderkey = l1.l_orderkey and l2.l_suppkey <> l1.l_suppkey ) and not exists ( select * from lineitem l3 where l3.l_orderkey = l1.l_orderkey and l3.l_suppkey <> l1.l_suppkey and l3.l_receiptdate > l3.l_commitdate ) and s_nationkey = n_nationkey group by s_name order by numwait desc, s_name;

Answered by jlowe

Jun 23, 2021

Thanks a ton for the qdrep file, that helps a lot!

So yes, shuffle and buffer reading is a significant chunk of this. Looking at the first stage of the query where it first starts reading Parquet:

All the yellow ranges after the purple "Hash partition" and green "Parquet readBatch" range on that thread are serializing out the task's output for Spark shuffle. The gaps between those are the CPU thread writing out shuffle data to disk and scheduling the next task. These are all outside of the scope of the RAPIDS Accelerator and is part of standard Apache Spark. Once we get to the Parquet read, we can see that the first 137ms of that range is spent buffering the input data from the filesyste…

View full answer

jlowe · 2021-06-22T21:33:41Z

jlowe
Jun 22, 2021

Is there a way to instrument the code (cudf/blazingsql/...) with custom cupti markers to pin-point what causes the idle time?

The blazingsql comment has me a bit confused. Is this using the RAPIDS Accelerator or BlazingSQL? If the latter then you'll need to raise a question with the BlazingSQL project at https://github.com/BlazingDB/blazingsql.

I'm assuming this is the RAPIDS Accelerator case. If cudf has been built with NVTX support (which has been true for published cudf jars for the last few versions), you can enable NVTX ranges by setting the Java property ai.rapids.cudf.nvtx.enabled to true. For example, when running in local mode you can add the Spark command-line parameter

--conf spark.driver.extraJavaOptions="-Dai.rapids.cudf.nvtx.enabled=true"

to the spark-shell, spark-submit, pyspark or spark-sql command. Traces captured with Nsight Systems (i.e.: nsys profile) or nvprof should show these NVTX time ranges. These ranges will add more information about what tasks are doing.

Spark shuffle is not explicitly identified as an NVTX range (doing so would require modifying Apache Spark itself to add the corresponding NVTX ranges), but other ranges within the RAPIDS Accelerator and cudf can help identify shuffle activity (such as deserializing batches from Spark shuffle, host-side coalesce of Spark shuffle batches, etc.). Another potential reason for a gap could be time spent reading Parquet data from the filesystem into memory, and there are NVTX ranges in the RAPIDS Accelerator that cover that.

I also suggest checking out the Tuning Guide as it can provide tips for increasing the performance. For example, if you have sufficient GPU memory available for a particular query, you can increase the task parallelism on the GPU which often increases GPU utilization and performance. This setting is set low by default since there is a chance of running out of GPU memory when too many tasks run concurrently on the GPU.

0 replies

eyalhir74 · 2021-06-23T05:37:53Z

eyalhir74
Jun 23, 2021
Author

The blazingsql comment has me a bit confused.

BlazingSQL is probably a mistake on my part. Yes I am using the RAPIDS accelerator.

Ok, I managed to run nsight profile and get the markers.

you can increase the task parallelism on the GPU which often increases
The screen shot I sent is indeed with spark.rapids.sql.concurrentGpuTasks set to 1 , for clarity.
I did use 6 for that property and indeed saw some levels of parallelism on the device, however the gaps in the timeline
still remained.

I am using 6 concurrent tasks, nvtx ranges enabled.
Attached are two screenshots (two portions of the sql query processing)
If I understand the output correctly, there are still many gaps. It doesn't seem to be because of the parquet read operation?
The streams in the second screenshot (doing the hash joins) also has gaps and the streams do not overlap either (not sure they can but it seems a bit weird as well)

I'll have another in depth look at the Tuning guide as you've suggested.
Any other suggestion how to overcome those gaps?

0 replies

jlowe · 2021-06-23T14:50:07Z

jlowe
Jun 23, 2021

I suspect at least some of these gaps are caused by Spark's handling of shuffle, where the CPU compresses the data, writes to disk, re-writes the files on disk to coalesce, then reads and decompresses on shuffle fetch. With UCX shuffle (i.e.: RapidsShuffleManager) the shuffle story is significantly different, but I am assuming this is using the standard Spark built-in shuffle manager.

All of these screenshots are focusing on the GPU section of the profile which is great to verify that there are gaps in the GPU utilization but won't directly help much to know why the gaps exist. NVTX ranges in the CPU section along with CPU scheduling information will provide more help there. The most interesting CPU threads to examine will have a name that starts with "Executor task" and there should be as many of those as the number of CPU cores assigned to the Spark executor. Enabling CPU scheduling info when capturing the profile can be helpful when some range appears larger than normal, as often this is because the thread stops running on the CPU for a bit (e.g.: during a JVM GC cycle, blocking on some lock, etc.).

Would it be possible to attach the Nsight qdrep file? Then I would be able to help dig into sections that aren't available in screenshots and zoom into sections that are.

The streams in the second screenshot (doing the hash joins) also has gaps and the streams do not overlap either (not sure they can but it seems a bit weird as well)

Given there are multiple streams in the profile, it's clear cudf is using per-thread-default-streams which should allow overlapping of GPU work between Spark tasks on the GPU. From the second screenshot it appears streams 14 and 16 are overlapping somewhat, but I cannot zoom in to verify.

0 replies

eyalhir74 · 2021-06-23T19:19:27Z

eyalhir74
Jun 23, 2021
Author

@jlowe Thanks for the assistance.
Attached is a qdrep file.

Given there are multiple streams in the profile, it's clear cudf is using per-thread-default-streams which should allow overlapping of GPU work between Spark tasks on the GPU. From the second screenshot it appears streams 14 and 16 are overlapping somewhat, but I cannot zoom in to verify.

Yes it seems so. The per-thread-default-stream feature is one of NVIDIA's best out-of-the-box solutions/feature.

report1.qdrep.gz

0 replies

jlowe · 2021-06-23T20:58:50Z

jlowe
Jun 23, 2021

Thanks a ton for the qdrep file, that helps a lot!

So yes, shuffle and buffer reading is a significant chunk of this. Looking at the first stage of the query where it first starts reading Parquet:

All the yellow ranges after the purple "Hash partition" and green "Parquet readBatch" range on that thread are serializing out the task's output for Spark shuffle. The gaps between those are the CPU thread writing out shuffle data to disk and scheduling the next task. These are all outside of the scope of the RAPIDS Accelerator and is part of standard Apache Spark. Once we get to the Parquet read, we can see that the first 137ms of that range is spent buffering the input data from the filesystem before the data is sent to the GPU where it is decompressed and decoded.

Similarly for the second stage of this query, we can see time spent off the GPU on deserializing batches (i.e.: shuffle read):

This shuffle behavior is inherent with the way the builtin Spark shuffle works. It is CPU-centric, with all of the data being handled by the CPU, compressed and decompressed by the CPU, and written to and read from the disk by the CPU. One way to mitigate this somewhat is to reduce the number of shuffle partitions involved as discussed in the tuning guide. Another option is to switch to using the Rapids Shuffle Manager which is a GPU-centric shuffle, caching the task outputs on the GPU as much as possible which could work particularly well in a local mode setup if the GPU can hold most of the shuffle data in memory.

0 replies

eyalhir74 · 2021-06-24T05:32:47Z

eyalhir74
Jun 24, 2021
Author

@jlowe Thanks a lot for the detailed explanation Jason!
I did try all the configuration changes in the tuning guide but none mitigated those gaps. I'll look further into this.

Last one general question/suggestion if I may. How does the rapids library compared to other solution, performance wise?
Are there benchmarks for Rapids+Spark using industry standard like TPCH queries compared to other solutions such as Oracle/mysql/CPU spark/etc, in order to find and try to mitigate such issues ?

0 replies

jlowe · 2021-06-24T14:27:39Z

jlowe
Jun 24, 2021

Are there benchmarks for Rapids+Spark using industry standard like TPCH queries compared to other solutions such as Oracle/mysql/CPU spark/etc

I don't know of any benchmarks vs. non-Spark solutions such as Oracle, MySQL, etc. The primary goal of the RAPIDS Accelerator is to help users that have existing ETL pipelines on Spark accelerate those workloads using GPUs. If users are open to any solution and willing to rewrite their pipeline logic accordingly then there are many, many options out there to consider, including BlazingSQL and Dask cudf that also leverage the RAPIDS stack.

As for benchmarking against Apache Spark on the CPU, I recommend checking out the recent GTC '21 session, Running Large-Scale ETL Benchmarks with GPU-Accelerated Apache Spark, which compares performance and cost of running the RAPIDS Accelerator against standard Spark on various platforms. Since the GPU gaps in your query primarily involve shuffle, I also recommend checking out the GTC '21 session that discusses the Rapids Shuffle Manager in more detail, Accelerating Apache Spark Shuffle with UCX.

0 replies

eyalhir74 · 2021-06-25T12:10:47Z

eyalhir74
Jun 25, 2021
Author

Thank you very much @jlowe for all the assistance and information.

0 replies

jlowe · 2021-06-25T14:04:40Z

jlowe
Jun 25, 2021

Thanks, @eyalhir74! I am closing this question as answered. Feel free to reopen or file a new issue if you have more questions about these profile traces.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[QST] SQL Performance issue #5342

{{title}}

Replies: 9 comments

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

Select a reply

[QST] SQL Performance issue #5342

eyalhir74 Jun 22, 2021

Replies: 9 comments

jlowe Jun 22, 2021

eyalhir74 Jun 23, 2021 Author

jlowe Jun 23, 2021

eyalhir74 Jun 23, 2021 Author

jlowe Jun 23, 2021

eyalhir74 Jun 24, 2021 Author

jlowe Jun 24, 2021

eyalhir74 Jun 25, 2021 Author

jlowe Jun 25, 2021

eyalhir74
Jun 22, 2021

jlowe
Jun 22, 2021

eyalhir74
Jun 23, 2021
Author

jlowe
Jun 23, 2021

eyalhir74
Jun 23, 2021
Author

jlowe
Jun 23, 2021

eyalhir74
Jun 24, 2021
Author

jlowe
Jun 24, 2021

eyalhir74
Jun 25, 2021
Author

jlowe
Jun 25, 2021