-
I'm running a TPCH sql query (see at the bottom) against a parquet file (100GB db) What can be the reason for such low utilization?
|
Beta Was this translation helpful? Give feedback.
Replies: 9 comments
-
The blazingsql comment has me a bit confused. Is this using the RAPIDS Accelerator or BlazingSQL? If the latter then you'll need to raise a question with the BlazingSQL project at https://github.com/BlazingDB/blazingsql. I'm assuming this is the RAPIDS Accelerator case. If cudf has been built with NVTX support (which has been true for published cudf jars for the last few versions), you can enable NVTX ranges by setting the Java property
to the Spark shuffle is not explicitly identified as an NVTX range (doing so would require modifying Apache Spark itself to add the corresponding NVTX ranges), but other ranges within the RAPIDS Accelerator and cudf can help identify shuffle activity (such as deserializing batches from Spark shuffle, host-side coalesce of Spark shuffle batches, etc.). Another potential reason for a gap could be time spent reading Parquet data from the filesystem into memory, and there are NVTX ranges in the RAPIDS Accelerator that cover that. I also suggest checking out the Tuning Guide as it can provide tips for increasing the performance. For example, if you have sufficient GPU memory available for a particular query, you can increase the task parallelism on the GPU which often increases GPU utilization and performance. This setting is set low by default since there is a chance of running out of GPU memory when too many tasks run concurrently on the GPU. |
Beta Was this translation helpful? Give feedback.
-
BlazingSQL is probably a mistake on my part. Yes I am using the RAPIDS accelerator. Ok, I managed to run nsight profile and get the markers.
I am using 6 concurrent tasks, nvtx ranges enabled. I'll have another in depth look at the Tuning guide as you've suggested. |
Beta Was this translation helpful? Give feedback.
-
I suspect at least some of these gaps are caused by Spark's handling of shuffle, where the CPU compresses the data, writes to disk, re-writes the files on disk to coalesce, then reads and decompresses on shuffle fetch. With UCX shuffle (i.e.: All of these screenshots are focusing on the GPU section of the profile which is great to verify that there are gaps in the GPU utilization but won't directly help much to know why the gaps exist. NVTX ranges in the CPU section along with CPU scheduling information will provide more help there. The most interesting CPU threads to examine will have a name that starts with "Executor task" and there should be as many of those as the number of CPU cores assigned to the Spark executor. Enabling CPU scheduling info when capturing the profile can be helpful when some range appears larger than normal, as often this is because the thread stops running on the CPU for a bit (e.g.: during a JVM GC cycle, blocking on some lock, etc.). Would it be possible to attach the Nsight qdrep file? Then I would be able to help dig into sections that aren't available in screenshots and zoom into sections that are.
Given there are multiple streams in the profile, it's clear cudf is using per-thread-default-streams which should allow overlapping of GPU work between Spark tasks on the GPU. From the second screenshot it appears streams 14 and 16 are overlapping somewhat, but I cannot zoom in to verify. |
Beta Was this translation helpful? Give feedback.
-
@jlowe Thanks for the assistance.
Yes it seems so. The per-thread-default-stream feature is one of NVIDIA's best out-of-the-box solutions/feature. |
Beta Was this translation helpful? Give feedback.
-
Thanks a ton for the qdrep file, that helps a lot! So yes, shuffle and buffer reading is a significant chunk of this. Looking at the first stage of the query where it first starts reading Parquet: All the yellow ranges after the purple "Hash partition" and green "Parquet readBatch" range on that thread are serializing out the task's output for Spark shuffle. The gaps between those are the CPU thread writing out shuffle data to disk and scheduling the next task. These are all outside of the scope of the RAPIDS Accelerator and is part of standard Apache Spark. Once we get to the Parquet read, we can see that the first 137ms of that range is spent buffering the input data from the filesystem before the data is sent to the GPU where it is decompressed and decoded. Similarly for the second stage of this query, we can see time spent off the GPU on deserializing batches (i.e.: shuffle read): This shuffle behavior is inherent with the way the builtin Spark shuffle works. It is CPU-centric, with all of the data being handled by the CPU, compressed and decompressed by the CPU, and written to and read from the disk by the CPU. One way to mitigate this somewhat is to reduce the number of shuffle partitions involved as discussed in the tuning guide. Another option is to switch to using the Rapids Shuffle Manager which is a GPU-centric shuffle, caching the task outputs on the GPU as much as possible which could work particularly well in a local mode setup if the GPU can hold most of the shuffle data in memory. |
Beta Was this translation helpful? Give feedback.
-
@jlowe Thanks a lot for the detailed explanation Jason! Last one general question/suggestion if I may. How does the rapids library compared to other solution, performance wise? |
Beta Was this translation helpful? Give feedback.
-
I don't know of any benchmarks vs. non-Spark solutions such as Oracle, MySQL, etc. The primary goal of the RAPIDS Accelerator is to help users that have existing ETL pipelines on Spark accelerate those workloads using GPUs. If users are open to any solution and willing to rewrite their pipeline logic accordingly then there are many, many options out there to consider, including BlazingSQL and Dask cudf that also leverage the RAPIDS stack. As for benchmarking against Apache Spark on the CPU, I recommend checking out the recent GTC '21 session, Running Large-Scale ETL Benchmarks with GPU-Accelerated Apache Spark, which compares performance and cost of running the RAPIDS Accelerator against standard Spark on various platforms. Since the GPU gaps in your query primarily involve shuffle, I also recommend checking out the GTC '21 session that discusses the Rapids Shuffle Manager in more detail, Accelerating Apache Spark Shuffle with UCX. |
Beta Was this translation helpful? Give feedback.
-
Thank you very much @jlowe for all the assistance and information. |
Beta Was this translation helpful? Give feedback.
-
Thanks, @eyalhir74! I am closing this question as answered. Feel free to reopen or file a new issue if you have more questions about these profile traces. |
Beta Was this translation helpful? Give feedback.
Thanks a ton for the qdrep file, that helps a lot!
So yes, shuffle and buffer reading is a significant chunk of this. Looking at the first stage of the query where it first starts reading Parquet:
All the yellow ranges after the purple "Hash partition" and green "Parquet readBatch" range on that thread are serializing out the task's output for Spark shuffle. The gaps between those are the CPU thread writing out shuffle data to disk and scheduling the next task. These are all outside of the scope of the RAPIDS Accelerator and is part of standard Apache Spark. Once we get to the Parquet read, we can see that the first 137ms of that range is spent buffering the input data from the filesyste…