How will a partition data be processed when the data size of a partition exceeds the configuration size of spark.task.resource.gpu.amount? #5336

YeahNew · 2022-01-24T08:40:20Z

YeahNew
Jan 24, 2022

When the data size of a partition exceeds the configuration size of spark.task.resource.gpu.amount, will the framework automatically batch processes?
For example, the size of a partition is 50MB, and spark.task.resource.gpu.amount=0.001625, the gpu memory resources obtained by each task=16GB0.0016251024=26.6MB. In this case, how will a partition data be processed ?
For this partition, will it be handled by 2 tasks or by one task and then continue processing？

Answered by jlowe

Jan 25, 2022

Ultimately the number of tasks that can be allowed to run concurrently on the GPU is influenced by both configuration settings, and how many tasks will run concurrently at any point in time depends on what the tasks are trying to do at that moment. I'll try to explain in detail, so here's how those two configuration settings work in practice:

spark.task.resource.gpu.amount will direct Spark's scheduler to limit how many tasks are allowed to run concurrently on an executor, whether those tasks are actively using the GPU or not. For example, if spark.task.resource.gpu.amount=0.25 then the executor can run at most 4 tasks at a time, just like when spark.executor.cores=4. This is the maximum …

View full answer

jlowe · 2022-01-24T15:36:51Z

jlowe
Jan 24, 2022

spark.task.resource.gpu.amount controls the parallelism of the Spark cluster based on the number of GPUs available per executor and the number of executors in the cluster, regardless of how much memory each GPU has. It is very similar to how Spark schedules tasks based on the number of CPU cores configured per task and number of CPU cores available in the cluster independent of how much memory each task has available.

Each task will process its data in different ways, not necessarily collecting it all at once, so it's not guaranteed that a task will run out of memory when its input partition exceeds its memory budget. For example, if all the task is doing is filtering rows then it could stream the data from the input partition in small batches and process a huge input partition with relatively low memory. However GPUs process data in a columnar fashion and performs better when there is a significant amount of input to process per batch, so the batches used will often be significantly larger than the row-based approach Spark uses on the CPU. The GPU batch size can be somewhat controlled by the "spark.rapids.sql.batchSizeBytes config, but it is not an absolute limit. Sometimes tasks will exceed this limit during their processing.

Note using an extreme setting like spark.task.resource.gpu.amount=0.001625 will perform very poorly, as it will try to share a single GPU across approximately 615 concurrent tasks and will likely cause significant memory problems. Even if a small batch size config enables this to run, it will perform very poorly since GPUs need significant amounts of data to process to see good performance. GPUs are often slower than CPUs when processing tiny amounts of data at a time.

If you have not done so already, I suggest checking out our tuning guide and specifically the section on concurrent tasks per GPU. We typically have seen the best performance when the number of tasks per GPU is between 2 and 4, but it does depend on the query.

0 replies

YeahNew · 2022-01-25T06:25:54Z

YeahNew
Jan 25, 2022
Author

spark.task.resource.gpu.amount controls the parallelism of the Spark cluster based on the number of GPUs available per executor and the number of executors in the cluster, regardless of how much memory each GPU has. It is very similar to how Spark schedules tasks based on the number of CPU cores configured per task and number of CPU cores available in the cluster independent of how much memory each task has available.

Each task will process its data in different ways, not necessarily collecting it all at once, so it's not guaranteed that a task will run out of memory when its input partition exceeds its memory budget. For example, if all the task is doing is filtering rows then it could stream the data from the input partition in small batches and process a huge input partition with relatively low memory. However GPUs process data in a columnar fashion and performs better when there is a significant amount of input to process per batch, so the batches used will often be significantly larger than the row-based approach Spark uses on the CPU. The GPU batch size can be somewhat controlled by the "spark.rapids.sql.batchSizeBytes config, but it is not an absolute limit. Sometimes tasks will exceed this limit during their processing.

Note using an extreme setting like spark.task.resource.gpu.amount=0.001625 will perform very poorly, as it will try to share a single GPU across approximately 615 concurrent tasks and will likely cause significant memory problems. Even if a small batch size config enables this to run, it will perform very poorly since GPUs need significant amounts of data to process to see good performance. GPUs are often slower than CPUs when processing tiny amounts of data at a time.

If you have not done so already, I suggest checking out our tuning guide and specifically the section on concurrent tasks per GPU. We typically have seen the best performance when the number of tasks per GPU is between 2 and 4, but it does depend on the query.

@jlowe
Thank you very much for your reply, but I still have a strange question, both spark.rapids.sql.concurrentGpuTasks and spark.task.resource.gpu.amount seem to control the concurrent tasks per executor(https://nvidia.github.io/spark-rapids/docs/configs.html, https://nvidia.github.io/spark-rapids/docs/tuning-guide.html). So which parameter will decide? ?

0 replies

jlowe · 2022-01-25T15:06:15Z

jlowe
Jan 25, 2022

Ultimately the number of tasks that can be allowed to run concurrently on the GPU is influenced by both configuration settings, and how many tasks will run concurrently at any point in time depends on what the tasks are trying to do at that moment. I'll try to explain in detail, so here's how those two configuration settings work in practice:

spark.task.resource.gpu.amount will direct Spark's scheduler to limit how many tasks are allowed to run concurrently on an executor, whether those tasks are actively using the GPU or not. For example, if spark.task.resource.gpu.amount=0.25 then the executor can run at most 4 tasks at a time, just like when spark.executor.cores=4. This is the maximum task parallelism of the executor. The Spark scheduler will never allow more tasks to run on an executor than is allowed by these settings.

spark.rapids.sql.concurrentGpuTasks is subtly different than the above configuration settings. It controls how many tasks are allowed to actively share the GPU at the same time. What do I mean by actively share? I mean that the task is actually using the GPU, i.e.: running GPU kernels, taking up GPU memory, etc. It's a subtle difference from the previous config, because tasks do not use the GPU actively during their entire lifetime. For example, when a task in an initial stage starts reading Parquet, the first step is to fetch data from the distributed filesystem. This does not involve the GPU at all, rather it's executing Java code through the FileSystem interface. Only after the compressed and encoded input data arrives from the distributed filesystem can the GPU become involved to start decompressing and decoding, and at that point is when the task will be actively using the GPU, and that's when the spark.rapids.sql.concurrentGpuTasks setting kicks in. As a concrete example, let's say we're allowing 4 tasks per executor via spark.executor.cores=4 but spark.rapids.sql.concurrentGpuTasks=1. That means that 4 tasks will all start fetching Parquet data from the distributed filesystem in parallel, but when it comes time for the tasks to decompress and decode the data when it arrives, only one task at a time will be allowed to run the kernels on the GPU to do so, and they will run serially at that point.

Similarly when tasks read and write shuffle, that is done without the GPU so tasks can run up to the configured executor parallelism (i.e.: as determined by spark.executor.cores and 1/spark.task.resource.gpu.amount). However when the task starts processing the shuffle data that arrives, that's when it may be limited by the spark.rapids.sql.concurrentGpuTasks setting. If spark.rapids.sql.concurrentGpuTasks is set to the same or higher value than spark.executor.cores then tasks will never wait to use the GPU.

This interaction is easily seen when looking at an Nsight Systems trace of an executor running when more tasks are configured via spark.executor.cores or spark.task.resource.gpu.amount than are allowed on the GPU concurrently via spark.rapids.sql.concurrentGpuTasks. You'll see as many executor task threads as configured by the cores config or the inverse of the GPU resource amount config, but those threads will block occasionally in Acquire GPU NVTX ranges as they wait for other tasks to stop using the GPU so they can take their turn.

So why have the two different configs? This is to allow the executor to run with a wider parallelism than the GPU memory could allow when tasks are not actively needing the GPU to perform processing (e.g.: during distributed filesystem reads, shuffle read/write, and other CPU-only portions of the query). If we only used the standard Spark executor core or task resource GPU settings, then we'd be stuck either running "too wide" and hitting OOM errors because there isn't sufficient GPU memory to accommodate that many concurrent tasks, or running "too narrow" and not going as fast as a CPU-only cluster could go for the portions of the query that are running only on the CPU. Having both of these configs allow us to run wider during CPU-only sections and narrower during GPU-only sections.

0 replies

revans2 · 2022-01-25T15:10:54Z

revans2
Jan 25, 2022
Maintainer

@jlowe if we need to write out this long explanation instead of pointing to https://github.com/NVIDIA/spark-rapids/blob/branch-22.02/docs/tuning-guide.md#number-of-tasks-per-executor and/or https://github.com/NVIDIA/spark-rapids/blob/branch-22.02/docs/tuning-guide.md#number-of-tasks-per-executor then we probably need to re-write them to make it more clear what is happening.

0 replies

jlowe · 2022-01-25T15:16:35Z

jlowe
Jan 25, 2022

@revans2 Totally agree. I already pointed to the tuning guide up above, and since questions remained, I'm using this issue to work out what additional clarifications are needed. Once the questions are all resolved, we can use this issue's discussion as input for what needs to be added to the tuning guide and FAQ.

0 replies

YeahNew · 2022-01-29T15:34:20Z

YeahNew
Jan 29, 2022
Author

@revans2 Totally agree. I already pointed to the tuning guide up above, and since questions remained, I'm using this issue to work out what additional clarifications are needed. Once the questions are all resolved, we can use this issue's discussion as input for what needs to be added to the tuning guide and FAQ.

@jlowe @revans2
Thank you very much for your careful reply, I have clear the meaning of these two parameters, which is very helpful for me to use this component. Thank you very much~

0 replies

jlowe · 2022-02-04T21:53:05Z

jlowe
Feb 4, 2022

Closing this as answered. I posted a FAQ change at #4692 to clarify this a bit further and point to the detailed sections of the tuning guide.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How will a partition data be processed when the data size of a partition exceeds the configuration size of spark.task.resource.gpu.amount? #5336

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 7 comments

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

How will a partition data be processed when the data size of a partition exceeds the configuration size of spark.task.resource.gpu.amount? #5336

YeahNew Jan 24, 2022

Replies: 7 comments

jlowe Jan 24, 2022

YeahNew Jan 25, 2022 Author

jlowe Jan 25, 2022

revans2 Jan 25, 2022 Maintainer

jlowe Jan 25, 2022

YeahNew Jan 29, 2022 Author

jlowe Feb 4, 2022

YeahNew
Jan 24, 2022

jlowe
Jan 24, 2022

YeahNew
Jan 25, 2022
Author

jlowe
Jan 25, 2022

revans2
Jan 25, 2022
Maintainer

jlowe
Jan 25, 2022

YeahNew
Jan 29, 2022
Author

jlowe
Feb 4, 2022