-
Notifications
You must be signed in to change notification settings - Fork 244
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEA] clean up host memory limit configs #8878
Comments
Hi @revans2 recently I found that, unless setting spark.rapids.memory.hostOffHeapLimit.enabled, the memory consumption is unbounded, so the whole spark process at risk of being killed by OOM killer or YARN. Any reason why we're still not enabling spark.rapids.memory.hostOffHeapLimit.enabled by default? |
Really there are two reasons.
When we finally finish the first issue, then we will discuss about the second issue and if there are things we can/should do to help mitigate it. |
Hi @revans2 can you give some example of spark-rapids/sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuDeviceManager.scala Line 461 in 91bdb92
CPU allocation code actually mean ?
The original problem we're facing is that: When running a customer SQL with buffer spilling, we want to maximize memory store spilling and minimize disk store spilling (to get better performance). Meanwhile, we have to limit the total offheap memory being used, to prevent the executor process eating up all OS memory and then being killed by OOM killer. Our current solution is to set:
With these configs we hope the total offheap memory is bounded. Any comments on your side? |
It should be close, but I would have to go back and look at the EPIC to see exactly what is left. You definitely could try that. I think we are 99% of the way to truly limiting host memory, but it has been a while. Be aware that the pool Spark uses for off heap does not overlap with the pool that we use for off heap. It would be nice to eventually combine them, but as it is now your config could use up to 80 GiB of off heap memory. |
Is your feature request related to a problem? Please describe.
Once the changes to the plugin are enough done that we feel that it should be turned on by default.
spark.rapids.memory.hostOffHeapLimit.enabled
to true by default, and deprecate it. We will have a follow on issue to remove it once we have confidence that customers can move to this without a lot of problems.spark.rapids.memory.host.spillStorageSize
and point people tospark.rapids.memory.hostOffHeapLimit.size
instead.spark.rapids.memory.hostPageable.taskOverhead.size
from being an internal config to being an advanced config.spark.rapids.memory.hostOffHeapLimit.size
< 2 *spark.rapids.memory.hostPageable.taskOverhead.size
*numberOfTasks
ORspark.rapids.memory.hostOffHeapLimit.size
<spark.rapids.sql.batchSizeBytes
+spark.sql.files.maxPartitionBytes
* some factor (but we can adjust this based off of testing we do). The warning should be there to let them know that we are adjusting it to the new minimum value to avoid problems. We should also warn it the pinned pool is larger than the offHealLimit. For now we will adjust the pinned pool down to fit in the off heap limit. All of these should be documented.The text was updated successfully, but these errors were encountered: