-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-20955][Core]Intern "executorId" to reduce the memory usage #18177
Conversation
executorId = taskInfo.executorId, | ||
host = taskInfo.host, | ||
executorId = weakIntern(taskInfo.executorId), | ||
host = weakIntern(taskInfo.host), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just intern it as well for safely, although host
doesn't come from executors.
cc @JoshRosen |
It seems reasonable but there are likely many more similar opportunities - is this a significantly large one? Or are there other easy wins? Interning strings isn't as problematic as it used to be. Ideally the JVM string deduplication would take care of this even better and automatically though it is still an experimental flag. |
@srowen Considering Spark keeps 1000 stages in UI, if each stage has 10000 tasks, the duplicated strings will be a lot. I observed this, about 150MB in a heap dump. Of cause, it's not significantly large but still worth to eliminate them. I agree that JVM string deduplication is better but it's not enabled by default. |
Test build #77654 has finished for PR 18177 at commit
|
LGTM. This seems like an easy win based on our empirical observations of high memory usage. I'm sure that we'll continue to find and fix others as we profile and heap dump more. |
Thanks! Merging to master and 2.2. |
## What changes were proposed in this pull request? In [this line](https://github.com/apache/spark/blob/f7cf2096fdecb8edab61c8973c07c6fc877ee32d/core/src/main/scala/org/apache/spark/scheduler/cluster/CoarseGrainedSchedulerBackend.scala#L128), it uses the `executorId` string received from executors and finally it will go into `TaskUIData`. As deserializing the `executorId` string will always create a new instance, we have a lot of duplicated string instances. This PR does a String interning for TaskUIData to reduce the memory usage. ## How was this patch tested? Manually test using `bin/spark-shell --master local-cluster[6,1,1024]`. Test codes: ``` for (_ <- 1 to 10) { sc.makeRDD(1 to 1000, 1000).count() } Thread.sleep(2000) val l = sc.getClass.getMethod("jobProgressListener").invoke(sc).asInstanceOf[org.apache.spark.ui.jobs.JobProgressListener] org.apache.spark.util.SizeEstimator.estimate(l.stageIdToData) ``` This PR reduces the size of `stageIdToData` from 3487280 to 3009744 (86.3%) in the above case. Author: Shixiong Zhu <shixiong@databricks.com> Closes #18177 from zsxwing/SPARK-20955. (cherry picked from commit 16186cd) Signed-off-by: Shixiong Zhu <shixiong@databricks.com>
What changes were proposed in this pull request?
In this line, it uses the
executorId
string received from executors and finally it will go intoTaskUIData
. As deserializing theexecutorId
string will always create a new instance, we have a lot of duplicated string instances.This PR does a String interning for TaskUIData to reduce the memory usage.
How was this patch tested?
Manually test using
bin/spark-shell --master local-cluster[6,1,1024]
. Test codes:This PR reduces the size of
stageIdToData
from 3487280 to 3009744 (86.3%) in the above case.