[SPARK-20955][Core]Intern "executorId" to reduce the memory usage #18177

zsxwing · 2017-06-01T20:48:53Z

What changes were proposed in this pull request?

In this line, it uses the executorId string received from executors and finally it will go into TaskUIData. As deserializing the executorId string will always create a new instance, we have a lot of duplicated string instances.

This PR does a String interning for TaskUIData to reduce the memory usage.

How was this patch tested?

Manually test using bin/spark-shell --master local-cluster[6,1,1024]. Test codes:

for (_ <- 1 to 10) { sc.makeRDD(1 to 1000, 1000).count() }
Thread.sleep(2000)
val l = sc.getClass.getMethod("jobProgressListener").invoke(sc).asInstanceOf[org.apache.spark.ui.jobs.JobProgressListener]
org.apache.spark.util.SizeEstimator.estimate(l.stageIdToData)

This PR reduces the size of stageIdToData from 3487280 to 3009744 (86.3%) in the above case.

zsxwing · 2017-06-01T20:49:46Z

core/src/main/scala/org/apache/spark/ui/jobs/UIData.scala

-        executorId = taskInfo.executorId,
-        host = taskInfo.host,
+        executorId = weakIntern(taskInfo.executorId),
+        host = weakIntern(taskInfo.host),


Just intern it as well for safely, although host doesn't come from executors.

zsxwing · 2017-06-01T20:52:02Z

cc @JoshRosen

srowen · 2017-06-01T23:21:54Z

It seems reasonable but there are likely many more similar opportunities - is this a significantly large one? Or are there other easy wins?

Interning strings isn't as problematic as it used to be. Ideally the JVM string deduplication would take care of this even better and automatically though it is still an experimental flag.

zsxwing · 2017-06-01T23:33:33Z

@srowen Considering Spark keeps 1000 stages in UI, if each stage has 10000 tasks, the duplicated strings will be a lot. I observed this, about 150MB in a heap dump. Of cause, it's not significantly large but still worth to eliminate them. I agree that JVM string deduplication is better but it's not enabled by default.

SparkQA · 2017-06-01T23:37:34Z

Test build #77654 has finished for PR 18177 at commit 38a4e31.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

JoshRosen · 2017-06-02T01:20:18Z

LGTM. This seems like an easy win based on our empirical observations of high memory usage. I'm sure that we'll continue to find and fix others as we profile and heap dump more.

zsxwing · 2017-06-02T17:32:47Z

Thanks! Merging to master and 2.2.

## What changes were proposed in this pull request? In [this line](https://github.com/apache/spark/blob/f7cf2096fdecb8edab61c8973c07c6fc877ee32d/core/src/main/scala/org/apache/spark/scheduler/cluster/CoarseGrainedSchedulerBackend.scala#L128), it uses the `executorId` string received from executors and finally it will go into `TaskUIData`. As deserializing the `executorId` string will always create a new instance, we have a lot of duplicated string instances. This PR does a String interning for TaskUIData to reduce the memory usage. ## How was this patch tested? Manually test using `bin/spark-shell --master local-cluster[6,1,1024]`. Test codes: ``` for (_ <- 1 to 10) { sc.makeRDD(1 to 1000, 1000).count() } Thread.sleep(2000) val l = sc.getClass.getMethod("jobProgressListener").invoke(sc).asInstanceOf[org.apache.spark.ui.jobs.JobProgressListener] org.apache.spark.util.SizeEstimator.estimate(l.stageIdToData) ``` This PR reduces the size of `stageIdToData` from 3487280 to 3009744 (86.3%) in the above case. Author: Shixiong Zhu <shixiong@databricks.com> Closes #18177 from zsxwing/SPARK-20955. (cherry picked from commit 16186cd) Signed-off-by: Shixiong Zhu <shixiong@databricks.com>

Intern "executorId" to reduce the memory usage

38a4e31

zsxwing commented Jun 1, 2017

View reviewed changes

asfgit closed this in 16186cd Jun 2, 2017

zsxwing deleted the SPARK-20955 branch June 2, 2017 17:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-20955][Core]Intern "executorId" to reduce the memory usage #18177

[SPARK-20955][Core]Intern "executorId" to reduce the memory usage #18177

zsxwing commented Jun 1, 2017 •

edited

Loading

zsxwing Jun 1, 2017

zsxwing commented Jun 1, 2017

srowen commented Jun 1, 2017

zsxwing commented Jun 1, 2017 •

edited

Loading

SparkQA commented Jun 1, 2017

JoshRosen commented Jun 2, 2017

zsxwing commented Jun 2, 2017

[SPARK-20955][Core]Intern "executorId" to reduce the memory usage #18177

[SPARK-20955][Core]Intern "executorId" to reduce the memory usage #18177

Conversation

zsxwing commented Jun 1, 2017 • edited Loading

What changes were proposed in this pull request?

How was this patch tested?

zsxwing Jun 1, 2017

Choose a reason for hiding this comment

zsxwing commented Jun 1, 2017

srowen commented Jun 1, 2017

zsxwing commented Jun 1, 2017 • edited Loading

SparkQA commented Jun 1, 2017

JoshRosen commented Jun 2, 2017

zsxwing commented Jun 2, 2017

zsxwing commented Jun 1, 2017 •

edited

Loading

zsxwing commented Jun 1, 2017 •

edited

Loading