Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-20955][Core]Intern "executorId" to reduce the memory usage #18177

Closed
wants to merge 1 commit into from

Conversation

zsxwing
Copy link
Member

@zsxwing zsxwing commented Jun 1, 2017

What changes were proposed in this pull request?

In this line, it uses the executorId string received from executors and finally it will go into TaskUIData. As deserializing the executorId string will always create a new instance, we have a lot of duplicated string instances.

This PR does a String interning for TaskUIData to reduce the memory usage.

How was this patch tested?

Manually test using bin/spark-shell --master local-cluster[6,1,1024]. Test codes:

for (_ <- 1 to 10) { sc.makeRDD(1 to 1000, 1000).count() }
Thread.sleep(2000)
val l = sc.getClass.getMethod("jobProgressListener").invoke(sc).asInstanceOf[org.apache.spark.ui.jobs.JobProgressListener]
org.apache.spark.util.SizeEstimator.estimate(l.stageIdToData)

This PR reduces the size of stageIdToData from 3487280 to 3009744 (86.3%) in the above case.

executorId = taskInfo.executorId,
host = taskInfo.host,
executorId = weakIntern(taskInfo.executorId),
host = weakIntern(taskInfo.host),
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just intern it as well for safely, although host doesn't come from executors.

@zsxwing
Copy link
Member Author

zsxwing commented Jun 1, 2017

cc @JoshRosen

@srowen
Copy link
Member

srowen commented Jun 1, 2017

It seems reasonable but there are likely many more similar opportunities - is this a significantly large one? Or are there other easy wins?

Interning strings isn't as problematic as it used to be. Ideally the JVM string deduplication would take care of this even better and automatically though it is still an experimental flag.

@zsxwing
Copy link
Member Author

zsxwing commented Jun 1, 2017

@srowen Considering Spark keeps 1000 stages in UI, if each stage has 10000 tasks, the duplicated strings will be a lot. I observed this, about 150MB in a heap dump. Of cause, it's not significantly large but still worth to eliminate them. I agree that JVM string deduplication is better but it's not enabled by default.

@SparkQA
Copy link

SparkQA commented Jun 1, 2017

Test build #77654 has finished for PR 18177 at commit 38a4e31.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@JoshRosen
Copy link
Contributor

LGTM. This seems like an easy win based on our empirical observations of high memory usage. I'm sure that we'll continue to find and fix others as we profile and heap dump more.

@zsxwing
Copy link
Member Author

zsxwing commented Jun 2, 2017

Thanks! Merging to master and 2.2.

asfgit pushed a commit that referenced this pull request Jun 2, 2017
## What changes were proposed in this pull request?

In [this line](https://github.com/apache/spark/blob/f7cf2096fdecb8edab61c8973c07c6fc877ee32d/core/src/main/scala/org/apache/spark/scheduler/cluster/CoarseGrainedSchedulerBackend.scala#L128), it uses the `executorId` string received from executors and finally it will go into `TaskUIData`. As deserializing the `executorId` string will always create a new instance, we have a lot of duplicated string instances.

This PR does a String interning for TaskUIData to reduce the memory usage.

## How was this patch tested?

Manually test using `bin/spark-shell --master local-cluster[6,1,1024]`. Test codes:
```
for (_ <- 1 to 10) { sc.makeRDD(1 to 1000, 1000).count() }
Thread.sleep(2000)
val l = sc.getClass.getMethod("jobProgressListener").invoke(sc).asInstanceOf[org.apache.spark.ui.jobs.JobProgressListener]
org.apache.spark.util.SizeEstimator.estimate(l.stageIdToData)
```
This PR reduces the size of `stageIdToData` from 3487280 to 3009744 (86.3%) in the above case.

Author: Shixiong Zhu <shixiong@databricks.com>

Closes #18177 from zsxwing/SPARK-20955.

(cherry picked from commit 16186cd)
Signed-off-by: Shixiong Zhu <shixiong@databricks.com>
@asfgit asfgit closed this in 16186cd Jun 2, 2017
@zsxwing zsxwing deleted the SPARK-20955 branch June 2, 2017 17:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants