[SPARK-9103][WIP] Add Memory Tracking UI and track Netty memory usage #17625

jsoltren · 2017-04-12T22:15:58Z

This patch resurrects #7753 by liyezhang556520.

What changes were proposed in this pull request?

Memory reporting is a much desired feature in Apache Spark. This change adds reporting for some memory used by netty, and adds a new UI tab for memory reporting. We introduce a new executorMetrics object to track this.

How was this patch tested?

Ran https://github.com/jsoltren/spark-demos/blob/master/src/main/scala/org/soltren/spark/examples/ShuffleGroupCount.scala and verified appearance of new tab in UI.

Manual pull of the PR from: https://github.com/apache/spark/pull/7753/commits Created with: git checkout upstream/pr/7753 -- . A bunch of manual merging happened after this.

squito · 2017-04-12T22:25:15Z

core/src/main/scala/org/apache/spark/network/netty/NettyBlockTransferService.scala

@@ -22,8 +22,12 @@ import java.nio.ByteBuffer
 import scala.collection.JavaConverters._
 import scala.concurrent.{Future, Promise}
 import scala.reflect.ClassTag
+import scala.tools.nsc.interpreter.JList


this should be

import java.util.{List => JList}

(it happens to work b/c of this: https://github.com/scala/scala/blob/v2.12.1/src/repl/scala/tools/nsc/interpreter/package.scala#L36)

squito · 2017-04-12T22:27:10Z

core/src/main/scala/org/apache/spark/scheduler/EventLoggingListener.scala

@@ -87,6 +88,10 @@ private[spark] class EventLoggingListener(
  // Visible for tests only.
  private[scheduler] val logPath = getLogPath(logBaseDir, appId, appAttemptId, compressionCodecName)

+  private val executorIdToLatestMetrics = new HashMap[String, SparkListenerExecutorMetricsUpdate]
+  private val executorIdToModifiedMaxMetrics = new
+    HashMap[String, SparkListenerExecutorMetricsUpdate]


nit: put new on same line as hashMap

squito · 2017-04-12T22:28:36Z

core/src/main/scala/org/apache/spark/scheduler/EventLoggingListener.scala

+      case None =>
+        executorIdToModifiedMaxMetrics(executorId) = latestEvent
+      case Some(toBeModifiedEvent) =>
+        val toBeModifiedTransportMetrics = toBeModifiedEvent.executorMetrics.transportMetrics


how about prevTransportMetrics

squito · 2017-04-12T22:28:49Z

core/src/main/scala/org/apache/spark/scheduler/EventLoggingListener.scala

+        } else {
+          toBeModifiedTransportMetrics.offHeapSize
+        }
+        val modifiedExecMetrics = ExecutorMetrics(toBeModifiedEvent.executorMetrics.hostname,


how about updatedExecMetrics

squito · 2017-04-12T22:29:11Z

Jenkins, ok to test

SparkQA · 2017-04-12T22:43:16Z

Test build #75750 has finished for PR 17625 at commit f3e5704.

This patch fails MiMa tests.
This patch merges cleanly.
This patch adds no public classes.

jerryshao · 2017-04-13T05:42:04Z

@jsoltren thanks to bring up this very old PR.

By looking at the UI you pasted here, I'm wondering what is the usage of Completed Stages here, what's difference here compared to Stages tab?

Also I think it would be better to to add a new Metrics Source about Netty memory usage, that would be useful.

tgravescs · 2017-04-13T13:06:11Z

we also just exposed more memory information for storage memory in the executors page in SPARK-17019.

If we now have a memory tab it could be confusing to the users where to go to see what. If this is again a per executor thing do we just add new columns which check boxes there or consider consolidating that information under this memory tab?

squito · 2017-04-13T15:48:31Z

core/src/test/scala/org/apache/spark/util/JsonProtocolSuite.scala

+    val newJson = JsonProtocol.executorMetricsUpdateToJson(executorMetricsUpdate)
+    val oldJson = newJson.removeField { case (field, _) => field == "Executor Metrics Updated"}
+    val newMetrics = JsonProtocol.executorMetricsUpdateFromJson(oldJson)
+    assert(newMetrics.executorMetrics.hostname === "")


it seems like really executorMetrics should be an Option (with a comment that it was not present till version x)

squito · 2017-04-13T15:55:01Z

yeah taking another look at the UI, I agree with @jerryshao and @tgravescs, the memory tab is pretty weird. I think putting this info on the "stages" tab makes sense -- that is really the info that this is currently exposing, the memory used by netty during a stage. I think at some point it might make sense to have a Memory tab which more completely summarizes the memory used by spark -- netty, cache, shuffle, execution, perhaps even parquet & serialization buffers, etc. But with just this, its more confusing than it is helpful.

It may be a little strange having just this one bit of memory info on the stage page until we get a more complete picture of memory, but hopefully still useful.

Also having a metric source for netty seems like a good idea.

squito · 2017-04-13T16:53:27Z

@tgravescs what do you think about breaking this into two parts -- the internal plumbing, and the UI stuff? by itself the plumbing part wouldn't do anything, but I think it would be easier to review, as long as there is general agreement this is a good direction.

tgravescs · 2017-04-13T18:01:33Z

I haven't looked through the code at all, but I definitely like the idea of tracking the netty memory usage. Breaking into 2 pieces makes sense.

If we end up creating any new UI pages please use datatables like the executors page rather then the old format.

SparkQA · 2017-04-17T16:13:52Z

Test build #75857 has finished for PR 17625 at commit 577d442.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

jerryshao · 2017-04-21T08:00:04Z

@jsoltren , a quick look at your current implementation, looks like you only track the Netty memory usage in NettyBlockTransferService, but in Spark there're some other places which will create Netty clientFactory or server, I think it would be better to also track memory usage in those places:

Netty RPC client factory and RPC Server.
Netty file download client factory.
Netty external shuffle client factory.
Netty block transfer client factory and server - currently you did this already. For the server I think it is not used for external shuffle.

Also Spark has different Netty context (rpc, shuffle), shall we also need different netty metrics for shuffle, rpc...?

I would suggest to only expose Netty metrics internally in this PR and get rid of UI things for the following PR.

Closes apache#11785 Closes apache#13027 Closes apache#13614 Closes apache#13761 Closes apache#15197 Closes apache#14006 Closes apache#12576 Closes apache#15447 Closes apache#13259 Closes apache#15616 Closes apache#14473 Closes apache#16638 Closes apache#16146 Closes apache#17269 Closes apache#17313 Closes apache#17418 Closes apache#17485 Closes apache#17551 Closes apache#17463 Closes apache#17625 Closes apache#10739 Closes apache#15193 Closes apache#15344 Closes apache#14804 Closes apache#16993 Closes apache#17040 Closes apache#15180 Closes apache#17238

jsoltren · 2017-04-24T18:37:44Z

I'm still making updates here. It would be ideal if one of the admins could please re-open this PR. Otherwise I will open a new one. Thanks!

jsoltren · 2017-04-25T00:01:47Z

This PR was closed, so, I'll create a new one focusing on just the back end pieces. I'll create a fresh JIRA for more general memory tracking improvements to the UI where we can hash out more of the details. The UI has changed quite a lot since the original PR!

This pr proposed to close stale PRs. Currently, we have 400+ open PRs and there are some stale PRs whose JIRA tickets have been already closed and whose JIRA tickets does not exist (also, they seem not to be minor issues). // Open PRs whose JIRA tickets have been already closed Closes apache#11785 Closes apache#13027 Closes apache#13614 Closes apache#13761 Closes apache#15197 Closes apache#14006 Closes apache#12576 Closes apache#15447 Closes apache#13259 Closes apache#15616 Closes apache#14473 Closes apache#16638 Closes apache#16146 Closes apache#17269 Closes apache#17313 Closes apache#17418 Closes apache#17485 Closes apache#17551 Closes apache#17463 Closes apache#17625 // Open PRs whose JIRA tickets does not exist and they are not minor issues Closes apache#10739 Closes apache#15193 Closes apache#15344 Closes apache#14804 Closes apache#16993 Closes apache#17040 Closes apache#15180 Closes apache#17238 N/A Author: Takeshi Yamamuro <yamamuro@apache.org> Closes apache#17734 from maropu/resolved_pr. Change-Id: Id2e590aa7283fe5ac01424d30a40df06da6098b5

jsoltren added 8 commits April 3, 2017 22:24

Netty network layer memory usage on webUI

1e51c73

Manual pull of the PR from: https://github.com/apache/spark/pull/7753/commits Created with: git checkout upstream/pr/7753 -- . A bunch of manual merging happened after this.

more fixups to get things compiling

543cd88

fix JsonProtocolSuite

b794b86

Some fixups related to internal code review. More are coming.

941eef0

delete stragglers

72ed41e

Doc changes and minor tweaks in preparation for PR

7cddd6c

s/Happen Time/Peak Time

1bb14b4

fix merge conflict

f3e5704

squito reviewed Apr 12, 2017

View reviewed changes

squito reviewed Apr 13, 2017

View reviewed changes

Respond to some code review feedback. Make executorMetrics an Option.

577d442

jerryshao mentioned this pull request Apr 21, 2017

[SPARK-20391][Core] Rename memory related fields in ExecutorSummay #17700

Closed

maropu mentioned this pull request Apr 23, 2017

[BUILD] Close stale PRs #17734

Closed

asfgit closed this in e9f9715 Apr 24, 2017

jsoltren mentioned this pull request Apr 25, 2017

[SPARK-9103][WIP] Track Netty memory usage - take two #17762

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-9103][WIP] Add Memory Tracking UI and track Netty memory usage #17625

[SPARK-9103][WIP] Add Memory Tracking UI and track Netty memory usage #17625

jsoltren commented Apr 12, 2017

squito Apr 12, 2017

squito Apr 12, 2017

squito Apr 12, 2017

squito Apr 12, 2017

squito commented Apr 12, 2017

SparkQA commented Apr 12, 2017

jerryshao commented Apr 13, 2017

tgravescs commented Apr 13, 2017

squito Apr 13, 2017

squito commented Apr 13, 2017 •

edited

Loading

squito commented Apr 13, 2017

tgravescs commented Apr 13, 2017

SparkQA commented Apr 17, 2017

jerryshao commented Apr 21, 2017 •

edited

Loading

jsoltren commented Apr 24, 2017

jsoltren commented Apr 25, 2017

[SPARK-9103][WIP] Add Memory Tracking UI and track Netty memory usage #17625

[SPARK-9103][WIP] Add Memory Tracking UI and track Netty memory usage #17625

Conversation

jsoltren commented Apr 12, 2017

What changes were proposed in this pull request?

How was this patch tested?

squito Apr 12, 2017

Choose a reason for hiding this comment

squito Apr 12, 2017

Choose a reason for hiding this comment

squito Apr 12, 2017

Choose a reason for hiding this comment

squito Apr 12, 2017

Choose a reason for hiding this comment

squito commented Apr 12, 2017

SparkQA commented Apr 12, 2017

jerryshao commented Apr 13, 2017

tgravescs commented Apr 13, 2017

squito Apr 13, 2017

Choose a reason for hiding this comment

squito commented Apr 13, 2017 • edited Loading

squito commented Apr 13, 2017

tgravescs commented Apr 13, 2017

SparkQA commented Apr 17, 2017

jerryshao commented Apr 21, 2017 • edited Loading

jsoltren commented Apr 24, 2017

jsoltren commented Apr 25, 2017

squito commented Apr 13, 2017 •

edited

Loading

jerryshao commented Apr 21, 2017 •

edited

Loading