[SPARK-30313][CORE] Ensure EndpointRef is available MasterWebUI/WorkerPage #27010

HeartSaVioR · 2019-12-26T05:18:40Z

What changes were proposed in this pull request?

This patch fixes flaky tests "master/worker web ui available" & "master/worker web ui available with reverseProxy" in MasterSuite.

Tracking back from stack trace below,

19/12/19 13:48:39.160 dispatcher-event-loop-4 INFO Worker: WorkerWebUI is available at http://localhost:8080/proxy/worker-20191219
134839-localhost-36054
19/12/19 13:48:39.296 WorkerUI-52072 WARN JettyUtils: GET /json/ failed: java.lang.NullPointerException
java.lang.NullPointerException
        at org.apache.spark.deploy.worker.ui.WorkerPage.renderJson(WorkerPage.scala:39)
        at org.apache.spark.ui.WebUI.$anonfun$attachPage$2(WebUI.scala:91)
        at org.apache.spark.ui.JettyUtils$$anon$1.doGet(JettyUtils.scala:80)
        at javax.servlet.http.HttpServlet.service(HttpServlet.java:687)
        at javax.servlet.http.HttpServlet.service(HttpServlet.java:790)
        at org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:873)
        at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1623)
        at org.apache.spark.ui.HttpSecurityFilter.doFilter(HttpSecurityFilter.scala:95)
        at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1610)
        at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:540)

there's possible race condition in Dispatcher.registerRpcEndpoint():

spark/core/src/main/scala/org/apache/spark/rpc/netty/Dispatcher.scala

Lines 64 to 77 in 481fb63

    
           def registerRpcEndpoint(name: String, endpoint: RpcEndpoint): NettyRpcEndpointRef = { 
        
             val addr = RpcEndpointAddress(nettyEnv.address, name) 
        
             val endpointRef = new NettyRpcEndpointRef(nettyEnv.conf, addr, nettyEnv) 
        
             synchronized { 
        
               if (stopped) { 
        
                 throw new IllegalStateException("RpcEnv has been stopped") 
        
               } 
        
               if (endpoints.putIfAbsent(name, getMessageLoop(name, endpoint)) != null) { 
        
                 throw new IllegalArgumentException(s"There is already an RpcEndpoint called $name") 
        
               } 
        
             } 
        
             endpointRefs.put(endpoint, endpointRef) 
        
             endpointRef 
        
           }

getMessageLoop() initializes a new Inbox for this endpoint for both DedicatedMessageLoop
and SharedMessageLoop, which calls onStart() "asynchronously" and "eventually" via posting OnStart message. onStart() will initialize UI page instance(s), so the execution of endpointRefs.put() and initializing UI page instance(s) are "concurrent".

MasterPage and WorkerPage retrieve endpoint ref and store it as "val" assuming endpoint ref is valid when they're initialized - so in bad case they could store "null" as endpoint ref, and don't change.

spark/core/src/main/scala/org/apache/spark/deploy/master/ui/MasterPage.scala

Lines 33 to 38 in 481fb63

    
           private[ui] class MasterPage(parent: MasterWebUI) extends WebUIPage("") { 
        
             private val master = parent.masterEndpointRef 
        
             def getMasterState: MasterStateResponse = { 
        
               master.askSync[MasterStateResponse](RequestMasterState) 
        
             }

spark/core/src/main/scala/org/apache/spark/deploy/worker/ui/WorkerPage.scala

Lines 35 to 41 in 481fb63

    
           private[ui] class WorkerPage(parent: WorkerWebUI) extends WebUIPage("") { 
        
             private val workerEndpoint = parent.worker.self 
        
             override def renderJson(request: HttpServletRequest): JValue = { 
        
               val workerState = workerEndpoint.askSync[WorkerStateResponse](RequestWorkerState) 
        
               JsonProtocol.writeWorkerState(workerState) 
        
             }

This patch breaks down the step to find the right message loop and register endpoint to message loop, and ensure endpoint ref is set "before" registering endpoint to message loop.

Why are the changes needed?

We observed the test failures from Jenkins; below are the links:

https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/115583/testReport/
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/115700/testReport/

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Existing UTs.

You can also reproduce the bug consistently via adding Thread.sleep(1000) just before endpointRefs.put(endpoint, endpointRef) in Dispatcher.registerRpcEndpoint(...).

…rPage

HeartSaVioR · 2019-12-26T05:20:21Z

core/src/main/scala/org/apache/spark/deploy/master/ui/MasterWebUI.scala

@@ -34,7 +34,12 @@ class MasterWebUI(
  extends WebUI(master.securityMgr, master.securityMgr.getSSLOptions("standalone"),
    requestedPort, master.conf, name = "MasterUI") with Logging {

-  val masterEndpointRef = master.self
+  val masterEndpointRef = {


If we don't feel comfortable adding infinite loop, we can just change it from val to def, with adding comment it shouldn't be cached.

HeartSaVioR · 2019-12-26T05:20:34Z

core/src/main/scala/org/apache/spark/deploy/worker/ui/WorkerPage.scala

@@ -33,7 +33,12 @@ import org.apache.spark.ui.{UIUtils, WebUIPage}
 import org.apache.spark.util.Utils

 private[ui] class WorkerPage(parent: WorkerWebUI) extends WebUIPage("") {
-  private val workerEndpoint = parent.worker.self
+  private val workerEndpoint = {


Same here: if we don't feel comfortable adding infinite loop, we can just change it from val to def, with adding comment it shouldn't be cached.

HeartSaVioR · 2019-12-26T05:27:02Z

Hmm... Does it mean the comment in below is broken? The code comment says self will become valid when onStart is called, but that doesn't seem to be true - self will become valid "around" when onStart is called and there's no guarantee that self is valid in onStart.

spark/core/src/main/scala/org/apache/spark/rpc/RpcEndpoint.scala

Lines 53 to 63 in 481fb63

    
             /** 
        
              * The [[RpcEndpointRef]] of this [[RpcEndpoint]]. `self` will become valid when `onStart` is 
        
              * called. And `self` will become `null` when `onStop` is called. 
        
              * 
        
              * Note: Because before `onStart`, [[RpcEndpoint]] has not yet been registered and there is not 
        
              * valid [[RpcEndpointRef]] for it. So don't call `self` before `onStart` is called. 
        
              */ 
        
             final def self: RpcEndpointRef = { 
        
               require(rpcEnv != null, "rpcEnv has not been initialized") 
        
               rpcEnv.endpointRef(this) 
        
             }

SparkQA · 2019-12-26T07:53:38Z

Test build #115790 has finished for PR 27010 at commit 223c466.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

Ngone51

How about adding Dispatcher.this.synchronized protection for endpointRefs in both registerRpcEndpoint() and getRpcEndpointRef()?

Actually, endpointRefs.put(endpoint, endpointRef) used to under the Dispatcher.this.synchronized protection` before #26059.

HeartSaVioR · 2019-12-26T23:28:50Z

Thanks for referring #26059. I took a look a bit, and found actual change relevant to this.

Below is the implementation of registerRpcEndpoint before #26059:

  def registerRpcEndpoint(name: String, endpoint: RpcEndpoint): NettyRpcEndpointRef = {
    val addr = RpcEndpointAddress(nettyEnv.address, name)
    val endpointRef = new NettyRpcEndpointRef(nettyEnv.conf, addr, nettyEnv)
    synchronized {
      if (stopped) {
        throw new IllegalStateException("RpcEnv has been stopped")
      }
      if (endpoints.putIfAbsent(name, new EndpointData(name, endpoint, endpointRef)) != null) {
        throw new IllegalArgumentException(s"There is already an RpcEndpoint called $name")
      }
      val data = endpoints.get(name)
      endpointRefs.put(data.endpoint, data.ref)
      receivers.offer(data)  // for the OnStart message
    }
    endpointRef
  }

According to the code comment, the code ensures onStart will be called "after" endpointRef is set to endpointRefs. Currently all the operations are done in getMessageLoop so we can't ensure it.

Maybe SharedMessageLoop and DedicatedMessageLoop shouldn't set Inbox be active by itself, and let Dispatcher initiates it.

HeartSaVioR · 2019-12-26T23:35:03Z

Btw, I found the bug can be reproduced consistently, via adding Thread.sleep(1000) just before endpointRefs.put(endpoint, endpointRef) in Dispatcher.registerRpcEndpoint(...). Also updated the same in the description of PR.

SparkQA · 2019-12-27T00:03:29Z

Test build #115823 has finished for PR 27010 at commit 7bcdf29.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

HeartSaVioR · 2019-12-27T00:04:34Z

I just changed the approach; please take a look. The idea is that endpoint should be only accessed by endpoint ref after the call of registerRpcEndpoint; only exception is referring ref in onStart. So it's safe to put endpoint ref earlier than assigning to message loop, and remove when assign to message loop fails.

SparkQA · 2019-12-27T02:34:56Z

Test build #115824 has finished for PR 27010 at commit fff5050.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

Ngone51 · 2019-12-27T06:44:00Z

core/src/main/scala/org/apache/spark/rpc/netty/Dispatcher.scala

+    // This must be done before assigning RpcEndpoint to MessageLoop, as MessageLoop sets Inbox be
+    // active when registering, and endpointRef must be put into endpointRefs before onStart is
+    // called. Refer the doc of `RpcEndpoint.self`, as well as `NettyRpcEnv.endpointRef`.
+    endpointRefs.put(endpoint, endpointRef)


If we can update endpointRefs here, could we also update endpointRefs above getMessageLoop/assignToMessageLoop?

Sorry I don't get it. Could you elaborate? If you meant creating endpointRef here, that would be simple to do but either we need to have tuple of return type or registerRpcEndpoint should get endpointRef from endpointRefs, looks like no big advantage.

Ok, never mind. I got why you do this(endpointRefs.put(endpoint, endpointRef)) in assignToMessageLoop().

Yes, that's for doing only when putIfAbsent runs code for "absent".

you read my mind.

Ngone51 · 2019-12-27T06:44:34Z

core/src/main/scala/org/apache/spark/rpc/netty/Dispatcher.scala

+          sharedLoop
+      }
+    } catch {
+      case NonFatal(e) =>


When will we fail?

It could be various reasons as we do non-trivial operations here; but yes I haven't met and imagine any real case. That's defensive code, but this ensures the behavior is same as before when failing. (ref will not be registered in refs.)

HeartSaVioR · 2019-12-29T11:57:24Z

cc. @vanzin @zsxwing

vanzin · 2019-12-30T19:11:59Z

core/src/main/scala/org/apache/spark/rpc/netty/Dispatcher.scala

@@ -68,11 +82,10 @@ private[netty] class Dispatcher(nettyEnv: NettyRpcEnv, numUsableCores: Int) exte
      if (stopped) {
        throw new IllegalStateException("RpcEnv has been stopped")
      }
-      if (endpoints.putIfAbsent(name, getMessageLoop(name, endpoint)) != null) {
+      if (endpoints.putIfAbsent(name, assignToMessageLoop(name, endpoint, endpointRef)) != null) {


While this solves the issue, I don't think it's quite right. The error path here is wrong, because you'll modify endpointRefs and, more importantly, the message loop. (assignToMessageLoop mutates those, and is called here regardless of whether the endpoint should be registered.)

To be fair, the previous code also has that problem w.r.t. the message loop being modified.

I think it would be safe here to have something like:

def findMessageLoop(endpoint) = { // return the right message loop without modification } val messageLoop = findMessageLoop(endpoint) if (endpoints.putIfAbsent(name, messageLoop) != null) { throw } endpointRefs.put(...) messageLoop.register(...)

If done inside the synchronized loop that seems to be safe and solve the problem. DedicatedMessageLoop should also implement register and call setActive there, instead of as part of the constructor. To add another small thing, DedicatedMessageLoop will leak a thread pool here in the error case, so maybe the thread pool should also be created in the register implementation.

In fact... since this is inside a synchronized block anyway, you can simplify some of the above by not using putIfAbsent. Just check with containsKey, throw if it already exists, then find the right loop, put it in endpoints and update endpointRefs, then call register(). You'll still need DedicatedMessageLoop.register() to call setActive() at the right time.

To be fair, the previous code also has that problem w.r.t. the message loop being modified.

Yes, that's the reason I just put the band-aid there and rename the method as well. I tried to provide smallest change as the goal of patch is to just fix the thread-safety issue.

But basically I totally agree about your suggestion, especially having register to DedicatedMessageLoop explicitly and not registering endpoint in findMessageLoop. Will reflect. Thanks for the suggestion.

SparkQA · 2019-12-31T02:17:09Z

Test build #115976 has finished for PR 27010 at commit ac10f87.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

vanzin · 2020-01-02T22:17:44Z

core/src/main/scala/org/apache/spark/rpc/netty/Dispatcher.scala

+      val msgLoop = findMessageLoop(name, endpoint)
+      endpoints.put(name, msgLoop)
+      try {
+        endpointRefs.put(endpoint, endpointRef)


Hmm. This is all correct but feels a bit overkill. Seems like a simpler version would be:

endpointRefs.put(endpoint, endpointRef) try { endpoints.put(name, getMessageLoop(name, endpoint)) } catch { // cleanup endpointRefs }

Yes, that uses the old getMessageLoop() (which could be inlined here for clarity), but that's ok as long as it's done after the containsKey check. Then you don't even need the changes to the other file.

Just to confirm, that's pretty much close with the patch before ac10f87 (that commit was to reflect review comment), with the additional changes; to use containsKey/put instead of putIfAbsent, and inline getMessageLoop (assignToMessageLoop before ac10f87 but name doesn't matter as we will inline). Could you confirm?

Hmm, I can't find a way to see the complete patch at a specific commit in the UI, so I'll say "maybe".

The goal is:

not modify "endpoints" when checking if the endpoint exists

update "endpointRefs" before registering the endpoint's message loop (calling register in the case of the shared loop, or creating the dedicated message loop)

Thanks for confirming! Will make a change.

This reverts commit ac10f87.

SparkQA · 2020-01-03T01:11:24Z

Test build #116059 has finished for PR 27010 at commit 8144b7d.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

HeartSaVioR · 2020-01-03T09:28:57Z

retest this, please

SparkQA · 2020-01-03T12:18:16Z

Test build #116090 has finished for PR 27010 at commit 8144b7d.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

vanzin · 2020-01-03T16:42:43Z

core/src/main/scala/org/apache/spark/rpc/netty/Dispatcher.scala

+      } catch {
+        case NonFatal(e) =>
+          endpointRefs.remove(endpoint)
+          if (messageLoop != null && messageLoop.isInstanceOf[DedicatedMessageLoop]) {


This will never happen, because if an exception is thrown, it will be in the DedicatedMessageLoop constructor, so messageLoop will still be null.

You're right. I'll remove it.

Btw, do we decide to just ignore leaking thread pool? The previous change I reverted was required to deal with it, as thread pool shouldn't be initialized in constructor. I guess it might be yes, as you've mentioned it as "small thing", but that's only the matter of git rebase so please let me know if we would want to address it.

If anything fails in that constructor, it will be the creation of the thread pool itself, so I don't think anything would leak. Also if that fails, we have bigger problems anyway.

SparkQA · 2020-01-04T02:47:52Z

Test build #116106 has finished for PR 27010 at commit 68fcd51.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HeartSaVioR · 2020-01-06T14:21:55Z

Looks like the related tests fail more frequently, seen two times in a PR (not sure why it seems to fail more frequently). Kindly reminder to handle this sooner.

vanzin · 2020-01-06T16:41:31Z

Merging to master.

HeartSaVioR · 2020-01-06T23:30:27Z

Thanks for reviewing and merging!

[SPARK-30313][CORE] Ensure EndpointRef is available MasterWebUI/Worke…

223c466

…rPage

HeartSaVioR commented Dec 26, 2019

View reviewed changes

wangshuo128 mentioned this pull request Dec 26, 2019

[SPARK-30285][CORE] Fix deadlock between LiveListenerBus#stop and AsyncEventQueue#removeListenerOnError #26924

Closed

Ngone51 reviewed Dec 26, 2019

View reviewed changes

HeartSaVioR added 3 commits December 27, 2019 08:55

Change approach a bit

d1123fe

Safe-guard

bb43152

remove unnecessary line

7bcdf29

Scalastyle fix

fff5050

Ngone51 reviewed Dec 27, 2019

View reviewed changes

vanzin reviewed Dec 30, 2019

View reviewed changes

reflect review comments

ac10f87

vanzin reviewed Jan 2, 2020

View reviewed changes

HeartSaVioR added 2 commits January 3, 2020 07:57

Revert "reflect review comments"

99ce924

This reverts commit ac10f87.

reflect review comment

8144b7d

vanzin reviewed Jan 3, 2020

View reviewed changes

Remove unnecessary code

68fcd51

HeartSaVioR mentioned this pull request Jan 6, 2020

[SPARK-29779][CORE] Compact old event log files and cleanup #27085

Closed

vanzin closed this in 895e572 Jan 6, 2020

HeartSaVioR deleted the SPARK-30313 branch January 6, 2020 23:30

	def registerRpcEndpoint(name: String, endpoint: RpcEndpoint): NettyRpcEndpointRef = {
	val addr = RpcEndpointAddress(nettyEnv.address, name)
	val endpointRef = new NettyRpcEndpointRef(nettyEnv.conf, addr, nettyEnv)
	synchronized {
	if (stopped) {
	throw new IllegalStateException("RpcEnv has been stopped")
	}
	if (endpoints.putIfAbsent(name, getMessageLoop(name, endpoint)) != null) {
	throw new IllegalArgumentException(s"There is already an RpcEndpoint called $name")
	}
	}
	endpointRefs.put(endpoint, endpointRef)
	endpointRef
	}

	private[ui] class MasterPage(parent: MasterWebUI) extends WebUIPage("") {
	private val master = parent.masterEndpointRef

	def getMasterState: MasterStateResponse = {
	master.askSync[MasterStateResponse](RequestMasterState)
	}

	private[ui] class WorkerPage(parent: WorkerWebUI) extends WebUIPage("") {
	private val workerEndpoint = parent.worker.self

	override def renderJson(request: HttpServletRequest): JValue = {
	val workerState = workerEndpoint.askSync[WorkerStateResponse](RequestWorkerState)
	JsonProtocol.writeWorkerState(workerState)
	}

[SPARK-30313][CORE] Ensure EndpointRef is available MasterWebUI/WorkerPage #27010

[SPARK-30313][CORE] Ensure EndpointRef is available MasterWebUI/WorkerPage #27010

Conversation

HeartSaVioR commented Dec 26, 2019 • edited Loading

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Choose a reason for hiding this comment

Choose a reason for hiding this comment

HeartSaVioR commented Dec 26, 2019

SparkQA commented Dec 26, 2019

Ngone51 left a comment

Choose a reason for hiding this comment

HeartSaVioR commented Dec 26, 2019

HeartSaVioR commented Dec 26, 2019

SparkQA commented Dec 27, 2019

HeartSaVioR commented Dec 27, 2019

SparkQA commented Dec 27, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

HeartSaVioR Dec 27, 2019 • edited Loading

Choose a reason for hiding this comment

HeartSaVioR commented Dec 29, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Dec 31, 2019

Choose a reason for hiding this comment

HeartSaVioR Jan 2, 2020 • edited Loading

Choose a reason for hiding this comment

vanzin Jan 2, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Jan 3, 2020

HeartSaVioR commented Jan 3, 2020

SparkQA commented Jan 3, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Jan 4, 2020

HeartSaVioR commented Jan 6, 2020

vanzin commented Jan 6, 2020

HeartSaVioR commented Jan 6, 2020

HeartSaVioR commented Dec 26, 2019 •

edited

Loading

HeartSaVioR Dec 27, 2019 •

edited

Loading

HeartSaVioR Jan 2, 2020 •

edited

Loading

vanzin Jan 2, 2020 •

edited

Loading