[SPARK-20529][Core]Allow worker and master work with a proxy server #17821

zsxwing · 2017-05-01T18:01:36Z

What changes were proposed in this pull request?

In the current codes, when worker connects to master, master will send its address to the worker. Then worker will save this address and use it to reconnect in case of failure. However, sometimes, this address is not correct. If there is a proxy between master and worker, the address master sent is not the address of proxy.

In this PR, the master address used by the worker will be sent to the master, then master just replies this address back, worker will use this address to reconnect in case of failure. In other words, the worker will use the config master address set in the worker side if possible rather than the master address set in the master side.

There is still one potential issue though. When a master is restarted or takes over leadership, the work will use the address sent from the master to connect. If there is still a proxy between master and worker, the address may be wrong. However, there is no way to figure it out just in the worker.

How was this patch tested?

The new added unit test.

zsxwing · 2017-05-01T18:01:59Z

cc @sameeragarwal

SparkQA · 2017-05-01T20:54:46Z

Test build #76351 has finished for PR 17821 at commit 8ded9b1.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
case class RegisteredWorker(

sameeragarwal

Looks solid, just some minor comments. Thanks!

sameeragarwal · 2017-05-01T21:25:54Z

core/src/main/scala/org/apache/spark/deploy/DeployMessage.scala

+  case class RegisteredWorker(
+      master: RpcEndpointRef,
+      masterWebUiUrl: String,
+      masterAddress: RpcAddress) extends DeployMessage with RegisterWorkerResponse


Can we avoid adding an extra field here? Perhaps just put the masterAddress in the master field.

Checked the current codes. Unfortunately, we cannot remove this extra field. master.address and masterAddress are different.

Alright, that sounds good.

sameeragarwal · 2017-05-01T21:28:16Z

core/src/main/scala/org/apache/spark/deploy/worker/Worker.scala

@@ -266,7 +282,7 @@ private[deploy] class Worker(
            if (registerMasterFutures != null) {
              registerMasterFutures.foreach(_.cancel(true))
            }
-            val masterAddress = masterRef.address
+            val masterAddress = masterAddressToConnect.get


How about we conf protect this change (with a default that still uses masterRef). If we can merge master and masterAddress as I suggested above, we can just add a conf on the master and the worker code can be largely unaffected.

Added a new conf

sameeragarwal

LGTM, just a small question. Thanks!

sameeragarwal · 2017-05-02T22:59:04Z

core/src/main/scala/org/apache/spark/deploy/DeployMessage.scala

+  case class RegisteredWorker(
+      master: RpcEndpointRef,
+      masterWebUiUrl: String,
+      masterAddress: RpcAddress) extends DeployMessage with RegisterWorkerResponse


Alright, that sounds good.

sameeragarwal · 2017-05-02T23:02:01Z

core/src/main/scala/org/apache/spark/deploy/worker/Worker.scala

@@ -266,7 +289,8 @@ private[deploy] class Worker(
            if (registerMasterFutures != null) {
              registerMasterFutures.foreach(_.cancel(true))
            }
-            val masterAddress = masterRef.address
+            val masterAddress =
+              if (preferConfiguredMasterAddress) masterAddressToConnect.get else masterRef.address


Perhaps it isn't an issue but do you think we should fall back to masterRef.address in case masterAddressToConnect isn't set (instead of throwing a generic scala exception)? Something along the lines of:

val masterAddress = masterAddressToConnect match { case Some(master) if preferConfiguredMasterAddress => master case _ => masterRef.address }

Right now masterRef and masterAddressToConnect are set at the same time. It's impossible unless we break something in future. It's better to fail rather than hiding the broken change.

SparkQA · 2017-05-03T00:52:38Z

Test build #76392 has finished for PR 17821 at commit f4699ad.

This patch fails from timeout after a configured wait of `250m`.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-05-03T11:10:29Z

Test build #3684 has finished for PR 17821 at commit f4699ad.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

zsxwing · 2017-05-16T17:35:17Z

Thanks! Merging to master and 2.2.

## What changes were proposed in this pull request? In the current codes, when worker connects to master, master will send its address to the worker. Then worker will save this address and use it to reconnect in case of failure. However, sometimes, this address is not correct. If there is a proxy between master and worker, the address master sent is not the address of proxy. In this PR, the master address used by the worker will be sent to the master, then master just replies this address back, worker will use this address to reconnect in case of failure. In other words, the worker will use the config master address set in the worker side if possible rather than the master address set in the master side. There is still one potential issue though. When a master is restarted or takes over leadership, the work will use the address sent from the master to connect. If there is still a proxy between master and worker, the address may be wrong. However, there is no way to figure it out just in the worker. ## How was this patch tested? The new added unit test. Author: Shixiong Zhu <shixiong@databricks.com> Closes #17821 from zsxwing/SPARK-20529. (cherry picked from commit 9150bca) Signed-off-by: Shixiong Zhu <shixiong@databricks.com>

## What changes were proposed in this pull request? In the current codes, when worker connects to master, master will send its address to the worker. Then worker will save this address and use it to reconnect in case of failure. However, sometimes, this address is not correct. If there is a proxy between master and worker, the address master sent is not the address of proxy. In this PR, the master address used by the worker will be sent to the master, then master just replies this address back, worker will use this address to reconnect in case of failure. In other words, the worker will use the config master address set in the worker side if possible rather than the master address set in the master side. There is still one potential issue though. When a master is restarted or takes over leadership, the work will use the address sent from the master to connect. If there is still a proxy between master and worker, the address may be wrong. However, there is no way to figure it out just in the worker. ## How was this patch tested? The new added unit test. Author: Shixiong Zhu <shixiong@databricks.com> Closes apache#17821 from zsxwing/SPARK-20529.

Fix SPARK-20529

8ded9b1

sameeragarwal reviewed May 1, 2017

View reviewed changes

Add a conf

f4699ad

sameeragarwal reviewed May 2, 2017

View reviewed changes

asfgit closed this in 9150bca May 16, 2017

zsxwing deleted the SPARK-20529 branch May 16, 2017 17:42

bleggett mentioned this pull request Oct 4, 2017

If master pod is destroyed and recreated, it takes ages for worker pods to timeout radanalyticsio/oshinko-cli#84

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-20529][Core]Allow worker and master work with a proxy server #17821

[SPARK-20529][Core]Allow worker and master work with a proxy server #17821

zsxwing commented May 1, 2017

zsxwing commented May 1, 2017

SparkQA commented May 1, 2017

sameeragarwal left a comment

sameeragarwal May 1, 2017

zsxwing May 2, 2017

sameeragarwal May 2, 2017

sameeragarwal May 1, 2017

zsxwing May 2, 2017

sameeragarwal left a comment

sameeragarwal May 2, 2017

sameeragarwal May 2, 2017

zsxwing May 2, 2017

SparkQA commented May 3, 2017

SparkQA commented May 3, 2017

zsxwing commented May 16, 2017

[SPARK-20529][Core]Allow worker and master work with a proxy server #17821

[SPARK-20529][Core]Allow worker and master work with a proxy server #17821

Conversation

zsxwing commented May 1, 2017

What changes were proposed in this pull request?

How was this patch tested?

zsxwing commented May 1, 2017

SparkQA commented May 1, 2017

sameeragarwal left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sameeragarwal left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented May 3, 2017

SparkQA commented May 3, 2017

zsxwing commented May 16, 2017