Access the Driver Launcher Server over NodePort for app launch + submit jars #30

mccheah · 2017-01-18T02:04:52Z

Expose the driver-submission service on NodePort and contact that as
opposed to going through the API server proxy
Restrict the ports that are exposed on the service to only the driver
submission service when uploading content and then only the Spark UI
after the job has started

Add configuration for SSL

Closes #24 and is a prerequisite for #28.

mccheah · 2017-01-18T02:06:40Z

...urce-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/kubernetes/Client.scala

-      containerPorts += new ContainerPortBuilder()
-        .withContainerPort(portValue)
-        .build()
+  private def getSubmitErrorMessage(


Moved this message construction as part of this PR since the indentation levels were growing confusing at the top level method.

👍 that run() method was getting ginormous -- I'd be happy breaking it up even more

foxish · 2017-01-18T15:36:52Z

...urce-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/kubernetes/Client.scala

@@ -110,37 +108,41 @@ private[spark] class Client(
        .done()
      try {
        val selectors = Map(DRIVER_LAUNCHER_SELECTOR_LABEL -> driverLauncherSelectorValue).asJava
-        val (servicePorts, containerPorts) = configurePorts()
+        val containerPorts = configureContainerPorts()
+        val driverLauncherServicePort = new ServicePortBuilder()


I think we should delay the creation of the service as much as possible; meaning that we should wait till after the pod is submitted and running. Also, we should make a decision as to whether there are local jars being submitted or not. If there aren't, we don't need to expose the NodePort at all. WDYT?

Delaying the service creation makes sense, but I don't think we should branch on whether or not local jars are uploaded. The latter is mostly due to the complexity we would add. Right now our launch logic is pretty tightly coupled to KubernetesSparkRestServer, and even without the consideration of local jars we do a lot there to build the classpath and form the specific Java command. Presumably when we start to handle Python and other languages that class will need more functionality. The driver's Dockerfile assumes the usage of KubernetesSparkRestServer as well. To branch we would need to create a separate Dockerfile and launch setup and that doesn't seem worth the added complexity.

mccheah · 2017-01-20T18:48:17Z

...urce-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/kubernetes/Client.scala

+      .filter(_.getName == DRIVER_LAUNCHER_SERVICE_PORT_NAME)
+      .head
+      .getNodePort
+    // NodePort is exposed on every node, so just pick one of them.


@foxish can we use the API server here? Is the API server also going to be guaranteed to be listening on the NodePort?

We don't need to use the APIServer proxy if we're using NodePort, it should be exposed directly on the NodeIP:port.

Sorry, I meant can we use the API server's host with the NodePort. Basically instead of selecting a node at random we can always pick to use the API server host.

The port should be exposed on all nodes and we can pick any at random. Using the Master's port seems less than ideal because we're trying to shift traffic away from it by not using the apiserver proxy.

Makes sense. I was mostly thinking in the context of when we introduce SSL. Mainly because the certificate issued by the server will now need to have its hostnames match all of the kubelets so the client can domain verify against any of them. I suppose if we document this clearly then we should be ok though.

@mccheah and I were also discussing this today and we might need a flag that sets and respects the NodePort service's externalIP in a followup PR. Filed #58 for separate discussion

tnachen · 2017-01-24T06:06:46Z

...urce-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/kubernetes/Client.scala

+                            .withName(UI_PORT_NAME)
+                            .withPort(uiPort)
+                            .withNewTargetPort(uiPort)
+                            .build


It's typical Spark style to always include parenthesis when calling a function
.build -> .build(), same else where

- Expose the driver-submission service on NodePort and contact that as opposed to going through the API server proxy - Restrict the ports that are exposed on the service to only the driver submission service when uploading content and then only the Spark UI after the job has started

…-upload

mccheah · 2017-01-25T23:28:02Z

@foxish @ash211 @erikerlandson As mentioned in my latest comment on #24, I think this implementation here is what we want to go with for now.

mccheah · 2017-01-25T23:30:31Z

...urce-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/kubernetes/Client.scala

-        sparkConf.setIfMissing("spark.driver.port", DEFAULT_DRIVER_PORT.toString)
-        sparkConf.setIfMissing("spark.blockmanager.port", DEFAULT_BLOCKMANAGER_PORT.toString)
-        val submitRequest = buildSubmissionRequest()
+        val selectors = Map(DRIVER_LAUNCHER_SELECTOR_LABEL -> driverLauncherSelectorValue).asJava


Bad merge here, will fix

ash211

Took a pretty close look -- some small coding/style nits but the thing I'm actually concerned about is the no SSL on driver upload, and especially sending a secret in plaintext across the network.

What's the right way to bring that secret transfer back to being over SSL?

EDIT: now I see your other PR adding SSL..

ash211 · 2017-01-26T06:06:54Z

...urce-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/kubernetes/Client.scala

+    val node = kubernetesClient.nodes.list.getItems.asScala.head
+    val nodeAddress = node.getStatus.getAddresses.asScala.head.getAddress
+    val url = s"http://$nodeAddress:$servicePort"
+    HttpClientUtil.createClient[KubernetesSparkRestApi](uri = url)


looks like there's no ssl here anymore -- I think we still want the spark-submit upload of jars to be encrypted though? especially because there's a secret being passed as part of the KubernetesCreateSubmissionRequest and now that's in plaintext

Ah yes, saw that later... let's get that PR merged into this one and review the whole shebang together

ash211 · 2017-01-26T06:13:25Z

...urce-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/kubernetes/Client.scala

+                          kubernetesClient.services().withName(kubernetesAppId).edit()
+                            .editSpec()
+                              .withType("ClusterIP")
+                              .withPorts(uiServicePort)


ideally we could just remove the now-unnecessary port rather than resetting back to this state. Possibly an admin may have added additional ports in the meantime, that this would now override?

Alternatively we can save the "additional app ports" feature for a separate PR since this is already a larger one

You can't say "remove this port" with the services API; the ports need to be entirely re-defined on every edit of the service.

ash211 · 2017-01-26T06:17:47Z

...urce-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/kubernetes/Client.scala

-      containerPorts += new ContainerPortBuilder()
-        .withContainerPort(portValue)
-        .build()
+  private def getSubmitErrorMessage(


👍 that run() method was getting ginormous -- I'd be happy breaking it up even more

ash211 · 2017-01-26T06:18:51Z

...urce-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/kubernetes/Client.scala

-        .map(_.toInt)
-        .getOrElse(DEFAULT_UI_PORT))
-    (servicePorts, containerPorts)
+  private def configureContainerPorts(): Seq[ContainerPort] = {


name more like createContainerPortSpecs

ash211 · 2017-01-26T06:27:08Z

...urce-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/kubernetes/Client.scala

-              val finalErrorMessage = s"$topLevelMessage\n" +
-                s"$podStatusPhase\n" +
-                s"$podStatusMessage\n\n$failedDriverContainerStatusString"
+              val finalErrorMessage: String = getSubmitErrorMessage(kubernetesClient, e)


maybe name this more like constructDriverSubmitErrorMessage -- getX makes me think there's no external calls being made, whereas this hits apiserver again looking for more detailed diagnostics

ash211 · 2017-01-26T06:28:46Z

...urce-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/kubernetes/Client.scala

-      uri = url,
-      sslSocketFactory = sslContext.getSocketFactory,
-      trustContext = trustManager)
+  private def getDriverLauncherClient(


constructDriverLauncherClient

ash211 · 2017-01-26T06:30:50Z

...urce-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/kubernetes/Client.scala

+                      adjustServicePort onFailure {
+                        case throwable: Throwable =>
+                          submitCompletedFuture.setException(throwable)
+                          kubernetesClient.services().delete(service)


do you need a try/catch around this also? We do this often enough and log a message that I wonder if we need a tryOrLogError wrapper we can put around this thing

There is a Utils.tryAndLogNonFatal but I've actually had some problems with that method awhile ago, particularly with nested closures. However it was awhile ago and I can't remember what the exact problem was. Might want to look into that a little more.

I don't think we need a try-catch here since the error will appear on the separate thread.

ash211 · 2017-01-26T06:32:26Z

...urce-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/kubernetes/Client.scala

        val submitCompletedFuture = SettableFuture.create[Boolean]
        val secretDirectory = s"$SPARK_SUBMISSION_SECRET_BASE_DIR/$kubernetesAppId"
-
+        val submitPending = new AtomicBoolean(false)


is this addition because previously a modification of any sort to the pod would trigger the watcher and start a second jar upload?

More that I'm not sure if the implementation of watches can cause this method to be running twice simultaneously. If this method could be run again while a first invocation of the method has not completed and hence not set the future's status, then yes it would be possible that the upload logic is triggered twice.

Actually this doesn't even matter since some of the logic is asynchronous to this method's call (see our usage of futures) - so we certainly need this and we can't use synchronized to work around that.

ash211 · 2017-01-26T06:35:57Z

Also I think you could edit the PR to be named something more like "Upload local jars to driver via NodePort instead of ServerApi" so the content of the change is more clear from reading the PR description, not just that there was a change (pretty much all you get from generic revamp/refactor/edit titles)

Method nesting was getting confusing so pulled out the inner class and removed the extra method indirection from createDriverPod()

mccheah · 2017-01-26T20:17:38Z

@aash addressed the comments. Used build over construct in general, what do you think?

foxish · 2017-01-27T06:22:35Z

In a subsequent PR, we should do #57.

ash211

Had a few more comments on here -- next step is to merge #49 into here

ash211 · 2017-01-27T07:32:06Z

...urce-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/kubernetes/Client.scala

+        .build(),
+      new ContainerPortBuilder()
+        .withContainerPort(uiPort)
+        .build())


I think this can be rewritten to:

private def buildContainerPorts(): Seq[ContainerPort] = { Seq(sparkConf.getInt("spark.driver.port", DEFAULT_DRIVER_PORT), sparkConf.getInt("spark.blockmanager.port", DEFAULT_BLOCKMANAGER_PORT), DRIVER_LAUNCHER_SERVICE_INTERNAL_PORT uiPort) .map(p => new ContainerPortBuilder().withContainerPort(p).build()) }

ash211 · 2017-01-27T07:45:42Z

...urce-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/kubernetes/Client.scala

+      .filter(_.getName == DRIVER_LAUNCHER_SERVICE_PORT_NAME)
+      .head
+      .getNodePort
+    // NodePort is exposed on every node, so just pick one of them.


@mccheah and I were also discussing this today and we might need a flag that sets and respects the NodePort service's externalIP in a followup PR. Filed #58 for separate discussion

ash211 · 2017-01-27T07:48:03Z

.../scala/org/apache/spark/scheduler/cluster/kubernetes/KubernetesClusterSchedulerBackend.scala

@@ -92,7 +92,7 @@ private[spark] class KubernetesClusterSchedulerBackend(
  protected var totalExpectedExecutors = new AtomicInteger(0)

  private val driverUrl = RpcEndpointAddress(
-    System.getenv(s"${convertToEnvMode(kubernetesDriverServiceName)}_SERVICE_HOST"),


does removing this also allow us to remove the convertToEnvMode method? I don't see it used elsewhere

Yeah we can remove that

ash211 · 2017-01-27T07:52:36Z

.../scala/org/apache/spark/scheduler/cluster/kubernetes/KubernetesClusterSchedulerBackend.scala

@@ -92,7 +92,7 @@ private[spark] class KubernetesClusterSchedulerBackend(
  protected var totalExpectedExecutors = new AtomicInteger(0)

  private val driverUrl = RpcEndpointAddress(
-    System.getenv(s"${convertToEnvMode(kubernetesDriverServiceName)}_SERVICE_HOST"),
+    sc.getConf.get("spark.driver.host"),


why set this using spark config instead of the k8s-provided envvar? does this get printed in logs / Spark UI and therefore we should set it?

We now don't create the service before creating the pod, which means this environment variable won't be set. Also, SERVICE_HOST refers to the service's IP but we're no longer exposing the driver port through the service (this is different from the driver launcher server port)

ash211 · 2017-01-27T07:55:05Z

...urce-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/kubernetes/Client.scala

+    val node = kubernetesClient.nodes.list.getItems.asScala.head
+    val nodeAddress = node.getStatus.getAddresses.asScala.head.getAddress
+    val url = s"http://$nodeAddress:$servicePort"
+    HttpClientUtil.createClient[KubernetesSparkRestApi](uri = url)


Ah yes, saw that later... let's get that PR merged into this one and review the whole shebang together

…te-incremental' into nodeport-upload

* Support SSL when setting up the driver. The user can provide a keyStore to load onto the driver pod and the driver pod will use that keyStore to set up SSL on its server. * Clean up SSL secrets after finishing submission. We don't need to persist these after the pod has them mounted and is running already. * Fix compilation error * Revert image change * Address comments * Programmatically generate certificates for integration tests. * Address comments * Resolve merge conflicts

mccheah · 2017-01-27T22:42:19Z

docs/running-on-kubernetes.md

@@ -132,6 +132,24 @@ To specify a main application resource that is in the Docker image, and if it ha
      --conf spark.kubernetes.executor.docker.image=registry-host:5000/spark-executor:latest \
      container:///home/applications/examples/example.jar

+### Setting Up SSL For Submitting the Driver


There might be something we want to note here about certificates and hostname matching against the Kubelets.

lins05 · 2017-01-28T14:59:46Z

resource-managers/kubernetes/docker-minimal-bundle/src/main/docker/driver/Dockerfile

+      --hostname $HOSTNAME \
+      --port $SPARK_DRIVER_LAUNCHER_SERVER_PORT \
+      --secret-file $SPARK_SUBMISSION_SECRET_LOCATION \
+      ${SSL_ARGS}


What about creating a shell script for this? That would make it more readable and editor-friendly.

Considered this - we can follow up in a separate PR, but I appreciate the visibility of embedding the command directly in the Docker file, as opposed to the indirection where a user would look at the Docker file and then have to look at a separate script afterwards.

lins05 · 2017-01-28T15:48:20Z

...urce-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/kubernetes/Client.scala

-              }
+          } finally {
+            if (!submitSucceeded) {
+              Utils.tryLogNonFatalError({


Please remove the paren around the anonymous block

lins05 · 2017-01-28T15:48:27Z

...urce-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/kubernetes/Client.scala

      } finally {
-        kubernetesClient.secrets().delete(secret)
+        Utils.tryLogNonFatalError({
+          kubernetesClient.secrets().delete(submitServerSecret)


lins05 · 2017-01-28T15:48:34Z

...urce-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/kubernetes/Client.scala

+          kubernetesClient.secrets().delete(submitServerSecret)
+        })
+        Utils.tryLogNonFatalError({
+          kubernetesClient.secrets().delete(sslSecrets: _*)


lins05 · 2017-01-28T16:48:36Z

...ests/src/test/scala/org/apache/spark/deploy/kubernetes/integrationtest/KubernetesSuite.scala

+        s"file://${trustStoreFile.getAbsolutePath}",
+      "--conf", s"spark.ssl.kubernetes.driverlaunch.trustStorePassword=changeit",
+      EXAMPLES_JAR)
+    SparkSubmit.main(args)


We can call expectationsForStaticAllocation here to make some assertions.

Also, I think the naming of spark.ssl.kubernetes.driverlaunch.enabled is not as good as spark.kubernetes.driverlaunch.ssl.enabled. We should put it in the scope of spark.kubernetes.*.

We use spark.ssl to be consistent with the other SSL-related things in the Spark project.

We shouldn't call expectationsForStaticAllocation because this test solely checks for the possibility to submit the app at all over SSL. If there were test failures because we allocated the wrong number of executors, for example, the test would fail with a false negative.

ash211 · 2017-01-30T22:09:33Z

I think this is in a good spot now -- @foxish any last thoughts before merging?

foxish · 2017-01-30T22:12:04Z

@ash211 Nothing from my end, merging.

…it jars (#30) * Revamp ports and service setup for the driver. - Expose the driver-submission service on NodePort and contact that as opposed to going through the API server proxy - Restrict the ports that are exposed on the service to only the driver submission service when uploading content and then only the Spark UI after the job has started * Move service creation down and more thorough error handling * Fix missed merge conflict * Add braces * Fix bad merge * Address comments and refactor run() more. Method nesting was getting confusing so pulled out the inner class and removed the extra method indirection from createDriverPod() * Remove unused method * Support SSL configuration for the driver application submission (#49) * Support SSL when setting up the driver. The user can provide a keyStore to load onto the driver pod and the driver pod will use that keyStore to set up SSL on its server. * Clean up SSL secrets after finishing submission. We don't need to persist these after the pod has them mounted and is running already. * Fix compilation error * Revert image change * Address comments * Programmatically generate certificates for integration tests. * Address comments * Resolve merge conflicts * Fix bad merge * Remove unnecessary braces * Fix compiler error

…it jars (apache-spark-on-k8s#30) * Revamp ports and service setup for the driver. - Expose the driver-submission service on NodePort and contact that as opposed to going through the API server proxy - Restrict the ports that are exposed on the service to only the driver submission service when uploading content and then only the Spark UI after the job has started * Move service creation down and more thorough error handling * Fix missed merge conflict * Add braces * Fix bad merge * Address comments and refactor run() more. Method nesting was getting confusing so pulled out the inner class and removed the extra method indirection from createDriverPod() * Remove unused method * Support SSL configuration for the driver application submission (apache-spark-on-k8s#49) * Support SSL when setting up the driver. The user can provide a keyStore to load onto the driver pod and the driver pod will use that keyStore to set up SSL on its server. * Clean up SSL secrets after finishing submission. We don't need to persist these after the pod has them mounted and is running already. * Fix compilation error * Revert image change * Address comments * Programmatically generate certificates for integration tests. * Address comments * Resolve merge conflicts * Fix bad merge * Remove unnecessary braces * Fix compiler error

mccheah commented Jan 18, 2017

View reviewed changes

foxish reviewed Jan 18, 2017

View reviewed changes

mccheah commented Jan 20, 2017

View reviewed changes

tnachen reviewed Jan 24, 2017

View reviewed changes

mccheah added 4 commits January 24, 2017 16:11

Move service creation down and more thorough error handling

5cbdba0

Merge commit 'c480574b3ac7ba83679ea1fba6739c1825093c1f' into nodeport…

7ffea06

…-upload

Fix missed merge conflict

3e7ce16

mccheah force-pushed the nodeport-upload branch from 33f4f11 to 3e7ce16 Compare January 25, 2017 23:02

Add braces

8e14ebd

mccheah commented Jan 25, 2017

View reviewed changes

Fix bad merge

7ff6eae

ash211 suggested changes Jan 26, 2017

View reviewed changes

Address comments and refactor run() more.

282b27f

Method nesting was getting confusing so pulled out the inner class and removed the extra method indirection from createDriverPod()

mccheah changed the title ~~Revamp ports and service setup for the driver.~~ Access the Driver Launcher Server over NodePort Jan 26, 2017

mccheah changed the title ~~Access the Driver Launcher Server over NodePort~~ Access the Driver Launcher Server over NodePort for app launch + submit jars Jan 26, 2017

ash211 reviewed Jan 27, 2017

View reviewed changes

mccheah added 2 commits January 27, 2017 14:06

Remove unused method

0b0a7ed

Merge remote-tracking branch 'apache-spark-on-k8s/k8s-support-alterna…

f7ca555

…te-incremental' into nodeport-upload

mccheah mentioned this pull request Jan 27, 2017

Support SSL configuration for the driver application submission #49

Merged

mccheah added 2 commits January 27, 2017 14:33

Fix bad merge

2f667ce

mccheah commented Jan 27, 2017

View reviewed changes

mccheah mentioned this pull request Jan 28, 2017

[WIP] Extract constants #62

Closed

lins05 reviewed Jan 28, 2017

View reviewed changes

mccheah added 2 commits January 30, 2017 10:58

Remove unnecessary braces

45b063c

Fix compiler error

b32d631

ash211 approved these changes Jan 30, 2017

View reviewed changes

foxish merged commit 4a2a002 into k8s-support-alternate-incremental Jan 30, 2017

ash211 deleted the nodeport-upload branch January 30, 2017 22:54

ash211 mentioned this pull request Jan 31, 2017

Regression introduced by NodePort PR #67

Closed

Access the Driver Launcher Server over NodePort for app launch + submit jars #30

Access the Driver Launcher Server over NodePort for app launch + submit jars #30

Conversation

mccheah commented Jan 18, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mccheah commented Jan 25, 2017

Choose a reason for hiding this comment

ash211 left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mccheah Jan 26, 2017 • edited Loading

Choose a reason for hiding this comment

ash211 commented Jan 26, 2017

mccheah commented Jan 26, 2017

foxish commented Jan 27, 2017

ash211 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ash211 commented Jan 30, 2017

foxish commented Jan 30, 2017

mccheah commented Jan 18, 2017 •

edited

Loading

ash211 left a comment •

edited

Loading

mccheah Jan 26, 2017 •

edited

Loading