Dynamic allocation #272

foxish · 2017-05-14T22:55:59Z

Dynamic allocation updated
Please see commits individually during review for clarity

cc @mccheah @ash211 @varunkatta @apache-spark-on-k8s/contributors

lins05 · 2017-05-15T11:04:54Z

...tes/core/src/main/scala/org/apache/spark/scheduler/cluster/kubernetes/NodeCacheManager.scala

+    shufflePodCache.get(executorNode) match {
+      case Some(pod) => pod
+      case _ =>
+        throw new SparkException(s"Unable to find shuffle pod on node $executorNode")


A corner case comes into my mind: if a shuffle pod for a node died and is being restarted, and at this moment a new executor on that node registers with the driver, it would crash the driver.

Can we improve this, e.g. let the executor die when the shuffle pod is not ready, instead of throwing SparkException to abort the driver?

Nice catch! I think you're right. Done. Passing back an empty string which should make the executor crash.

Throwing the exception here should be fine, right? This is being processed on a separate thread in the RPC environment. Thus the exception here should only propagate to the executor that asked for this configuration.

I could be wrong, but IIUC SparkException is thrown whenever unrecoverable errors happen.

In this case it's less important what the type of exception is. What's important is where the exception is thrown from and where it propagates to.

lins05 · 2017-05-15T11:15:54Z

...tes/core/src/main/scala/org/apache/spark/scheduler/cluster/kubernetes/NodeCacheManager.scala

+              addShufflePodToCache(p)
+          }
+        }
+        override def onClose(e: KubernetesClientException): Unit = {}


Is a daemonset watchable? If so can we watch on it directly instead of using labels?

It is, but we don't want to make an assumption that it's a daemonset in use. In the current way, it remains - any pod that is co-located on that node with the same labels.

lins05 · 2017-05-15T11:21:24Z

.../scala/org/apache/spark/scheduler/cluster/kubernetes/KubernetesClusterSchedulerBackend.scala

+            "shuffle-labels")
+      val shuffleDirs = conf.getOption(KUBERNETES_SHUFFLE_DIR.key).map {
+        _.split(",")
+      }.getOrElse(Utils.getConfiguredLocalDirs(conf))


Should we throw an exception here is dynamic allocation is enabled but shuffle.labels is empty?

lins05 · 2017-05-15T11:22:42Z

...tes/core/src/main/scala/org/apache/spark/scheduler/cluster/kubernetes/NodeCacheManager.scala

+import org.apache.spark.internal.Logging
+import org.apache.spark.util.ThreadUtils
+
+private[spark] class NodeCacheManager (


nit: the name NodeCacheManager is not that intuitive. Maybe sth. like ShufflePodsCatalog? I don't know..

Changed to ShufflePodCache

mccheah

Reviewed the top level PR, will follow up on changes in the subsequent individual commits.

mccheah · 2017-05-15T17:52:43Z

.../scala/org/apache/spark/scheduler/cluster/kubernetes/KubernetesClusterSchedulerBackend.scala

+
+    private val allocatorRunnable: Runnable = new Runnable {
+      override def run(): Unit = {
+        if (runningExecutorPods.size - totalRegisteredExecutors.get() > 0) {


Could probably just use < for clarity.

mccheah · 2017-05-15T17:53:05Z

.../scala/org/apache/spark/scheduler/cluster/kubernetes/KubernetesClusterSchedulerBackend.scala

+      override def run(): Unit = {
+        if (runningExecutorPods.size - totalRegisteredExecutors.get() > 0) {
+          logDebug("Waiting for pending executors before scaling")
+          return


Try to structure the logic so that we don't use return.

mccheah · 2017-05-15T17:53:44Z

.../scala/org/apache/spark/scheduler/cluster/kubernetes/KubernetesClusterSchedulerBackend.scala

+        if (totalExpectedExecutors.get() <= runningExecutorPods.size) {
+          logDebug(
+            "Maximum allowed executor limit reached. Not scaling up further.")
+          return


Similarly here - avoid return.

mccheah · 2017-05-15T17:55:04Z

.../scala/org/apache/spark/scheduler/cluster/kubernetes/KubernetesClusterSchedulerBackend.scala

+   * KubernetesAllocator class watches executor registrations, limits
+   * and creates new executors when it is appropriate.
+   */
+  private[spark] class KubernetesAllocator(client: KubernetesClient)


Does this specifically need to be a separate class? The code could just be inlined in the scheduler backend class.

I like the separate class for the separation it offers for this allocator mechanism. Do you strongly prefer not having it?

This ought to be in a separate file then, I think. But if this also requires a backwards reference to the scheduler backend (e.g. to access fields) then this should just be inlined.

It requires access to totalRegisteredExecutors which is a protected field in CoarseGrainedSchedulerBackend and a couple of other accounting fields from the KubernetesClusterSchedulerBackend.

Oh, did you mean that it shouldn't be a separate class?

mccheah · 2017-05-15T18:01:08Z

...etes/core/src/main/scala/org/apache/spark/scheduler/cluster/kubernetes/ShufflePodCache.scala

+  def start(): Unit = {
+    // seed the initial cache.
+    val pods = client.pods().withLabels(dsLabels.asJava).list()
+    for (pod <- pods.getItems.asScala) {


I think we prefer foreach over for in general.

mccheah · 2017-05-15T20:07:35Z

...ts/src/test/scala/org/apache/spark/deploy/kubernetes/integrationtest/KubernetesV1Suite.scala

@@ -81,6 +82,68 @@ private[spark] class KubernetesV1Suite(testBackend: IntegrationTestBackend)
    })
  }

+  private def expectationsForDynamicAllocation(sparkMetricsService: SparkRestApiV1): Unit = {


Let's avoid adding tests to V1 and solely focus on V2.

mccheah · 2017-05-15T20:09:01Z

.../scala/org/apache/spark/scheduler/cluster/kubernetes/KubernetesClusterSchedulerBackend.scala

+        .endContainer()
+      .endSpec()
+
+    var resolvedPodBuilder = shuffleServiceConfig


val not var.

mccheah · 2017-05-15T20:11:00Z

...obs/src/main/scala/org/apache/spark/deploy/kubernetes/integrationtest/jobs/GroupByTest.scala

+import org.apache.spark.sql.SparkSession
+
+object GroupByTest {
+  def main(args: Array[String]) {


Would be good to confirm that this test is creating multiple executors and is writing files to the shuffle service. I'm not sure if we can do this in an automated way.

We can achieve that by setting --conf spark.dynamicAllocation.minExecutors=2 and wait for these two executors to be ready via k8s api (or spark rest api?)

The executors can spin up but not write any shuffle data to disk. We should check that shuffle data is being written to the disks.

I chose this test so that it would have shuffle data being written to disk. I've manually verified that it does write to disk.

mccheah · 2017-05-15T20:12:25Z

.../scala/org/apache/spark/scheduler/cluster/kubernetes/KubernetesClusterSchedulerBackend.scala

+    }
+
+    def start(): Unit = {
+      if (interval > 0) {


Just make the interval an Optional configuration and use interval.foreach, as opposed to checking if an integer is greater than zero. We should validate then that any given value is positive and throw an exception if it isn't.

mccheah · 2017-05-15T20:14:27Z

.../scala/org/apache/spark/scheduler/cluster/kubernetes/KubernetesClusterSchedulerBackend.scala

+          s"but no ${KUBERNETES_SHUFFLE_LABELS.key} specified")
+      }
+
+      val shuffleDirs = conf.getOption(KUBERNETES_SHUFFLE_DIR.key).map {


Can we use .get instead of .getOption here?

I was trying to get Option[String] because if the shuffle directory is left empty, we use the default from Utils.getConfiguredLocalDirs(conf). I'm not sure how we can get this behavior using get.

Using .get here without using .key on the configuration key should give back an Option.

Ah.. good point. Done

foxish · 2017-05-16T06:55:18Z

@mccheah Addressed all comments. PTAL

foxish · 2017-05-16T07:05:02Z

Has the unit testing changed? I'm seeing failures in files I did not touch at all.

ash211 · 2017-05-16T07:07:58Z

rerun unit tests please

ash211 · 2017-05-16T07:09:53Z

@foxish I've been working on an SBT-based unit test build in jenkins and it looks like it was racing with the current maven-based unit tests. I've disabled the new test build and expect just the old one to be running now.

Sorry about that!

foxish · 2017-05-16T07:11:02Z

@ash211, we can fix the new one. The errors appeared to be:

Java HotSpot(TM) 64-Bit Server VM warning: ignoring option MaxPermSize=512m; support was removed in 8.0
Scalastyle checks failed at following occurrences:
[error] /home/jenkins/workspace/PR-spark-k8s-unit-tests-SBT-TESTING/resource-managers/kubernetes/integration-tests/src/test/scala/org/apache/spark/deploy/kubernetes/integrationtest/KubernetesTestComponents.scala: File must end with newline character
[error] /home/jenkins/workspace/PR-spark-k8s-unit-tests-SBT-TESTING/resource-managers/kubernetes/integration-tests/src/test/scala/org/apache/spark/deploy/kubernetes/integrationtest/ProcessUtils.scala:28:0: Use Javadoc style indentation for multiline comments
[error] /home/jenkins/workspace/PR-spark-k8s-unit-tests-SBT-TESTING/resource-managers/kubernetes/integration-tests/src/test/scala/org/apache/spark/deploy/kubernetes/integrationtest/constants.scala: File must end with newline character
[error] (kubernetes-integration-tests/test:scalastyle) errors exist
[error] Total time: 13 s, completed May 15, 2017 9:02:13 PM
[error] running /home/jenkins/workspace/PR-spark-k8s-unit-tests-SBT-TESTING/dev/lint-scala ; received return code 1

Sending a PR to fix these.

ash211 · 2017-05-16T07:12:20Z

@foxish I've got a commit that does that already in another PR: 6c84023

foxish · 2017-05-16T07:13:06Z

Ah! Okay, SG. Thanks!

foxish · 2017-05-16T15:29:41Z

Tests passed. Are there any more comments @mccheah, @ash211, @lins05? I'd like to get this merged before Wednesday because it blocks Varun's recovery behavior PR.

mccheah

Would it be possible to add unit-level tests around these? It would be great if we can start hardening the features we are implementing here. Unit testing these kinds of things can be difficult; we would probably have to refactor much of the scheduler backend and the shuffle pod cache to make us able to verify the things that are important.

mccheah · 2017-05-16T17:37:58Z

.../scala/org/apache/spark/scheduler/cluster/kubernetes/KubernetesClusterSchedulerBackend.scala

+      Some(conf.get(KUBERNETES_ALLOCATION_BATCH_SIZE))
+    } else {
+      throw new SparkException(s"Allocation batch size ${KUBERNETES_ALLOCATION_BATCH_SIZE} " +
+        s"should be a positive integer")


Add what the value the user specified was.

mccheah · 2017-05-16T17:38:50Z

.../scala/org/apache/spark/scheduler/cluster/kubernetes/KubernetesClusterSchedulerBackend.scala

@@ -130,12 +197,27 @@ private[spark] class KubernetesClusterSchedulerBackend(
    super.start()
    executorWatchResource.set(kubernetesClient.pods().withLabel(SPARK_APP_ID_LABEL, applicationId())
      .watch(new ExecutorPodsWatcher()))
+
+    podAllocationInterval.foreach(allocator.scheduleWithFixedDelay(allocatorRunnable,
+      0,


Put allocatorRunnable, 0, TimeUnit.SECONDS all on this line.

I think since now podAllocationInterval is always going to be provided (we always set it to Some(...) or throw an exception) then this thread will always be running. Is this the intended behavior? If so, no need to use foreach and options here.

mccheah · 2017-05-16T22:20:43Z

.../scala/org/apache/spark/scheduler/cluster/kubernetes/KubernetesClusterSchedulerBackend.scala

+  private val allocatorRunnable: Runnable = new Runnable {
+    override def run(): Unit = {
+      if (totalRegisteredExecutors.get() >= runningExecutorPods.size) {
+        if (totalExpectedExecutors.get() > runningExecutorPods.size) {


Would be cleaner I think to use if...else if... else here:

if (...) { logDebug("Maximum allowed executor limit...") } else if (...) { logDebug("Waiting for pending...") } else { // Actual logic }

Done. Thanks! Trying to add a couple of unit tests to ShufflePodCache now and a mechanism that might help us add tests easily in the future.

mccheah · 2017-05-16T22:36:07Z

.../scala/org/apache/spark/scheduler/cluster/kubernetes/KubernetesClusterSchedulerBackend.scala

@@ -105,6 +131,44 @@ private[spark] class KubernetesClusterSchedulerBackend(

  private val initialExecutors = getInitialTargetExecutorNumber(1)

+  private val podAllocationInterval =
+    if (conf.get(KUBERNETES_ALLOCATION_BATCH_DELAY) > 0) {


There's probably no need to make this an option - just assign podAllocationInterval directly, and then check the variable directly + throw the SparkException immediately afterwards.

mccheah · 2017-05-16T23:00:57Z

...ers/kubernetes/core/src/main/scala/org/apache/spark/deploy/kubernetes/CommandLineUtils.scala

+
+import org.apache.spark.SparkException
+
+object CommandLineUtils {


This should be KeyValueUtils? This doesn't seem related to the command line.

I envisioned it as a place for utility functions related to commandline options which we could have more of, in future. The parsing of key-values is necessitated by commandline strings being supplied.

Is it confusing?

This seems to mainly be used to parse out labels and annotations from SparkConf values - the command line doesn't seem to be related to that.

I see, I was assuming the primary way of supplying those args was via the cmdline. Okay, how about ConfigurationUtils? KeyValueUtils.parseKeyValuePairs() just seems a bit redundant.

ConfigurationUtils is fine.

foxish · 2017-05-16T23:15:49Z

Unit tests seem more complex than expected, because watchers and such. https://mvnrepository.com/artifact/io.fabric8/kubernetes-server-mock provided an easy beginnning but I think I'll take it separately instead of blocking experiments using dynamic allocation.

mccheah · 2017-05-16T23:16:54Z

We can probably test the watches separately and just ensure that if the watch receives an event then the scheduler responds accordingly.

foxish · 2017-05-16T23:18:10Z

The mock server can be taught to expect the watch calls and respond appropriately. I used a similar thing in the unit tests here.

foxish · 2017-05-16T23:26:02Z

Created #275, will follow up there

foxish · 2017-05-17T00:00:51Z

Updated docs, any other comments?

mccheah

There's a few minor style things but they can be addressed either here or at some other point. Someone else can make a final pass before merging, but if there are no objections before the end of the day then feel free to proceed with the merge.

mccheah · 2017-05-17T00:07:24Z

.../scala/org/apache/spark/scheduler/cluster/kubernetes/KubernetesClusterSchedulerBackend.scala

@@ -105,6 +131,40 @@ private[spark] class KubernetesClusterSchedulerBackend(

  private val initialExecutors = getInitialTargetExecutorNumber(1)

+  private val podAllocationInterval = conf.get(KUBERNETES_ALLOCATION_BATCH_DELAY)
+  if (podAllocationInterval <= 0) {


We can use require here and in other similar places.

mccheah · 2017-05-17T00:07:48Z

.../scala/org/apache/spark/scheduler/cluster/kubernetes/KubernetesClusterSchedulerBackend.scala

+                val runningExecutorPod = kubernetesClient
+                  .pods()
+                  .withName(
+                    runningExecutorPods(executorId).getMetadata.getName)


Move up to previous line.

mccheah · 2017-05-17T00:08:12Z

...etes/core/src/main/scala/org/apache/spark/scheduler/cluster/kubernetes/ShufflePodCache.scala

+import org.apache.spark.internal.Logging
+
+private[spark] class ShufflePodCache (
+    val client: KubernetesClient,


These don't have to be vals and it's preferred that they aren't since these variables are now accessible from outside the scope of this class.

foxish · 2017-05-17T16:07:56Z

Addressed comments. Will merge after tests pass.

foxish · 2017-05-17T16:29:28Z

@ash211, do you want to merge #251 before this?

ash211 · 2017-05-17T16:38:14Z

I think this dynamic allocation PR first, then afterwards the init containers one. That way the executor recovery PR can start making progress given that it's also blocked on this PR merging

foxish · 2017-05-17T16:42:18Z

Okay, SG. Merging this now, as tests passed.

* dynamic allocation: shuffle service docker, yaml and test fixture * dynamic allocation: changes to spark-core * dynamic allocation: tests * dynamic allocation: docs * dynamic allocation: kubernetes allocator and executor accounting * dynamic allocation: shuffle service, node caching

…ogging Force commons-logging version to avoid conflicts

* dynamic allocation: shuffle service docker, yaml and test fixture * dynamic allocation: changes to spark-core * dynamic allocation: tests * dynamic allocation: docs * dynamic allocation: kubernetes allocator and executor accounting * dynamic allocation: shuffle service, node caching

foxish force-pushed the dynamic-allocation branch 2 times, most recently from 4f6f75a to 988db3b Compare May 14, 2017 23:04

lins05 reviewed May 15, 2017

View reviewed changes

foxish force-pushed the dynamic-allocation branch 2 times, most recently from fc70821 to bccf43b Compare May 15, 2017 17:13

mccheah suggested changes May 15, 2017

View reviewed changes

foxish force-pushed the dynamic-allocation branch 8 times, most recently from 3ac475f to 5264fad Compare May 16, 2017 06:54

foxish force-pushed the dynamic-allocation branch from 5264fad to 4400a8c Compare May 16, 2017 06:58

foxish force-pushed the dynamic-allocation branch 2 times, most recently from 2f05ac0 to a861849 Compare May 16, 2017 16:09

foxish added 3 commits May 16, 2017 09:17

dynamic allocation: shuffle service docker, yaml and test fixture

12a22e6

dynamic allocation: changes to spark-core

1d57d46

dynamic allocation: tests

ddbc6c9

foxish force-pushed the dynamic-allocation branch from a861849 to c87008d Compare May 16, 2017 16:17

mccheah reviewed May 16, 2017

View reviewed changes

foxish force-pushed the dynamic-allocation branch from c87008d to 2b5bba0 Compare May 16, 2017 22:02

mccheah reviewed May 16, 2017

View reviewed changes

foxish force-pushed the dynamic-allocation branch from 2b5bba0 to 26805ed Compare May 16, 2017 22:29

mccheah reviewed May 16, 2017

View reviewed changes

foxish force-pushed the dynamic-allocation branch from 26805ed to b377fa6 Compare May 16, 2017 23:13

dynamic allocation: docs

be1807b

foxish force-pushed the dynamic-allocation branch from b377fa6 to 6ec3d59 Compare May 16, 2017 23:59

mccheah approved these changes May 17, 2017

View reviewed changes

foxish added 2 commits May 17, 2017 09:01

dynamic allocation: kubernetes allocator and executor accounting

66e22b2

dynamic allocation: shuffle service, node caching

4dd4715

foxish force-pushed the dynamic-allocation branch from 6ec3d59 to 4dd4715 Compare May 17, 2017 16:07

foxish merged commit e9da549 into branch-2.1-kubernetes May 17, 2017

foxish deleted the dynamic-allocation branch May 17, 2017 16:44

ozzieba pushed a commit to ozzieba/spark that referenced this pull request Jun 27, 2018

RetrieveSparkAppConfig (missed change from apache-spark-on-k8s#272)

70add55

ozzieba pushed a commit to ozzieba/spark that referenced this pull request Jun 27, 2018

missed BlockManager shuffle code (apache-spark-on-k8s#272)

d5a99cb

ifilonenko pushed a commit to ifilonenko/spark that referenced this pull request Feb 26, 2019

Merge pull request apache-spark-on-k8s#272 from palantir/rk/commons-l…

7bcda69

…ogging Force commons-logging version to avoid conflicts


		import org.apache.spark.SparkException

		object CommandLineUtils {

Dynamic allocation #272

Dynamic allocation #272

Conversation

foxish commented May 14, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mccheah left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

foxish May 16, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

foxish commented May 16, 2017

foxish commented May 16, 2017

ash211 commented May 16, 2017

ash211 commented May 16, 2017

foxish commented May 16, 2017

ash211 commented May 16, 2017 • edited Loading

foxish commented May 16, 2017

foxish commented May 16, 2017

mccheah left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

foxish commented May 16, 2017

mccheah commented May 16, 2017

foxish commented May 16, 2017

foxish commented May 16, 2017

foxish commented May 17, 2017

mccheah left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

foxish commented May 17, 2017

foxish commented May 17, 2017

ash211 commented May 17, 2017 • edited Loading

foxish commented May 17, 2017

foxish May 16, 2017 •

edited

Loading

ash211 commented May 16, 2017 •

edited

Loading

ash211 commented May 17, 2017 •

edited

Loading