Skip to content
This repository has been archived by the owner on Jan 9, 2020. It is now read-only.

Dynamic allocation #272

Merged
merged 6 commits into from
May 17, 2017
Merged

Dynamic allocation #272

merged 6 commits into from
May 17, 2017

Conversation

foxish
Copy link
Member

@foxish foxish commented May 14, 2017

Dynamic allocation updated
Please see commits individually during review for clarity

cc @mccheah @ash211 @varunkatta @apache-spark-on-k8s/contributors

@foxish foxish force-pushed the dynamic-allocation branch 2 times, most recently from 4f6f75a to 988db3b Compare May 14, 2017 23:04
shufflePodCache.get(executorNode) match {
case Some(pod) => pod
case _ =>
throw new SparkException(s"Unable to find shuffle pod on node $executorNode")
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A corner case comes into my mind: if a shuffle pod for a node died and is being restarted, and at this moment a new executor on that node registers with the driver, it would crash the driver.

Can we improve this, e.g. let the executor die when the shuffle pod is not ready, instead of throwing SparkException to abort the driver?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch! I think you're right. Done. Passing back an empty string which should make the executor crash.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Throwing the exception here should be fine, right? This is being processed on a separate thread in the RPC environment. Thus the exception here should only propagate to the executor that asked for this configuration.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could be wrong, but IIUC SparkException is thrown whenever unrecoverable errors happen.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case it's less important what the type of exception is. What's important is where the exception is thrown from and where it propagates to.

addShufflePodToCache(p)
}
}
override def onClose(e: KubernetesClientException): Unit = {}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is a daemonset watchable? If so can we watch on it directly instead of using labels?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is, but we don't want to make an assumption that it's a daemonset in use. In the current way, it remains - any pod that is co-located on that node with the same labels.

"shuffle-labels")
val shuffleDirs = conf.getOption(KUBERNETES_SHUFFLE_DIR.key).map {
_.split(",")
}.getOrElse(Utils.getConfiguredLocalDirs(conf))
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we throw an exception here is dynamic allocation is enabled but shuffle.labels is empty?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

import org.apache.spark.internal.Logging
import org.apache.spark.util.ThreadUtils

private[spark] class NodeCacheManager (
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: the name NodeCacheManager is not that intuitive. Maybe sth. like ShufflePodsCatalog? I don't know..

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to ShufflePodCache

@foxish foxish force-pushed the dynamic-allocation branch 2 times, most recently from fc70821 to bccf43b Compare May 15, 2017 17:13
Copy link

@mccheah mccheah left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed the top level PR, will follow up on changes in the subsequent individual commits.


private val allocatorRunnable: Runnable = new Runnable {
override def run(): Unit = {
if (runningExecutorPods.size - totalRegisteredExecutors.get() > 0) {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could probably just use < for clarity.

override def run(): Unit = {
if (runningExecutorPods.size - totalRegisteredExecutors.get() > 0) {
logDebug("Waiting for pending executors before scaling")
return
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Try to structure the logic so that we don't use return.

if (totalExpectedExecutors.get() <= runningExecutorPods.size) {
logDebug(
"Maximum allowed executor limit reached. Not scaling up further.")
return
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similarly here - avoid return.

* KubernetesAllocator class watches executor registrations, limits
* and creates new executors when it is appropriate.
*/
private[spark] class KubernetesAllocator(client: KubernetesClient)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this specifically need to be a separate class? The code could just be inlined in the scheduler backend class.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the separate class for the separation it offers for this allocator mechanism. Do you strongly prefer not having it?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This ought to be in a separate file then, I think. But if this also requires a backwards reference to the scheduler backend (e.g. to access fields) then this should just be inlined.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It requires access to totalRegisteredExecutors which is a protected field in CoarseGrainedSchedulerBackend and a couple of other accounting fields from the KubernetesClusterSchedulerBackend.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, did you mean that it shouldn't be a separate class?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

def start(): Unit = {
// seed the initial cache.
val pods = client.pods().withLabels(dsLabels.asJava).list()
for (pod <- pods.getItems.asScala) {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we prefer foreach over for in general.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@@ -81,6 +82,68 @@ private[spark] class KubernetesV1Suite(testBackend: IntegrationTestBackend)
})
}

private def expectationsForDynamicAllocation(sparkMetricsService: SparkRestApiV1): Unit = {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's avoid adding tests to V1 and solely focus on V2.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

.endContainer()
.endSpec()

var resolvedPodBuilder = shuffleServiceConfig
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

val not var.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

import org.apache.spark.sql.SparkSession

object GroupByTest {
def main(args: Array[String]) {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be good to confirm that this test is creating multiple executors and is writing files to the shuffle service. I'm not sure if we can do this in an automated way.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can achieve that by setting --conf spark.dynamicAllocation.minExecutors=2 and wait for these two executors to be ready via k8s api (or spark rest api?)

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The executors can spin up but not write any shuffle data to disk. We should check that shuffle data is being written to the disks.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I chose this test so that it would have shuffle data being written to disk. I've manually verified that it does write to disk.

}

def start(): Unit = {
if (interval > 0) {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just make the interval an Optional configuration and use interval.foreach, as opposed to checking if an integer is greater than zero. We should validate then that any given value is positive and throw an exception if it isn't.

s"but no ${KUBERNETES_SHUFFLE_LABELS.key} specified")
}

val shuffleDirs = conf.getOption(KUBERNETES_SHUFFLE_DIR.key).map {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use .get instead of .getOption here?

Copy link
Member Author

@foxish foxish May 16, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was trying to get Option[String] because if the shuffle directory is left empty, we use the default from Utils.getConfiguredLocalDirs(conf). I'm not sure how we can get this behavior using get.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using .get here without using .key on the configuration key should give back an Option.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah.. good point. Done

@foxish foxish force-pushed the dynamic-allocation branch 8 times, most recently from 3ac475f to 5264fad Compare May 16, 2017 06:54
@foxish
Copy link
Member Author

foxish commented May 16, 2017

@mccheah Addressed all comments. PTAL

@foxish foxish force-pushed the dynamic-allocation branch from 5264fad to 4400a8c Compare May 16, 2017 06:58
@foxish
Copy link
Member Author

foxish commented May 16, 2017

Has the unit testing changed? I'm seeing failures in files I did not touch at all.

@ash211
Copy link

ash211 commented May 16, 2017

rerun unit tests please

@ash211
Copy link

ash211 commented May 16, 2017

@foxish I've been working on an SBT-based unit test build in jenkins and it looks like it was racing with the current maven-based unit tests. I've disabled the new test build and expect just the old one to be running now.

Sorry about that!

@foxish
Copy link
Member Author

foxish commented May 16, 2017

@ash211, we can fix the new one. The errors appeared to be:

Java HotSpot(TM) 64-Bit Server VM warning: ignoring option MaxPermSize=512m; support was removed in 8.0
Scalastyle checks failed at following occurrences:
[error] /home/jenkins/workspace/PR-spark-k8s-unit-tests-SBT-TESTING/resource-managers/kubernetes/integration-tests/src/test/scala/org/apache/spark/deploy/kubernetes/integrationtest/KubernetesTestComponents.scala: File must end with newline character
[error] /home/jenkins/workspace/PR-spark-k8s-unit-tests-SBT-TESTING/resource-managers/kubernetes/integration-tests/src/test/scala/org/apache/spark/deploy/kubernetes/integrationtest/ProcessUtils.scala:28:0: Use Javadoc style indentation for multiline comments
[error] /home/jenkins/workspace/PR-spark-k8s-unit-tests-SBT-TESTING/resource-managers/kubernetes/integration-tests/src/test/scala/org/apache/spark/deploy/kubernetes/integrationtest/constants.scala: File must end with newline character
[error] (kubernetes-integration-tests/test:scalastyle) errors exist
[error] Total time: 13 s, completed May 15, 2017 9:02:13 PM
[error] running /home/jenkins/workspace/PR-spark-k8s-unit-tests-SBT-TESTING/dev/lint-scala ; received return code 1

Sending a PR to fix these.

@ash211
Copy link

ash211 commented May 16, 2017

@foxish I've got a commit that does that already in another PR: 6c84023

@foxish
Copy link
Member Author

foxish commented May 16, 2017

Ah! Okay, SG. Thanks!

@foxish
Copy link
Member Author

foxish commented May 16, 2017

Tests passed. Are there any more comments @mccheah, @ash211, @lins05? I'd like to get this merged before Wednesday because it blocks Varun's recovery behavior PR.

@foxish foxish force-pushed the dynamic-allocation branch 2 times, most recently from 2f05ac0 to a861849 Compare May 16, 2017 16:09
@foxish foxish force-pushed the dynamic-allocation branch from a861849 to c87008d Compare May 16, 2017 16:17
Copy link

@mccheah mccheah left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be possible to add unit-level tests around these? It would be great if we can start hardening the features we are implementing here. Unit testing these kinds of things can be difficult; we would probably have to refactor much of the scheduler backend and the shuffle pod cache to make us able to verify the things that are important.

Some(conf.get(KUBERNETES_ALLOCATION_BATCH_SIZE))
} else {
throw new SparkException(s"Allocation batch size ${KUBERNETES_ALLOCATION_BATCH_SIZE} " +
s"should be a positive integer")
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add what the value the user specified was.

@@ -130,12 +197,27 @@ private[spark] class KubernetesClusterSchedulerBackend(
super.start()
executorWatchResource.set(kubernetesClient.pods().withLabel(SPARK_APP_ID_LABEL, applicationId())
.watch(new ExecutorPodsWatcher()))

podAllocationInterval.foreach(allocator.scheduleWithFixedDelay(allocatorRunnable,
0,
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Put allocatorRunnable, 0, TimeUnit.SECONDS all on this line.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think since now podAllocationInterval is always going to be provided (we always set it to Some(...) or throw an exception) then this thread will always be running. Is this the intended behavior? If so, no need to use foreach and options here.

@foxish foxish force-pushed the dynamic-allocation branch from c87008d to 2b5bba0 Compare May 16, 2017 22:02
private val allocatorRunnable: Runnable = new Runnable {
override def run(): Unit = {
if (totalRegisteredExecutors.get() >= runningExecutorPods.size) {
if (totalExpectedExecutors.get() > runningExecutorPods.size) {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be cleaner I think to use if...else if... else here:

if (...) {
  logDebug("Maximum allowed executor limit...")
} else if (...) {
  logDebug("Waiting for pending...")
} else {
  // Actual logic
}

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Thanks! Trying to add a couple of unit tests to ShufflePodCache now and a mechanism that might help us add tests easily in the future.

@foxish foxish force-pushed the dynamic-allocation branch from 2b5bba0 to 26805ed Compare May 16, 2017 22:29
@@ -105,6 +131,44 @@ private[spark] class KubernetesClusterSchedulerBackend(

private val initialExecutors = getInitialTargetExecutorNumber(1)

private val podAllocationInterval =
if (conf.get(KUBERNETES_ALLOCATION_BATCH_DELAY) > 0) {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's probably no need to make this an option - just assign podAllocationInterval directly, and then check the variable directly + throw the SparkException immediately afterwards.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done


import org.apache.spark.SparkException

object CommandLineUtils {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be KeyValueUtils? This doesn't seem related to the command line.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I envisioned it as a place for utility functions related to commandline options which we could have more of, in future. The parsing of key-values is necessitated by commandline strings being supplied.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it confusing?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems to mainly be used to parse out labels and annotations from SparkConf values - the command line doesn't seem to be related to that.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, I was assuming the primary way of supplying those args was via the cmdline. Okay, how about ConfigurationUtils? KeyValueUtils.parseKeyValuePairs() just seems a bit redundant.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ConfigurationUtils is fine.

@foxish foxish force-pushed the dynamic-allocation branch from 26805ed to b377fa6 Compare May 16, 2017 23:13
@foxish
Copy link
Member Author

foxish commented May 16, 2017

Unit tests seem more complex than expected, because watchers and such. https://mvnrepository.com/artifact/io.fabric8/kubernetes-server-mock provided an easy beginnning but I think I'll take it separately instead of blocking experiments using dynamic allocation.

@mccheah
Copy link

mccheah commented May 16, 2017

We can probably test the watches separately and just ensure that if the watch receives an event then the scheduler responds accordingly.

@foxish
Copy link
Member Author

foxish commented May 16, 2017

The mock server can be taught to expect the watch calls and respond appropriately. I used a similar thing in the unit tests here.

@foxish
Copy link
Member Author

foxish commented May 16, 2017

Created #275, will follow up there

@foxish foxish force-pushed the dynamic-allocation branch from b377fa6 to 6ec3d59 Compare May 16, 2017 23:59
@foxish
Copy link
Member Author

foxish commented May 17, 2017

Updated docs, any other comments?

Copy link

@mccheah mccheah left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a few minor style things but they can be addressed either here or at some other point. Someone else can make a final pass before merging, but if there are no objections before the end of the day then feel free to proceed with the merge.

@@ -105,6 +131,40 @@ private[spark] class KubernetesClusterSchedulerBackend(

private val initialExecutors = getInitialTargetExecutorNumber(1)

private val podAllocationInterval = conf.get(KUBERNETES_ALLOCATION_BATCH_DELAY)
if (podAllocationInterval <= 0) {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can use require here and in other similar places.

val runningExecutorPod = kubernetesClient
.pods()
.withName(
runningExecutorPods(executorId).getMetadata.getName)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move up to previous line.

import org.apache.spark.internal.Logging

private[spark] class ShufflePodCache (
val client: KubernetesClient,
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These don't have to be vals and it's preferred that they aren't since these variables are now accessible from outside the scope of this class.

@foxish foxish force-pushed the dynamic-allocation branch from 6ec3d59 to 4dd4715 Compare May 17, 2017 16:07
@foxish
Copy link
Member Author

foxish commented May 17, 2017

Addressed comments. Will merge after tests pass.

@foxish
Copy link
Member Author

foxish commented May 17, 2017

@ash211, do you want to merge #251 before this?

@ash211
Copy link

ash211 commented May 17, 2017

I think this dynamic allocation PR first, then afterwards the init containers one. That way the executor recovery PR can start making progress given that it's also blocked on this PR merging

@foxish
Copy link
Member Author

foxish commented May 17, 2017

Okay, SG. Merging this now, as tests passed.

@foxish foxish merged commit e9da549 into branch-2.1-kubernetes May 17, 2017
@foxish foxish deleted the dynamic-allocation branch May 17, 2017 16:44
foxish added a commit that referenced this pull request Jul 24, 2017
* dynamic allocation: shuffle service docker, yaml and test fixture

* dynamic allocation: changes to spark-core

* dynamic allocation: tests

* dynamic allocation: docs

* dynamic allocation: kubernetes allocator and executor accounting

* dynamic allocation: shuffle service, node caching
ozzieba pushed a commit to ozzieba/spark that referenced this pull request Jun 27, 2018
ozzieba pushed a commit to ozzieba/spark that referenced this pull request Jun 27, 2018
ifilonenko pushed a commit to ifilonenko/spark that referenced this pull request Feb 26, 2019
…ogging

Force commons-logging version to avoid conflicts
puneetloya pushed a commit to puneetloya/spark that referenced this pull request Mar 11, 2019
* dynamic allocation: shuffle service docker, yaml and test fixture

* dynamic allocation: changes to spark-core

* dynamic allocation: tests

* dynamic allocation: docs

* dynamic allocation: kubernetes allocator and executor accounting

* dynamic allocation: shuffle service, node caching
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants