Consider switching driver to be StatefulSet of size 1 instead of pod #288

foxish · 2017-05-22T18:03:26Z

This would give us resilience against node failure, and the automatic restarting of the driver type of behavior that long-running streaming jobs often desire.
It should just be a simple switch over if we choose to do this, but it does need some testing.

/cc @apache-spark-on-k8s/contributors

kimoonkim · 2017-05-22T19:24:32Z

+1. I can't think of any downside. The executor command lines will have to use the DNS name of the driver coming from the StatefulSet. That should be doable.

varunkatta · 2017-05-22T19:31:14Z

+1 for a more resilient driver and taking support from framework for relaunches during node failures.

mccheah · 2017-05-22T19:45:56Z

Can we use a ReplicationController or ReplicaSet? If all we want is pod restarts we should use the primitive that has only that, as opposed to StatefulSets which seem to have more things attached.

erikerlandson · 2017-05-22T19:58:35Z

If it's purely restarts, the pod can have a restart policy. Can spark leverage stateful-set hooks to restore more state?

ash211 · 2017-05-22T20:02:32Z

We probably could store state about what tasks have finished vs not and resume a job mid-run, rather than restarting from the beginning, but that's a much bigger change.

I'd lean towards making the smallest change to support the driver pod restart, and maybe that's to use a ReplicaSet

kimoonkim · 2017-05-22T20:03:49Z

If the driver wants to store its state in persistent volume so it can be restored upon restart, then StatefulSet is probably the right mechanism.

mccheah · 2017-05-22T20:12:40Z

Yet thinking about this from an API point of view, it would be difficult for spark-submit to support persistent volumes to the driver pod. I think stateful driver support is something that requires a decent amount of specification and design discussion in and of itself. We should focus on the minimum set of requirements which should be to allow the driver pod to restart upon failure, and focus on the more sophisticated use cases as they come up and/or after we’ve given them a careful specification and API.

erikerlandson · 2017-05-22T20:19:16Z

I agree that this warrants a design doc. Switching from pod to stateful-set or replica-set might be straightforward in the code, but making use of it is less obvious

foxish · 2017-05-22T21:23:27Z

Can we use a ReplicationController or ReplicaSet? If all we want is pod restarts we should use the primitive that has only that, as opposed to StatefulSets which seem to have more things attached.

StatefulSet has one important property which is that there is guaranteed to be at-most-N replicas, where N is the size specified. The other abstractions don't have this particular property and having more than 1 driver pod at a time would be undesirable.

If it's purely restarts, the pod can have a restart policy. Can spark leverage stateful-set hooks to restore more state?

The pod restart policy is something that the node honors (kubelet in specific), but if a node is lost, there is no entity that will restart that pod. In order to restart on a different node, a higher level controller is necessary.

mccheah · 2017-05-22T23:04:08Z

@foxish what are the cases where ReplicaSet and ReplicationController would have more than N replicas? I know rolling update is one of them but in that case it's probably expected to have multiple drivers.

foxish · 2017-05-22T23:16:01Z

The case is detailed in this doc. One case would be a network partition. RC/RS would favor availability and create copies on a new node without really verifying if the previously running instance is dead, or simply partitioned and alive.

gdearment · 2017-05-22T23:25:54Z

While StatefulSet's guarantee around at-most-N replicas seems desirable however, I would be hesitant to want to allow any spark job the ability of claiming a persisted volume claim, as it would introduce a number of additional failure modes to launching new jobs (the availability of PVs, the ability to allocate a new one). Additionally, it would limit a kubernetes administrator from restricting the use of PVs for ephemeral, short lived jobs such as Spark.

erikerlandson · 2017-05-22T23:30:32Z

I like the idea of improving robustness. I do want to be careful about setting user expectations around the spark driver state reconstruction.

foxish · 2017-05-22T23:31:04Z

PVs and PVCs aren't mandatory parts of a StatefulSet, nor is the headless service (stable DNS identity). Creating a StatefulSet without a spec.VolumeClaimTemplates field should not create any PVs/PVCs on the user's behalf.

foxish · 2017-05-22T23:34:47Z

If we want to deliberate more and talk through the design, we can revisit this after Spark Summit when we potentially (hopefully) have a streaming use-case that has more of a need for these recovery semantics. It is an implementation detail, so, it shouldn't be particularly hard to change later. Just brought up the issue for discussion at this time.

lins05 · 2017-05-22T23:54:13Z

IIUC the current spark driver doesn't support state recovery when the driver process gets crashed. See my comments in apache#17750 (comment) .

For convenience I'll quote it here:

The spark driver contains lots of stateful information: the job/stage/task info, 
executors info, catalog that holds temporary views, to name a few. And all 
those are kept in the driver's memory and would be lost whenever
the driver crashes.

foxish · 2017-05-23T00:14:10Z

@lins05 I see. It seems however like the spark-streaming case does have a way to do driver checkpointing using a WAL. https://databricks.com/blog/2015/01/15/improved-driver-fault-tolerance-and-zero-data-loss-in-spark-streaming.html talks about this. Would you happen to know if this is currently supported on other cluster managers (YARN/Mesos)?

lins05 · 2017-05-23T01:04:06Z

@foxish Ah, i was not aware not that. So spark streaming does support recover most of the information from checkpointed data. Guess I need to read more about the code and the doc to learn more about the design.

GaalDornick · 2018-05-02T12:48:55Z

Note that for both Spark Streaming and Structured Streaming, the checkpointing directory should be on a storage that is shared among all executors and the driver. Simply, starting driver as a StatefulSet is not going to cut it. On AWS, the most common options for checkpointing are

HDFS
EFS
S3

IMO, you shouldn't do anything special to support streaming. Streaming applications should deploy a HDFS cluster on Kubernetes and use it to do checkpointing.

* Bump to Hadoop 2.9.0-palantir.3 * Update deps list

* Revert "Bump Hadoop to 2.9.0-palantir.3 (apache-spark-on-k8s#288)" This reverts commit bb010b8. * Revert "Hadoop 2.9.0-palantir.2 (apache-spark-on-k8s#283)" This reverts commit 65956b7.

ifilonenko pushed a commit to ifilonenko/spark that referenced this issue Feb 26, 2019

Bump Hadoop to 2.9.0-palantir.3 (apache-spark-on-k8s#288)

bb010b8

* Bump to Hadoop 2.9.0-palantir.3 * Update deps list

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Consider switching driver to be StatefulSet of size 1 instead of pod #288

Consider switching driver to be StatefulSet of size 1 instead of pod #288

foxish commented May 22, 2017

kimoonkim commented May 22, 2017

varunkatta commented May 22, 2017

mccheah commented May 22, 2017

erikerlandson commented May 22, 2017

ash211 commented May 22, 2017

kimoonkim commented May 22, 2017

mccheah commented May 22, 2017

erikerlandson commented May 22, 2017

foxish commented May 22, 2017

mccheah commented May 22, 2017

foxish commented May 22, 2017

gdearment commented May 22, 2017

erikerlandson commented May 22, 2017

foxish commented May 22, 2017

foxish commented May 22, 2017

lins05 commented May 22, 2017 •

edited

Loading

foxish commented May 23, 2017

lins05 commented May 23, 2017

GaalDornick commented May 2, 2018

Consider switching driver to be StatefulSet of size 1 instead of pod #288

Consider switching driver to be StatefulSet of size 1 instead of pod #288

Comments

foxish commented May 22, 2017

kimoonkim commented May 22, 2017

varunkatta commented May 22, 2017

mccheah commented May 22, 2017

erikerlandson commented May 22, 2017

ash211 commented May 22, 2017

kimoonkim commented May 22, 2017

mccheah commented May 22, 2017

erikerlandson commented May 22, 2017

foxish commented May 22, 2017

mccheah commented May 22, 2017

foxish commented May 22, 2017

gdearment commented May 22, 2017

erikerlandson commented May 22, 2017

foxish commented May 22, 2017

foxish commented May 22, 2017

lins05 commented May 22, 2017 • edited Loading

foxish commented May 23, 2017

lins05 commented May 23, 2017

GaalDornick commented May 2, 2018

lins05 commented May 22, 2017 •

edited

Loading