Skip to content
This repository has been archived by the owner on Jan 9, 2020. It is now read-only.

Consider switching driver to be StatefulSet of size 1 instead of pod #288

Open
foxish opened this issue May 22, 2017 · 19 comments
Open

Consider switching driver to be StatefulSet of size 1 instead of pod #288

foxish opened this issue May 22, 2017 · 19 comments

Comments

@foxish
Copy link
Member

foxish commented May 22, 2017

This would give us resilience against node failure, and the automatic restarting of the driver type of behavior that long-running streaming jobs often desire.
It should just be a simple switch over if we choose to do this, but it does need some testing.

/cc @apache-spark-on-k8s/contributors

@kimoonkim
Copy link
Member

+1. I can't think of any downside. The executor command lines will have to use the DNS name of the driver coming from the StatefulSet. That should be doable.

@varunkatta
Copy link
Member

+1 for a more resilient driver and taking support from framework for relaunches during node failures.

@mccheah
Copy link

mccheah commented May 22, 2017

Can we use a ReplicationController or ReplicaSet? If all we want is pod restarts we should use the primitive that has only that, as opposed to StatefulSets which seem to have more things attached.

@erikerlandson
Copy link
Member

If it's purely restarts, the pod can have a restart policy. Can spark leverage stateful-set hooks to restore more state?

@ash211
Copy link

ash211 commented May 22, 2017

We probably could store state about what tasks have finished vs not and resume a job mid-run, rather than restarting from the beginning, but that's a much bigger change.

I'd lean towards making the smallest change to support the driver pod restart, and maybe that's to use a ReplicaSet

@kimoonkim
Copy link
Member

If the driver wants to store its state in persistent volume so it can be restored upon restart, then StatefulSet is probably the right mechanism.

@mccheah
Copy link

mccheah commented May 22, 2017

Yet thinking about this from an API point of view, it would be difficult for spark-submit to support persistent volumes to the driver pod. I think stateful driver support is something that requires a decent amount of specification and design discussion in and of itself. We should focus on the minimum set of requirements which should be to allow the driver pod to restart upon failure, and focus on the more sophisticated use cases as they come up and/or after we’ve given them a careful specification and API.

@erikerlandson
Copy link
Member

I agree that this warrants a design doc. Switching from pod to stateful-set or replica-set might be straightforward in the code, but making use of it is less obvious

@foxish
Copy link
Member Author

foxish commented May 22, 2017

Can we use a ReplicationController or ReplicaSet? If all we want is pod restarts we should use the primitive that has only that, as opposed to StatefulSets which seem to have more things attached.

StatefulSet has one important property which is that there is guaranteed to be at-most-N replicas, where N is the size specified. The other abstractions don't have this particular property and having more than 1 driver pod at a time would be undesirable.

If it's purely restarts, the pod can have a restart policy. Can spark leverage stateful-set hooks to restore more state?

The pod restart policy is something that the node honors (kubelet in specific), but if a node is lost, there is no entity that will restart that pod. In order to restart on a different node, a higher level controller is necessary.

@mccheah
Copy link

mccheah commented May 22, 2017

@foxish what are the cases where ReplicaSet and ReplicationController would have more than N replicas? I know rolling update is one of them but in that case it's probably expected to have multiple drivers.

@foxish
Copy link
Member Author

foxish commented May 22, 2017

The case is detailed in this doc. One case would be a network partition. RC/RS would favor availability and create copies on a new node without really verifying if the previously running instance is dead, or simply partitioned and alive.

@gdearment
Copy link

While StatefulSet's guarantee around at-most-N replicas seems desirable however, I would be hesitant to want to allow any spark job the ability of claiming a persisted volume claim, as it would introduce a number of additional failure modes to launching new jobs (the availability of PVs, the ability to allocate a new one). Additionally, it would limit a kubernetes administrator from restricting the use of PVs for ephemeral, short lived jobs such as Spark.

@erikerlandson
Copy link
Member

I like the idea of improving robustness. I do want to be careful about setting user expectations around the spark driver state reconstruction.

@foxish
Copy link
Member Author

foxish commented May 22, 2017

PVs and PVCs aren't mandatory parts of a StatefulSet, nor is the headless service (stable DNS identity). Creating a StatefulSet without a spec.VolumeClaimTemplates field should not create any PVs/PVCs on the user's behalf.

@foxish
Copy link
Member Author

foxish commented May 22, 2017

If we want to deliberate more and talk through the design, we can revisit this after Spark Summit when we potentially (hopefully) have a streaming use-case that has more of a need for these recovery semantics. It is an implementation detail, so, it shouldn't be particularly hard to change later. Just brought up the issue for discussion at this time.

@lins05
Copy link

lins05 commented May 22, 2017

IIUC the current spark driver doesn't support state recovery when the driver process gets crashed. See my comments in apache#17750 (comment) .

For convenience I'll quote it here:

The spark driver contains lots of stateful information: the job/stage/task info, 
executors info, catalog that holds temporary views, to name a few. And all 
those are kept in the driver's memory and would be lost whenever
the driver crashes.

@foxish
Copy link
Member Author

foxish commented May 23, 2017

@lins05 I see. It seems however like the spark-streaming case does have a way to do driver checkpointing using a WAL. https://databricks.com/blog/2015/01/15/improved-driver-fault-tolerance-and-zero-data-loss-in-spark-streaming.html talks about this. Would you happen to know if this is currently supported on other cluster managers (YARN/Mesos)?

@lins05
Copy link

lins05 commented May 23, 2017

@foxish Ah, i was not aware not that. So spark streaming does support recover most of the information from checkpointed data. Guess I need to read more about the code and the doc to learn more about the design.

@GaalDornick
Copy link

Note that for both Spark Streaming and Structured Streaming, the checkpointing directory should be on a storage that is shared among all executors and the driver. Simply, starting driver as a StatefulSet is not going to cut it. On AWS, the most common options for checkpointing are

  1. HDFS
  2. EFS
  3. S3

IMO, you shouldn't do anything special to support streaming. Streaming applications should deploy a HDFS cluster on Kubernetes and use it to do checkpointing.

ifilonenko pushed a commit to ifilonenko/spark that referenced this issue Feb 26, 2019
* Bump to Hadoop 2.9.0-palantir.3

* Update deps list
ifilonenko pushed a commit to ifilonenko/spark that referenced this issue Feb 26, 2019
* Revert "Bump Hadoop to 2.9.0-palantir.3 (apache-spark-on-k8s#288)"

This reverts commit bb010b8.

* Revert "Hadoop 2.9.0-palantir.2 (apache-spark-on-k8s#283)"

This reverts commit 65956b7.
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

9 participants