-
Notifications
You must be signed in to change notification settings - Fork 118
Consider switching driver to be StatefulSet of size 1 instead of pod #288
Comments
+1. I can't think of any downside. The executor command lines will have to use the DNS name of the driver coming from the |
+1 for a more resilient driver and taking support from framework for relaunches during node failures. |
Can we use a ReplicationController or ReplicaSet? If all we want is pod restarts we should use the primitive that has only that, as opposed to StatefulSets which seem to have more things attached. |
If it's purely restarts, the pod can have a restart policy. Can spark leverage stateful-set hooks to restore more state? |
We probably could store state about what tasks have finished vs not and resume a job mid-run, rather than restarting from the beginning, but that's a much bigger change. I'd lean towards making the smallest change to support the driver pod restart, and maybe that's to use a |
If the driver wants to store its state in persistent volume so it can be restored upon restart, then |
Yet thinking about this from an API point of view, it would be difficult for spark-submit to support persistent volumes to the driver pod. I think stateful driver support is something that requires a decent amount of specification and design discussion in and of itself. We should focus on the minimum set of requirements which should be to allow the driver pod to restart upon failure, and focus on the more sophisticated use cases as they come up and/or after we’ve given them a careful specification and API. |
I agree that this warrants a design doc. Switching from pod to stateful-set or replica-set might be straightforward in the code, but making use of it is less obvious |
StatefulSet has one important property which is that there is guaranteed to be at-most-N replicas, where N is the size specified. The other abstractions don't have this particular property and having more than 1 driver pod at a time would be undesirable.
The pod restart policy is something that the node honors (kubelet in specific), but if a node is lost, there is no entity that will restart that pod. In order to restart on a different node, a higher level controller is necessary. |
@foxish what are the cases where ReplicaSet and ReplicationController would have more than N replicas? I know rolling update is one of them but in that case it's probably expected to have multiple drivers. |
The case is detailed in this doc. One case would be a network partition. RC/RS would favor availability and create copies on a new node without really verifying if the previously running instance is dead, or simply partitioned and alive. |
While StatefulSet's guarantee around at-most-N replicas seems desirable however, I would be hesitant to want to allow any spark job the ability of claiming a persisted volume claim, as it would introduce a number of additional failure modes to launching new jobs (the availability of PVs, the ability to allocate a new one). Additionally, it would limit a kubernetes administrator from restricting the use of PVs for ephemeral, short lived jobs such as Spark. |
I like the idea of improving robustness. I do want to be careful about setting user expectations around the spark driver state reconstruction. |
PVs and PVCs aren't mandatory parts of a StatefulSet, nor is the headless service (stable DNS identity). Creating a StatefulSet without a |
If we want to deliberate more and talk through the design, we can revisit this after Spark Summit when we potentially (hopefully) have a streaming use-case that has more of a need for these recovery semantics. It is an implementation detail, so, it shouldn't be particularly hard to change later. Just brought up the issue for discussion at this time. |
IIUC the current spark driver doesn't support state recovery when the driver process gets crashed. See my comments in apache#17750 (comment) . For convenience I'll quote it here:
|
@lins05 I see. It seems however like the spark-streaming case does have a way to do driver checkpointing using a WAL. https://databricks.com/blog/2015/01/15/improved-driver-fault-tolerance-and-zero-data-loss-in-spark-streaming.html talks about this. Would you happen to know if this is currently supported on other cluster managers (YARN/Mesos)? |
Note that for both Spark Streaming and Structured Streaming, the checkpointing directory should be on a storage that is shared among all executors and the driver. Simply, starting driver as a StatefulSet is not going to cut it. On AWS, the most common options for checkpointing are
IMO, you shouldn't do anything special to support streaming. Streaming applications should deploy a HDFS cluster on Kubernetes and use it to do checkpointing. |
* Bump to Hadoop 2.9.0-palantir.3 * Update deps list
* Revert "Bump Hadoop to 2.9.0-palantir.3 (apache-spark-on-k8s#288)" This reverts commit bb010b8. * Revert "Hadoop 2.9.0-palantir.2 (apache-spark-on-k8s#283)" This reverts commit 65956b7.
This would give us resilience against node failure, and the automatic restarting of the driver type of behavior that long-running streaming jobs often desire.
It should just be a simple switch over if we choose to do this, but it does need some testing.
/cc @apache-spark-on-k8s/contributors
The text was updated successfully, but these errors were encountered: