-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
📖 Update spot instances proposal with termination handler design #3528
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -10,7 +10,7 @@ reviewers: | |
- "@CecileRobertMichon" | ||
- "@randomvariable" | ||
creation-date: 2020-03-30 | ||
last-updated: 2020-03-30 | ||
last-updated: 2020-06-24 | ||
status: provisional | ||
see-also: | ||
replaces: | ||
|
@@ -43,8 +43,11 @@ superseded-by: | |
* [Azure](#azure) | ||
* [Launching Instances](#launching-instances-2) | ||
* [Deallocation](#deallocation) | ||
* [Termination handler](#termination-handler) | ||
* [Termination handler design](#termination-handler-design) | ||
* [Running the termination handler](#running-the-termination-handler) | ||
* [Termination handler security](#termination-handler-security) | ||
* [Future Work](#future-work) | ||
* [Termination handler](#termination-handler) | ||
* [Support for MachinePools](#support-for-machinepools) | ||
* [Risks and Mitigations](#risks-and-mitigations) | ||
* [Control-Plane instances](#control-plane-instances) | ||
|
@@ -250,9 +253,6 @@ The Node will transition to an unready state which would be detected by a Machin | |
though there may be some delay depending on the configuration of the MachineHealthCheck. | ||
In the future, a termination handler could trigger the Machine to be deleted sooner. | ||
|
||
|
||
|
||
|
||
### 'Interruptible' label | ||
|
||
In order to deploy the termination handler, we'll need to create a DaemonSet that runs it on each spot instance node. | ||
|
@@ -285,8 +285,7 @@ if !interruptible { | |
``` | ||
|
||
### Future Work | ||
|
||
#### Termination handler | ||
### Termination handler | ||
|
||
To enable graceful termination of workloads running on non-guaranteed instances, | ||
a DaemonSet will need to be deployed to watch for termination notices and gracefully move workloads. | ||
|
@@ -297,6 +296,70 @@ This would be preferable as a DaemonSet would not be required on workload cluste | |
Since this is not essential for running on non-guaranteed instances and existing solutions exist for each provider, | ||
users can deploy these existing solutions until CAPI has capacity to implement a solution. | ||
|
||
#### Termination handler design | ||
|
||
To enable graceful termination of workloads running on non-guaranteed instances, | ||
a termination handler pod will need to be deployed on each spot instance. | ||
|
||
The termination pod will be developed to poll the metadata service for the Node | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm not 100% sure this would work for Azure. At the bottom of this proposal we wrote:
cc @awesomenix There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We are planning integrate https://github.com/awesomenix/drainsafe to automatic cordon and drain any terminating instances for machinepools. We can probably extend it to execute additional termination handlers. Just a reminder that you get at most 30 seconds for a Spot VM on azure, I am not sure how much you can do in that time. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. For termination handlers, it should be sufficient to use node draining logic that we already have implemented in CAPI https://github.com/kubernetes-sigs/cluster-api/blob/master/controllers/machine_controller.go#L276 There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We have a POC for AWS which marks the node with a condition as described in this proposal, the azure one can be pretty much identical to this, just substituting the logic for how to read the response and adding the required header to the request. While I agree 30 seconds isn't very long to actually do much in the drain, I think the benefit would come from replacing nodes quicker. Without the contents of this proposal, the instance gets terminated, if an MHC is present, it will remove the Machine after some time of being unhealthy, then the MachineSet replaces the node. |
||
that it is running on. | ||
We will implement request/response handlers for each infrastructure provider that supports non-guaranteed instances | ||
that will enable the handler to determine if the cloud provider instance is due | ||
for termination. | ||
|
||
The code for all providers will be mostly similar, the only difference is logic for polling termination endpoint and processing the response. | ||
This means that cloud provider type can be passed as an argument and termination handler can be common for all providers and placed in a separate repository. | ||
|
||
Should the instance be marked for termination, the handler will add a condition | ||
to the Node object that it is running on to signal that the instance will be | ||
terminating imminently. | ||
|
||
```yaml | ||
- type: Terminating | ||
status: True | ||
Reason: TerminationRequested | ||
Message: The cloud provider has marked this instance for termination | ||
``` | ||
Once the Node has been marked with the `Terminating` condition, it will be | ||
the MachineHealthCheck controller's responsibility to ensure that the Machine | ||
is deleted, triggering it to be drained and removed from the cluster. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Wouldn't MHC cause a new Machine to be created? Since the idea is to replace an unhealthy machine? Is that what we want when a spot VM is terminated for lack of capacity? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think yes, we need to delete machine and leverage node draining here There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Given the Machine is going to be I guess when you're running a set up like this you kind of need the cluster autoscaler to notice that there's no capacity and scale up elsewhere, or maybe have the MachineSet bring up a node in an alternate failure domain (if we go down the route of supporting that? Would probably need to recognise somehow that the spot capacity is out and retry a different failure domain) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think that for a first implementation letting the machine be recreated is reasonable. We can always do smarter things based on usage feedback once we have something in place. |
||
|
||
To enable this, the following MachineHealthCheck should be deployed: | ||
|
||
```yaml | ||
--- | ||
apiVersion: cluster.x-k8s.io/v1alpha3 | ||
kind: MachineHealthCheck | ||
metadata: | ||
name: cluster-api-termination-handler | ||
labels: | ||
api: clusterapi | ||
k8s-app: termination-handler | ||
spec: | ||
selector: | ||
matchLabels: | ||
cluster.x-k8s.io/interruptible-instance: "" # This label should be automatically applied to all spot instances | ||
maxUnhealthy: 100% # No reason to block the interruption, it's inevitable | ||
unhealthyConditions: | ||
- type: Terminating | ||
status: "True" | ||
timeout: 0s # Immediately terminate the instance | ||
``` | ||
|
||
#### Running the termination handler | ||
The Termination Pod will be part of a DaemonSet, that can be deployed using [ClusterResourceSet](https://github.com/kubernetes-sigs/cluster-api/blob/master/docs/proposals/20200220-cluster-resource-set.md). The DaemonSet will select Nodes which are labelled as spot instances to ensure the Termination Pod only runs on instances that require termination handlers. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Why? I think this should be enabled for all nodes by default. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The termination notices that these termination handlers rely upon are only implemented on the spot/preemptible instances. Eg on AWS, if the spot instance is going away, we get a 2 minute warning via the metadata endpoint. There is no equivalent for an on-demand instance. I appreciate that there are times that cloud providers remove instances for maintenance reasons, and there may be some way programmatically to detect this, but as far as I know they use a different mechanism and are also far less frequent. For now at least, I think they are out of scope for these termination handlers There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Proposal referenced here #3528 (comment) describes the situation where the node can be unexpectedly shut down. |
||
|
||
The spot label will be added to the Node by the machine controller as described [here](#interruptible-label), provided they support spot instances and the instance is a spot instance. | ||
#### Termination handler security | ||
The metadata services that are hosted by the cloud providers are only accessible from the hosts themselves, so the pod will need to run within the host network. | ||
|
||
To restrict the possible effects of the termination handler, it should re-use the Kubelet credentials which pass through the [NodeRestriction](https://kubernetes.io/docs/reference/access-authn-authz/admission-controllers/#noderestriction) admission controller. This limits the termination handler to only be able to modify the Node on which it is running. Eg, it would not be able to set the conditions on a different Node. | ||
|
||
To be able to mount and read the Kubelet credentials, the pod must be able to mount host paths and, since the credentials have to be mounted as read-only just for the root account, the termination handler must be able to run as root. | ||
|
||
### Future Work | ||
|
||
#### Support for MachinePools | ||
|
||
While MachinePools are being implemented across the three cloud providers that this project covers, | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where would users get this termination handler?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The plan is to make something similar to what
calico
has.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a bit curious as to why a new termination handler rather than trying to leverage/extend existing projects, such as https://github.com/aws/aws-node-termination-handler for this purpose?
One of the things that stands out to me is that the aws-node-termination-handler supports an SQS-based system that would avoid the security concerns mentioned below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Up for discussion as to whether this benefit is worth it or not, but the benefit of having our own termination handler is that it reduces the time to get a new Node.
Using a normal handler, that needs to be able to drain the node, then we have to wait for the Node to go away and MHC to remediate.
Using our own we can trigger machine deletion as soon as the termination notice is served, causing the machineset to create a new machine.
In some cases this won't really be all that useful, but there are a couple I can think of where this is useful.
In GCP preemptible instances are only allowed to run for 24 hours and then get served a termination notice, this would save time here in replacing the lost compute capacity. Secondly, while CAPI doesn't support it yet, if a MachineSet supported multi AZ (which I believe there's discussion of somewhere), the MachineSet may be able to create a new instance in a different AZ if a single AZ is terminating its instances.
As a secondary benefit, by leveraging the machine controller shutdown logic, if we ever encode any extra logic for an instance (pre stop hooks), this would also be leveraged before the instance is actually gone (more useful on AWS with the 2 minute warning than others I admit).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@JoelSpeed is that something that we could address by working with related upstream projects to add support for Cluster API rather than having to replicate support for the various different ways to detect spot instance termination across various providers?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@detiber One more benefit of having our own termination handler is that we will have only one project for all infra providers that support spot instances. This allows us to deliver to users only one binary instead of 3 different for each cloud.
It's not a full replication of other projects, it more like a stripped-down version of them that should reduce the complexity of integrating termination handlers functionality to CAPI.