📖 Update spot instances proposal with termination handler design #3528

alexander-demicev · 2020-08-25T13:40:03Z

What this PR does / why we need it:
This PR updates spot instances proposal with termination handler design.

Thanks to @JoelSpeed because it's mostly copy-paste of his proposal done for OpenShift.

k8s-ci-robot · 2020-08-25T13:40:11Z

Hi @alexander-demichev. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

neolit123 · 2020-08-25T13:42:23Z

@alexander-demichev could you please link to the PR/issue/comment where this amend was requested?

neolit123 · 2020-08-25T13:42:35Z

/kind documentation

vincepri

/hold

For more information

vincepri · 2020-08-25T15:43:44Z

docs/proposals/20200330-spot-instances.md

+for termination.
+
+The code for all providers will be mostly similar, the only difference is logic for polling termination endpoint and processing the response.
+This means that cloud provider type can be passed as an argument and termination handler can be common for all providers and placed in a separate repository, for example: `kubernetes-sigs/cluster-api-termination-handler`.


Given that this doesn't actually exist today, should we remove it?

Do you have a POC of what the code would look like?

Yes, I'll put it all together and demonstrate it, probably next week.

@CecileRobertMichon Hi, I made a prototype that should work https://github.com/alexander-demichev/termination-handler.

vincepri · 2020-08-25T15:44:51Z

docs/proposals/20200330-spot-instances.md

+To enable graceful termination of workloads running on non-guaranteed instances,
+a termination handler pod will need to be deployed on each spot instance.


Where would users get this termination handler?

The plan is to make something similar to what calico has.

I'm a bit curious as to why a new termination handler rather than trying to leverage/extend existing projects, such as https://github.com/aws/aws-node-termination-handler for this purpose?

One of the things that stands out to me is that the aws-node-termination-handler supports an SQS-based system that would avoid the security concerns mentioned below.

Up for discussion as to whether this benefit is worth it or not, but the benefit of having our own termination handler is that it reduces the time to get a new Node.

Using a normal handler, that needs to be able to drain the node, then we have to wait for the Node to go away and MHC to remediate.

Using our own we can trigger machine deletion as soon as the termination notice is served, causing the machineset to create a new machine.

In some cases this won't really be all that useful, but there are a couple I can think of where this is useful.

In GCP preemptible instances are only allowed to run for 24 hours and then get served a termination notice, this would save time here in replacing the lost compute capacity. Secondly, while CAPI doesn't support it yet, if a MachineSet supported multi AZ (which I believe there's discussion of somewhere), the MachineSet may be able to create a new instance in a different AZ if a single AZ is terminating its instances.

As a secondary benefit, by leveraging the machine controller shutdown logic, if we ever encode any extra logic for an instance (pre stop hooks), this would also be leveraged before the instance is actually gone (more useful on AWS with the 2 minute warning than others I admit).

@JoelSpeed is that something that we could address by working with related upstream projects to add support for Cluster API rather than having to replicate support for the various different ways to detect spot instance termination across various providers?

@detiber One more benefit of having our own termination handler is that we will have only one project for all infra providers that support spot instances. This allows us to deliver to users only one binary instead of 3 different for each cloud.
It's not a full replication of other projects, it more like a stripped-down version of them that should reduce the complexity of integrating termination handlers functionality to CAPI.

vincepri · 2020-08-25T15:45:19Z

docs/proposals/20200330-spot-instances.md

+```yaml
+---
+apiVersion: cluster.x-k8s.io/v1alpha3
+kind: MachineHealthCheck
+metadata:
+  name: cluster-api-termination-handler
+  labels:
+    api: clusterapi
+    k8s-app: termination-handler
+spec:
+  selector:
+    matchLabels:
+      cluster.x-k8s.io/interruptible-instance: "" # This label is automatically applied to all spot instances
+  maxUnhealthy: 100% # No reason to block the interruption, it's inevitable
+  unhealthyConditions:
+  - type: Terminating
+    status: "True"
+    timeout: 0s # Immediately terminate the instance
+```


Shouldn't this something that MachinePool controller should handler rather than MachineHealthCheck?

cc @CecileRobertMichon

Spot instances are implemented for InfraMachines, not just MachinePools (in fact they aren't implemented for MachinePools yet, see Future work below). This solution only works for Machines. For ASGs/VMSS this would need to be part of the MachinePool controller, but we would need both.

In that case, should the Machine controller be responsible for noticing this condition? That way it would be the same in both Machine and MachinePool?

CecileRobertMichon · 2020-08-25T16:28:10Z

docs/proposals/20200330-spot-instances.md

+```
+
+#### Running the termination handler
+The Termination Pod will be part of a DaemonSet, which should be deployed manually by users. It will select Nodes which are labelled as spot instances to ensure the Termination Pod only runs on instances that require termination handlers.


"which should be deployed manually by users" -> can ClusterResourceSets be leveraged instead?

Are ClusterResourceSets usable at this point/will they be by the time we want to implement this proposal? I believe they would solve the issue well

They are usable (we use them in e2e) but are still experimental and turned on by feature flag.

We can definitely use ClusterResourceSets for this

CecileRobertMichon · 2020-08-25T16:30:37Z

docs/proposals/20200330-spot-instances.md

+```yaml
+---
+apiVersion: cluster.x-k8s.io/v1alpha3
+kind: MachineHealthCheck
+metadata:
+  name: cluster-api-termination-handler
+  labels:
+    api: clusterapi
+    k8s-app: termination-handler
+spec:
+  selector:
+    matchLabels:
+      cluster.x-k8s.io/interruptible-instance: "" # This label is automatically applied to all spot instances
+  maxUnhealthy: 100% # No reason to block the interruption, it's inevitable
+  unhealthyConditions:
+  - type: Terminating
+    status: "True"
+    timeout: 0s # Immediately terminate the instance
+```


Spot instances are implemented for InfraMachines, not just MachinePools (in fact they aren't implemented for MachinePools yet, see Future work below). This solution only works for Machines. For ASGs/VMSS this would need to be part of the MachinePool controller, but we would need both.

CecileRobertMichon · 2020-08-25T16:31:35Z

docs/proposals/20200330-spot-instances.md

+
+Once the Node has been marked with the `Terminating` condition, it will be
+the MachineHealthCheck controller's responsibility to ensure that the Machine
+is deleted, triggering it to be drained and removed from the cluster.


Wouldn't MHC cause a new Machine to be created? Since the idea is to replace an unhealthy machine? Is that what we want when a spot VM is terminated for lack of capacity?

I think yes, we need to delete machine and leverage node draining here

Given the Machine is going to be Failed, is there any reason to keep it around apart from the fact we know a replacement would fail?

I guess when you're running a set up like this you kind of need the cluster autoscaler to notice that there's no capacity and scale up elsewhere, or maybe have the MachineSet bring up a node in an alternate failure domain (if we go down the route of supporting that? Would probably need to recognise somehow that the spot capacity is out and retry a different failure domain)

I think that for a first implementation letting the machine be recreated is reasonable. We can always do smarter things based on usage feedback once we have something in place.

CecileRobertMichon · 2020-08-25T16:34:26Z

docs/proposals/20200330-spot-instances.md

+
+The termination pod will be developed to poll the metadata service for the Node
+that it is running on.
+We will implement request/response handlers for each of the three cloud providers


"each of the three cloud providers" seems a bit restrictive, maybe this can be

Suggested change

We will implement request/response handlers for each of the three cloud providers

We will implement request/response handlers for each infrastructure provider that supports non-guaranteed instances

CecileRobertMichon · 2020-08-25T16:35:17Z

docs/proposals/20200330-spot-instances.md

+for termination.
+
+The code for all providers will be mostly similar, the only difference is logic for polling termination endpoint and processing the response.
+This means that cloud provider type can be passed as an argument and termination handler can be common for all providers and placed in a separate repository, for example: `kubernetes-sigs/cluster-api-termination-handler`.


Do you have a POC of what the code would look like?

CecileRobertMichon · 2020-08-25T16:37:49Z

docs/proposals/20200330-spot-instances.md

+#### Running the termination handler
+The Termination Pod will be part of a DaemonSet, which should be deployed manually by users. It will select Nodes which are labelled as spot instances to ensure the Termination Pod only runs on instances that require termination handlers.
+
+The spot label will be added by the cloud providers as they create instances, provided they support spot instances and the instance is a spot instance. This also requires ability of cloud providers to sync labels between machines and nodes on workload cluster.


"This also requires ability of cloud providers to sync labels between machines and nodes on workload cluster." -> this seems like an important limitation, by this do you mean the Cluster API infra provider has to be able to sync labels between CAPI Machines and nodes, Infra Machines and nodes, or something else?

I think somehow we need to mark nodes as spot instances (using a well known label) so that the DaemonSet only deploys to spot instances.

Is there anywhere in CAPI already where we can put a label on a Machine and have that copied to the Node once it joins the cluster? If there is, could the Infra provider add a label to that set when it creates a spot instance?

Otherwise the core machine controller could infer from the InfrastructureRef if this is a non-guaranteed instance and add the label to the node as it adds the nodeRef

I created an issue for tracking this #3504 and a POC PR for aws kubernetes-sigs/cluster-api-provider-aws#1876

CecileRobertMichon · 2020-08-25T16:38:37Z

docs/proposals/20200330-spot-instances.md

+The spot label will be added by the cloud providers as they create instances, provided they support spot instances and the instance is a spot instance. This also requires ability of cloud providers to sync labels between machines and nodes on workload cluster.
+
+#### Termination handler security
+To be able to perform the actions required by the termination handler, the pod will need to be relatively privileged.


"the pod will need to be relatively privileged." seems a bit vague, what access does it need exactly? What are the actions required by the termination handler?

Agreed this is a bit vague, I believe it is a precursive sentence for the rest of the content of this section, which explains the privilege level required to run this termination handler. Could probably just drop this whole sentence.

CecileRobertMichon · 2020-08-25T16:53:42Z

docs/proposals/20200330-spot-instances.md

+To enable graceful termination of workloads running on non-guaranteed instances,
+a termination handler pod will need to be deployed on each spot instance.
+
+The termination pod will be developed to poll the metadata service for the Node


I'm not 100% sure this would work for Azure. At the bottom of this proposal we wrote:

Azure uses their Scheduled Events API to notify Spot VMs that they are due to be preempted. This is a similar service to the AWS metadata service that each machine can poll to see events for itself. Azure only gives 30 seconds warning for nodes being preempted though. A Daemonset solution similar to the AWS termination handlers could be implemented to provide graceful shutdown with Azure Spot VMs. For example see this [existing solution](https://github.com/awesomenix/drainsafe).

cc @awesomenix

We are planning integrate https://github.com/awesomenix/drainsafe to automatic cordon and drain any terminating instances for machinepools. We can probably extend it to execute additional termination handlers. Just a reminder that you get at most 30 seconds for a Spot VM on azure, I am not sure how much you can do in that time.

For termination handlers, it should be sufficient to use node draining logic that we already have implemented in CAPI https://github.com/kubernetes-sigs/cluster-api/blob/master/controllers/machine_controller.go#L276

We have a POC for AWS which marks the node with a condition as described in this proposal, the azure one can be pretty much identical to this, just substituting the logic for how to read the response and adding the required header to the request.

https://github.com/openshift/cluster-api-provider-aws/blob/b4a3478db44ddb554883cf77a9e5f49ffd54fdf4/pkg/termination/handler.go

While I agree 30 seconds isn't very long to actually do much in the drain, I think the benefit would come from replacing nodes quicker. Without the contents of this proposal, the instance gets terminated, if an MHC is present, it will remove the Machine after some time of being unhealthy, then the MachineSet replaces the node.
With this proposal, we get an up to 30 second warning, that then allows us to mark the Node, have an MHC notice this mark, delete the Machine and then have the MachineSet controller create a new Machine. This should pretty much all happen "instantly" so we should start the creation of the new instance before the old instance goes away rather than a few minutes after it has already shut down

enxebre · 2020-09-09T09:13:23Z

Still need to agree on #3528 (comment) but overall this looks good to me.

cpuguy83 · 2020-09-29T17:36:50Z

There's a KEP that was discussed on sig-node to hook into systemd's ability to allow you to delay shutdown to handle shutdown gracefully from the kubelet: kubernetes/enhancements#2001

I do like the idea of hooking this as a separate component as done here.

alexander-demicev · 2020-12-01T14:07:59Z

@cpuguy83 Thanks for sharing the link. The logic described in this proposal might not work with spot instances, I am not sure that we can delay the shutdown of a spot instance.
One more thing to add, for spot instances we will be polling a termination endpoint that is located on the cloud provider side, it should give us more time for the graceful shutdown than the approach described in K8S proposal.

alexander-demicev · 2020-12-04T13:54:45Z

@CecileRobertMichon @vincepri @JoelSpeed This proposal amendment is ready for another round of reviews.

fabriziopandini · 2020-12-10T15:31:58Z

/ok-to-test

fabriziopandini · 2021-01-19T13:35:32Z

/milestone v0.4.0

vincepri · 2021-03-22T18:50:35Z

/milestone Next

Folks, what's the status of this proposal?

alexander-demicev · 2021-03-23T20:45:56Z

I think all concerns and comments were addressed, I'm happy to continue with this proposal.

CecileRobertMichon · 2021-04-21T16:29:07Z

@alexander-demichev are there any breaking changes in this proposal?

CecileRobertMichon

@alexander-demichev please update the last-updated date on the proposal

k8s-ci-robot · 2021-06-24T12:47:11Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
To complete the pull request process, please assign timothysc after the PR has been reviewed.
You can assign the PR to them by writing /assign @timothysc in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

alexander-demicev · 2021-06-24T12:47:27Z

@CecileRobertMichon sure, it's done

alexander-demicev · 2021-07-09T11:07:12Z

@CecileRobertMichon There are no breaking changes.

k8s-ci-robot · 2021-08-11T15:45:59Z

@alexander-demichev: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
pull-cluster-api-test-main-mink8s	`19ec895`	link	`/test pull-cluster-api-test-main-mink8s`

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

randomvariable · 2021-08-18T14:14:48Z

Couple of things:

Proposal history needs updating
In last conversations with AWS:
- they said they were recommending the queue processor mode of the AWS Node Termination handler, and @gab-satchi implemented a bunch of stuff around consumption of EventBridge in CAPA already that can potentially be reused - this would remove the need for running a DaemonSet on the workload cluster.
- they recommended not using RunInstances at all for spot instances, and moving wholesale to Fleet.

I would be tempted to reduce the proposal here to its essentials - defining the contract, and defer the per provider implementation to CAEPs (or preferred processes) for those providers.

We should also at least mention the existence of https://github.com/aws/aws-node-termination-handler in the alternatives section.

whites11 · 2021-09-19T11:25:11Z

docs/proposals/20200330-spot-instances.md

+```
+
+#### Running the termination handler
+The Termination Pod will be part of a DaemonSet, that can be deployed using [ClusterResourceSet](https://github.com/kubernetes-sigs/cluster-api/blob/master/docs/proposals/20200220-cluster-resource-set.md). The DaemonSet will select Nodes which are labelled as spot instances to ensure the Termination Pod only runs on instances that require termination handlers.


Why? I think this should be enabled for all nodes by default.
Even if the problem is clearly more important for spot instances, every node that is about to be terminated, for whatever reason, should attempt to properly drain.

The termination notices that these termination handlers rely upon are only implemented on the spot/preemptible instances. Eg on AWS, if the spot instance is going away, we get a 2 minute warning via the metadata endpoint. There is no equivalent for an on-demand instance.

I appreciate that there are times that cloud providers remove instances for maintenance reasons, and there may be some way programmatically to detect this, but as far as I know they use a different mechanism and are also far less frequent. For now at least, I think they are out of scope for these termination handlers

Proposal referenced here #3528 (comment) describes the situation where the node can be unexpectedly shut down.

randomvariable · 2021-09-22T19:34:17Z

This got auto closed because of the branch rename to main.

k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Aug 25, 2020

k8s-ci-robot requested review from CecileRobertMichon and ncdc August 25, 2020 13:40

k8s-ci-robot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Aug 25, 2020

k8s-ci-robot added the kind/documentation Categorizes issue or PR as related to documentation. label Aug 25, 2020

vincepri reviewed Aug 25, 2020

View reviewed changes

k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Aug 25, 2020

CecileRobertMichon reviewed Aug 25, 2020

View reviewed changes

alexander-demicev mentioned this pull request Sep 21, 2020

✨ Label interruptible nodes #3668

Merged

ncdc mentioned this pull request Sep 21, 2020

Spot Instance interruption notice support kubernetes-sigs/cluster-api-provider-aws#1899

Closed

alexander-demicev mentioned this pull request Sep 30, 2020

Set interruptible label to nodes #3504

Closed

k8s-ci-robot added needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. and removed needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. labels Dec 1, 2020

k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Dec 10, 2020

k8s-ci-robot added this to the v0.4.0 milestone Jan 19, 2021

Danil-Grigorev mentioned this pull request Mar 15, 2021

move "kubectl drain" into the server kubernetes/kubernetes#25625

Closed

k8s-ci-robot removed this from the v0.4.0 milestone Mar 22, 2021

k8s-ci-robot added this to the Next milestone Mar 22, 2021

sedefsavas mentioned this pull request Mar 23, 2021

AWSMachinePool does not drain nodes during scale-in kubernetes-sigs/cluster-api-provider-aws#2023

Open

CecileRobertMichon reviewed Jun 24, 2021

View reviewed changes

Update spot instances proposal with termination handler design

19ec895

whites11 reviewed Sep 19, 2021

View reviewed changes

vincepri deleted the branch kubernetes-sigs:master September 22, 2021 19:09

vincepri closed this Sep 22, 2021

alexander-demicev mentioned this pull request Oct 15, 2021

✨Update spot instances proposal with termination handler design #5432

Closed

		To enable graceful termination of workloads running on non-guaranteed instances,
		a termination handler pod will need to be deployed on each spot instance.

	We will implement request/response handlers for each of the three cloud providers
	We will implement request/response handlers for each infrastructure provider that supports non-guaranteed instances

📖 Update spot instances proposal with termination handler design #3528

📖 Update spot instances proposal with termination handler design #3528

Conversation

alexander-demicev commented Aug 25, 2020

k8s-ci-robot commented Aug 25, 2020

neolit123 commented Aug 25, 2020 • edited Loading

neolit123 commented Aug 25, 2020

vincepri left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

enxebre commented Sep 9, 2020

cpuguy83 commented Sep 29, 2020

alexander-demicev commented Dec 1, 2020 • edited Loading

alexander-demicev commented Dec 4, 2020

fabriziopandini commented Dec 10, 2020

fabriziopandini commented Jan 19, 2021

vincepri commented Mar 22, 2021 • edited Loading

alexander-demicev commented Mar 23, 2021

CecileRobertMichon commented Apr 21, 2021

CecileRobertMichon left a comment

Choose a reason for hiding this comment

k8s-ci-robot commented Jun 24, 2021

alexander-demicev commented Jun 24, 2021

alexander-demicev commented Jul 9, 2021

k8s-ci-robot commented Aug 11, 2021 • edited Loading

randomvariable commented Aug 18, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

randomvariable commented Sep 22, 2021

neolit123 commented Aug 25, 2020 •

edited

Loading

alexander-demicev commented Dec 1, 2020 •

edited

Loading

vincepri commented Mar 22, 2021 •

edited

Loading

k8s-ci-robot commented Aug 11, 2021 •

edited

Loading

randomvariable commented Aug 18, 2021 •

edited

Loading