Tried to taint/delete more than required nodes while scale down #10

prashanth26 · 2018-10-25T09:06:12Z

Issue

While scale down it tainted more nodes than it was suppose to
- Desired scale down - 30
- Actual deleted - 30
- Actual taints - 42
Even if it did above, autoscaler failed to remove taints from nodes that it had wrongly tainted. Why is that?

Logs

I1025 07:39:06.934286 1 delete.go:106] Releasing taint {Key:ToBeDeletedByClusterAutoscaler Value:1540453142 Effect:NoSchedule TimeAdded:} on node shoot--jaas--jaask8sos-worker-unfpk-z1-88cbf4966-qh29c

I1025 07:39:06.938911 1 delete.go:119] Successfully released toBeDeletedTaint on node shoot--jaas--jaask8sos-worker-unfpk-z1-88cbf4966-qh29c

E1025 07:39:07.131582 1 scale_down.go:641] Problem with empty node deletion: failed to delete shoot--jaas--jaask8sos-worker-unfpk-z1-88cbf4966-vg5fg: Unable to update MachineDeployment object shoot--jaas--jaask8sos-worker-unfpk-z1

E1025 07:39:07.131719 1 static_autoscaler.go:373] Failed to scale down:

prashanth26 · 2018-10-25T09:12:51Z

FYI - @hardikdr

prashanth26 · 2018-11-20T08:37:04Z

There are two parts to this issue.

1. MachineDeployment updates fails due to receiving multiple updates over short span

While scaling down multiple nodes by Cluster-Autoscaler (CA)
The MachineDeployment replica field is being updated continuously, and update fails.
Hence a new set of candidate machines might be used on each try and more than desired machines are tainted with deletion annotation.
We need retry logic to update MachineDeployment object.
This is should keep the chosen candidate machine set consistent.

2. Node updates doesn't have a retry logic (more severe issue)

This has been fixed in the upstream autoscaler
Need to rebase our cluster autoscaler with upstream

KesavanKing · 2018-12-28T06:38:12Z

During my performance tests with Density framework, after the workloads deletion, nodes scaled down to 5 from 10. There also I faced this issue, where I can see the taint labels for the existing nodes as well

prashanth26 added area/dev-productivity Developer productivity related (how to improve development) kind/bug Bug status/new Issue is new and unprocessed topology/seed Affects Seed clusters labels Oct 25, 2018

prashanth26 changed the title ~~Failed to remove unwanted ToBeDeletedByClusterAutoscaler taints~~ Tried to taint/delete more than required nodes while scale down Nov 13, 2018

prashanth26 self-assigned this Dec 5, 2018

prashanth26 mentioned this issue Dec 5, 2018

Update k8s to 1.12 #11

Merged

hardikdr closed this as completed in #11 Dec 24, 2018

gardener-robot-ci-1 removed the status/new Issue is new and unprocessed label Dec 25, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tried to taint/delete more than required nodes while scale down #10

Tried to taint/delete more than required nodes while scale down #10

prashanth26 commented Oct 25, 2018 •

edited

Loading

prashanth26 commented Oct 25, 2018

prashanth26 commented Nov 20, 2018 •

edited

Loading

KesavanKing commented Dec 28, 2018

Tried to taint/delete more than required nodes while scale down #10

Tried to taint/delete more than required nodes while scale down #10

Comments

prashanth26 commented Oct 25, 2018 • edited Loading

Issue

Logs

prashanth26 commented Oct 25, 2018

prashanth26 commented Nov 20, 2018 • edited Loading

1. MachineDeployment updates fails due to receiving multiple updates over short span

2. Node updates doesn't have a retry logic (more severe issue)

KesavanKing commented Dec 28, 2018

prashanth26 commented Oct 25, 2018 •

edited

Loading

prashanth26 commented Nov 20, 2018 •

edited

Loading