Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tried to taint/delete more than required nodes while scale down #10

Closed
prashanth26 opened this issue Oct 25, 2018 · 3 comments · Fixed by #11
Closed

Tried to taint/delete more than required nodes while scale down #10

prashanth26 opened this issue Oct 25, 2018 · 3 comments · Fixed by #11
Assignees
Labels
area/dev-productivity Developer productivity related (how to improve development) kind/bug Bug topology/seed Affects Seed clusters

Comments

@prashanth26
Copy link

prashanth26 commented Oct 25, 2018

Issue

  • While scale down it tainted more nodes than it was suppose to
    • Desired scale down - 30
    • Actual deleted - 30
    • Actual taints - 42
  • Even if it did above, autoscaler failed to remove taints from nodes that it had wrongly tainted. Why is that?

Logs

I1025 07:39:06.934286 1 delete.go:106] Releasing taint {Key:ToBeDeletedByClusterAutoscaler Value:1540453142 Effect:NoSchedule TimeAdded:} on node shoot--jaas--jaask8sos-worker-unfpk-z1-88cbf4966-qh29c

I1025 07:39:06.938911 1 delete.go:119] Successfully released toBeDeletedTaint on node shoot--jaas--jaask8sos-worker-unfpk-z1-88cbf4966-qh29c

E1025 07:39:07.131582 1 scale_down.go:641] Problem with empty node deletion: failed to delete shoot--jaas--jaask8sos-worker-unfpk-z1-88cbf4966-vg5fg: Unable to update MachineDeployment object shoot--jaas--jaask8sos-worker-unfpk-z1

E1025 07:39:07.131719 1 static_autoscaler.go:373] Failed to scale down:

@prashanth26 prashanth26 added area/dev-productivity Developer productivity related (how to improve development) kind/bug Bug status/new Issue is new and unprocessed topology/seed Affects Seed clusters labels Oct 25, 2018
@prashanth26
Copy link
Author

FYI - @hardikdr

@prashanth26 prashanth26 changed the title Failed to remove unwanted ToBeDeletedByClusterAutoscaler taints Tried to taint/delete more than required nodes while scale down Nov 13, 2018
@prashanth26
Copy link
Author

prashanth26 commented Nov 20, 2018

There are two parts to this issue.

1. MachineDeployment updates fails due to receiving multiple updates over short span

  • While scaling down multiple nodes by Cluster-Autoscaler (CA)
  • The MachineDeployment replica field is being updated continuously, and update fails.
  • Hence a new set of candidate machines might be used on each try and more than desired machines are tainted with deletion annotation.
  • We need retry logic to update MachineDeployment object.
  • This is should keep the chosen candidate machine set consistent.

2. Node updates doesn't have a retry logic (more severe issue)

  • This has been fixed in the upstream autoscaler
  • Need to rebase our cluster autoscaler with upstream

@prashanth26 prashanth26 self-assigned this Dec 5, 2018
@gardener-robot-ci-1 gardener-robot-ci-1 removed the status/new Issue is new and unprocessed label Dec 25, 2018
@KesavanKing
Copy link

During my performance tests with Density framework, after the workloads deletion, nodes scaled down to 5 from 10. There also I faced this issue, where I can see the taint labels for the existing nodes as well

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/dev-productivity Developer productivity related (how to improve development) kind/bug Bug topology/seed Affects Seed clusters
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants