Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cluster Autoscaler erroneously triggered during PVC binding #923

Closed
gtie opened this issue Jun 6, 2018 · 11 comments
Closed

Cluster Autoscaler erroneously triggered during PVC binding #923

gtie opened this issue Jun 6, 2018 · 11 comments
Labels
area/cluster-autoscaler area/provider/aws Issues or PRs related to aws provider lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.

Comments

@gtie
Copy link

gtie commented Jun 6, 2018

Scaling up is currently triggered by any uschedulable pod. A pod can be unschedulable for just a few seconds, however, while it's waiting for a volume to be bound. Those few seconds are enough for Cluster Autoscaler to kick off the creation of a new instance.

Here is how the sequence looks in the event stream:

$ k get events --field-selector involvedObject.uid=45b82c29-6970-11e8-a9c2-02145a4f4030 -o=custom-columns=CREATED:.metadata.creationTimestamp,SOURCE:.source.component,REASON:.reason,MSG:.message
CREATED                SOURCE                    REASON                   MSG
2018-06-06T09:59:16Z   default-scheduler         FailedScheduling         pod has unbound PersistentVolumeClaims (repeated 92 times)
2018-06-06T09:59:22Z   cluster-autoscaler        TriggeredScaleUp         pod triggered scale-up: [{1.default.nodes.ASG-XXXXXX 37->38 (max: 39)}]
2018-06-06T09:59:23Z   default-scheduler         Scheduled                Successfully assigned os-strg-osq6574-2-68d64976f6-4s5j2 to ip-10-117-5-157.XXXXXX
2018-06-06T09:59:23Z   kubelet                   SuccessfulMountVolume    MountVolume.SetUp succeeded for volume "logs-volume" 
2018-06-06T09:59:23Z   kubelet                   SuccessfulMountVolume    MountVolume.SetUp succeeded for volume "default-token-tm7c7" 
2018-06-06T09:59:23Z   attachdetach-controller   FailedAttachVolume       AttachVolume.Attach failed for volume "pvc-45afef4a-6970-11e8-96b5-0a38a7e18608" : "Error attaching EBS volume \"vol-0a2eddf3918053d85\"" to instance "i-0c6399097e297ff80" since volume is in "creating" state
2018-06-06T09:59:27Z   attachdetach-controller   SuccessfulAttachVolume   AttachVolume.Attach succeeded for volume "pvc-45afef4a-6970-11e8-96b5-0a38a7e18608" 
2018-06-06T09:59:32Z   kubelet                   SuccessfulMountVolume    MountVolume.SetUp succeeded for volume "pvc-45afef4a-6970-11e8-96b5-0a38a7e18608" 
2018-06-06T09:59:35Z   kubelet                   Pulling                  pulling image "XXXXXX"
2018-06-06T09:59:50Z   kubelet                   Pulled                   Successfully pulled image "XXXXXX"
2018-06-06T09:59:50Z   kubelet                   Created                  Created container
2018-06-06T09:59:50Z   kubelet                   Started                  Started container
2018-06-06T09:59:50Z   kubelet                   Pulling                  pulling image "alpine:3.6"
2018-06-06T09:59:53Z   kubelet                   Pulled                   Successfully pulled image "alpine:3.6"
2018-06-06T09:59:54Z   kubelet                   Created                  Created container
2018-06-06T09:59:54Z   kubelet                   Started                  Started container

In my particular case, where the above bug is combined with a high concentration of PodDisrupttionBudget=0 Pods and a significant amount of turnover, this bug often means that the new extra node is there to stay. The combination quickly leads to very low usage density and very high server costs.

The above problem is observable in Kubernetes clusters running version 1.9 and 1.10 in AWS. Cluster autoscaled in use: v.1.2.2

@aleksandra-malinowska aleksandra-malinowska added area/cluster-autoscaler area/provider/aws Issues or PRs related to aws provider labels Jun 6, 2018
@MaciekPytel
Copy link
Contributor

@gtie Can you specify which CA version are you using?

@gtie
Copy link
Author

gtie commented Jun 7, 2018

@MaciekPytel , updated original description with the version (answer is v.1.2.2)

@efilipov
Copy link

No one else seeing this sequence of events?

@hercynium
Copy link
Contributor

hercynium commented Sep 13, 2018

My team at work has implemented a work-around for this issue. We've introduced a delay so that pods that are too "young" will not trigger a scale-up. The CA will only consider scaling-up for unschedulable pods past a certain age. This is configurable via the command-line.

So far, in production, setting a delay of 2m seems to have eliminated the issue for us. I suspect the delay could be as little as 30s and this workaround would still be effective.

Would this be something worth submitting as a pull request?

@WebSpider
Copy link
Contributor

That sounds like a good addition to solve this issue :)

hercynium added a commit to hercynium/autoscaler that referenced this issue Sep 14, 2018
  - This is intended to address the issue described in kubernetes#923
  - the delay is configurable via a CLI option
  - in production (on AWS) we set this to a value of 2m
  - the delay could possibly be set as low as 30s and still be effective depending on your workload and environment
  - the default of 0 for the CLI option results in no change to the CA's behavior from defaults.

Change-Id: I7e3f36bb48641faaf8a392cca01a12b07fb0ee35
@hercynium
Copy link
Contributor

OK, I've posted a diff and a PR.

@bskiba
Copy link
Member

bskiba commented Sep 17, 2018

@aleksandra-malinowska Can you take a look?

hercynium added a commit to hercynium/autoscaler that referenced this issue Nov 19, 2018
  - This is intended to address the issue described in kubernetes#923
  - the delay is configurable via a CLI option
  - in production (on AWS) we set this to a value of 2m
  - the delay could possibly be set as low as 30s and still be effective depending on your workload and environment
  - the default of 0 for the CLI option results in no change to the CA's behavior from defaults.

Change-Id: I7e3f36bb48641faaf8a392cca01a12b07fb0ee35
@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 16, 2018
@fejta-bot
Copy link

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jan 15, 2019
@fejta-bot
Copy link

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

@k8s-ci-robot
Copy link
Contributor

@fejta-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

hardikdr pushed a commit to hardikdr/autoscaler that referenced this issue Apr 1, 2020
  - This is intended to address the issue described in kubernetes#923
  - the delay is configurable via a CLI option
  - in production (on AWS) we set this to a value of 2m
  - the delay could possibly be set as low as 30s and still be effective depending on your workload and environment
  - the default of 0 for the CLI option results in no change to the CA's behavior from defaults.

Change-Id: I7e3f36bb48641faaf8a392cca01a12b07fb0ee35
yaroslava-serdiuk pushed a commit to yaroslava-serdiuk/autoscaler that referenced this issue Feb 22, 2024
[Makefile] Use $(PROJECT_DIR) instead of $(shell pwd)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/cluster-autoscaler area/provider/aws Issues or PRs related to aws provider lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.
Projects
None yet
Development

No branches or pull requests

9 participants