Cluster Autoscaler erroneously triggered during PVC binding #923

gtie · 2018-06-06T11:56:04Z

Scaling up is currently triggered by any uschedulable pod. A pod can be unschedulable for just a few seconds, however, while it's waiting for a volume to be bound. Those few seconds are enough for Cluster Autoscaler to kick off the creation of a new instance.

Here is how the sequence looks in the event stream:

$ k get events --field-selector involvedObject.uid=45b82c29-6970-11e8-a9c2-02145a4f4030 -o=custom-columns=CREATED:.metadata.creationTimestamp,SOURCE:.source.component,REASON:.reason,MSG:.message
CREATED                SOURCE                    REASON                   MSG
2018-06-06T09:59:16Z   default-scheduler         FailedScheduling         pod has unbound PersistentVolumeClaims (repeated 92 times)
2018-06-06T09:59:22Z   cluster-autoscaler        TriggeredScaleUp         pod triggered scale-up: [{1.default.nodes.ASG-XXXXXX 37->38 (max: 39)}]
2018-06-06T09:59:23Z   default-scheduler         Scheduled                Successfully assigned os-strg-osq6574-2-68d64976f6-4s5j2 to ip-10-117-5-157.XXXXXX
2018-06-06T09:59:23Z   kubelet                   SuccessfulMountVolume    MountVolume.SetUp succeeded for volume "logs-volume" 
2018-06-06T09:59:23Z   kubelet                   SuccessfulMountVolume    MountVolume.SetUp succeeded for volume "default-token-tm7c7" 
2018-06-06T09:59:23Z   attachdetach-controller   FailedAttachVolume       AttachVolume.Attach failed for volume "pvc-45afef4a-6970-11e8-96b5-0a38a7e18608" : "Error attaching EBS volume \"vol-0a2eddf3918053d85\"" to instance "i-0c6399097e297ff80" since volume is in "creating" state
2018-06-06T09:59:27Z   attachdetach-controller   SuccessfulAttachVolume   AttachVolume.Attach succeeded for volume "pvc-45afef4a-6970-11e8-96b5-0a38a7e18608" 
2018-06-06T09:59:32Z   kubelet                   SuccessfulMountVolume    MountVolume.SetUp succeeded for volume "pvc-45afef4a-6970-11e8-96b5-0a38a7e18608" 
2018-06-06T09:59:35Z   kubelet                   Pulling                  pulling image "XXXXXX"
2018-06-06T09:59:50Z   kubelet                   Pulled                   Successfully pulled image "XXXXXX"
2018-06-06T09:59:50Z   kubelet                   Created                  Created container
2018-06-06T09:59:50Z   kubelet                   Started                  Started container
2018-06-06T09:59:50Z   kubelet                   Pulling                  pulling image "alpine:3.6"
2018-06-06T09:59:53Z   kubelet                   Pulled                   Successfully pulled image "alpine:3.6"
2018-06-06T09:59:54Z   kubelet                   Created                  Created container
2018-06-06T09:59:54Z   kubelet                   Started                  Started container

In my particular case, where the above bug is combined with a high concentration of PodDisrupttionBudget=0 Pods and a significant amount of turnover, this bug often means that the new extra node is there to stay. The combination quickly leads to very low usage density and very high server costs.

The above problem is observable in Kubernetes clusters running version 1.9 and 1.10 in AWS. Cluster autoscaled in use: v.1.2.2

The text was updated successfully, but these errors were encountered:

MaciekPytel · 2018-06-07T09:24:02Z

@gtie Can you specify which CA version are you using?

gtie · 2018-06-07T09:39:28Z

@MaciekPytel , updated original description with the version (answer is v.1.2.2)

efilipov · 2018-07-17T08:56:19Z

No one else seeing this sequence of events?

hercynium · 2018-09-13T18:52:47Z

My team at work has implemented a work-around for this issue. We've introduced a delay so that pods that are too "young" will not trigger a scale-up. The CA will only consider scaling-up for unschedulable pods past a certain age. This is configurable via the command-line.

So far, in production, setting a delay of 2m seems to have eliminated the issue for us. I suspect the delay could be as little as 30s and this workaround would still be effective.

Would this be something worth submitting as a pull request?

WebSpider · 2018-09-14T01:12:40Z

That sounds like a good addition to solve this issue :)

- This is intended to address the issue described in kubernetes#923 - the delay is configurable via a CLI option - in production (on AWS) we set this to a value of 2m - the delay could possibly be set as low as 30s and still be effective depending on your workload and environment - the default of 0 for the CLI option results in no change to the CA's behavior from defaults. Change-Id: I7e3f36bb48641faaf8a392cca01a12b07fb0ee35

hercynium · 2018-09-14T18:33:55Z

OK, I've posted a diff and a PR.

bskiba · 2018-09-17T09:01:30Z

@aleksandra-malinowska Can you take a look?

- This is intended to address the issue described in kubernetes#923 - the delay is configurable via a CLI option - in production (on AWS) we set this to a value of 2m - the delay could possibly be set as low as 30s and still be effective depending on your workload and environment - the default of 0 for the CLI option results in no change to the CA's behavior from defaults. Change-Id: I7e3f36bb48641faaf8a392cca01a12b07fb0ee35

fejta-bot · 2018-12-16T09:53:03Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

fejta-bot · 2019-01-15T10:37:16Z

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

fejta-bot · 2019-02-14T10:54:09Z

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

k8s-ci-robot · 2019-02-14T10:54:18Z

@fejta-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

- This is intended to address the issue described in kubernetes#923 - the delay is configurable via a CLI option - in production (on AWS) we set this to a value of 2m - the delay could possibly be set as low as 30s and still be effective depending on your workload and environment - the default of 0 for the CLI option results in no change to the CA's behavior from defaults. Change-Id: I7e3f36bb48641faaf8a392cca01a12b07fb0ee35

[Makefile] Use $(PROJECT_DIR) instead of $(shell pwd)

aleksandra-malinowska added area/cluster-autoscaler area/provider/aws Issues or PRs related to aws provider labels Jun 6, 2018

hercynium mentioned this issue Sep 14, 2018

Add configurable delay for pod age before considering for scale-up #1245

Merged

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 16, 2018

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jan 15, 2019

k8s-ci-robot closed this as completed Feb 14, 2019

dtsvetkovsap mentioned this issue Jan 27, 2020

Cluster autoscaler is triggered during provisioning of PVC gardener/autoscaler#28

Closed

yaroslava-serdiuk pushed a commit to yaroslava-serdiuk/autoscaler that referenced this issue Feb 22, 2024

Merge pull request kubernetes#923 from epam/make_use_project_dir

20b82b2

[Makefile] Use $(PROJECT_DIR) instead of $(shell pwd)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cluster Autoscaler erroneously triggered during PVC binding #923

Cluster Autoscaler erroneously triggered during PVC binding #923

gtie commented Jun 6, 2018 •

edited

Loading

MaciekPytel commented Jun 7, 2018

gtie commented Jun 7, 2018

efilipov commented Jul 17, 2018

hercynium commented Sep 13, 2018 •

edited

Loading

WebSpider commented Sep 14, 2018

hercynium commented Sep 14, 2018

bskiba commented Sep 17, 2018

fejta-bot commented Dec 16, 2018

fejta-bot commented Jan 15, 2019

fejta-bot commented Feb 14, 2019

k8s-ci-robot commented Feb 14, 2019

Cluster Autoscaler erroneously triggered during PVC binding #923

Cluster Autoscaler erroneously triggered during PVC binding #923

Comments

gtie commented Jun 6, 2018 • edited Loading

MaciekPytel commented Jun 7, 2018

gtie commented Jun 7, 2018

efilipov commented Jul 17, 2018

hercynium commented Sep 13, 2018 • edited Loading

WebSpider commented Sep 14, 2018

hercynium commented Sep 14, 2018

bskiba commented Sep 17, 2018

fejta-bot commented Dec 16, 2018

fejta-bot commented Jan 15, 2019

fejta-bot commented Feb 14, 2019

k8s-ci-robot commented Feb 14, 2019

gtie commented Jun 6, 2018 •

edited

Loading

hercynium commented Sep 13, 2018 •

edited

Loading