Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

VPA: prune stale container aggregates, split recommendations over true number of containers #6745

Open
wants to merge 12 commits into
base: master
Choose a base branch
from

Conversation

jkyros
Copy link

@jkyros jkyros commented Apr 22, 2024

What type of PR is this?

/kind bug

What this PR does / why we need it:

Previously we weren't cleaning up "stale" aggregates when container names changed (because of renames, removals) and that was resulting in:

  • VPAs showing recommendations for containers which no longer exist
  • Resources being split across containers which no longer exist (resulting in some containers ending up with resource limits too small for them to effectively live)
  • There was also a corner case where during a rollout after a container was renamed/removed from a deployment, we were counting the number of unique container names and not the actual number of containers in each pod, so we were splitting resources that shouldn't have been split.

This PR is an attempt to clean up those stale aggregates without incurring too much overhead, and make sure that the resources get spread across the correct number of containers during a rollout.

Which issue(s) this PR fixes:

Fixes #6744

Special notes for your reviewer:

There are probably a lot of different ways we can do the pruning of stale aggregates for missing containers:

  • I went with explicitly marking and sweeping them because it saved us an additional loop through all the pods and containers
  • We could also just as easily just have a PruneAggregates() that runs after LoadPods() that goes through everything and removes them (or do this work as part of LoadPods() but that seems...deceptive?)
  • We could probably also tweak the existing garbageCollectAggregateCollectionStates and run it immediately after LoadPods() every time but that might be expensive.

I'm not super-attached to any particular approach, I'd just like to fix this, so I can retool it if necessary.

  • If I am being ignorant, and there are corner cases I'm missing, absolutely let me know
  • it probably need some tests/cleanup and I'll change the names of things to...whatever you want them to be. 😄

Does this PR introduce a user-facing change?

Added pruning of container aggregates and changed container math so resources will no longer be split across the wrong number of containers

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:


@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. kind/bug Categorizes issue or PR as related to a bug. labels Apr 22, 2024
@k8s-ci-robot k8s-ci-robot added area/vertical-pod-autoscaler cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Apr 22, 2024
@k8s-ci-robot
Copy link
Contributor

Hi @jkyros. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Apr 22, 2024
@jkyros jkyros marked this pull request as ready for review April 22, 2024 17:56
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Apr 22, 2024
@k8s-ci-robot k8s-ci-robot requested a review from kgolab April 22, 2024 17:57
// TODO(jkyros): This only removes the container state from the VPA's aggregate states, there
// is still a reference to them in feeder.clusterState.aggregateStateMap, and those get
// garbage collected eventually by the rate limited aggregate garbage collector later.
// Maybe we should clean those up here too since we know which ones are stale?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it a lot of extra work to do that? Do you see any risks doing it here?

Copy link
Author

@jkyros jkyros May 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, I don't think it's a lot of extra work, it should be reasonably cheap to clean them up here since it's just deletions from the other maps if the keys exist, I just didn't know all the history.

It seemed possible at least that we were intentionally waiting to clean up the aggregates so if there was an unexpected hiccup we didn't just immediately blow away all that aggregate history we worked so hard to get? (Like maybe someone oopses, deletes their deployment, then puts it back? Right now we don't have to start over -- the pods come back in, find their container aggregates, and resume ? But if I clean them up here, we have to start over...)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cleaning the map is cheap, so I don't mind handling it here. However, in cases like the one mentioned above (where someone deletes a deployment and recreates it immediately), I think it's better to leave this logic to be handled by gc.
@adrianmoisey, any thoughts on this?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Leaving it for the gc feels like the right option to me, for what it's worth

// the correct number and not just the number of aggregates that have *ever* been present. (We don't want minimum resources
// to erroneously shrink, either)
func (cluster *ClusterState) setVPAContainersPerPod(pod *PodState) {
for _, vpa := range cluster.Vpas {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering if there is already a place where this logic could go so we don't have to loop over all VPAs for every pod again here.
In large clusters with a VPA to Pod ratio that's closer to 1 this could be a little wasteful.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, yeah, I struggled with finding a less expensive way without making too much of a mess. Unless I'm missing something (and I might be) we don't seem to have a VPA <--> Pod map -- probably because we didn't need one until now? At the very least I think I should gate this to only run if the number of containers in the pod is > 1.

Like, I think our options are:

  1. update the VPA as the pods roll through (which requires me to find the VPA for each pod like I did here) or
  2. count the containers as we load the VPAs (but we load the VPAs before we load the pods, so we'd have to go through the pods again, so that doesn't help us)
  3. have the VPA actually track the pods it's managing, something like this: jkyros@6ddc208 (could also just be an array of PodIDs and we could look up the state so we could save the memory cost of the PodState pointer, but you know what I mean)

I put it where I did (option 1) because at least LoadPods() was already looping through all the pods so we could freeload off the "outer" pod loop and I figured we didn't want to spend the memory on option 3. If we'd entertain option 3 and are okay with the memory usage, I can totally do that?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe option 3 is the best approach. We can implement this in a follow-up PR.

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

  • Mark this PR as fresh with /remove-lifecycle stale
  • Close this PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 31, 2024
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

  • Mark this PR as fresh with /remove-lifecycle rotten
  • Close this PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Aug 30, 2024
@sreber84
Copy link

/remove-lifecycle rotten

@k8s-ci-robot k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Sep 20, 2024
@adrianmoisey
Copy link
Member

/ok-to-test
/assign

I want to see if I can help get this merged

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Dec 3, 2024
@maxcao13
Copy link

maxcao13 commented Dec 3, 2024

Hi everyone, so John has take a hiatus and has left me with this PR, so after catching up, I guess we are still waiting for those conversations to resolve on which way do we want to go with those design decisions. The two commits I just put up are just a improvement on the existing implementation (assuming we will go with that, we don't have to), and some e2e tests to prove this works.

@maxcao13 maxcao13 force-pushed the vpa-aggregates-fix-mark-sweep branch 2 times, most recently from c3a1f0e to c84075b Compare December 3, 2024 22:29
@adrianmoisey
Copy link
Member

Generally speaking I'm fine with this change. I haven't given it a thorough review, but it works locally in a way that makes sense.
I see there's a TODO item to add more tests, and documentation still needs updating.

But I kinda want to get more thoughts on this (ping @omerap12 @raywainman @voelzmo) just to check if it all makes sense to someone else too

@omerap12
Copy link
Member

Generally speaking I'm fine with this change. I haven't given it a thorough review, but it works locally in a way that makes sense. I see there's a TODO item to add more tests, and documentation still needs updating.

But I kinda want to get more thoughts on this (ping @omerap12 @raywainman @voelzmo) just to check if it all makes sense to someone else too

Thanks for pinging, Ill take a look tomorrow

Copy link
Member

@omerap12 omerap12 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, I have a couple of notes.

// TODO(jkyros): This only removes the container state from the VPA's aggregate states, there
// is still a reference to them in feeder.clusterState.aggregateStateMap, and those get
// garbage collected eventually by the rate limited aggregate garbage collector later.
// Maybe we should clean those up here too since we know which ones are stale?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cleaning the map is cheap, so I don't mind handling it here. However, in cases like the one mentioned above (where someone deletes a deployment and recreates it immediately), I think it's better to leave this logic to be handled by gc.
@adrianmoisey, any thoughts on this?

if podExists && len(pod.Containers) > 1 {
feeder.clusterState.SetVPAContainersPerPod(podState, true)
} else if !podExists {
panic("This shouldn't happen because AddOrUpdatePod should've placed this pod in the clusterState")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we want to exit here? and if so can we use klog.ErrorS + os.Exit(255)?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, sorry I forgot about that. That was a temporary check for me. I'll remove that in the next set of patches, thanks.

func (cluster *ClusterState) addPodToItsVpa(pod *PodState) {
// SetVPAContainersPerPod sets the number of containers per pod seen for pods connected to this VPA
// so that later when we're splitting the minimum recommendations over containers, we're splitting them over
// the correct number and not just the number of aggregates that have *ever* been present. (We don't want minimum resources
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚀

// the correct number and not just the number of aggregates that have *ever* been present. (We don't want minimum resources
// to erroneously shrink, either)
func (cluster *ClusterState) setVPAContainersPerPod(pod *PodState) {
for _, vpa := range cluster.Vpas {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe option 3 is the best approach. We can implement this in a follow-up PR.

}
vpaExists = false
}
if !vpaExists {
vpa = NewVpa(vpaID, selector, apiObject.CreationTimestamp.Time)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we log that we are creating a new VPA?

containerName := resources.ContainerName
containerAggregateState := vpaContainerNameToAggregateStateMap[containerName]
if containerAggregateState != nil && !containerAggregateState.IsUnderVPA && !containerAggregateState.IsAggregateStale(now) {
klog.V(5).InfoS("Container no longer exists, but is not stale. Keeping container recommendation at previous state.", "vpa", vpa.ID.VpaName, "container", containerName)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I follow. If the container no longer exists and is not under the VPA (!containerAggregateState.IsUnderVPA), and it's not stale (!containerAggregateState.IsAggregateStale(now)), then why are we keeping the recommendation instead of pruning it?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the container no longer exists, then either:

  • It's considered stale by grace period
  • Not considered stale by grace period

If it is considered stale, then we are not going to overwrite the new recommendation with the previous, and we just ignore this branch of code which should set a new recommendation that we calculated in https://github.com/kubernetes/autoscaler/pull/6745/files#diff-5c44f22eb0f35931ab799838d2584dd10037916c74f114124e6904af48fcd608R96

If it's not considered stale, then potentially the user still want's this container's recommendation to exist for later, so here I explicitly overwrite the stale container's calculated recommendation with it's previous recommendation.

So the logic is sort of backwards, if that makes sense.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Understood. Thanks for clarifying this :)

@maxcao13
Copy link

maxcao13 commented Jan 13, 2025

I have another comment, and I'm unsure on this one.

The new behaviour results in this:
...minimized

What I had originally expected to happen was that the --pod-recommendation-min-memory-mb would be split between the two containers, until the vpaPruningGracePeriod period expires, then the stale container would be removed, and --pod-recommendation-min-memory-mb be recalculated.

But, thinking it over, I guess this is fine? The second container's recommendations are retained (until it comes back, or when vpaPruningGracePeriod elapses). The primary container gets recommendations that make sense.

I agree I was unsure on what behaviour should be "correct".

On one hand, I think we want to keep the stale recommendation in there if the pruning hasn't happened, but here's a scenario that might be a problem.

  1. We have
    containerRecommendations:
    - containerName: hamster
      lowerBound:
        cpu: 445m
        memory: 262144k
      target:
        cpu: 627m
        memory: 262144k
      uncappedTarget:
        cpu: 627m
        memory: 262144k
      upperBound:
        cpu: "1"
        memory: 500Mi
    - containerName: hamster2
      lowerBound:
        cpu: 214m
        memory: 131072k
      target:
        cpu: 627m
        memory: 131072k
      uncappedTarget:
        cpu: 627m
        memory: 131072k
      upperBound:
        cpu: "1"
        memory: 500Mi
  1. Then we restart the pod with the 2nd container added back.
  2. I think the admission controller will set the pod container resources exactly like the recommendation above? Which is incorrect until the recommender loops once and recommends the correct resources.

I'm not exactly sure how we want to handle that. If we remove the recommendation entirely, that means the second container doesn't get a recommended resource initially at all, even if the pruning grace period wasn't met.

@omerap12
Copy link
Member

I have another comment, and I'm unsure on this one.

The new behaviour results in this:

status:
  conditions:
  - lastTransitionTime: "2025-01-11T12:07:41Z"
    status: "True"
    type: RecommendationProvided
  recommendation:
    containerRecommendations:
    - containerName: hamster
      lowerBound:
        cpu: 445m
        memory: 262144k
      target:
        cpu: 627m
        memory: 262144k
      uncappedTarget:
        cpu: 627m
        memory: 262144k
      upperBound:
        cpu: "1"
        memory: 500Mi
    - containerName: hamster2
      lowerBound:
        cpu: 214m
        memory: 131072k
      target:
        cpu: 627m
        memory: 131072k
      uncappedTarget:
        cpu: 627m
        memory: 131072k
      upperBound:
        cpu: "1"
        memory: 500Mi

This happens after the second container is removed, but before the vpaPruningGracePeriod period happens. The remaining container's memory is set to the value of --pod-recommendation-min-memory-mb, and the deleted container still exists.

What I had originally expected to happen was that the --pod-recommendation-min-memory-mb would be split between the two containers, until the vpaPruningGracePeriod period expires, then the stale container would be removed, and --pod-recommendation-min-memory-mb be recalculated.

But, thinking it over, I guess this is fine? The second container's recommendations are retained (until it comes back, or when vpaPruningGracePeriod elapses). The primary container gets recommendations that make sense.

For me it makes sense

@adrianmoisey
Copy link
Member

/milestone vertical-pod-autoscaler-1.4.0

@adrianmoisey
Copy link
Member

I assume the comment here can now be removed?

// TODO: Garbage collect old AggregateContainerStates.
aggregateContainerStates aggregateContainerStatesMap

@adrianmoisey
Copy link
Member

Generally this seems fine. There are a few "TODO" items that I think need doing (such as the tests).
Pinging @voelzmo and @raywainman to get their eyes on it too

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jan 21, 2025
jkyros and others added 4 commits January 21, 2025 12:15
Previously we were dividing the resources per pod by the number of
container aggregates, but in a situation where we're doing a rollout and
the container names are changing (either a rename, or a removal) we're
splitting resources across the wrong number of containers, resulting in
smaller values than we should actually have.

This collects a count of containers in the model when the pods are
loaded, and uses the "high water mark value", so in the event we are
doing something like adding a container during a rollout, we favor the
pod that has the additional container.

There are probably better ways to do this plumbing, but this was my
initial attempt, and it does fix the issue.
Previously we were only cleaning checkpoints after something happened to
the VPA or the targetRef, and so when a container got renamed the
checkpoint would stick around forever.

Since we're trying to clean up the aggregates immediately now, we need
to force the checkpoint garbage collection to clean up any checkpoints
that don't have matching aggregates.

If the checkpoints did get loaded back in after a restart,
PruneContainers() would take the aggregates back out, but we probably
shouldn't leave the checkpoints out there.

Signed-off-by: Max Cao <macao@redhat.com>
Previously we were letting the rate limited garbage collector clean up
the aggregate states, and that works really well in most cases, but when
the list of containers in a pod changes, either due to the removal or
rename of a container, the aggregates for the old containers stick
around forever and cause problems.

To get around this, this marks all existing aggregates/initial
aggregates in the list for each VPA as "not under a VPA" every time
before we LoadPods(), and then LoadPods() will re-mark the aggregates as
"under a VPA" for all the ones that are still there, which lets us
easily prune the stale container aggregates that are still marked as
"not under a VPA" but are still wrongly in the VPA's list.

This does leave the ultimate garbage collection to the rate limited
garbage collector, which should be fine, we just needed the stale
entries to get removed from the per-VPA lists so they didn't affect VPA
behavior.
Signed-off-by: Max Cao <macao@redhat.com>
@maxcao13 maxcao13 force-pushed the vpa-aggregates-fix-mark-sweep branch from 811fa14 to 463d7cf Compare January 22, 2025 00:02
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jan 22, 2025
…the updater prunes aggregateContainerStates/recommendations due to non-existent containers

Signed-off-by: Max Cao <macao@redhat.com>
@maxcao13 maxcao13 force-pushed the vpa-aggregates-fix-mark-sweep branch from 463d7cf to e242842 Compare January 22, 2025 22:15
@k8s-ci-robot k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Jan 22, 2025
…for non-breaking opt-in change

Signed-off-by: Max Cao <macao@redhat.com>
…d of containerPolicy

Signed-off-by: Max Cao <macao@redhat.com>
…to be linked to new containers

We need to set VPAContainersPerPod for a VPA correctly so we can split resources correctly on its first run through the recommendation loop. So I opted to explicitly set it after updating the pod's containers.

This also allows old aggregateContainerStates that were previously created from a removed Pod's container, to be reused by a new Pod's container that shares the same VPA and targetRef. This allows recommendations to be updated correctly when aggregates are pruned or created.

Signed-off-by: Max Cao <macao@redhat.com>
Also sets the VPAContainersPerPod correctly when initializing VPA from history provider.

Signed-off-by: Max Cao <macao@redhat.com>
Changes so that the recommender starts to prune stale aggregates only when all aggregates by containerName no longer belong to an active container. The aggregate of all containers of the same name are considered stale based on the "newest" aggregate state. i.e. the aggregate with the most recent update time.

Signed-off-by: Max Cao <macao@redhat.com>
@maxcao13 maxcao13 force-pushed the vpa-aggregates-fix-mark-sweep branch from e242842 to 9a40688 Compare January 24, 2025 06:54
@maxcao13 maxcao13 force-pushed the vpa-aggregates-fix-mark-sweep branch from 9a40688 to b8d78c6 Compare January 24, 2025 06:59
@maxcao13
Copy link

In order, the most recent commits are:

  • 3d4f751 is a commit for e2e tests.
  • 24c4eba is a commit which reworks a bit of how the pruning works, in that if an aggregate container still exists, but is considered stale, it will get pruned unless there is a container with the same name that still exists. The grace period of the combined aggregateByContainerName will follow the aggregate with the most recent updateTime. I think this makes more sense as it follows how the VPA usually works with contributive container names.
  • b8d78c6 is a commit with unit tests.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/vertical-pod-autoscaler cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.
Projects
None yet
9 participants