Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Automatic switch to emergency mode when metrics unavailable #424

Merged
merged 41 commits into from
Jan 17, 2025

Conversation

randytqwjp
Copy link
Collaborator

What this PR does / why we need it:

This PR adds a check to HPA status condition and switches tortoise to emergency mode whenever hpa is unable to get resource metric

Which issue(s) this PR fixes:

#422

Fixes #

Special notes for your reviewer:

return true
}

if conditions[1].Type == "ScalingActive" && conditions[1].Status == "False" && conditions[1].Reason == "FailedGetResourceMetric" {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FailedGetResourceMetric

What about other failures? e.g., the container resource metrics

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sanposhiho should we only switch to emergency mode when main container resource metrics are missing? e.g. if istio-proxy metrics are unavailable for some reason but the main container metrics are still available, do we stay in auto mode since scaling is still active?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should fallback to emergency only if all metrics are dead..? Maybe it's too aggressive to do a fallback if some, but not all metrics are dead.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

e.g., Let's say a user deploys a new version to the app container which contains a bug unfortunately, and all main containers are crash due to the bug. In this case, switching to emerge doesn't help, rather just become unnecessary cost increase.

So, ideally we have to detect which issue is solvable with increasing the replicas, and which isn't.
I know it's super difficult, though. But, if two container's metrics are missing (a metric server is dead, the service is overwhelmed and whole dead, etc), we can say increasing the replica might help solve the issue, with more confidence.

Copy link
Collaborator

@sanposhiho sanposhiho left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You must write a test.

@randytqwjp randytqwjp marked this pull request as draft November 28, 2024 07:44
@sanposhiho
Copy link
Collaborator

sanposhiho commented Dec 4, 2024

You meant, at /before? I don't exactly remember whether it's intentional or I just forgot, but I guess it's intentional because the reconciliation should create the monitor VPA at the initialization phase, no?

scalingActive := r.HpaService.CheckHpaMetricStatus(ctx, hpa)
if scalingActive == false && tortoise.Spec.UpdateMode == autoscalingv1beta3.UpdateModeAuto && tortoise.Status.TortoisePhase == autoscalingv1beta3.TortoisePhaseWorking {
//switch to emergency mode only when auto tortoise and already working
tortoise.Spec.UpdateMode = autoscalingv1beta3.UpdateModeEmergency
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sanposhiho somehow this line doesn't change the actual spec that k8sclient.get retrieves so in the test while tortoisephase changes to emergency, spec.updatemode is still auto. If this is the case, i think we need to create a new tortoisephase for auto scale up. What do you think?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

somehow this line doesn't change the actual spec

because UpdateTortoiseStatus below only updates .status (literally), not .spec.
And, thinking about it, probably we shouldn't directly change the spec. It's breaking the boundary between the user's intent and the controller.

So,

If this is the case, i think we need to create a new tortoisephase for auto scale up.

yeah. I think, instead of changing the updatemode on spec, we should change something at .status. TortoisePhase makes sense to me.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we just use the existing TortoisePhaseEmergency, not a new one?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we use existing emergency phase, when tortoise is in auto mode it will reconcile and switch it back to working right? we can only use emergency phase if we disallow emergency to be input by user. If we want users to be able to manually switch to emergency, we need a new phase

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually i am not sure what other scenarios will require emergency mode. If missing metrics is the only scenario then it would make sense to disallow manual switch. What do you think?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we use existing emergency phase, when tortoise is in auto mode it will reconcile and switch it back to working right?

No, it shouldn't. That depends on how you implement. The controller should keep putting TortoisePhaseEmergency as long as HPA has a missing metric, even if the mode is auto.

Actually i am not sure what other scenarios will require emergency mode. If missing metrics is the only scenario then it would make sense to disallow manual switch. What do you think?

There could be many scenarios right now, e.g.,

  • Mercari is going to have a Black Friday sale, and we want to pre-autoscale the number of replicas. (could be solved by init commit for scheduled scaling #370)
  • A big upstream microservice is down, and the downstream microservice is getting traffic much less than usual now. The dynamic minReplicas feature keeps the replica number to some extent though, we want to make it bigger enough to prepare the big upstream microservice to come back fully.
  • We're getting requests more than usual, unexpectedly (DDoS, etc), and the platform needs to scale up all the services.

I agree that we should enhance tortoise so that users don't have to use emergency mode at all eventually, but for now, we should have it, as a last resort.

@@ -256,7 +256,8 @@ const (
// TortoisePhaseBackToNormal means tortoise was in the emergency mode, and now it's coming back to the normal operation.
// During TortoisePhaseBackToNormal, the number of replicas of workloads are gradually reduced to the usual value.
// - Emergency → BackToNormal
TortoisePhaseBackToNormal TortoisePhase = "BackToNormal"
TortoisePhaseBackToNormal TortoisePhase = "BackToNormal"
TortoisePhaseAutoEmergency TortoisePhase = "AutoEmergency"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, I do not agree adding a new phase. It'll increase our maintenance effort.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will look at it again later i didnt see your comment earlier

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sanposhiho Changed to use only emergency mode. Could you review the new test cases as well? I am also using currentMetric.ContainerResource.Current.Value.IsZero() which is the default value for int32 though I am not sure what is supposed to be returned when metrics are down. I assume its the default value for the type but I'm not too familiar with golang

Copy link
Collaborator

@sanposhiho sanposhiho Dec 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure what is supposed to be returned when metrics are down.

I guess it's correct to assume HPA to return zero when something goes wrong happens with the metric. Or maybe they just keep the previous values (= they don't clean up the list with zero values when the HPA reconciliation fails).
You have to check HPA controller implementation to make 100% sure.

pkg/hpa/service_test.go Show resolved Hide resolved
@@ -765,3 +766,54 @@ func (c *Service) excludeExternalMetric(ctx context.Context, hpa *v2.HorizontalP

return newHPA
}

func (c *Service) CheckHpaMetricStatus(ctx context.Context, currenthpa *v2.HorizontalPodAutoscaler) bool {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit

Suggested change
func (c *Service) CheckHpaMetricStatus(ctx context.Context, currenthpa *v2.HorizontalPodAutoscaler) bool {
func (c *Service) IsHpaMetricAvailable(ctx context.Context, currenthpa *v2.HorizontalPodAutoscaler) bool {

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, can we do like this instead? It would eliminate the logic from the controller layer.

func (c *Service) UpdateTortoisePhaseIfHPAIsUnhealthy(ctx context.Context, currenthpa *v2.HorizontalPodAutoscaler, tortoise *v1beta1.Tortoise) error {
...

			if condition.Type == "ScalingActive" && condition.Status == "False" && condition.Reason == "FailedGetResourceMetric" {
				//switch to Emergency mode since no metrics
				logger.Info("HPA failed to get resource metrics, switch to emergency mode")
				tortoise.Status.TortoisePhase = v1beta3.TortoisePhaseEmergency
				return nil
			}

}

Then, probably it'd make more sense to move inside the tortoise service, instead of hpa service.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wont this add extra responsibilities to tortoise service if the service has to manage hpa conditions as well? Would it be better to add a UpdateTortoisePhaseIfHPAIsUnhealthy function in tortoise service but still perform the check in hpa service? so we only pass in the scalingActive bool into tortoise service instead of the entire hpa. Im not sure which will be cleaner

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wont this add extra responsibilities to tortoise service if the service has to manage hpa conditions as well

Hmm, that's understandable. So,

Would it be better to add a UpdateTortoisePhaseIfHPAIsUnhealthy function in tortoise service but still perform the check in hpa service? so we only pass in the scalingActive bool into tortoise service instead of the entire hpa.

this direction looks good.

Comment on lines 773 to 791
if currenthpa == nil {
logger.Info("empty HPA passed into status check, ignore")
return true
}

if reflect.DeepEqual(currenthpa.Status, v2.HorizontalPodAutoscalerStatus{}) {
logger.Info("HPA empty status, switch to emergency mode")
return false
}

if currenthpa.Status.Conditions == nil {
logger.Info("HPA empty conditions, switch to emergency mode")
return false
}

if currenthpa.Status.CurrentMetrics == nil {
logger.Info("HPA no metrics, switch to emergency mode")
return false
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's add error in the return value and return err in those cases.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and, when err, we should abort the reconciliation because HPA's object looks like not valid at all.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will cause an issue with the controller test that initializes hpa because in the test environment, creating a new hpa without manually updating the status will give an empty status. But we cannot force a status in the controller itself because that might affect the actual hpa in an actual environment.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, looks like it's expected that if HPA is just created, Status etc could be empty (since HPA controller might not yet notice a new HPA)
Can we early-return (or not run this function) when tortoise's phase is Initializing?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes i handled it to only run when tortoise is working. But there is a test case where tortoise is working but no hpa and tortoise reconcile should create a hpa. This will always fail and im not sure of a workaround for this specific test case

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes but because of that it will cause error since it generates hpa with empty status and tortoisephase is still working. A rather hackish way around it is if a new hpa is created from that function it returns a bool so we skip hpa status check on that reconcile loop and check it on future reconcile loops

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm, that's complicated.
Okay, I now agree that we don't return error from this function even when HPA has empty condition etc, and we just ignore, doing nothing at those cases.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

regarding the missing metrics concern as well, would skipping the check if theres a change in policy be a good workaround? as long as the metrics are populated by the next reconcile loop it should work?

Copy link
Collaborator

@sanposhiho sanposhiho Dec 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's two cases coming from UpdateHPASpecFromTortoiseAutoscalingPolicy, IIUC:

  1. A new HPA is created: In this case, empty status and hence we skip. So, no issue.
  2. A new metric is added: In this case, a status doesn't have anything about new metric added at HPA spec, but this function doesn't check missing metrics by comparing spec and status, so no issue.

So, I don't see any issues. WDYT

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i dont see issues as well just that it may not be the cleanest way to code this? But I can't think of anything better at the moment

@randytqwjp
Copy link
Collaborator Author

@sanposhiho Ready for review

@randytqwjp randytqwjp marked this pull request as ready for review December 17, 2024 13:50
@@ -497,32 +498,33 @@ func (c *Service) UpdateHPASpecFromTortoiseAutoscalingPolicy(
givenHPA *v2.HorizontalPodAutoscaler,
replicaNum int32,
now time.Time,
) (*autoscalingv1beta3.Tortoise, error) {
) (*autoscalingv1beta3.Tortoise, bool, error) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I was suggesting at the end of the thread is a bit different.

Okay, I now agree that we don't return error from this function even when HPA has empty condition etc, and we just ignore, doing nothing at those cases.
#424 (comment)

So, we don't need hpaCreated, and we can just ignore (= return true) when HPA has empty condition at CheckHpaMetricStatus. (I meant CheckHpaMetricStatus doesn't have to have error in a return value anymore)

@@ -765,3 +768,51 @@ func (c *Service) excludeExternalMetric(ctx context.Context, hpa *v2.HorizontalP

return newHPA
}

func (c *Service) CheckHpaMetricStatus(ctx context.Context, currenthpa *v2.HorizontalPodAutoscaler) (bool, error) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, I don't think we have to return error anymore.

Suggested change
func (c *Service) CheckHpaMetricStatus(ctx context.Context, currenthpa *v2.HorizontalPodAutoscaler) (bool, error) {
// IsHPAScalingActive returns whether HPA is correctly working or not.
func (c *Service) IsHPAScalingActive(ctx context.Context, currenthpa *v2.HorizontalPodAutoscaler) bool {

Comment on lines 775 to 790
if currenthpa == nil {
logger.Info("empty HPA passed into status check, ignore")
return true, nil
}

if reflect.DeepEqual(currenthpa.Status, v2.HorizontalPodAutoscalerStatus{}) {
return false, fmt.Errorf("HPA empty status, switch to emergency mode")
}

if currenthpa.Status.Conditions == nil {
return false, fmt.Errorf("HPA empty conditions, switch to emergency mode")
}

if currenthpa.Status.CurrentMetrics == nil {
return false, fmt.Errorf("HPA no metrics, switch to emergency mode")
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And, let's return true for all those cases based on the assumption that HPA is just created and the HPA controller doesn't handle it yet.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, this might be a general issue that maybe HPA hasn't be reconciled yet when UpdateHPASpecFromTortoiseAutoscalingPolicy just created the one.
In such cases, Tortoise controller cannot (or shouldn't) calculate the recommendation.

Can we varify those empty stuff at GetHPAOnTortoise, and if empty, we abort the reconciliation? And, at the next reconciliation, HPA status is supposed to be filled by HPA controller, and we can reconcile.

func (c *Service) GetHPAOnTortoise(ctx context.Context, tortoise *autoscalingv1beta3.Tortoise) (*v2.HorizontalPodAutoscaler, bool, error) {
...
	hpa := &v2.HorizontalPodAutoscaler{}
	if err := c.c.Get(ctx, types.NamespacedName{Namespace: tortoise.Namespace, Name: tortoise.Status.Targets.HorizontalPodAutoscaler}, hpa); err != nil {
		return nil, false, fmt.Errorf("failed to get hpa on tortoise: %w", err)
	}

+	if hpa.Status == nil || hpa.Status.Conditions == nil || .... {
+		// Most likely, HPA is just created and not yet handled by HPA controller.
+		return nil, false, nil
+	}

...
}
	hpa, isReady, err = r.HpaService.GetHPAOnTortoise(ctx, tortoise)
	if err != nil {
		logger.Error(err, "failed to get HPA", "tortoise", req.NamespacedName)
		return ctrl.Result{}, err
	}
+	
+	if !isReady {
+		// HPA is correctly fetched, but looks like not ready yet. We won't be able to calculate things correctly, and hence stop the reconciliation here.
+		return ctrl.Result{}, nil
+	}

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a quick question @sanposhiho How should we handle the test case where HPA is created? In the current test case, it is updated to the correct recommendation after one reconcile loop in tortoise conditions. Now it should just be the same tortoise since reconcile was aborted and tests only cover one reconcile loop? Would this be lacking from a testing perspective

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here, we change the expectation of the reconciliation loop that creates HPA to stop the reconciliation at that point (if HPA hasn't yet updated immediately by HPA controller) and expect the next reconciliation would handle them appropriately.

So, we can just change the test case; change the test case's tortoise expectation not to get any update around the recommendation.

@randytqwjp
Copy link
Collaborator Author

@sanposhiho fixed the above comments. Could you look through the create hpa test case too? in case i missed something

internal/controller/tortoise_controller.go Outdated Show resolved Hide resolved
internal/controller/tortoise_controller.go Outdated Show resolved Hide resolved
pkg/hpa/service.go Outdated Show resolved Hide resolved
pkg/hpa/service.go Outdated Show resolved Hide resolved
pkg/hpa/service.go Outdated Show resolved Hide resolved
pkg/hpa/service.go Outdated Show resolved Hide resolved
pkg/hpa/service.go Outdated Show resolved Hide resolved
conditions := currenthpa.Status.Conditions
currentMetrics := currenthpa.Status.CurrentMetrics

if len(conditions) > 0 {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I moved it to L777 above

Suggested change
if len(conditions) > 0 {

}
}

if len(currentMetrics) > 0 {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I moved it to L777 above

Suggested change
if len(currentMetrics) > 0 {

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You overlooked this. L776 already checked this.


func (c *Service) UpdateTortoisePhaseIfHPAIsUnhealthy(ctx context.Context, scalingActive bool, tortoise *v1beta3.Tortoise) (*v1beta3.Tortoise, error) {
if !scalingActive && tortoise.Spec.UpdateMode == v1beta3.UpdateModeAuto && tortoise.Status.TortoisePhase == v1beta3.TortoisePhaseWorking {
//switch to Emergency mode since no metrics
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
//switch to Emergency mode since no metrics
log.FromContext(ctx).Info("switching Tortoise to Emergency mode because looks like HPA isn't working properly")
s.recorder.Event(tortoise, corev1.EventTypeNormal, event.EmergencyModeEnabled, "Tortoise is in Emergency mode because it detected HPA isn't working appropriately. It will increase the number of replicas")

@@ -232,13 +233,28 @@ func (r *TortoiseReconciler) Reconcile(ctx context.Context, req ctrl.Request) (_
// VPA is ready, we mark all Vertical scaling resources as Running.
tortoise = vpa.SetAllVerticalContainerResourcePhaseWorking(tortoise, now)

isReady := false
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: we don't need to define this here?

}
}

if len(currentMetrics) > 0 {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You overlooked this. L776 already checked this.

})
It("HPA scalingactive no metrics", func() {
runTest(filepath.Join("testdata", "reconcile-automatic-emergency-mode-hpa-no-metrics"))
})
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add one more test case which tests that tortoise gets back to tortoisePhase: Working when HPA is back to normal?

@randytqwjp
Copy link
Collaborator Author

@sanposhiho made the changes PTAL

Copy link
Collaborator

@sanposhiho sanposhiho left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one nit

internal/controller/tortoise_controller_test.go Outdated Show resolved Hide resolved
randytqwjp and others added 2 commits January 17, 2025 16:05
Co-authored-by: Kensei Nakada <handbomusic@gmail.com>
Copy link
Collaborator

@sanposhiho sanposhiho left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pretty. Thanks!

@sanposhiho sanposhiho merged commit 860456d into main Jan 17, 2025
3 checks passed
@sanposhiho sanposhiho deleted the kubebuilder-inplace branch January 17, 2025 07:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants