feat(helm): add keda autoscaling and fix dashboards #4687

davidspek · 2023-04-07T17:12:27Z

What this PR does

This PR adds the Keda autoscaling from the libsonnet deployment to the Helm chart and updates the autoscaling sections in the dashboards so they work with Helm deployments.

Checklist

Tests updated
Documentation added
CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX]

To see the specific tasks where the Asana app for GitHub is being used, see below:
- https://app.asana.com/0/0/1204668201167767

dimitarvdimitrov

i'm not quite sure about relying on metrics from mimir about autoscaling mimir. If the cluster is unavailable, then this may exacerbate the problem.

What do you think about having the metamonitoring remote URL as the serverAddress for autoscaling?

davidspek · 2023-04-11T09:59:28Z

@dimitarvdimitrov Seems reasonable to me. I assume that can also write to itself by default. I'll have a look at how to do that in the templating.

lamida · 2023-04-17T10:40:57Z

The CHANGELOG has just been cut to prepare for the next Mimir release. Please rebase main and eventually move the CHANGELOG entry added / updated in this PR to the top of the CHANGELOG document. Thanks!

davidspek · 2023-06-07T09:19:38Z

@lamida Sorry for the slow turn around here. I've just rebased on main and updated the changelog.

davidspek · 2023-06-20T12:39:25Z

@lamida @dimitarvdimitrov I've just resolved the merge conflicts so hopefully this PR is ready for merging.

davidspek · 2023-07-07T12:46:36Z

Ping @lamida @dimitarvdimitrov

Signed-off-by: David van der Spek <vanderspek.david@gmail.com>

dimitarvdimitrov · 2023-07-10T11:53:13Z

Thanks for the contribution @davidspek! And apologies for the delay, the first notification had slipped through my inbox. I will find some time this week to start reviewing the PR

davidspek · 2023-07-10T11:54:56Z

No problem, thanks for having a look.

dimitarvdimitrov · 2023-07-14T16:17:10Z

Apologies for not managing to get to this this week. I will start reviewing it first thing next week

dimitarvdimitrov

Amazing work! I've left some comments ater an initial review, but the core work is in a very good state. Thank you for the effort.

To call this effort complete we'd need to add some documentation. IIRC at Grafana Labs we had some issues when removing the "replicas" field from objects, which caused k8s to scale them down to 1 replica until autoscaling kicks in.

In addition to that it would be nice if we can make use of the existing helm-jsonent comparison to make sure that autscaling between the two doesn't drift apart.

I wouldn't want to put them on your or even include them in this PR because the work you've done is impressive already. Unless you are keep to work on them, I think we can get some resources at GL to work on these two.

Althought it would be nice to have this tested out in some extent. Have you had a chance to test the Helm autoscaling in a Mimir cluster?

dimitarvdimitrov · 2023-07-18T16:18:55Z

operations/mimir-mixin/mixin-compiled.libsonnet

+      ruler_query_frontend+: {
+        enabled: true,
+      },


The ruler QF isn't added in this PR, right? Should we keep it disabled?

dimitarvdimitrov · 2023-07-18T16:24:14Z

operations/helm/charts/mimir-distributed/templates/query-frontend/autoscaling.yaml

@@ -0,0 +1,44 @@
+{{- if .Values.query_frontend.kedaAutoscaling.enabled }}


There is some convention to name the files as <name>-<resource_kind>.yaml. Can you name these e.g. query-frontend-scaledObject.yaml or query-frontend-so.yaml?

dimitarvdimitrov · 2023-07-18T16:25:29Z

operations/helm/charts/mimir-distributed/templates/distributor/distributor-dep.yaml

+  {{- if not .Values.distributor.kedaAutoscaling.enabled }}
  replicas: {{ .Values.distributor.replicas }}
+  {{- end }}


Should the same be present on the ruler deployment?

dimitarvdimitrov · 2023-07-18T16:26:53Z

operations/helm/charts/mimir-distributed/templates/lib/map-to-csv.tpl

@@ -0,0 +1,12 @@
+{{/*
+Convert labels to string like: key1="value1", key2="value2", ...


Is it key1="value1" or key1=value1? Also I think there isn't any space between the keys. Can you add an example input and output in the comment?

dimitarvdimitrov · 2023-07-18T16:28:57Z

operations/helm/charts/mimir-distributed/templates/lib/resource-calculations.tpl

+  ctx = . context
+  component = name of the component
+*/}}
+{{- define "mimir.lib.memorySiToBytes" -}}


I think we already have mimir.siToBytes which does the same

dimitarvdimitrov · 2023-07-18T16:34:01Z

operations/helm/charts/mimir-distributed/CHANGELOG.md

@@ -30,6 +30,7 @@ Entries should include a reference to the Pull Request that introduced the chang

 * [CHANGE] Changed max unavailable ingesters and store-gateways in a zone to 50. #5327
 * [CHANGE] Don't render PodSecurityPolicy on Kubernetes >=1.24. (was >= 1.25). This helps with upgrades between 1.24 and 1.25. To use a PSP in 1.24, toggle `rbac.forcePSPOnKubernetes124: true`. #5357
+[FEATURE] Allow for deploying keda autoscaling objects as part of the helm chart. #4687


can you prepend this with a bullet *?

Suggested change

[FEATURE] Allow for deploying keda autoscaling objects as part of the helm chart. #4687

* [FEATURE] Added experimental feature for deploying keda autoscaling objects as part of the helm chart for the components: distributor, querier, query-frontend and ruler. Requires metamonitoring, for more details on metamonitoring see the Helm chart documentation. #4687

We should also mention the limitation that currently only URL embedded basic auth method is supported and we might add further Keda auth methods later, see https://keda.sh/docs/2.11/scalers/prometheus/

dimitarvdimitrov · 2023-07-18T16:36:49Z

operations/helm/charts/mimir-distributed/values.yaml

+    enabled: false
+    minReplicaCount: 1
+    maxReplicaCount: 10
+    pollingInterval: 10


Can the polling interval default to the scrape interval for metamonitoring?

dimitarvdimitrov · 2023-07-18T16:38:50Z

operations/helm/charts/mimir-distributed/values.yaml

+    behavior:
+      scaleDown:
+        policies:
+        - periodSeconds: 60


The value in jsonnet is 600 here I think. The reason is that the ruler is a bit more sensitive to scaling up and down. Restarting a ruler can result in missed rule evaluations.

dimitarvdimitrov · 2023-07-18T16:43:20Z

operations/helm/charts/mimir-distributed/values.yaml

+    minReplicaCount: 2
+    maxReplicaCount: 10
+    pollingInterval: 10
+    querySchedulerInflightRequestsThreshold: 6


The jsonnet calculates with as 0.75*8 where 8 is the default value for max_concurrent requests. In Helm that's 16. Let's use 12 here in this case

dimitarvdimitrov · 2023-07-18T16:45:09Z

operations/helm/charts/mimir-distributed/values.yaml

+  kedaAutoscaling:
+    enabled: false
+    minReplicaCount: 1
+    maxReplicaCount: 10
+    pollingInterval: 10
+    targetCPUUtilizationPercentage: 80
+    targetMemoryUtilizationPercentage: 80
+    customHeaders: {}
+      # X-Scope-OrgID: ""
+    behavior:
+      scaleDown:
+        policies:
+        - periodSeconds: 60
+          type: Percent
+          value: 10
+


Can you add some comments on the kedaAutoscaling section and its fields? I think enough to be able to determine whether you need keda autoscaling and to determine what each of these fields matter.

(same comment for th rest of the sections)

krajorama · 2023-07-19T07:45:41Z

operations/helm/charts/mimir-distributed/templates/metamonitoring/_helpers.tpl

+{{- include "mimir.remoteReadUrl.inCluster" $.ctx }}
+{{- else -}}
+{{- $parsed := urlParse (.remote).url -}}
+{{ $parsed.scheme }}://{{ $parsed.host }}{{ include "mimir.prometheusHttpPrefix" $.ctx }}


For external URLs we need a sensible default like /prometheus and allow an override somewhere, otherwise we're tying the internal setting and an unrelated external setting together. For an MVP, I'd just hardcode to /prometheus while we figure out where to put the override.

nice spot. A Prometheus instance also doesn't have the /prometheus prefix, /prometheus is only a Mimir prefix.

davidspek · 2023-08-09T23:00:59Z

Sorry for the late response. I was on vacation the past few weeks. I’ll go through your comments hopefully by the end of this week.

hobbsh · 2023-11-20T17:13:11Z

@davidspek @dimitarvdimitrov I'm happy to push this through if possible as it's an important feature.

dimitarvdimitrov · 2023-11-27T18:42:05Z

Thanks for offering to help @hobbsh!

My review above still holds I believe. It would be nice to test this out in a Mimir cluster before merging the PR, I would appreciate it someone can help out with this. There are some review comments that should be addressed.

And finally write some docs. Docs aren't critical to merge this, but I think they should follow before we announce this as a stable feature of the helm chart. We can start working on them once the PR is in its final stages.

I also know that @pracucci has been working on some improvements on the querier HPA queries that have been already rolled out in cells at Grafana. It makes sense to wait a bit for those to be upstreamed so we can use the queries and KEDA config in helm as well

dimitarvdimitrov · 2024-01-08T09:51:47Z

The autoscaling improvements have been merged in #6971. It would be a good idea to include them in this PR once it's back on track

dimitarvdimitrov · 2024-02-13T13:38:08Z

Thank you for your work on this @davidspek. @beatkind took ownership of the autoscaling changes and we just merged #7282 as a first iteration which was based on the changes in your PR.

davidspek requested a review from a team as a code owner April 7, 2023 17:12

56quarters added enhancement New feature or request helm labels Apr 10, 2023

dimitarvdimitrov reviewed Apr 11, 2023

View reviewed changes

davidspek force-pushed the helm-keda-autoscaling branch from 701a410 to 3474702 Compare April 11, 2023 13:23

lamida added the release/notified-changelog-cut label Apr 17, 2023

davidspek force-pushed the helm-keda-autoscaling branch from 3474702 to 5a3cc24 Compare June 7, 2023 09:19

davidspek force-pushed the helm-keda-autoscaling branch from 3e0cb94 to 426e27f Compare June 7, 2023 09:46

davidspek requested a review from dimitarvdimitrov June 7, 2023 09:52

davidspek force-pushed the helm-keda-autoscaling branch from c3afdc1 to 4aefc10 Compare June 20, 2023 18:31

davidspek added 8 commits July 7, 2023 14:58

feat(helm): add keda autoscalers from jsonnet

300d740

Signed-off-by: David van der Spek <vanderspek.david@gmail.com>

fix: fix duplicate http in url + better header tpl

596d7b8

Signed-off-by: David van der Spek <vanderspek.david@gmail.com>

feat(helm): autoscaling dashboards compatibility

ef7b33f

Signed-off-by: David van der Spek <vanderspek.david@gmail.com>

ci: run build-helm-tests

64c092c

Signed-off-by: David van der Spek <vanderspek.david@gmail.com>

feat: use metamonitoring remote for autoscaling

cec76d2

Signed-off-by: David van der Spek <vanderspek.david@gmail.com>

fix: add tests with autoscaling enabled

13bb1e0

Signed-off-by: David van der Spek <vanderspek.david@gmail.com>

fix: update changelog

07b2e59

Signed-off-by: David van der Spek <vanderspek.david@gmail.com>

ci: fix generated test values

93b3bc2

Signed-off-by: David van der Spek <vanderspek.david@gmail.com>

davidspek force-pushed the helm-keda-autoscaling branch from 4aefc10 to 93b3bc2 Compare July 7, 2023 12:59

davidspek added 2 commits July 7, 2023 15:00

fix: update changelog after rebase

c01beeb

Signed-off-by: David van der Spek <vanderspek.david@gmail.com>

fix: update tests after rebase

01d3d4c

Signed-off-by: David van der Spek <vanderspek.david@gmail.com>

dimitarvdimitrov mentioned this pull request Mar 31, 2023

[EPIC] Grafana Mimir Helm chart 6.0 tracker #4560

Open

dimitarvdimitrov reviewed Jul 18, 2023

View reviewed changes

krajorama reviewed Jul 19, 2023

View reviewed changes

dimitarvdimitrov linked an issue Jul 19, 2023 that may be closed by this pull request

Add support for HPA in mimir-distributed #3430

Closed

dimitarvdimitrov mentioned this pull request Oct 13, 2023

Helm: Support creation of HorizontalPodAutoscaler #4229

Closed

3 tasks

dimitarvdimitrov mentioned this pull request Jan 8, 2024

Add support for HPA in mimir-distributed #3430

Closed

beatkind mentioned this pull request Feb 3, 2024

feat(helm): Adding KEDA autoscaling support #7282

Merged

4 tasks

dimitarvdimitrov mentioned this pull request Feb 13, 2024

helm: promote autoscaling to stable #7368

Open

9 tasks

dimitarvdimitrov closed this Feb 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(helm): add keda autoscaling and fix dashboards #4687

feat(helm): add keda autoscaling and fix dashboards #4687

davidspek commented Apr 7, 2023 •

edited

Loading

dimitarvdimitrov left a comment

davidspek commented Apr 11, 2023

lamida commented Apr 17, 2023

davidspek commented Jun 7, 2023

davidspek commented Jun 20, 2023

davidspek commented Jul 7, 2023

dimitarvdimitrov commented Jul 10, 2023

davidspek commented Jul 10, 2023

dimitarvdimitrov commented Jul 14, 2023

dimitarvdimitrov left a comment

dimitarvdimitrov Jul 18, 2023

dimitarvdimitrov Jul 18, 2023

dimitarvdimitrov Jul 18, 2023

dimitarvdimitrov Jul 18, 2023

dimitarvdimitrov Jul 18, 2023

dimitarvdimitrov Jul 18, 2023

krajorama Jul 19, 2023

krajorama Jul 19, 2023

dimitarvdimitrov Jul 18, 2023

dimitarvdimitrov Jul 18, 2023

dimitarvdimitrov Jul 18, 2023

dimitarvdimitrov Jul 18, 2023

krajorama Jul 19, 2023

dimitarvdimitrov Jul 19, 2023

davidspek commented Aug 9, 2023

hobbsh commented Nov 20, 2023

dimitarvdimitrov commented Nov 27, 2023

dimitarvdimitrov commented Jan 8, 2024

dimitarvdimitrov commented Feb 13, 2024

		@@ -0,0 +1,44 @@
		{{- if .Values.query_frontend.kedaAutoscaling.enabled }}

		@@ -0,0 +1,12 @@
		{{/*
		Convert labels to string like: key1="value1", key2="value2", ...

	[FEATURE] Allow for deploying keda autoscaling objects as part of the helm chart. #4687
	* [FEATURE] Added experimental feature for deploying keda autoscaling objects as part of the helm chart for the components: distributor, querier, query-frontend and ruler. Requires metamonitoring, for more details on metamonitoring see the Helm chart documentation. #4687

feat(helm): add keda autoscaling and fix dashboards #4687

feat(helm): add keda autoscaling and fix dashboards #4687

Conversation

davidspek commented Apr 7, 2023 • edited Loading

What this PR does

Checklist

dimitarvdimitrov left a comment

Choose a reason for hiding this comment

davidspek commented Apr 11, 2023

lamida commented Apr 17, 2023

davidspek commented Jun 7, 2023

davidspek commented Jun 20, 2023

davidspek commented Jul 7, 2023

dimitarvdimitrov commented Jul 10, 2023

davidspek commented Jul 10, 2023

dimitarvdimitrov commented Jul 14, 2023

dimitarvdimitrov left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

davidspek commented Aug 9, 2023

hobbsh commented Nov 20, 2023

dimitarvdimitrov commented Nov 27, 2023

dimitarvdimitrov commented Jan 8, 2024

dimitarvdimitrov commented Feb 13, 2024

davidspek commented Apr 7, 2023 •

edited

Loading