Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(helm): add keda autoscaling and fix dashboards #4687

Closed
wants to merge 10 commits into from

Conversation

davidspek
Copy link
Contributor

@davidspek davidspek commented Apr 7, 2023

What this PR does

This PR adds the Keda autoscaling from the libsonnet deployment to the Helm chart and updates the autoscaling sections in the dashboards so they work with Helm deployments.

Checklist

  • Tests updated
  • Documentation added
  • CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX]

@davidspek davidspek requested a review from a team as a code owner April 7, 2023 17:12
@56quarters 56quarters added enhancement New feature or request helm labels Apr 10, 2023
Copy link
Contributor

@dimitarvdimitrov dimitarvdimitrov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'm not quite sure about relying on metrics from mimir about autoscaling mimir. If the cluster is unavailable, then this may exacerbate the problem.

What do you think about having the metamonitoring remote URL as the serverAddress for autoscaling?

@davidspek
Copy link
Contributor Author

@dimitarvdimitrov Seems reasonable to me. I assume that can also write to itself by default. I'll have a look at how to do that in the templating.

@davidspek davidspek force-pushed the helm-keda-autoscaling branch from 701a410 to 3474702 Compare April 11, 2023 13:23
@lamida
Copy link
Contributor

lamida commented Apr 17, 2023

The CHANGELOG has just been cut to prepare for the next Mimir release. Please rebase main and eventually move the CHANGELOG entry added / updated in this PR to the top of the CHANGELOG document. Thanks!

@davidspek
Copy link
Contributor Author

@lamida Sorry for the slow turn around here. I've just rebased on main and updated the changelog.

@davidspek davidspek force-pushed the helm-keda-autoscaling branch from 3e0cb94 to 426e27f Compare June 7, 2023 09:46
@davidspek
Copy link
Contributor Author

@lamida @dimitarvdimitrov I've just resolved the merge conflicts so hopefully this PR is ready for merging.

@davidspek davidspek force-pushed the helm-keda-autoscaling branch from c3afdc1 to 4aefc10 Compare June 20, 2023 18:31
@davidspek
Copy link
Contributor Author

Ping @lamida @dimitarvdimitrov

davidspek added 8 commits July 7, 2023 14:58
Signed-off-by: David van der Spek <vanderspek.david@gmail.com>
Signed-off-by: David van der Spek <vanderspek.david@gmail.com>
Signed-off-by: David van der Spek <vanderspek.david@gmail.com>
Signed-off-by: David van der Spek <vanderspek.david@gmail.com>
Signed-off-by: David van der Spek <vanderspek.david@gmail.com>
Signed-off-by: David van der Spek <vanderspek.david@gmail.com>
Signed-off-by: David van der Spek <vanderspek.david@gmail.com>
Signed-off-by: David van der Spek <vanderspek.david@gmail.com>
@davidspek davidspek force-pushed the helm-keda-autoscaling branch from 4aefc10 to 93b3bc2 Compare July 7, 2023 12:59
davidspek added 2 commits July 7, 2023 15:00
Signed-off-by: David van der Spek <vanderspek.david@gmail.com>
Signed-off-by: David van der Spek <vanderspek.david@gmail.com>
@dimitarvdimitrov
Copy link
Contributor

Thanks for the contribution @davidspek! And apologies for the delay, the first notification had slipped through my inbox. I will find some time this week to start reviewing the PR

@davidspek
Copy link
Contributor Author

No problem, thanks for having a look.

@dimitarvdimitrov
Copy link
Contributor

Apologies for not managing to get to this this week. I will start reviewing it first thing next week

Copy link
Contributor

@dimitarvdimitrov dimitarvdimitrov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Amazing work! I've left some comments ater an initial review, but the core work is in a very good state. Thank you for the effort.

To call this effort complete we'd need to add some documentation. IIRC at Grafana Labs we had some issues when removing the "replicas" field from objects, which caused k8s to scale them down to 1 replica until autoscaling kicks in.

In addition to that it would be nice if we can make use of the existing helm-jsonent comparison to make sure that autscaling between the two doesn't drift apart.

I wouldn't want to put them on your or even include them in this PR because the work you've done is impressive already. Unless you are keep to work on them, I think we can get some resources at GL to work on these two.

Althought it would be nice to have this tested out in some extent. Have you had a chance to test the Helm autoscaling in a Mimir cluster?

Comment on lines +21 to +23
ruler_query_frontend+: {
enabled: true,
},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ruler QF isn't added in this PR, right? Should we keep it disabled?

@@ -0,0 +1,44 @@
{{- if .Values.query_frontend.kedaAutoscaling.enabled }}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is some convention to name the files as <name>-<resource_kind>.yaml. Can you name these e.g. query-frontend-scaledObject.yaml or query-frontend-so.yaml?

Comment on lines +11 to +13
{{- if not .Values.distributor.kedaAutoscaling.enabled }}
replicas: {{ .Values.distributor.replicas }}
{{- end }}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should the same be present on the ruler deployment?

@@ -0,0 +1,12 @@
{{/*
Convert labels to string like: key1="value1", key2="value2", ...
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it key1="value1" or key1=value1? Also I think there isn't any space between the keys. Can you add an example input and output in the comment?

ctx = . context
component = name of the component
*/}}
{{- define "mimir.lib.memorySiToBytes" -}}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we already have mimir.siToBytes which does the same

@@ -30,6 +30,7 @@ Entries should include a reference to the Pull Request that introduced the chang

* [CHANGE] Changed max unavailable ingesters and store-gateways in a zone to 50. #5327
* [CHANGE] Don't render PodSecurityPolicy on Kubernetes >=1.24. (was >= 1.25). This helps with upgrades between 1.24 and 1.25. To use a PSP in 1.24, toggle `rbac.forcePSPOnKubernetes124: true`. #5357
[FEATURE] Allow for deploying keda autoscaling objects as part of the helm chart. #4687
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you prepend this with a bullet *?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
[FEATURE] Allow for deploying keda autoscaling objects as part of the helm chart. #4687
* [FEATURE] Added experimental feature for deploying keda autoscaling objects as part of the helm chart for the components: distributor, querier, query-frontend and ruler. Requires metamonitoring, for more details on metamonitoring see the Helm chart documentation. #4687

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should also mention the limitation that currently only URL embedded basic auth method is supported and we might add further Keda auth methods later, see https://keda.sh/docs/2.11/scalers/prometheus/

enabled: false
minReplicaCount: 1
maxReplicaCount: 10
pollingInterval: 10
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can the polling interval default to the scrape interval for metamonitoring?

behavior:
scaleDown:
policies:
- periodSeconds: 60
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The value in jsonnet is 600 here I think. The reason is that the ruler is a bit more sensitive to scaling up and down. Restarting a ruler can result in missed rule evaluations.

minReplicaCount: 2
maxReplicaCount: 10
pollingInterval: 10
querySchedulerInflightRequestsThreshold: 6
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The jsonnet calculates with as 0.75*8 where 8 is the default value for max_concurrent requests. In Helm that's 16. Let's use 12 here in this case

Comment on lines +1116 to +1131
kedaAutoscaling:
enabled: false
minReplicaCount: 1
maxReplicaCount: 10
pollingInterval: 10
targetCPUUtilizationPercentage: 80
targetMemoryUtilizationPercentage: 80
customHeaders: {}
# X-Scope-OrgID: ""
behavior:
scaleDown:
policies:
- periodSeconds: 60
type: Percent
value: 10

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add some comments on the kedaAutoscaling section and its fields? I think enough to be able to determine whether you need keda autoscaling and to determine what each of these fields matter.

(same comment for th rest of the sections)

{{- include "mimir.remoteReadUrl.inCluster" $.ctx }}
{{- else -}}
{{- $parsed := urlParse (.remote).url -}}
{{ $parsed.scheme }}://{{ $parsed.host }}{{ include "mimir.prometheusHttpPrefix" $.ctx }}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For external URLs we need a sensible default like /prometheus and allow an override somewhere, otherwise we're tying the internal setting and an unrelated external setting together. For an MVP, I'd just hardcode to /prometheus while we figure out where to put the override.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice spot. A Prometheus instance also doesn't have the /prometheus prefix, /prometheus is only a Mimir prefix.

@dimitarvdimitrov dimitarvdimitrov linked an issue Jul 19, 2023 that may be closed by this pull request
@davidspek
Copy link
Contributor Author

Sorry for the late response. I was on vacation the past few weeks. I’ll go through your comments hopefully by the end of this week.

@hobbsh
Copy link

hobbsh commented Nov 20, 2023

@davidspek @dimitarvdimitrov I'm happy to push this through if possible as it's an important feature.

@dimitarvdimitrov
Copy link
Contributor

Thanks for offering to help @hobbsh!

My review above still holds I believe. It would be nice to test this out in a Mimir cluster before merging the PR, I would appreciate it someone can help out with this. There are some review comments that should be addressed.

And finally write some docs. Docs aren't critical to merge this, but I think they should follow before we announce this as a stable feature of the helm chart. We can start working on them once the PR is in its final stages.

I also know that @pracucci has been working on some improvements on the querier HPA queries that have been already rolled out in cells at Grafana. It makes sense to wait a bit for those to be upstreamed so we can use the queries and KEDA config in helm as well

@dimitarvdimitrov
Copy link
Contributor

The autoscaling improvements have been merged in #6971. It would be a good idea to include them in this PR once it's back on track

@dimitarvdimitrov
Copy link
Contributor

Thank you for your work on this @davidspek. @beatkind took ownership of the autoscaling changes and we just merged #7282 as a first iteration which was based on the changes in your PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add support for HPA in mimir-distributed
6 participants