-
Notifications
You must be signed in to change notification settings - Fork 543
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(helm): add keda autoscaling and fix dashboards #4687
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i'm not quite sure about relying on metrics from mimir about autoscaling mimir. If the cluster is unavailable, then this may exacerbate the problem.
What do you think about having the metamonitoring remote URL as the serverAddress
for autoscaling?
@dimitarvdimitrov Seems reasonable to me. I assume that can also write to itself by default. I'll have a look at how to do that in the templating. |
701a410
to
3474702
Compare
The CHANGELOG has just been cut to prepare for the next Mimir release. Please rebase |
3474702
to
5a3cc24
Compare
@lamida Sorry for the slow turn around here. I've just rebased on main and updated the changelog. |
3e0cb94
to
426e27f
Compare
@lamida @dimitarvdimitrov I've just resolved the merge conflicts so hopefully this PR is ready for merging. |
c3afdc1
to
4aefc10
Compare
Ping @lamida @dimitarvdimitrov |
Signed-off-by: David van der Spek <vanderspek.david@gmail.com>
Signed-off-by: David van der Spek <vanderspek.david@gmail.com>
Signed-off-by: David van der Spek <vanderspek.david@gmail.com>
Signed-off-by: David van der Spek <vanderspek.david@gmail.com>
Signed-off-by: David van der Spek <vanderspek.david@gmail.com>
Signed-off-by: David van der Spek <vanderspek.david@gmail.com>
Signed-off-by: David van der Spek <vanderspek.david@gmail.com>
Signed-off-by: David van der Spek <vanderspek.david@gmail.com>
4aefc10
to
93b3bc2
Compare
Signed-off-by: David van der Spek <vanderspek.david@gmail.com>
Signed-off-by: David van der Spek <vanderspek.david@gmail.com>
Thanks for the contribution @davidspek! And apologies for the delay, the first notification had slipped through my inbox. I will find some time this week to start reviewing the PR |
No problem, thanks for having a look. |
Apologies for not managing to get to this this week. I will start reviewing it first thing next week |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Amazing work! I've left some comments ater an initial review, but the core work is in a very good state. Thank you for the effort.
To call this effort complete we'd need to add some documentation. IIRC at Grafana Labs we had some issues when removing the "replicas" field from objects, which caused k8s to scale them down to 1 replica until autoscaling kicks in.
In addition to that it would be nice if we can make use of the existing helm-jsonent comparison to make sure that autscaling between the two doesn't drift apart.
I wouldn't want to put them on your or even include them in this PR because the work you've done is impressive already. Unless you are keep to work on them, I think we can get some resources at GL to work on these two.
Althought it would be nice to have this tested out in some extent. Have you had a chance to test the Helm autoscaling in a Mimir cluster?
ruler_query_frontend+: { | ||
enabled: true, | ||
}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The ruler QF isn't added in this PR, right? Should we keep it disabled?
@@ -0,0 +1,44 @@ | |||
{{- if .Values.query_frontend.kedaAutoscaling.enabled }} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is some convention to name the files as <name>-<resource_kind>.yaml
. Can you name these e.g. query-frontend-scaledObject.yaml
or query-frontend-so.yaml
?
{{- if not .Values.distributor.kedaAutoscaling.enabled }} | ||
replicas: {{ .Values.distributor.replicas }} | ||
{{- end }} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should the same be present on the ruler deployment?
@@ -0,0 +1,12 @@ | |||
{{/* | |||
Convert labels to string like: key1="value1", key2="value2", ... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it key1="value1"
or key1=value1
? Also I think there isn't any space between the keys. Can you add an example input and output in the comment?
ctx = . context | ||
component = name of the component | ||
*/}} | ||
{{- define "mimir.lib.memorySiToBytes" -}} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we already have mimir.siToBytes
which does the same
@@ -30,6 +30,7 @@ Entries should include a reference to the Pull Request that introduced the chang | |||
|
|||
* [CHANGE] Changed max unavailable ingesters and store-gateways in a zone to 50. #5327 | |||
* [CHANGE] Don't render PodSecurityPolicy on Kubernetes >=1.24. (was >= 1.25). This helps with upgrades between 1.24 and 1.25. To use a PSP in 1.24, toggle `rbac.forcePSPOnKubernetes124: true`. #5357 | |||
[FEATURE] Allow for deploying keda autoscaling objects as part of the helm chart. #4687 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you prepend this with a bullet *
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[FEATURE] Allow for deploying keda autoscaling objects as part of the helm chart. #4687 | |
* [FEATURE] Added experimental feature for deploying keda autoscaling objects as part of the helm chart for the components: distributor, querier, query-frontend and ruler. Requires metamonitoring, for more details on metamonitoring see the Helm chart documentation. #4687 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should also mention the limitation that currently only URL embedded basic auth method is supported and we might add further Keda auth methods later, see https://keda.sh/docs/2.11/scalers/prometheus/
enabled: false | ||
minReplicaCount: 1 | ||
maxReplicaCount: 10 | ||
pollingInterval: 10 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can the polling interval default to the scrape interval for metamonitoring?
behavior: | ||
scaleDown: | ||
policies: | ||
- periodSeconds: 60 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The value in jsonnet is 600 here I think. The reason is that the ruler is a bit more sensitive to scaling up and down. Restarting a ruler can result in missed rule evaluations.
minReplicaCount: 2 | ||
maxReplicaCount: 10 | ||
pollingInterval: 10 | ||
querySchedulerInflightRequestsThreshold: 6 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The jsonnet calculates with as 0.75*8 where 8 is the default value for max_concurrent requests. In Helm that's 16. Let's use 12 here in this case
kedaAutoscaling: | ||
enabled: false | ||
minReplicaCount: 1 | ||
maxReplicaCount: 10 | ||
pollingInterval: 10 | ||
targetCPUUtilizationPercentage: 80 | ||
targetMemoryUtilizationPercentage: 80 | ||
customHeaders: {} | ||
# X-Scope-OrgID: "" | ||
behavior: | ||
scaleDown: | ||
policies: | ||
- periodSeconds: 60 | ||
type: Percent | ||
value: 10 | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add some comments on the kedaAutoscaling section and its fields? I think enough to be able to determine whether you need keda autoscaling and to determine what each of these fields matter.
(same comment for th rest of the sections)
{{- include "mimir.remoteReadUrl.inCluster" $.ctx }} | ||
{{- else -}} | ||
{{- $parsed := urlParse (.remote).url -}} | ||
{{ $parsed.scheme }}://{{ $parsed.host }}{{ include "mimir.prometheusHttpPrefix" $.ctx }} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For external URLs we need a sensible default like /prometheus
and allow an override somewhere, otherwise we're tying the internal setting and an unrelated external setting together. For an MVP, I'd just hardcode to /prometheus
while we figure out where to put the override.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice spot. A Prometheus instance also doesn't have the /prometheus
prefix, /prometheus
is only a Mimir prefix.
Sorry for the late response. I was on vacation the past few weeks. I’ll go through your comments hopefully by the end of this week. |
@davidspek @dimitarvdimitrov I'm happy to push this through if possible as it's an important feature. |
Thanks for offering to help @hobbsh! My review above still holds I believe. It would be nice to test this out in a Mimir cluster before merging the PR, I would appreciate it someone can help out with this. There are some review comments that should be addressed. And finally write some docs. Docs aren't critical to merge this, but I think they should follow before we announce this as a stable feature of the helm chart. We can start working on them once the PR is in its final stages. I also know that @pracucci has been working on some improvements on the querier HPA queries that have been already rolled out in cells at Grafana. It makes sense to wait a bit for those to be upstreamed so we can use the queries and KEDA config in helm as well |
The autoscaling improvements have been merged in #6971. It would be a good idea to include them in this PR once it's back on track |
Thank you for your work on this @davidspek. @beatkind took ownership of the autoscaling changes and we just merged #7282 as a first iteration which was based on the changes in your PR. |
What this PR does
This PR adds the Keda autoscaling from the libsonnet deployment to the Helm chart and updates the autoscaling sections in the dashboards so they work with Helm deployments.
Checklist
CHANGELOG.md
updated - the order of entries should be[CHANGE]
,[FEATURE]
,[ENHANCEMENT]
,[BUGFIX]