Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Managed alertmanager no longer running in clusters #990

Closed
gravelg opened this issue May 29, 2024 · 5 comments
Closed

Managed alertmanager no longer running in clusters #990

gravelg opened this issue May 29, 2024 · 5 comments
Assignees

Comments

@gravelg
Copy link

gravelg commented May 29, 2024

Not sure if this is the right place to report such a bug, but we've been using managed alertmanager for a year now, and it seems that in the last few days, the pod has disappeared from a few of our clusters, all GKE Autopilot clusters (if that matters).

Config Secret is still there and unchanged

❯ k -n gmp-public get secrets
NAME                                         TYPE                                  DATA   AGE
alertmanager                                 Opaque                                1      364d

alertmanager pod is gone

❯ k -n gke-gmp-system get pods
NAME                              READY   STATUS    RESTARTS   AGE
collector-s4wjs                   2/2     Running   0          123m
collector-tdnkl                   2/2     Running   0          9d
collector-vkvhj                   2/2     Running   0          9d
gmp-operator-68988c87ff-7m5q5     1/1     Running   0          9d
rule-evaluator-664c866849-64qks   2/2     Running   0          9d

On another non-autopilot cluster:

❯ k -n gmp-system get pods
NAME                              READY   STATUS    RESTARTS      AGE
alertmanager-0                    2/2     Running   0             9d
collector-bt69f                   2/2     Running   0             9d
collector-fdpdm                   2/2     Running   0             9d
collector-kfhdz                   2/2     Running   0             9d
gmp-operator-6b4cf8fcc4-b6n5t     1/1     Running   0             9d
rule-evaluator-659bf557cf-gmcvt   2/2     Running   2 (20h ago)   9d

I also see that the namespace for gmp pods is not the same on an autopilot vs regular cluster, not sure if that has anything to do with it.

@bernot-dev
Copy link
Collaborator

The fact that you're observing this only in GKE Autopilot clusters is likely related to the timing of our release rollouts. We introduced a change in #691 that scales alertmanager to zero when rules are not configured using our Rules, ClusterRules, or GlobalRules resources.

Can you provide a bit more information about how you are using Alertmanager? Do you have any of those Rules configured?

@gravelg
Copy link
Author

gravelg commented May 29, 2024

We manage alert rules through Grafana, and so we don't have any of the Rules objects configured in those clusters. I can try to create a Rules object and see if the alertmanager pod comes back

@bernot-dev
Copy link
Collaborator

It may also be worth checking if the StatefulSet that manages the alertmanager pods still exists: kubectl get -n gke-gmp-system statefulset/alertmanager

@gravelg
Copy link
Author

gravelg commented May 29, 2024

The StatefulSet is indeed still there

❯ kubectl get -n gke-gmp-system statefulset/alertmanager
NAME           READY   AGE
alertmanager   0/0     370d

I just applied the example-rule from the repo and sure enough, alertmanager is back

❯ kubectl get -n gke-gmp-system statefulset/alertmanager
NAME           READY   AGE
alertmanager   1/1     555d

I'll try to craft a rule that doesn't actually alert us just to hang around and make sure the alertmanager doesn't scale to 0, unless you have another option I can try

@bernot-dev
Copy link
Collaborator

Something like our example rule should be a good starting point.

That will be the best workaround for now. I'll discuss with the team whether it make sense for us to implement another solution for future releases.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants