Managed alertmanager no longer running in clusters #990

gravelg · 2024-05-29T15:12:00Z

Not sure if this is the right place to report such a bug, but we've been using managed alertmanager for a year now, and it seems that in the last few days, the pod has disappeared from a few of our clusters, all GKE Autopilot clusters (if that matters).

Config Secret is still there and unchanged

❯ k -n gmp-public get secrets
NAME                                         TYPE                                  DATA   AGE
alertmanager                                 Opaque                                1      364d

alertmanager pod is gone

❯ k -n gke-gmp-system get pods
NAME                              READY   STATUS    RESTARTS   AGE
collector-s4wjs                   2/2     Running   0          123m
collector-tdnkl                   2/2     Running   0          9d
collector-vkvhj                   2/2     Running   0          9d
gmp-operator-68988c87ff-7m5q5     1/1     Running   0          9d
rule-evaluator-664c866849-64qks   2/2     Running   0          9d

On another non-autopilot cluster:

❯ k -n gmp-system get pods
NAME                              READY   STATUS    RESTARTS      AGE
alertmanager-0                    2/2     Running   0             9d
collector-bt69f                   2/2     Running   0             9d
collector-fdpdm                   2/2     Running   0             9d
collector-kfhdz                   2/2     Running   0             9d
gmp-operator-6b4cf8fcc4-b6n5t     1/1     Running   0             9d
rule-evaluator-659bf557cf-gmcvt   2/2     Running   2 (20h ago)   9d

I also see that the namespace for gmp pods is not the same on an autopilot vs regular cluster, not sure if that has anything to do with it.

The text was updated successfully, but these errors were encountered:

bernot-dev · 2024-05-29T15:28:10Z

The fact that you're observing this only in GKE Autopilot clusters is likely related to the timing of our release rollouts. We introduced a change in #691 that scales alertmanager to zero when rules are not configured using our Rules, ClusterRules, or GlobalRules resources.

Can you provide a bit more information about how you are using Alertmanager? Do you have any of those Rules configured?

gravelg · 2024-05-29T15:34:40Z

We manage alert rules through Grafana, and so we don't have any of the Rules objects configured in those clusters. I can try to create a Rules object and see if the alertmanager pod comes back

bernot-dev · 2024-05-29T15:38:05Z

It may also be worth checking if the StatefulSet that manages the alertmanager pods still exists: kubectl get -n gke-gmp-system statefulset/alertmanager

gravelg · 2024-05-29T15:41:21Z

The StatefulSet is indeed still there

❯ kubectl get -n gke-gmp-system statefulset/alertmanager
NAME           READY   AGE
alertmanager   0/0     370d

I just applied the example-rule from the repo and sure enough, alertmanager is back

❯ kubectl get -n gke-gmp-system statefulset/alertmanager
NAME           READY   AGE
alertmanager   1/1     555d

I'll try to craft a rule that doesn't actually alert us just to hang around and make sure the alertmanager doesn't scale to 0, unless you have another option I can try

bernot-dev · 2024-05-29T16:31:18Z

Something like our example rule should be a good starting point.

That will be the best workaround for now. I'll discuss with the team whether it make sense for us to implement another solution for future releases.

github-actions bot assigned maxamins May 29, 2024

bernot-dev assigned bernot-dev and unassigned maxamins May 29, 2024

bernot-dev closed this as completed May 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Managed alertmanager no longer running in clusters #990

Managed alertmanager no longer running in clusters #990

gravelg commented May 29, 2024

bernot-dev commented May 29, 2024

gravelg commented May 29, 2024

bernot-dev commented May 29, 2024

gravelg commented May 29, 2024

bernot-dev commented May 29, 2024

Managed alertmanager no longer running in clusters #990

Managed alertmanager no longer running in clusters #990

Comments

gravelg commented May 29, 2024

bernot-dev commented May 29, 2024

gravelg commented May 29, 2024

bernot-dev commented May 29, 2024

gravelg commented May 29, 2024

bernot-dev commented May 29, 2024