Add cluster peers DNS refresh job #1428

povilasv · 2018-06-21T15:13:48Z

Adds a job which runs periodically and refreshes cluster.peer dns records

The problem is that when you restart all of the alertmanager instances in an environment like Kubernetes, DNS may contain old alertmanager instance IPs, but on startup (when Join() happens) none of the new instance IPs. As at the start DNS is not empty resolvePeers waitIfEmpty=true, will return and "islands" of 1 alertmanager instances will form.

All alert manager metrics endpoints show: alertmanager_cluster_members 1

Here are some logs:

logs of alertmanager1:

level=info ts=2018-06-21T14:35:44.824688253Z caller=main.go:141 build_context="(go=go1.10, user=root@f278953f13ef, date=20180323-13:05:10)"
level=warn ts=2018-06-21T14:36:04.840918631Z caller=cluster.go:129 component=cluster msg="failed to join cluster" err="2 errors occurred:\n\n* Failed to join 10.2.19.164: dial tcp 10.2.19.164:8001: i/o timeout\n* Failed to join 10.2.43.52: dial tcp 10.2.43.52:8001: i/o timeout"
level=info ts=2018-06-21T14:36:04.841778884Z caller=cluster.go:249 component=cluster msg="Waiting for gossip to settle..." interval=2s
level=info ts=2018-06-21T14:36:04.84242773Z caller=main.go:270 msg="Loading configuration file" file=/etc/alertmanager/config.yml
level=info ts=2018-06-21T14:36:04.849504109Z caller=main.go:346 msg=Listening address=0.0.0.0:9093
level=info ts=2018-06-21T14:36:06.84218093Z caller=cluster.go:274 component=cluster msg="gossip not settled" polls=0 before=0 now=1 elapsed=2.000165572s
level=info ts=2018-06-21T14:36:14.843072999Z caller=cluster.go:266 component=cluster msg="gossip settled; proceeding" elapsed=10.001056858s
level=info ts=2018-06-21T14:51:04.842216499Z caller=nflog.go:313 component=nflog msg="Running maintenance"
level=info ts=2018-06-21T14:51:04.842784642Z caller=silence.go:252 component=silences msg="Running maintenance"
level=info ts=2018-06-21T14:51:04.84482936Z caller=silence.go:254 component=silences msg="Maintenance done" duration=2.046436ms size=0
level=info ts=2018-06-21T14:51:04.844844053Z caller=nflog.go:315 component=nflog msg="Maintenance done" duration=2.631928ms size=6305

alertmanager2:

evel=info ts=2018-06-21T14:35:44.824589916Z caller=main.go:140 msg="Starting Alertmanager" version="(version=0.15.0-rc.1, branch=HEAD, revision=acb111e812530bec1ac6d908bc14725
793e07cf3)"
level=info ts=2018-06-21T14:35:44.824688253Z caller=main.go:141 build_context="(go=go1.10, user=root@f278953f13ef, date=20180323-13:05:10)"
level=warn ts=2018-06-21T14:36:04.840918631Z caller=cluster.go:129 component=cluster msg="failed to join cluster" err="2 errors occurred:\n\n* Failed to join 10.2.19.164: dial
tcp 10.2.19.164:8001: i/o timeout\n* Failed to join 10.2.43.52: dial tcp 10.2.43.52:8001: i/o timeout"
level=info ts=2018-06-21T14:36:04.841778884Z caller=cluster.go:249 component=cluster msg="Waiting for gossip to settle..." interval=2s
level=info ts=2018-06-21T14:36:04.84242773Z caller=main.go:270 msg="Loading configuration file" file=/etc/alertmanager/config.yml
level=info ts=2018-06-21T14:36:04.849504109Z caller=main.go:346 msg=Listening address=0.0.0.0:9093
level=info ts=2018-06-21T14:36:06.84218093Z caller=cluster.go:274 component=cluster msg="gossip not settled" polls=0 before=0 now=1 elapsed=2.000165572s
level=info ts=2018-06-21T14:36:14.843072999Z caller=cluster.go:266 component=cluster msg="gossip settled; proceeding" elapsed=10.001056858s
level=info ts=2018-06-21T14:51:04.842216499Z caller=nflog.go:313 component=nflog msg="Running maintenance"
level=info ts=2018-06-21T14:51:04.842784642Z caller=silence.go:252 component=silences msg="Running maintenance"
level=info ts=2018-06-21T14:51:04.84482936Z caller=silence.go:254 component=silences msg="Maintenance done" duration=2.046436ms size=0
level=info ts=2018-06-21T14:51:04.844844053Z caller=nflog.go:315 component=nflog msg="Maintenance done" duration=2.631928ms size=6305

my k8s config:

apiVersion: v1
kind: Service
metadata:
  labels:
    name: alertmanager-peers
  name: alertmanager-peers
  namespace: sys-mon
spec:
  clusterIP: None
  ports:
  - name: cluster
    protocol: TCP
    port: 8001
    targetPort: cluster
  selector:
    app: alertmanager
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: alertmanager
  namespace: sys-mon
spec:
  replicas: 3
  selector:
    matchLabels:
      app: alertmanager
  template:
    metadata:
      name: alertmanager
      labels:
        app: alertmanager
    spec:
      containers:
      - name: alertmanager
        image: prom/alertmanager:v0.15.0-rc.1
        args:
          - --config.file=/etc/alertmanager/config.yml
          - --web.listen-address=0.0.0.0:9093
          - --storage.path=/alertmanager
          - --web.external-url=https://alertmanager.dev.uw.systems
          - --cluster.listen-address=0.0.0.0:8001
          - --cluster.peer=alertmanager-peers.sys-mon:8001
...

To reproduce, just run alertmanager in Kubernetes with headless service and k delete po --force --grace-period=0 -l app=alertmanager

After applying change and setting refresh to ~ 5mins I can see that nodes joined back after 5mins by refresh job, as the counter got increased:

alertmanager_cluster_refresh_total 2

brian-brazil · 2018-06-21T15:22:06Z

Rather than slowly re-inventing service discovery, if this is determined to be needed we should use the existing SD from Prometheus.

povilasv · 2018-06-21T15:35:58Z

@brian-brazil fair point, but using Prom's DNS service discover could only replace resolvePeers function (https://github.com/prometheus/alertmanager/blob/master/cluster/cluster.go#L611).

IMO DNS refresh job would still be needed, unless we pass down dns names to membership.Join(), but I'm pretty sure it doesn't refresh dns after initial Join() run, which would leave us in the same place

But I'm all for better ways to do this thing, so if you have anything specific in mind I would be glad to help

Signed-off-by: Povilas Versockas <p.versockas@gmail.com>

mxinden

Left some minor comments. Thanks for looking into this @povilasv!

mxinden · 2018-06-26T14:34:12Z

cluster/cluster.go

+		Help: "A counter of the number of failed cluster peer refresh attempts.",
+	})
+	p.refreshCounter = prometheus.NewCounter(prometheus.CounterOpts{
+		Name: "alertmanager_cluster_refresh_total",


Metric name and metric description seem diverged to me. alertmanager_cluster_refresh_total sounds like the amount of times a refresh happened and not amount of times a peer joined the cluster due to a refresh. How about alertmanager_cluster_refresh_join_total? (Not quite perfect either)

I agree, renamed to alertmanager_cluster_refresh_join_total and alertmanager_cluster_refresh_join_failed_total to indicate that there are counters for join

mxinden · 2018-06-26T14:36:31Z

cmd/alertmanager/main.go

@@ -162,6 +162,7 @@ func main() {
 		settleTimeout        = kingpin.Flag("cluster.settle-timeout", "Maximum time to wait for cluster connections to settle before evaluating notifications.").Default(cluster.DefaultPushPullInterval.String()).Duration()
 		reconnectInterval    = kingpin.Flag("cluster.reconnect-interval", "Interval between attempting to reconnect to lost peers.").Default(cluster.DefaultReconnectInterval.String()).Duration()
 		peerReconnectTimeout = kingpin.Flag("cluster.reconnect-timeout", "Length of time to attempt to reconnect to a lost peer.").Default(cluster.DefaultReconnectTimeout.String()).Duration()
+		refreshInterval      = kingpin.Flag("cluster.refresh-interval", "Interval between attempting to refresh cluster.peers DNS records.").Default(cluster.DefaultReconnectInterval.String()).Duration()


Would a good default value be enough for now, or is a custom configuration necessary for most environments?

👍 I think 30s should be fine for most environments, typically alertmanger is quick to start, so IMO anything longer than that would slow down startup/gossip settling

Signed-off-by: Povilas Versockas <p.versockas@gmail.com>

mxinden

Just a small follow up.

Any thoughts by the others?

mxinden · 2018-07-02T21:30:58Z

cluster/cluster.go

@@ -112,6 +118,7 @@ func Join(
 	probeInterval time.Duration,
 	reconnectInterval time.Duration,
 	reconnectTimeout time.Duration,
+	refreshInterval time.Duration,


As this is not configurable via a command line flag anymore, there is no reason for the parameter, right?

grobie · 2018-07-06T14:26:57Z

Why are you restarting all alertmanagers at once? Kubernetes provides many options to control the deployment speed and ensure that DNS is updated before continuing with the next instance restart. For example the deployment speed can be controlled with minReadySeconds and maxSurge, or you switch to a StatefulSet altogether.

I don't see why any additional alertmanager functionality is necessary to ensure it can be safely deployed in Kubernetes.

povilasv · 2018-08-01T05:31:16Z

@grobie the problem is not about deployment in Kubernetes, it's just a way to reproduce.
The problem is that we shouldn't expect DNS to always be there nor depend on DNS update speed to form a complete alertmanager cluster, wouldn't you agree? (Sorry for delayed response, I was on vacation)

Signed-off-by: Povilas Versockas <p.versockas@gmail.com>

mxinden · 2018-10-25T16:27:40Z

Note to my future-me and maybe others here as well:

We (thanks @metalmatze for debugging) have just hit the same issue (depending on DNS being consistent instead of eventual consistent) updating an Alertmanager cluster (3 instances) on a Kubernetes cluster deployed via the Prometheus Operator.

Alertmanager 0	Alertmanager 1	Alertmanager 2
runs 0.15.1	runs 0.15.1	runs 0.15.1
_	_	updates to 0.15.2, DNS query returns IP of old 0 and old 1
_	updates to 0.15.2, DNS query returns IP of old 0 and old 3, discovers new 3 via old 0 via gossip	_
updates to 0.15.2, DNS query returns IP of old 1 and old 3, does not discover new 2 or new 3	_	_

This partition ({new-0}, {new-1,new-2}) would resolve eventually by refreshing DNS entries (this PR).

Alternative fix as suggested by @grobie is to leverage Kubernetes readiness-check. This would not actually depend on Alertmanager saying that it discovered its peers, but rather adding a simple delay hoping for DNS to reach consistency in the meantime.

I agree with @povilasv that depending on an eventual consistent system in a consistent fashion is a design flaw.

Before reinventing the wheel here I think @brian-brazil also has a good point with:

Rather than slowly re-inventing service discovery, if this is determined to be needed we should use the existing SD from Prometheus.

I will take a look how easily that can be achieved. In the meantime I am curious what your thoughts are.

brancz · 2018-11-05T10:47:57Z

I disagree with #1428 (comment), for one there are other environments than Kubernetes out there with varying capabilities, and the proposed solution is a hack at best and still racy. Currently discovery is so broken that people have to apply hacks and do racy things that are very fragile, I think we should attempt to fix what we have right now by actually re-querying DNS, making a current release of Alertmanager usable again.

Then immediately start working on re-using the Prometheus service discovery module, I do agree that this is probably the best way forward but also a "more complicated than it sounds" type of issue/feature (I would probably want to introduce this in parallel as we've have a lot of problems with the SD module when we refactored it within Prometheus to be re-used for discovering Alertmanagers).

grobie · 2018-11-05T11:11:22Z

The alertmanager peers configuration does not support service disocvery. It is currently intended to list the address over every peer separately by repeating that flag. The way you seem to configure alertmanager is the actual hack here, using the Kubernetes DNS service address as single peer, expecting to eventually have every peer connet to every other.

In environments outside of Kubernetes, people will just use their existing means of service discovery or configuration management to configure the list of alertmanager peers explicitely. That's what we do at SoundCloud for example where we don't want to deploy alertmanager inside of Kubernetes.

Most software doesn't have an opinionated built-in service discovery, and I don't see a compelling reason to add this complexity to alertmanager. The only people who have reported issues so far with the existing means deploy alertmanager on Kubernetes, which provides all features to avoid having to re-implement service discovery functionality in every single service.

Using a deployment instead of a statefulset is the wrong choice for the Kubernetes controller. The statefulset provides the properties you want for a HA alertmanager deployement to guarantee at most one instance is taken down at a time. It also provides you with stable identifiers for every instance in the set, so that you can use --cluster.peer=alertmanager-1 --cluster.peer=alertmanager-2 --cluster.peer=alertmanager-3 ... in your config.

I still can't see the need to add the complexity of service discovery to alertmanager just to discover it peers. There exist an unlimited number of different service discovery mechanisms out there. The Prometheus Operator uses the worst possible features of Kubernetes to configure and deploy alertmanager, I don't see why alertmanager needs to be fixed here instead of the Operator itself @brancz @mxinden.

brancz · 2018-11-05T12:54:25Z

The Prometheus Operator uses the worst possible features of Kubernetes to configure and deploy alertmanager, I don't see why alertmanager needs to be fixed here instead of the Operator itself @brancz @mxinden.

Given that you are referencing the operator to use deployments shows that you haven't actually looked at it, so I'm gonna ask you to stay respectful in your wording.

We do exactly what you described as the right choice of how to deploy Alertmanager on Kubernetes (with statefulsets and consistent DNS pod identity). We can introduce additional rollout delays, but that assumes, that the DNS server always respects TTLs precisely, which from experience has not always been the case, so we are looking for additional hardening on existing functionality. Argumenting that DNS records should be resolved again once in a while because they are not a consistent system, is regardless of the Prometheus Operator or even Kubernetes.

Whether we want the full service discovery module from Prometheus is as far as I can tell up for discussion, I have just expressed interest, but understand not wanting that heavy functionality as well. In my opinion that's a separate topic from fixing existing functionality.

grobie · 2018-11-05T13:58:17Z

Given that you are referencing the operator to use deployments shows that you haven't actually looked at it, so I'm gonna ask you to stay respectful in your wording.

Point taken. That was a poor argumentation, I apologize.

Argumenting that DNS records should be resolved again once in a while because they are not a consistent system, is regardless of the Prometheus Operator or even Kubernetes.

Alright, let's do it.

grobie · 2018-11-05T14:02:25Z

cluster/cluster.go

@@ -391,6 +408,51 @@ func (p *Peer) reconnect() {
 	}
 }

+func (p *Peer) handleRefresh(d time.Duration) {


If we're talking about proper DNS support for alertmanager, it would be better to respect the TTL of the record as advertised by the authority.

Would you shorten the interval to 10s or 15s?
Even though I'm sometimes running into this problem of members not finding each other, I still think that 30s is enough for the cluster to heal.
Every time this happened I got some duplicate alerts, which is something I can live with, if fixed within < 30s.

@grobie It looks like Go net package doesn't expose TTL values https://golang.org/pkg/net/#IPAddr.
I guess the only way to get the TTL value is to use smth like https://stackoverflow.com/a/48997354, which would look quite cumbersome.

I personally prefer to use this, as it's a simpler solution, but I can prepare a change if you want.

This ends up relying on using the internal resolver, which should be properly caching/refreshing responses. I would say we can shorten the time.

👍 Shortened to 15s.

grobie · 2018-11-05T14:03:48Z

cluster/cluster.go

@@ -97,6 +102,7 @@ const (
 	DefaultProbeInterval     = 1 * time.Second
 	DefaultReconnectInterval = 10 * time.Second
 	DefaultReconnectTimeout  = 6 * time.Hour
+	DefaultRefreshInterval   = 30 * time.Second


This seems to be quiet long to prevent a partition during deployment.

metalmatze · 2018-11-05T15:32:41Z

cluster/cluster.go

+		if !isPeerFound {
+			if _, err := p.mlist.Join([]string{peer}); err != nil {
+				p.failedRefreshCounter.Inc()
+				level.Debug(logger).Log("result", "failure", "addr", peer)


I would argue that this could also be a Info or Warn.

👍 changed to be Warn

Signed-off-by: Povilas Versockas <p.versockas@gmail.com>

mxinden

I have been testing this on a minikube cluster with increased CoreDNS caching time and @metalmatze and I have tested this on a vanilla Kubernetes cluster. In addition I have run this through the Prometheus Operator test suit and added a specific test case to cover this use case (prometheus-operator/prometheus-operator#2145).

This looks good to me. Any further comments by others?

stuartnelson3 · 2018-11-21T10:58:28Z

This looks good to me. Any further comments by others?

You've been looking after this one, if you're happy with it then 👍 from me

simonpasquier

I'm fine with the change. I think that eventually we should leverage the DNS service discovery of Prometheus but for now, it can't be integrated without pulling all the SD packages...

mxinden · 2018-11-23T08:47:57Z

Thanks @povilasv for the patch. Thanks everyone for the collaboration and discussions.

Adds a job which runs periodically and refreshes cluster.peer dns records. The problem is that when you restart all of the alertmanager instances in an environment like Kubernetes, DNS may contain old alertmanager instance IPs, but on startup (when Join() happens) none of the new instance IPs. As at the start DNS is not empty resolvePeers waitIfEmpty=true, will return and "islands" of 1 alertmanager instances will form. Signed-off-by: Povilas Versockas <p.versockas@gmail.com>

XI1062-abhisheksinghal · 2021-04-16T06:39:56Z

@povilasv can you help
using prometheus , alert Manager , Node exporter , grafana .
Configured alertManager for getting alerts when one of the service instance gets down .Alerts are fired succesffuly but as i have configured gmail for receiving the alerts , not receiving any alert
Docker logs for alertmanager shows below error

caller=dispatch.go:309 component=dispatcher msg="Notify for alerts failed" num_alerts=1 err="gmail-notifications/email[0]: notify retry canceled after 2 attempts: create SMTP client: EOF"

povilasv force-pushed the peers-refresh branch from f3a98e3 to bdfd36e Compare June 21, 2018 15:16

povilasv changed the title ~~Add cluster.refresh-interval refresh~~ Add cluster peers DNS refresh job Jun 21, 2018

Add cluster.peers refresh

5985377

Signed-off-by: Povilas Versockas <p.versockas@gmail.com>

povilasv force-pushed the peers-refresh branch from bdfd36e to 5985377 Compare June 21, 2018 15:43

mxinden reviewed Jun 26, 2018

View reviewed changes

Refactor after review

5b64904

Signed-off-by: Povilas Versockas <p.versockas@gmail.com>

povilasv force-pushed the peers-refresh branch from 8793347 to 5b64904 Compare June 27, 2018 07:07

povilasv mentioned this pull request Jul 1, 2018

Clustered Alertmanager instances not joining via gossip #1449

Closed

mxinden reviewed Jul 2, 2018

View reviewed changes

Merge master

bf248a5

povilasv force-pushed the peers-refresh branch 2 times, most recently from e11d28a to 6e5e827 Compare August 21, 2018 14:58

Fixes after review

9c69c38

Signed-off-by: Povilas Versockas <p.versockas@gmail.com>

povilasv force-pushed the peers-refresh branch from 6e5e827 to 9c69c38 Compare August 21, 2018 15:02

Merge branch 'master' into peers-refresh

1e4bcb6

grobie closed this Nov 5, 2018

grobie reopened this Nov 5, 2018

grobie approved these changes Nov 5, 2018

View reviewed changes

metalmatze reviewed Nov 5, 2018

View reviewed changes

Fixes after review

e955b4c

Signed-off-by: Povilas Versockas <p.versockas@gmail.com>

povilasv force-pushed the peers-refresh branch from 9452c67 to e955b4c Compare November 6, 2018 14:47

brancz mentioned this pull request Nov 8, 2018

kube-prometheus alertmanager cluster join failure prometheus-operator/prometheus-operator#2103

Closed

mxinden approved these changes Nov 21, 2018

View reviewed changes

mxinden mentioned this pull request Nov 21, 2018

pkg/alertmanager: Update to v0.16.0 prometheus-operator/prometheus-operator#2145

Merged

simonpasquier reviewed Nov 22, 2018

View reviewed changes

mxinden merged commit 7f34cb4 into prometheus:master Nov 23, 2018

povilasv deleted the peers-refresh branch November 23, 2018 09:36

mxinden mentioned this pull request Dec 13, 2018

Initially unresolved peers are lost forever #1661

Closed

dansimone mentioned this pull request Jan 14, 2019

Pull DNS refresh into 0.15 branch #1709

Closed

brancz mentioned this pull request Feb 18, 2019

Prometheus-Operator: Alertmanager peers - wrong service hostname prometheus-operator/prometheus-operator#2400

Closed

krajorama mentioned this pull request Nov 22, 2021

Memberlist.Members safe use docstring hashicorp/memberlist#250

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add cluster peers DNS refresh job #1428

Add cluster peers DNS refresh job #1428

povilasv commented Jun 21, 2018 •

edited

Loading

brian-brazil commented Jun 21, 2018

povilasv commented Jun 21, 2018 •

edited

Loading

mxinden left a comment

mxinden Jun 26, 2018

povilasv Jun 27, 2018

mxinden Jun 26, 2018

povilasv Jun 27, 2018 •

edited

Loading

mxinden left a comment

mxinden Jul 2, 2018

grobie commented Jul 6, 2018

povilasv commented Aug 1, 2018 •

edited

Loading

mxinden commented Oct 25, 2018

brancz commented Nov 5, 2018 •

edited

Loading

grobie commented Nov 5, 2018

brancz commented Nov 5, 2018 •

edited

Loading

grobie commented Nov 5, 2018

grobie Nov 5, 2018

metalmatze Nov 5, 2018

povilasv Nov 6, 2018

stuartnelson3 Nov 6, 2018

povilasv Nov 6, 2018

grobie Nov 5, 2018

metalmatze Nov 5, 2018

povilasv Nov 6, 2018 •

edited

Loading

mxinden left a comment

stuartnelson3 commented Nov 21, 2018

simonpasquier left a comment

mxinden commented Nov 23, 2018

XI1062-abhisheksinghal commented Apr 16, 2021

Add cluster peers DNS refresh job #1428

Add cluster peers DNS refresh job #1428

Conversation

povilasv commented Jun 21, 2018 • edited Loading

brian-brazil commented Jun 21, 2018

povilasv commented Jun 21, 2018 • edited Loading

mxinden left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

povilasv Jun 27, 2018 • edited Loading

Choose a reason for hiding this comment

mxinden left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

grobie commented Jul 6, 2018

povilasv commented Aug 1, 2018 • edited Loading

mxinden commented Oct 25, 2018

brancz commented Nov 5, 2018 • edited Loading

grobie commented Nov 5, 2018

brancz commented Nov 5, 2018 • edited Loading

grobie commented Nov 5, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

povilasv Nov 6, 2018 • edited Loading

Choose a reason for hiding this comment

mxinden left a comment

Choose a reason for hiding this comment

stuartnelson3 commented Nov 21, 2018

simonpasquier left a comment

Choose a reason for hiding this comment

mxinden commented Nov 23, 2018

XI1062-abhisheksinghal commented Apr 16, 2021

povilasv commented Jun 21, 2018 •

edited

Loading

povilasv commented Jun 21, 2018 •

edited

Loading

povilasv Jun 27, 2018 •

edited

Loading

povilasv commented Aug 1, 2018 •

edited

Loading

brancz commented Nov 5, 2018 •

edited

Loading

brancz commented Nov 5, 2018 •

edited

Loading

povilasv Nov 6, 2018 •

edited

Loading