Initially unresolved peers are lost forever #1661

redbaron · 2018-12-12T16:16:34Z

What did you do?

Run alertmanager as a statefulset with --cluster-peer command line given once per member. If initial DNS resolution of peers fails, they are forgotten and never retried. They also neither show up in clusterStatus.peers, nor established connections are seen in netstat

Each pod of statefulset of 3 shows their own IP address in clusterStatus and report themselves as ready. Should I restart any one of them (kubectl exec pkill alert), it comes up, manages to resolve peer IP addresses and joins other 2, forming cluster of 3 (those which didn't get restarted are probably find about each other via gossip)

What did you expect to see?

early DNS failures shouldn't cause isolated "islands" to form.

I suspect problem is in

alertmanager/cluster/cluster.go

Lines 693 to 698 in 7f34cb4

    
           ips, err := res.LookupIPAddr(ctx, host) 
        
           if err != nil { 
        
           	// Assume direct address. 
        
           	resolvedPeers = append(resolvedPeers, peer) 
        
           	continue 
        
           }

where DNS failure is interpreted as if exact IP was given.

What did you see instead? Under which circumstances?

Each member of statefulset forms single-node cluster.

Environment

Alertmanager version:

0.15.3

Alertmanager configuration file:

global:
  resolve_timeout: 5m
route:
  group_by: ['job']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 12h
  receiver: 'null'
  routes:
  - match:
      alertname: DeadMansSwitch
    receiver: 'null'
receivers:
- name: 'null'

Logs:

level=info ts=2018-12-12T15:59:11.569012901Z caller=main.go:174 msg="Starting Alertmanager" version="(version=0.15.3, branch=HEAD, revision=d4a7697cc90f8bce62efe7c44b63b542578ec0a1)"
level=info ts=2018-12-12T15:59:11.569095844Z caller=main.go:175 build_context="(go=go1.11.2, user=root@4ecc17c53d26, date=20181109-15:40:48)"
level=debug ts=2018-12-12T15:59:11.576487801Z caller=cluster.go:143 component=cluster msg="resolved peers to following addresses" peers=alertmanager-test-0.alertmanager-operated.allns-x-amclustergossipsilences-pjmsed-0.svc:6783,alertmanag
er-test-1.alertmanager-operated.allns-x-amclustergossipsilences-pjmsed-0.svc:6783,alertmanager-test-2.alertmanager-operated.allns-x-amclustergossipsilences-pjmsed-0.svc:6783
level=debug ts=2018-12-12T15:59:11.579127508Z caller=delegate.go:209 component=cluster received=NotifyJoin node=01CYHJ9DPRWZWC3P8VZW6J2E1H addr=192.168.0.129:6783
level=debug ts=2018-12-12T15:59:11.582852186Z caller=cluster.go:287 component=cluster memberlist="2018/12/12 15:59:11 [WARN] memberlist: Failed to resolve alertmanager-test-0.alertmanager-operated.allns-x-amclustergossipsilences-pjmsed-0.
svc:6783: lookup alertmanager-test-0.alertmanager-operated.allns-x-amclustergossipsilences-pjmsed-0.svc on 10.96.0.10:53: no such host\n"
level=debug ts=2018-12-12T15:59:11.586089894Z caller=cluster.go:287 component=cluster memberlist="2018/12/12 15:59:11 [WARN] memberlist: Failed to resolve alertmanager-test-1.alertmanager-operated.allns-x-amclustergossipsilences-pjmsed-0.
svc:6783: lookup alertmanager-test-1.alertmanager-operated.allns-x-amclustergossipsilences-pjmsed-0.svc on 10.96.0.10:53: no such host\n"
level=debug ts=2018-12-12T15:59:11.589100603Z caller=cluster.go:287 component=cluster memberlist="2018/12/12 15:59:11 [WARN] memberlist: Failed to resolve alertmanager-test-2.alertmanager-operated.allns-x-amclustergossipsilences-pjmsed-0.
svc:6783: lookup alertmanager-test-2.alertmanager-operated.allns-x-amclustergossipsilences-pjmsed-0.svc on 10.96.0.10:53: no such host\n"
level=warn ts=2018-12-12T15:59:11.589148819Z caller=cluster.go:219 component=cluster msg="failed to join cluster" err="3 errors occurred:\n\n* Failed to resolve alertmanager-test-0.alertmanager-operated.allns-x-amclustergossipsilences-pjm
sed-0.svc:6783: lookup alertmanager-test-0.alertmanager-operated.allns-x-amclustergossipsilences-pjmsed-0.svc on 10.96.0.10:53: no such host\n* Failed to resolve alertmanager-test-1.alertmanager-operated.allns-x-amclustergossipsilences-pj
msed-0.svc:6783: lookup alertmanager-test-1.alertmanager-operated.allns-x-amclustergossipsilences-pjmsed-0.svc on 10.96.0.10:53: no such host\n* Failed to resolve alertmanager-test-2.alertmanager-operated.allns-x-amclustergossipsilences-p
jmsed-0.svc:6783: lookup alertmanager-test-2.alertmanager-operated.allns-x-amclustergossipsilences-pjmsed-0.svc on 10.96.0.10:53: no such host"
level=info ts=2018-12-12T15:59:11.589212472Z caller=cluster.go:221 component=cluster msg="will retry joining cluster every 10s"
level=warn ts=2018-12-12T15:59:11.589237137Z caller=main.go:265 msg="unable to join gossip mesh" err="3 errors occurred:\n\n* Failed to resolve alertmanager-test-0.alertmanager-operated.allns-x-amclustergossipsilences-pjmsed-0.svc:6783: l
ookup alertmanager-test-0.alertmanager-operated.allns-x-amclustergossipsilences-pjmsed-0.svc on 10.96.0.10:53: no such host\n* Failed to resolve alertmanager-test-1.alertmanager-operated.allns-x-amclustergossipsilences-pjmsed-0.svc:6783: 
lookup alertmanager-test-1.alertmanager-operated.allns-x-amclustergossipsilences-pjmsed-0.svc on 10.96.0.10:53: no such host\n* Failed to resolve alertmanager-test-2.alertmanager-operated.allns-x-amclustergossipsilences-pjmsed-0.svc:6783:
 lookup alertmanager-test-2.alertmanager-operated.allns-x-amclustergossipsilences-pjmsed-0.svc on 10.96.0.10:53: no such host"
level=info ts=2018-12-12T15:59:11.589316523Z caller=main.go:322 msg="Loading configuration file" file=/etc/alertmanager/config/alertmanager.yaml
level=info ts=2018-12-12T15:59:11.589762081Z caller=cluster.go:570 component=cluster msg="Waiting for gossip to settle..." interval=2s
level=info ts=2018-12-12T15:59:11.593358442Z caller=main.go:398 msg=Listening address=:9093
level=info ts=2018-12-12T15:59:13.590121522Z caller=cluster.go:595 component=cluster msg="gossip not settled" polls=0 before=0 now=1 elapsed=2.000133907s
level=debug ts=2018-12-12T15:59:15.590328722Z caller=cluster.go:592 component=cluster msg="gossip looks settled" elapsed=4.000379327s
level=debug ts=2018-12-12T15:59:17.590574734Z caller=cluster.go:592 component=cluster msg="gossip looks settled" elapsed=6.000623571s
level=debug ts=2018-12-12T15:59:19.590842429Z caller=cluster.go:592 component=cluster msg="gossip looks settled" elapsed=8.000892937s
level=info ts=2018-12-12T15:59:21.59108732Z caller=cluster.go:587 component=cluster msg="gossip settled; proceeding" elapsed=10.0011107s

Status:

{"status":"success","data":{"configYAML":"global:\n  resolve_timeout: 5m\n  http_config: {}\n  smtp_hello: localhost\n  smtp_require_tls: true\n  pagerduty_url: https://events.pagerduty.com/v2/enqueue\n  hipchat_api_url: https://api.hipch
at.com/\n  opsgenie_api_url: https://api.opsgenie.com/\n  wechat_api_url: https://qyapi.weixin.qq.com/cgi-bin/\n  victorops_api_url: https://alert.victorops.com/integrations/generic/20131114/alert/\nroute:\n  receiver: \"null\"\n  group_b
y:\n  - job\n  routes:\n  - receiver: \"null\"\n    match:\n      alertname: DeadMansSwitch\n  group_wait: 30s\n  group_interval: 5m\n  repeat_interval: 12h\nreceivers:\n- name: \"null\"\ntemplates: []\n","configJSON":{"global":{"resolve_
timeout":300000000000,"http_config":{"BasicAuth":null,"BearerToken":"","BearerTokenFile":"","ProxyURL":{},"TLSConfig":{"CAFile":"","CertFile":"","KeyFile":"","ServerName":"","InsecureSkipVerify":false}},"smtp_hello":"localhost","smtp_requ
ire_tls":true,"pagerduty_url":"https://events.pagerduty.com/v2/enqueue","hipchat_api_url":"https://api.hipchat.com/","opsgenie_api_url":"https://api.opsgenie.com/","wechat_api_url":"https://qyapi.weixin.qq.com/cgi-bin/","victorops_api_url
":"https://alert.victorops.com/integrations/generic/20131114/alert/"},"route":{"receiver":"null","group_by":["job"],"routes":[{"receiver":"null","match":{"alertname":"DeadMansSwitch"}}],"group_wait":30000000000,"group_interval":3000000000
00,"repeat_interval":43200000000000},"receivers":[{"name":"null"}],"templates":null},"versionInfo":{"branch":"HEAD","buildDate":"20181109-15:40:48","buildUser":"root@4ecc17c53d26","goVersion":"go1.11.2","revision":"d4a7697cc90f8bce62efe7c
44b63b542578ec0a1","version":"0.15.3"},"uptime":"2018-12-12T15:59:11.589298893Z","clusterStatus":{"name":"01CYHJ9DPRWZWC3P8VZW6J2E1H","status":"ready","peers":[{"name":"01CYHJ9DPRWZWC3P8VZW6J2E1H","address":"192.168.0.129:6783"}]}}}

The text was updated successfully, but these errors were encountered:

redbaron · 2018-12-12T16:33:29Z

Initial DNS resolution failure have something to do with CoreDNS. It seems to be slow to return DNS entries alertmanager pod is requesting very early on.

It is some sort of race with POD creation, because if I exec into Alertmanager container names are resolved just fine, as well as causing container to restart (with pkill alertmanager) without deleting a POD make DNS resolution to work.

Switching from CoreDNS to kube-dns hides the problem, because DNS resolution starts to work faster and by the time alertmanager requesting it correct entries are returned

mxinden · 2018-12-13T10:00:53Z

We have recently merged #1428 which introduces a DNS periodic DNS refresh. This should address your issue. It is included in the latest alpha release: https://github.com/prometheus/alertmanager/releases/tag/v0.16.0-alpha.0.

Let us know if this is of any help!

redbaron · 2018-12-13T10:13:01Z

@mxinden , I saw that one, but it operates on already known list of members. My impression was that problem is that peer get lost from list of peers very early on, there is nothing to "refresh". I didn't trace all the codepaths though, so I'll give 0.16.0 a try, maybe it will help

dansimone · 2019-01-14T19:56:12Z

I've encountered this problem as well on 0.15.3, trying to do the same thing you're doing (deploying an AlertManager cluster as a Kubernetes statefulset). I've found that v0.16.0-alpha.0 does indeed fix this - I've been unable to reproduce the issue of cluster partitions forming after initial deployment of the cluster (and this was fairly easy to reproduce with 0.15.3).

I'd like to pick this up in production, except that the only released AlertManager with this change is 0.16.0.alpha... Is there any eta on the first stable 0.16.x release? Or would the team be open to cutting a 0.15.4 release with this change? This issue seems to be a sticking point with a lot of folks trying to use AlertManager clustering with Kubernetes.

simonpasquier · 2019-04-05T14:24:36Z

Closing it as v0.16.0 (v0.16.2 being the latest) has been released now and should fix the issue. Feel free to re-open if it isn't the case.

simonpasquier added kind/bug component/high availability kind/more-info-needed labels Dec 13, 2018

simonpasquier closed this as completed Apr 5, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Initially unresolved peers are lost forever #1661

Initially unresolved peers are lost forever #1661

redbaron commented Dec 12, 2018

redbaron commented Dec 12, 2018

mxinden commented Dec 13, 2018

redbaron commented Dec 13, 2018

dansimone commented Jan 14, 2019 •

edited

Loading

simonpasquier commented Apr 5, 2019

Initially unresolved peers are lost forever #1661

Initially unresolved peers are lost forever #1661

Comments

redbaron commented Dec 12, 2018

redbaron commented Dec 12, 2018

mxinden commented Dec 13, 2018

redbaron commented Dec 13, 2018

dansimone commented Jan 14, 2019 • edited Loading

simonpasquier commented Apr 5, 2019

dansimone commented Jan 14, 2019 •

edited

Loading