-
Notifications
You must be signed in to change notification settings - Fork 2.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Initially unresolved peers are lost forever #1661
Comments
Initial DNS resolution failure have something to do with CoreDNS. It seems to be slow to return DNS entries alertmanager pod is requesting very early on. It is some sort of race with POD creation, because if I exec into Alertmanager container names are resolved just fine, as well as causing container to restart (with Switching from CoreDNS to kube-dns hides the problem, because DNS resolution starts to work faster and by the time alertmanager requesting it correct entries are returned |
We have recently merged #1428 which introduces a DNS periodic DNS refresh. This should address your issue. It is included in the latest alpha release: https://github.com/prometheus/alertmanager/releases/tag/v0.16.0-alpha.0. Let us know if this is of any help! |
@mxinden , I saw that one, but it operates on already known list of members. My impression was that problem is that peer get lost from list of peers very early on, there is nothing to "refresh". I didn't trace all the codepaths though, so I'll give 0.16.0 a try, maybe it will help |
I've encountered this problem as well on 0.15.3, trying to do the same thing you're doing (deploying an AlertManager cluster as a Kubernetes statefulset). I've found that v0.16.0-alpha.0 does indeed fix this - I've been unable to reproduce the issue of cluster partitions forming after initial deployment of the cluster (and this was fairly easy to reproduce with 0.15.3). I'd like to pick this up in production, except that the only released AlertManager with this change is 0.16.0.alpha... Is there any eta on the first stable 0.16.x release? Or would the team be open to cutting a 0.15.4 release with this change? This issue seems to be a sticking point with a lot of folks trying to use AlertManager clustering with Kubernetes. |
Closing it as v0.16.0 (v0.16.2 being the latest) has been released now and should fix the issue. Feel free to re-open if it isn't the case. |
What did you do?
Run alertmanager as a statefulset with
--cluster-peer
command line given once per member. If initial DNS resolution of peers fails, they are forgotten and never retried. They also neither show up inclusterStatus.peers
, nor established connections are seen in netstatEach pod of statefulset of 3 shows their own IP address in
clusterStatus
and report themselves as ready. Should I restart any one of them (kubectl exec pkill alert
), it comes up, manages to resolve peer IP addresses and joins other 2, forming cluster of 3 (those which didn't get restarted are probably find about each other via gossip)What did you expect to see?
early DNS failures shouldn't cause isolated "islands" to form.
I suspect problem is in
alertmanager/cluster/cluster.go
Lines 693 to 698 in 7f34cb4
What did you see instead? Under which circumstances?
Each member of statefulset forms single-node cluster.
Environment
0.15.3
Status:
The text was updated successfully, but these errors were encountered: