-
Notifications
You must be signed in to change notification settings - Fork 2.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add cluster peers DNS refresh job #1428
Conversation
Rather than slowly re-inventing service discovery, if this is determined to be needed we should use the existing SD from Prometheus. |
@brian-brazil fair point, but using Prom's DNS service discover could only replace IMO DNS refresh job would still be needed, unless we pass down dns names to But I'm all for better ways to do this thing, so if you have anything specific in mind I would be glad to help |
Signed-off-by: Povilas Versockas <p.versockas@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left some minor comments. Thanks for looking into this @povilasv!
cluster/cluster.go
Outdated
Help: "A counter of the number of failed cluster peer refresh attempts.", | ||
}) | ||
p.refreshCounter = prometheus.NewCounter(prometheus.CounterOpts{ | ||
Name: "alertmanager_cluster_refresh_total", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Metric name and metric description seem diverged to me. alertmanager_cluster_refresh_total
sounds like the amount of times a refresh happened and not amount of times a peer joined the cluster due to a refresh. How about alertmanager_cluster_refresh_join_total
? (Not quite perfect either)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, renamed to alertmanager_cluster_refresh_join_total
and alertmanager_cluster_refresh_join_failed_total
to indicate that there are counters for join
cmd/alertmanager/main.go
Outdated
@@ -162,6 +162,7 @@ func main() { | |||
settleTimeout = kingpin.Flag("cluster.settle-timeout", "Maximum time to wait for cluster connections to settle before evaluating notifications.").Default(cluster.DefaultPushPullInterval.String()).Duration() | |||
reconnectInterval = kingpin.Flag("cluster.reconnect-interval", "Interval between attempting to reconnect to lost peers.").Default(cluster.DefaultReconnectInterval.String()).Duration() | |||
peerReconnectTimeout = kingpin.Flag("cluster.reconnect-timeout", "Length of time to attempt to reconnect to a lost peer.").Default(cluster.DefaultReconnectTimeout.String()).Duration() | |||
refreshInterval = kingpin.Flag("cluster.refresh-interval", "Interval between attempting to refresh cluster.peers DNS records.").Default(cluster.DefaultReconnectInterval.String()).Duration() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would a good default value be enough for now, or is a custom configuration necessary for most environments?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 I think 30s should be fine for most environments, typically alertmanger is quick to start, so IMO anything longer than that would slow down startup/gossip settling
Signed-off-by: Povilas Versockas <p.versockas@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a small follow up.
Any thoughts by the others?
cluster/cluster.go
Outdated
@@ -112,6 +118,7 @@ func Join( | |||
probeInterval time.Duration, | |||
reconnectInterval time.Duration, | |||
reconnectTimeout time.Duration, | |||
refreshInterval time.Duration, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As this is not configurable via a command line flag anymore, there is no reason for the parameter, right?
Why are you restarting all alertmanagers at once? Kubernetes provides many options to control the deployment speed and ensure that DNS is updated before continuing with the next instance restart. For example the deployment speed can be controlled with I don't see why any additional alertmanager functionality is necessary to ensure it can be safely deployed in Kubernetes. |
@grobie the problem is not about deployment in Kubernetes, it's just a way to reproduce. |
e11d28a
to
6e5e827
Compare
Signed-off-by: Povilas Versockas <p.versockas@gmail.com>
6e5e827
to
9c69c38
Compare
Note to my future-me and maybe others here as well: We (thanks @metalmatze for debugging) have just hit the same issue (depending on DNS being consistent instead of eventual consistent) updating an Alertmanager cluster (3 instances) on a Kubernetes cluster deployed via the Prometheus Operator.
This partition ({new-0}, {new-1,new-2}) would resolve eventually by refreshing DNS entries (this PR). Alternative fix as suggested by @grobie is to leverage Kubernetes readiness-check. This would not actually depend on Alertmanager saying that it discovered its peers, but rather adding a simple delay hoping for DNS to reach consistency in the meantime. I agree with @povilasv that depending on an eventual consistent system in a consistent fashion is a design flaw. Before reinventing the wheel here I think @brian-brazil also has a good point with:
I will take a look how easily that can be achieved. In the meantime I am curious what your thoughts are. |
I disagree with #1428 (comment), for one there are other environments than Kubernetes out there with varying capabilities, and the proposed solution is a hack at best and still racy. Currently discovery is so broken that people have to apply hacks and do racy things that are very fragile, I think we should attempt to fix what we have right now by actually re-querying DNS, making a current release of Alertmanager usable again. Then immediately start working on re-using the Prometheus service discovery module, I do agree that this is probably the best way forward but also a "more complicated than it sounds" type of issue/feature (I would probably want to introduce this in parallel as we've have a lot of problems with the SD module when we refactored it within Prometheus to be re-used for discovering Alertmanagers). |
The alertmanager peers configuration does not support service disocvery. It is currently intended to list the address over every peer separately by repeating that flag. The way you seem to configure alertmanager is the actual hack here, using the Kubernetes DNS service address as single peer, expecting to eventually have every peer connet to every other. In environments outside of Kubernetes, people will just use their existing means of service discovery or configuration management to configure the list of alertmanager peers explicitely. That's what we do at SoundCloud for example where we don't want to deploy alertmanager inside of Kubernetes. Most software doesn't have an opinionated built-in service discovery, and I don't see a compelling reason to add this complexity to alertmanager. The only people who have reported issues so far with the existing means deploy alertmanager on Kubernetes, which provides all features to avoid having to re-implement service discovery functionality in every single service. Using a deployment instead of a statefulset is the wrong choice for the Kubernetes controller. The statefulset provides the properties you want for a HA alertmanager deployement to guarantee at most one instance is taken down at a time. It also provides you with stable identifiers for every instance in the set, so that you can use I still can't see the need to add the complexity of service discovery to alertmanager just to discover it peers. There exist an unlimited number of different service discovery mechanisms out there. The Prometheus Operator uses the worst possible features of Kubernetes to configure and deploy alertmanager, I don't see why alertmanager needs to be fixed here instead of the Operator itself @brancz @mxinden. |
Given that you are referencing the operator to use deployments shows that you haven't actually looked at it, so I'm gonna ask you to stay respectful in your wording. We do exactly what you described as the right choice of how to deploy Alertmanager on Kubernetes (with statefulsets and consistent DNS pod identity). We can introduce additional rollout delays, but that assumes, that the DNS server always respects TTLs precisely, which from experience has not always been the case, so we are looking for additional hardening on existing functionality. Argumenting that DNS records should be resolved again once in a while because they are not a consistent system, is regardless of the Prometheus Operator or even Kubernetes. Whether we want the full service discovery module from Prometheus is as far as I can tell up for discussion, I have just expressed interest, but understand not wanting that heavy functionality as well. In my opinion that's a separate topic from fixing existing functionality. |
Point taken. That was a poor argumentation, I apologize.
Alright, let's do it. |
@@ -391,6 +408,51 @@ func (p *Peer) reconnect() { | |||
} | |||
} | |||
|
|||
func (p *Peer) handleRefresh(d time.Duration) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we're talking about proper DNS support for alertmanager, it would be better to respect the TTL of the record as advertised by the authority.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would you shorten the interval to 10s or 15s?
Even though I'm sometimes running into this problem of members not finding each other, I still think that 30s is enough for the cluster to heal.
Every time this happened I got some duplicate alerts, which is something I can live with, if fixed within < 30s.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@grobie It looks like Go net package doesn't expose TTL values https://golang.org/pkg/net/#IPAddr.
I guess the only way to get the TTL value is to use smth like https://stackoverflow.com/a/48997354, which would look quite cumbersome.
I personally prefer to use this, as it's a simpler solution, but I can prepare a change if you want.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This ends up relying on using the internal resolver, which should be properly caching/refreshing responses. I would say we can shorten the time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 Shortened to 15s.
cluster/cluster.go
Outdated
@@ -97,6 +102,7 @@ const ( | |||
DefaultProbeInterval = 1 * time.Second | |||
DefaultReconnectInterval = 10 * time.Second | |||
DefaultReconnectTimeout = 6 * time.Hour | |||
DefaultRefreshInterval = 30 * time.Second |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems to be quiet long to prevent a partition during deployment.
cluster/cluster.go
Outdated
if !isPeerFound { | ||
if _, err := p.mlist.Join([]string{peer}); err != nil { | ||
p.failedRefreshCounter.Inc() | ||
level.Debug(logger).Log("result", "failure", "addr", peer) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would argue that this could also be a Info or Warn.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 changed to be Warn
Signed-off-by: Povilas Versockas <p.versockas@gmail.com>
9452c67
to
e955b4c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have been testing this on a minikube cluster with increased CoreDNS caching time and @metalmatze and I have tested this on a vanilla Kubernetes cluster. In addition I have run this through the Prometheus Operator test suit and added a specific test case to cover this use case (prometheus-operator/prometheus-operator#2145).
This looks good to me. Any further comments by others?
You've been looking after this one, if you're happy with it then 👍 from me |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm fine with the change. I think that eventually we should leverage the DNS service discovery of Prometheus but for now, it can't be integrated without pulling all the SD packages...
Thanks @povilasv for the patch. Thanks everyone for the collaboration and discussions. |
Adds a job which runs periodically and refreshes cluster.peer dns records. The problem is that when you restart all of the alertmanager instances in an environment like Kubernetes, DNS may contain old alertmanager instance IPs, but on startup (when Join() happens) none of the new instance IPs. As at the start DNS is not empty resolvePeers waitIfEmpty=true, will return and "islands" of 1 alertmanager instances will form. Signed-off-by: Povilas Versockas <p.versockas@gmail.com>
@povilasv can you help caller=dispatch.go:309 component=dispatcher msg="Notify for alerts failed" num_alerts=1 err="gmail-notifications/email[0]: notify retry canceled after 2 attempts: create SMTP client: EOF" |
Adds a job which runs periodically and refreshes cluster.peer dns records
REF #1449
The problem is that when you restart all of the alertmanager instances in an environment like Kubernetes, DNS may contain old alertmanager instance IPs, but on startup (when
Join()
happens) none of the new instance IPs. As at the start DNS is not emptyresolvePeers waitIfEmpty=true
, will return and "islands" of 1 alertmanager instances will form.All alert manager metrics endpoints show:
alertmanager_cluster_members 1
Here are some logs:
logs of alertmanager1:
alertmanager2:
my k8s config:
To reproduce, just run alertmanager in Kubernetes with headless service and
k delete po --force --grace-period=0 -l app=alertmanager
After applying change and setting refresh to ~ 5mins I can see that nodes joined back after 5mins by refresh job, as the counter got increased: