Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add cluster peers DNS refresh job #1428

Merged
merged 6 commits into from
Nov 23, 2018
Merged

Conversation

povilasv
Copy link
Contributor

@povilasv povilasv commented Jun 21, 2018

Adds a job which runs periodically and refreshes cluster.peer dns records

REF #1449

The problem is that when you restart all of the alertmanager instances in an environment like Kubernetes, DNS may contain old alertmanager instance IPs, but on startup (when Join() happens) none of the new instance IPs. As at the start DNS is not empty resolvePeers waitIfEmpty=true, will return and "islands" of 1 alertmanager instances will form.

All alert manager metrics endpoints show: alertmanager_cluster_members 1

Here are some logs:

logs of alertmanager1:

level=info ts=2018-06-21T14:35:44.824688253Z caller=main.go:141 build_context="(go=go1.10, user=root@f278953f13ef, date=20180323-13:05:10)"
level=warn ts=2018-06-21T14:36:04.840918631Z caller=cluster.go:129 component=cluster msg="failed to join cluster" err="2 errors occurred:\n\n* Failed to join 10.2.19.164: dial tcp 10.2.19.164:8001: i/o timeout\n* Failed to join 10.2.43.52: dial tcp 10.2.43.52:8001: i/o timeout"
level=info ts=2018-06-21T14:36:04.841778884Z caller=cluster.go:249 component=cluster msg="Waiting for gossip to settle..." interval=2s
level=info ts=2018-06-21T14:36:04.84242773Z caller=main.go:270 msg="Loading configuration file" file=/etc/alertmanager/config.yml
level=info ts=2018-06-21T14:36:04.849504109Z caller=main.go:346 msg=Listening address=0.0.0.0:9093
level=info ts=2018-06-21T14:36:06.84218093Z caller=cluster.go:274 component=cluster msg="gossip not settled" polls=0 before=0 now=1 elapsed=2.000165572s
level=info ts=2018-06-21T14:36:14.843072999Z caller=cluster.go:266 component=cluster msg="gossip settled; proceeding" elapsed=10.001056858s
level=info ts=2018-06-21T14:51:04.842216499Z caller=nflog.go:313 component=nflog msg="Running maintenance"
level=info ts=2018-06-21T14:51:04.842784642Z caller=silence.go:252 component=silences msg="Running maintenance"
level=info ts=2018-06-21T14:51:04.84482936Z caller=silence.go:254 component=silences msg="Maintenance done" duration=2.046436ms size=0
level=info ts=2018-06-21T14:51:04.844844053Z caller=nflog.go:315 component=nflog msg="Maintenance done" duration=2.631928ms size=6305

alertmanager2:

evel=info ts=2018-06-21T14:35:44.824589916Z caller=main.go:140 msg="Starting Alertmanager" version="(version=0.15.0-rc.1, branch=HEAD, revision=acb111e812530bec1ac6d908bc14725
793e07cf3)"
level=info ts=2018-06-21T14:35:44.824688253Z caller=main.go:141 build_context="(go=go1.10, user=root@f278953f13ef, date=20180323-13:05:10)"
level=warn ts=2018-06-21T14:36:04.840918631Z caller=cluster.go:129 component=cluster msg="failed to join cluster" err="2 errors occurred:\n\n* Failed to join 10.2.19.164: dial
tcp 10.2.19.164:8001: i/o timeout\n* Failed to join 10.2.43.52: dial tcp 10.2.43.52:8001: i/o timeout"
level=info ts=2018-06-21T14:36:04.841778884Z caller=cluster.go:249 component=cluster msg="Waiting for gossip to settle..." interval=2s
level=info ts=2018-06-21T14:36:04.84242773Z caller=main.go:270 msg="Loading configuration file" file=/etc/alertmanager/config.yml
level=info ts=2018-06-21T14:36:04.849504109Z caller=main.go:346 msg=Listening address=0.0.0.0:9093
level=info ts=2018-06-21T14:36:06.84218093Z caller=cluster.go:274 component=cluster msg="gossip not settled" polls=0 before=0 now=1 elapsed=2.000165572s
level=info ts=2018-06-21T14:36:14.843072999Z caller=cluster.go:266 component=cluster msg="gossip settled; proceeding" elapsed=10.001056858s
level=info ts=2018-06-21T14:51:04.842216499Z caller=nflog.go:313 component=nflog msg="Running maintenance"
level=info ts=2018-06-21T14:51:04.842784642Z caller=silence.go:252 component=silences msg="Running maintenance"
level=info ts=2018-06-21T14:51:04.84482936Z caller=silence.go:254 component=silences msg="Maintenance done" duration=2.046436ms size=0
level=info ts=2018-06-21T14:51:04.844844053Z caller=nflog.go:315 component=nflog msg="Maintenance done" duration=2.631928ms size=6305

my k8s config:

apiVersion: v1
kind: Service
metadata:
  labels:
    name: alertmanager-peers
  name: alertmanager-peers
  namespace: sys-mon
spec:
  clusterIP: None
  ports:
  - name: cluster
    protocol: TCP
    port: 8001
    targetPort: cluster
  selector:
    app: alertmanager
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: alertmanager
  namespace: sys-mon
spec:
  replicas: 3
  selector:
    matchLabels:
      app: alertmanager
  template:
    metadata:
      name: alertmanager
      labels:
        app: alertmanager
    spec:
      containers:
      - name: alertmanager
        image: prom/alertmanager:v0.15.0-rc.1
        args:
          - --config.file=/etc/alertmanager/config.yml
          - --web.listen-address=0.0.0.0:9093
          - --storage.path=/alertmanager
          - --web.external-url=https://alertmanager.dev.uw.systems
          - --cluster.listen-address=0.0.0.0:8001
          - --cluster.peer=alertmanager-peers.sys-mon:8001
...

To reproduce, just run alertmanager in Kubernetes with headless service and k delete po --force --grace-period=0 -l app=alertmanager

After applying change and setting refresh to ~ 5mins I can see that nodes joined back after 5mins by refresh job, as the counter got increased:

alertmanager_cluster_refresh_total 2

@povilasv povilasv changed the title Add cluster.refresh-interval refresh Add cluster peers DNS refresh job Jun 21, 2018
@brian-brazil
Copy link
Contributor

Rather than slowly re-inventing service discovery, if this is determined to be needed we should use the existing SD from Prometheus.

@povilasv
Copy link
Contributor Author

povilasv commented Jun 21, 2018

@brian-brazil fair point, but using Prom's DNS service discover could only replace resolvePeers function (https://github.com/prometheus/alertmanager/blob/master/cluster/cluster.go#L611).

IMO DNS refresh job would still be needed, unless we pass down dns names to membership.Join(), but I'm pretty sure it doesn't refresh dns after initial Join() run, which would leave us in the same place

But I'm all for better ways to do this thing, so if you have anything specific in mind I would be glad to help

Signed-off-by: Povilas Versockas <p.versockas@gmail.com>
Copy link
Member

@mxinden mxinden left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some minor comments. Thanks for looking into this @povilasv!

Help: "A counter of the number of failed cluster peer refresh attempts.",
})
p.refreshCounter = prometheus.NewCounter(prometheus.CounterOpts{
Name: "alertmanager_cluster_refresh_total",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Metric name and metric description seem diverged to me. alertmanager_cluster_refresh_total sounds like the amount of times a refresh happened and not amount of times a peer joined the cluster due to a refresh. How about alertmanager_cluster_refresh_join_total? (Not quite perfect either)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, renamed to alertmanager_cluster_refresh_join_total and alertmanager_cluster_refresh_join_failed_total to indicate that there are counters for join

@@ -162,6 +162,7 @@ func main() {
settleTimeout = kingpin.Flag("cluster.settle-timeout", "Maximum time to wait for cluster connections to settle before evaluating notifications.").Default(cluster.DefaultPushPullInterval.String()).Duration()
reconnectInterval = kingpin.Flag("cluster.reconnect-interval", "Interval between attempting to reconnect to lost peers.").Default(cluster.DefaultReconnectInterval.String()).Duration()
peerReconnectTimeout = kingpin.Flag("cluster.reconnect-timeout", "Length of time to attempt to reconnect to a lost peer.").Default(cluster.DefaultReconnectTimeout.String()).Duration()
refreshInterval = kingpin.Flag("cluster.refresh-interval", "Interval between attempting to refresh cluster.peers DNS records.").Default(cluster.DefaultReconnectInterval.String()).Duration()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would a good default value be enough for now, or is a custom configuration necessary for most environments?

Copy link
Contributor Author

@povilasv povilasv Jun 27, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 I think 30s should be fine for most environments, typically alertmanger is quick to start, so IMO anything longer than that would slow down startup/gossip settling

Signed-off-by: Povilas Versockas <p.versockas@gmail.com>
Copy link
Member

@mxinden mxinden left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a small follow up.

Any thoughts by the others?

@@ -112,6 +118,7 @@ func Join(
probeInterval time.Duration,
reconnectInterval time.Duration,
reconnectTimeout time.Duration,
refreshInterval time.Duration,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As this is not configurable via a command line flag anymore, there is no reason for the parameter, right?

@grobie
Copy link
Member

grobie commented Jul 6, 2018

Why are you restarting all alertmanagers at once? Kubernetes provides many options to control the deployment speed and ensure that DNS is updated before continuing with the next instance restart. For example the deployment speed can be controlled with minReadySeconds and maxSurge, or you switch to a StatefulSet altogether.

I don't see why any additional alertmanager functionality is necessary to ensure it can be safely deployed in Kubernetes.

@povilasv
Copy link
Contributor Author

povilasv commented Aug 1, 2018

@grobie the problem is not about deployment in Kubernetes, it's just a way to reproduce.
The problem is that we shouldn't expect DNS to always be there nor depend on DNS update speed to form a complete alertmanager cluster, wouldn't you agree? (Sorry for delayed response, I was on vacation)

@povilasv povilasv force-pushed the peers-refresh branch 2 times, most recently from e11d28a to 6e5e827 Compare August 21, 2018 14:58
Signed-off-by: Povilas Versockas <p.versockas@gmail.com>
@mxinden
Copy link
Member

mxinden commented Oct 25, 2018

Note to my future-me and maybe others here as well:

We (thanks @metalmatze for debugging) have just hit the same issue (depending on DNS being consistent instead of eventual consistent) updating an Alertmanager cluster (3 instances) on a Kubernetes cluster deployed via the Prometheus Operator.

Alertmanager 0 Alertmanager 1 Alertmanager 2
runs 0.15.1 runs 0.15.1 runs 0.15.1
_ _ updates to 0.15.2, DNS query returns IP of old 0 and old 1
_ updates to 0.15.2, DNS query returns IP of old 0 and old 3, discovers new 3 via old 0 via gossip _
updates to 0.15.2, DNS query returns IP of old 1 and old 3, does not discover new 2 or new 3 _ _

This partition ({new-0}, {new-1,new-2}) would resolve eventually by refreshing DNS entries (this PR).

Alternative fix as suggested by @grobie is to leverage Kubernetes readiness-check. This would not actually depend on Alertmanager saying that it discovered its peers, but rather adding a simple delay hoping for DNS to reach consistency in the meantime.

I agree with @povilasv that depending on an eventual consistent system in a consistent fashion is a design flaw.

Before reinventing the wheel here I think @brian-brazil also has a good point with:

Rather than slowly re-inventing service discovery, if this is determined to be needed we should use the existing SD from Prometheus.

I will take a look how easily that can be achieved. In the meantime I am curious what your thoughts are.

@brancz
Copy link
Member

brancz commented Nov 5, 2018

I disagree with #1428 (comment), for one there are other environments than Kubernetes out there with varying capabilities, and the proposed solution is a hack at best and still racy. Currently discovery is so broken that people have to apply hacks and do racy things that are very fragile, I think we should attempt to fix what we have right now by actually re-querying DNS, making a current release of Alertmanager usable again.

Then immediately start working on re-using the Prometheus service discovery module, I do agree that this is probably the best way forward but also a "more complicated than it sounds" type of issue/feature (I would probably want to introduce this in parallel as we've have a lot of problems with the SD module when we refactored it within Prometheus to be re-used for discovering Alertmanagers).

@grobie grobie closed this Nov 5, 2018
@grobie grobie reopened this Nov 5, 2018
@grobie
Copy link
Member

grobie commented Nov 5, 2018

The alertmanager peers configuration does not support service disocvery. It is currently intended to list the address over every peer separately by repeating that flag. The way you seem to configure alertmanager is the actual hack here, using the Kubernetes DNS service address as single peer, expecting to eventually have every peer connet to every other.

In environments outside of Kubernetes, people will just use their existing means of service discovery or configuration management to configure the list of alertmanager peers explicitely. That's what we do at SoundCloud for example where we don't want to deploy alertmanager inside of Kubernetes.

Most software doesn't have an opinionated built-in service discovery, and I don't see a compelling reason to add this complexity to alertmanager. The only people who have reported issues so far with the existing means deploy alertmanager on Kubernetes, which provides all features to avoid having to re-implement service discovery functionality in every single service.

Using a deployment instead of a statefulset is the wrong choice for the Kubernetes controller. The statefulset provides the properties you want for a HA alertmanager deployement to guarantee at most one instance is taken down at a time. It also provides you with stable identifiers for every instance in the set, so that you can use --cluster.peer=alertmanager-1 --cluster.peer=alertmanager-2 --cluster.peer=alertmanager-3 ... in your config.

I still can't see the need to add the complexity of service discovery to alertmanager just to discover it peers. There exist an unlimited number of different service discovery mechanisms out there. The Prometheus Operator uses the worst possible features of Kubernetes to configure and deploy alertmanager, I don't see why alertmanager needs to be fixed here instead of the Operator itself @brancz @mxinden.

@brancz
Copy link
Member

brancz commented Nov 5, 2018

The Prometheus Operator uses the worst possible features of Kubernetes to configure and deploy alertmanager, I don't see why alertmanager needs to be fixed here instead of the Operator itself @brancz @mxinden.

Given that you are referencing the operator to use deployments shows that you haven't actually looked at it, so I'm gonna ask you to stay respectful in your wording.

We do exactly what you described as the right choice of how to deploy Alertmanager on Kubernetes (with statefulsets and consistent DNS pod identity). We can introduce additional rollout delays, but that assumes, that the DNS server always respects TTLs precisely, which from experience has not always been the case, so we are looking for additional hardening on existing functionality. Argumenting that DNS records should be resolved again once in a while because they are not a consistent system, is regardless of the Prometheus Operator or even Kubernetes.

Whether we want the full service discovery module from Prometheus is as far as I can tell up for discussion, I have just expressed interest, but understand not wanting that heavy functionality as well. In my opinion that's a separate topic from fixing existing functionality.

@grobie
Copy link
Member

grobie commented Nov 5, 2018

Given that you are referencing the operator to use deployments shows that you haven't actually looked at it, so I'm gonna ask you to stay respectful in your wording.

Point taken. That was a poor argumentation, I apologize.

Argumenting that DNS records should be resolved again once in a while because they are not a consistent system, is regardless of the Prometheus Operator or even Kubernetes.

Alright, let's do it.

@@ -391,6 +408,51 @@ func (p *Peer) reconnect() {
}
}

func (p *Peer) handleRefresh(d time.Duration) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we're talking about proper DNS support for alertmanager, it would be better to respect the TTL of the record as advertised by the authority.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you shorten the interval to 10s or 15s?
Even though I'm sometimes running into this problem of members not finding each other, I still think that 30s is enough for the cluster to heal.
Every time this happened I got some duplicate alerts, which is something I can live with, if fixed within < 30s.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@grobie It looks like Go net package doesn't expose TTL values https://golang.org/pkg/net/#IPAddr.
I guess the only way to get the TTL value is to use smth like https://stackoverflow.com/a/48997354, which would look quite cumbersome.

I personally prefer to use this, as it's a simpler solution, but I can prepare a change if you want.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This ends up relying on using the internal resolver, which should be properly caching/refreshing responses. I would say we can shorten the time.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 Shortened to 15s.

@@ -97,6 +102,7 @@ const (
DefaultProbeInterval = 1 * time.Second
DefaultReconnectInterval = 10 * time.Second
DefaultReconnectTimeout = 6 * time.Hour
DefaultRefreshInterval = 30 * time.Second
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems to be quiet long to prevent a partition during deployment.

if !isPeerFound {
if _, err := p.mlist.Join([]string{peer}); err != nil {
p.failedRefreshCounter.Inc()
level.Debug(logger).Log("result", "failure", "addr", peer)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would argue that this could also be a Info or Warn.

Copy link
Contributor Author

@povilasv povilasv Nov 6, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 changed to be Warn

Signed-off-by: Povilas Versockas <p.versockas@gmail.com>
Copy link
Member

@mxinden mxinden left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have been testing this on a minikube cluster with increased CoreDNS caching time and @metalmatze and I have tested this on a vanilla Kubernetes cluster. In addition I have run this through the Prometheus Operator test suit and added a specific test case to cover this use case (prometheus-operator/prometheus-operator#2145).

This looks good to me. Any further comments by others?

@stuartnelson3
Copy link
Contributor

This looks good to me. Any further comments by others?

You've been looking after this one, if you're happy with it then 👍 from me

Copy link
Member

@simonpasquier simonpasquier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine with the change. I think that eventually we should leverage the DNS service discovery of Prometheus but for now, it can't be integrated without pulling all the SD packages...

@mxinden mxinden merged commit 7f34cb4 into prometheus:master Nov 23, 2018
@mxinden
Copy link
Member

mxinden commented Nov 23, 2018

Thanks @povilasv for the patch. Thanks everyone for the collaboration and discussions.

@povilasv povilasv deleted the peers-refresh branch November 23, 2018 09:36
dansimone pushed a commit to dansimone/alertmanager that referenced this pull request Jan 14, 2019
Adds a job which runs periodically and refreshes cluster.peer dns records.

The problem is that when you restart all of the alertmanager instances in an environment like Kubernetes, DNS may contain old alertmanager instance IPs, but on startup (when Join() happens) none of the new instance IPs. As at the start DNS is not empty resolvePeers waitIfEmpty=true, will return and "islands" of 1 alertmanager instances will form.

Signed-off-by: Povilas Versockas <p.versockas@gmail.com>
@XI1062-abhisheksinghal
Copy link

@povilasv can you help
using prometheus , alert Manager , Node exporter , grafana .
Configured alertManager for getting alerts when one of the service instance gets down .Alerts are fired succesffuly but as i have configured gmail for receiving the alerts , not receiving any alert
Docker logs for alertmanager shows below error

caller=dispatch.go:309 component=dispatcher msg="Notify for alerts failed" num_alerts=1 err="gmail-notifications/email[0]: notify retry canceled after 2 attempts: create SMTP client: EOF"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants