Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[gossip] Don't merge expired gossip messages #1631

Merged
merged 3 commits into from
Nov 21, 2018

Conversation

stuartnelson3
Copy link
Contributor

If they're expired, they should be cleaned up on
the next GC cycle, but merging them in means that
they'll probably be gossip'd continually between
the cluster members.

fixes #1624

The issue was in the merge function: https://github.com/prometheus/alertmanager/blob/master/silence/silence.go#L752-L756

Expired silences would get deleted every 15 minutes (the hard-coded maintenance interval), but on each full-state sync an alertmanager would get a bunch of expired silences. Since they didn't exist in its state cache, they would be merged back in. Since the merge happens every minute, they were seemingly guaranteed to stick around. If all the alertmanagers were started at the same time they might GC all the silences before a full state sync, but if they're offset in their maintenance timing, they would keep sharing the same expired silences with each other. This explains why I saw the silences "GC'd", because I had looked within this magical window of "maintenance has run, but a full state sync hasn't yet", but then they came back.

Copy link
Member

@mxinden mxinden left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool, that you caught the bug. Would you mind adding a unit test?

silence/silence.go Outdated Show resolved Hide resolved
@stuartnelson3 stuartnelson3 force-pushed the stn/not-expiring-silences branch from d1f7ce2 to 066e9ac Compare November 20, 2018 18:35
If they're expired, they should be cleaned up on
the next GC cycle, but merging them in means that
they'll probably be gossip'd continually between
the cluster members.

Signed-off-by: stuart nelson <stuartnelson3@gmail.com>
@stuartnelson3 stuartnelson3 force-pushed the stn/not-expiring-silences branch from 066e9ac to 5f811ad Compare November 20, 2018 18:45
@stuartnelson3
Copy link
Contributor Author

Oh dear, my force-pushings are being reported!

@stuartnelson3
Copy link
Contributor Author

Also are notification logs affected too?

I expect so, we've just never noticed because they don't appear in the UI.

The code for nflog was also constantly re-adding
nflogs to the internal memory store, the same as
the silence code was.

Signed-off-by: stuart nelson <stuartnelson3@gmail.com>
@stuartnelson3
Copy link
Contributor Author

The code between these two is so similar, the difference is just a generic away ..

@stuartnelson3 stuartnelson3 changed the title [silences] Don't merge expired silences [gossip] Don't merge expired gossip messages Nov 20, 2018
With the default 0 retention, the alerts would not
be merged.

Signed-off-by: Stuart Nelson <stuartnelson3@gmail.com>
Copy link
Member

@simonpasquier simonpasquier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@stuartnelson3 stuartnelson3 merged commit 2026e4a into master Nov 21, 2018
@stuartnelson3 stuartnelson3 deleted the stn/not-expiring-silences branch November 21, 2018 10:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Alertmanager not GC'ing silences
4 participants