-
Notifications
You must be signed in to change notification settings - Fork 2.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[gossip] Don't merge expired gossip messages #1631
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool, that you caught the bug. Would you mind adding a unit test?
d1f7ce2
to
066e9ac
Compare
If they're expired, they should be cleaned up on the next GC cycle, but merging them in means that they'll probably be gossip'd continually between the cluster members. Signed-off-by: stuart nelson <stuartnelson3@gmail.com>
066e9ac
to
5f811ad
Compare
Oh dear, my force-pushings are being reported! |
I expect so, we've just never noticed because they don't appear in the UI. |
The code for nflog was also constantly re-adding nflogs to the internal memory store, the same as the silence code was. Signed-off-by: stuart nelson <stuartnelson3@gmail.com>
The code between these two is so similar, the difference is just a generic away .. |
With the default 0 retention, the alerts would not be merged. Signed-off-by: Stuart Nelson <stuartnelson3@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
If they're expired, they should be cleaned up on
the next GC cycle, but merging them in means that
they'll probably be gossip'd continually between
the cluster members.
fixes #1624
The issue was in the
merge
function: https://github.com/prometheus/alertmanager/blob/master/silence/silence.go#L752-L756Expired silences would get deleted every 15 minutes (the hard-coded maintenance interval), but on each full-state sync an alertmanager would get a bunch of expired silences. Since they didn't exist in its state cache, they would be merged back in. Since the merge happens every minute, they were seemingly guaranteed to stick around. If all the alertmanagers were started at the same time they might GC all the silences before a full state sync, but if they're offset in their maintenance timing, they would keep sharing the same expired silences with each other. This explains why I saw the silences "GC'd", because I had looked within this magical window of "maintenance has run, but a full state sync hasn't yet", but then they came back.