-
Notifications
You must be signed in to change notification settings - Fork 2.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
memberlist gossip #1389
memberlist gossip #1389
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want to re-gossip new/updated silences too?
cluster/cluster.go
Outdated
@@ -107,7 +107,7 @@ func Join( | |||
readyc: make(chan struct{}), | |||
logger: l, | |||
} | |||
p.delegate = newDelegate(l, reg, p) | |||
p.delegate = newDelegate(l, reg, p, len(knownPeers)/2) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit but I would do the computation here rather than add a new argument to newDelegate()
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you mean setting RetransmitMulti
on p.delegate?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for being inaccurate. I meant:
retransmit := len(knownPeers)/2
if restransmit < 3 {
retransmit = 3
}
p.delegate = newDelegate(l, reg, p, retransmit)
Yeah, that would make sense |
Updated. When running this branch at SC with only nflog entries being further gossiped, we were seeing messages queued up. Newer messages are sent preferentially by memberlist, but it's still an issue that we have an unbounded queue. Ideally, the queue would be able to empty itself, but that's not what I was seeing. I'll investigate this. |
I agree with keeping the amount of configuration options small (fow now). What is your reasoning for: retransmit := len(knownPeers) / 2 Thanks a lot for working on this! |
Note: The probability calculation I was doing was based on changing
Each gossip interval will send messages to all machines picked by Probability of a machine not being gossiped to during a gossip intervalCaveat: This might be wrong :) Play along at: Tweaking the numbers available (probabilities are the chance of not receiving a message, and given as a percentage out of 100):
The testing was far from thorough given that I doubt anyone runs more than 10 instances, but it seemed like |
If a peer receives an nflog that it hasn't seen before, queue the message and propagate it further to other peers. This should ensure that all peers within a cluster receive all gossip messages. Signed-off-by: stuart nelson <stuartnelson3@gmail.com>
For alertmanagers that are brought up with a list of peers, set the number of message retransmits to be half of that number. If there are no peers on start, or there are few, continue to use the default value of 3. Signed-off-by: stuart nelson <stuartnelson3@gmail.com>
Signed-off-by: stuart nelson <stuartnelson3@gmail.com>
Signed-off-by: stuart nelson <stuartnelson3@gmail.com>
During a gossip, we send messages to at most GossipNodes nodes. If possible, we only a message to reach all nodes as soon as possible. Signed-off-by: stuart nelson <stuartnelson3@gmail.com>
Signed-off-by: stuart nelson <stuartnelson3@gmail.com>
5f9378b
to
4125ca8
Compare
These are useful as a direct indication of CPU contention and task scheduler latency. Handy references: - https://github.com/torvalds/linux/blob/master/Documentation/scheduler/sched-stats.txt - https://doc.opensuse.org/documentation/leap/tuning/html/book.sle.tuning/cha.tuning.taskscheduler.html procfs is updated to pull in the enabling change: prometheus/procfs#186 Signed-off-by: Phil Frost <phil@postmates.com>
meant to address #1387
one current issue is that it seems
peer0
has a few notifications queued, but hasn't cleared them in over an hour. Some way of clearing old messages is probably needed.The method for choosing
RetransmitMulti
is also something to be discussed. It's an important value, but I'm wary of exposing too many configuration options in alertmanager to alter memberlist. I think we need to come up with a "smart" value that also doesn't overwhelm the transmit queues .