-
Notifications
You must be signed in to change notification settings - Fork 2.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
race condition with gossiped silences causes crash #982
Comments
The crash reason is listed: It looks like it's something in our gossip implementation:
From the stack trace, it looks like un-coordinated access to a hash map. |
After taking a quick look, turning I'm surprised this hasn't surfaced until now. |
Yes, I wonder if this is something in mesh or in the alertmanager implementation (which sounds simple).
So if I read it correctly, Encode is called on the same gossipData twice. As far as I understood, this should be called only after https://github.com/weaveworks/mesh/blob/50329b4166c9c7d4024470996fa55956902d7815/gossip.go#L55 doesn't seem to say anything but the examples clearly do that : https://github.com/weaveworks/mesh/blob/50329b4166c9c7d4024470996fa55956902d7815/examples/increment-only-counter/state.go#L58 |
It's definitely in our implementation |
If nobody is currently on a fix, I'll take this one. |
Sounds good! |
This should fix silence/silence.go prometheus#982
This should fix silence/silence.go prometheus#982
This should fix silence/silence.go prometheus#982
This should fix silence/silence.go prometheus#982
This should fix silence/silence.go prometheus#982
@fabxc @brancz Running Is Alertmanager receiving any attention from CoreOS in the near future? It would be great to go through and clean these up, but I'm not aware of the time necessary for the fixes. Perhaps @iksaif could expand his work in #984 :) |
This should fix silence/silence.go prometheus#982
This should fix silence/silence.go prometheus#982
This should fix silence/silence.go prometheus#982
@stuartnelson3 we actually have a some time planned regarding Alertmanager robustness, so we should get to this relatively soon. I'm hoping to get started next week. |
This should fix silence/silence.go #982
Note: the patch may introduce a deadlock, I'll try to diagnose that tomorrow. It's not very common though. |
Yes .. we have a deadlock:
|
Calling gossip.GossipBroadcast() will cause a deadlock if there is a currently executing OnBroadcast* function. See prometheus#982
Calling gossip.GossipBroadcast() will cause a deadlock if there is a currently executing OnBroadcast* function. See prometheus#982
Calling gossip.GossipBroadcast() will cause a deadlock if there is a currently executing OnBroadcast* function. See prometheus#982
Calling gossip.GossipBroadcast() will cause a deadlock if there is a currently executing OnBroadcast* function. See prometheus#982
* silences: avoid deadlock Calling gossip.GossipBroadcast() will cause a deadlock if there is a currently executing OnBroadcast* function. See #982 * silence_test: better unit test to detect deadlocks
…tion to 1.16 version (prometheus#982) Add additional example of how to save old metrics Signed-off-by: Ivan Kiselev <ivan@messagebird.com>
Looks like our alertmanager is crashing when handling 1000s of silences
The text was updated successfully, but these errors were encountered: