Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix concurrent read and wirte group error #1447

Merged
merged 2 commits into from
Jul 10, 2018

Conversation

giant-panda666
Copy link
Contributor

There is no lock when adding aggrGroup to group map. If another goroutine call Dispatcher.Groups or Dispatcher.run clear up aggrGroups, alertmanager will be panic.

Signed-off-by: denghuan <denghuan@actionsky.com>
@giant-panda666 giant-panda666 changed the title fix concurrent read and wirte group fix concurrent read and wirte group error Jul 3, 2018
@giant-panda666
Copy link
Contributor Author

@juliusv Could you please confirm if this problem is existed?

@stuartnelson3
Copy link
Contributor

Sorry for the delay, I'll look at this soon. Adding a testcase that fails because of the race detector would be helpful and prevent a regression.

@@ -271,7 +271,9 @@ func (d *Dispatcher) processAlert(alert *types.Alert, route *Route) {
ag, ok := group[fp]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This part should probably be protected too. It might be simpler to just add a defer after the first lock (L262):

fp := groupLabels.Fingerprint()
 
d.mtx.Lock()
defer d.mtx.Unlock()

@giant-panda666
Copy link
Contributor Author

@stuartnelson3 ok, I'll add a testcase.
@simonpasquier this is the only place to write group, so i just add lock to there. But adding a defer after the first lock is a good choice, i will change to it.

Signed-off-by: denghuan <denghuan@actionsky.com>
@giant-panda666
Copy link
Contributor Author

@stuartnelson3 I'm sorry that I can't add a testcase because it's very hard to recurrent. It only occurred once in my production, and the panic reason is that concurrent read and write map.

@simonpasquier
Copy link
Member

👍 for me. Looking at the code it is indeed quite complicated to add a test that will catch the issue in a deterministic fashion.

@stuartnelson3
Copy link
Contributor

Looks good.

In general, it feels like we're throwing locks all over the place because we're directly accessing these maps. It might simplify things to have an abstraction around a map that manages this, and access that abstraction from our code.

@stuartnelson3 stuartnelson3 merged commit f3bc41d into prometheus:master Jul 10, 2018
mxinden pushed a commit to mxinden/alertmanager that referenced this pull request Jul 10, 2018
* fix concurrent read and wirte group

Signed-off-by: denghuan <denghuan@actionsky.com>

* make lock more elegant

Signed-off-by: denghuan <denghuan@actionsky.com>
@mxinden mxinden mentioned this pull request Jul 10, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants