fix concurrent read and wirte group error #1447

giant-panda666 · 2018-07-01T07:22:03Z

There is no lock when adding aggrGroup to group map. If another goroutine call Dispatcher.Groups or Dispatcher.run clear up aggrGroups, alertmanager will be panic.

Signed-off-by: denghuan <denghuan@actionsky.com>

giant-panda666 · 2018-07-04T00:14:44Z

@juliusv Could you please confirm if this problem is existed?

stuartnelson3 · 2018-07-04T04:43:30Z

Sorry for the delay, I'll look at this soon. Adding a testcase that fails because of the race detector would be helpful and prevent a regression.

simonpasquier · 2018-07-04T08:51:31Z

dispatch/dispatch.go

@@ -271,7 +271,9 @@ func (d *Dispatcher) processAlert(alert *types.Alert, route *Route) {
 	ag, ok := group[fp]


This part should probably be protected too. It might be simpler to just add a defer after the first lock (L262):

fp := groupLabels.Fingerprint() d.mtx.Lock() defer d.mtx.Unlock()

giant-panda666 · 2018-07-04T11:55:53Z

@stuartnelson3 ok, I'll add a testcase.
@simonpasquier this is the only place to write group, so i just add lock to there. But adding a defer after the first lock is a good choice, i will change to it.

Signed-off-by: denghuan <denghuan@actionsky.com>

giant-panda666 · 2018-07-06T15:44:32Z

@stuartnelson3 I'm sorry that I can't add a testcase because it's very hard to recurrent. It only occurred once in my production, and the panic reason is that concurrent read and write map.

simonpasquier · 2018-07-09T08:36:24Z

👍 for me. Looking at the code it is indeed quite complicated to add a test that will catch the issue in a deterministic fashion.

stuartnelson3 · 2018-07-10T15:13:31Z

Looks good.

In general, it feels like we're throwing locks all over the place because we're directly accessing these maps. It might simplify things to have an abstraction around a map that manages this, and access that abstraction from our code.

* fix concurrent read and wirte group Signed-off-by: denghuan <denghuan@actionsky.com> * make lock more elegant Signed-off-by: denghuan <denghuan@actionsky.com>

fix concurrent read and wirte group

573c464

Signed-off-by: denghuan <denghuan@actionsky.com>

giant-panda666 changed the title ~~fix concurrent read and wirte group~~ fix concurrent read and wirte group error Jul 3, 2018

simonpasquier reviewed Jul 4, 2018

View reviewed changes

make lock more elegant

b51fe2d

Signed-off-by: denghuan <denghuan@actionsky.com>

mxinden approved these changes Jul 10, 2018

View reviewed changes

simonpasquier mentioned this pull request Jul 10, 2018

config: fix regression with Pager Duty #1455

Merged

stuartnelson3 approved these changes Jul 10, 2018

View reviewed changes

stuartnelson3 merged commit f3bc41d into prometheus:master Jul 10, 2018

mxinden mentioned this pull request Jul 10, 2018

*: Cut 0.15.1 #1462

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix concurrent read and wirte group error #1447

fix concurrent read and wirte group error #1447

giant-panda666 commented Jul 1, 2018

giant-panda666 commented Jul 4, 2018

stuartnelson3 commented Jul 4, 2018

simonpasquier Jul 4, 2018

giant-panda666 commented Jul 4, 2018

giant-panda666 commented Jul 6, 2018

simonpasquier commented Jul 9, 2018

stuartnelson3 commented Jul 10, 2018

		@@ -271,7 +271,9 @@ func (d Dispatcher) processAlert(alert types.Alert, route *Route) {
		ag, ok := group[fp]

fix concurrent read and wirte group error #1447

fix concurrent read and wirte group error #1447

Conversation

giant-panda666 commented Jul 1, 2018

giant-panda666 commented Jul 4, 2018

stuartnelson3 commented Jul 4, 2018

simonpasquier Jul 4, 2018

Choose a reason for hiding this comment

giant-panda666 commented Jul 4, 2018

giant-panda666 commented Jul 6, 2018

simonpasquier commented Jul 9, 2018

stuartnelson3 commented Jul 10, 2018