[SPARK-30285][CORE] Fix deadlock between LiveListenerBus#stop and AsyncEventQueue#removeListenerOnError #26924

wangshuo128 · 2019-12-17T09:48:58Z

What changes were proposed in this pull request?

There is a deadlock between LiveListenerBus#stop and AsyncEventQueue#removeListenerOnError.

We can reproduce as follows:

Post some events to LiveListenerBus
Call LiveListenerBus#stop and hold the synchronized lock of bus(

spark/core/src/main/scala/org/apache/spark/scheduler/LiveListenerBus.scala

Line 229 in 5e92301

synchronized {

), waiting until all the events are processed by listeners, then remove all the queues
Event queue would drain out events by posting to its listeners. If a listener is interrupted, it will call AsyncEventQueue#removeListenerOnError, inside it will call bus.removeListener(

spark/core/src/main/scala/org/apache/spark/scheduler/AsyncEventQueue.scala

Line 207 in 7b1b60c

bus.removeListener(listener)

), trying to acquire synchronized lock of bus, resulting in deadlock

This PR removes the synchronized from LiveListenerBus.stop because underlying data structures themselves are thread-safe.

Why are the changes needed?

To fix deadlock.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

New UT.

…emoveListenerOnError

wangshuo128 · 2019-12-17T09:49:38Z

Gentle ping @squito :)

squito

I agree with your analysis of the issue. I do think this would fix it, but I'm wondering if there is a cleaner way. @vanzin any ideas?

one general thing -- I'd replace every use of "race condition" with deadlock in the PR description.

squito · 2019-12-17T21:15:29Z

core/src/main/java/org/apache/spark/SparkFirehoseListener.java

+  }
+
+  @Override
+  public void dead_$eq(boolean dead) { }


these methods should actually be implemented, so anybody extending this in java gets the fix as well.

ideally we could do this somehow so this doesn't get exposed at all part of the api, but I can't think of a way to do that ...

squito · 2019-12-17T21:17:02Z

core/src/main/scala/org/apache/spark/scheduler/AsyncEventQueue.scala

+    if (bus.isInStop) {
+      // If bus is in the progress of stop, we just mark the listener as dead instead of removing
+      // via calling `bus.removeListener` to avoid race condition
+      // dead listeners will be removed eventually in `bus.stop`


some grammar nits:

If we're in the middle of stopping the bus, we just mark the listener as dead,
instead of removing, to avoid a deadlock.
Dead listeners will be removed eventually in bus.stop

dongjoon-hyun · 2019-12-19T21:13:36Z

ok to test

SparkQA · 2019-12-19T23:09:21Z

Test build #115583 has finished for PR 26924 at commit c2afd63.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

vanzin · 2019-12-20T00:10:57Z

I don't like this approach because it exposes completely internal things in the public API. (Also, exposes Scala-isms in a Java class...)

I'm almost convinced that we should just remove the synchronized from LiveListenerBus.stop. The underlying data structures themselves are thread-safe... and in fact there are multiple races I spotted when stopping things (e.g. it would be possible to post an event to a queue after it has been stopped, so if it gets queued after the poison pill, nobody would see it).

But during shutdown the important event (application end) is posted on the same thread stopping the bus, so there's no race there, and that's the only event I'd be worried about.

wangshuo128 · 2019-12-20T07:48:45Z

Thanks for your feedback! @squito @vanzin

I agree that this fix is sub-optional and looking forward to your expert advice.

I'm almost convinced that we should just remove the synchronized from LiveListenerBus.stop. The underlying data structures themselves are thread-safe...

Another concern is that the queues is a CopyOnWriteArrayList, it would properbly miss newly updates in traversal (e.g. miss the newly added queue when stopping). However, LiveListenerBus.addToQueue has already checked the stopped status and it is not a risk.
So it seems fine to me and I'll make the change.

SparkQA · 2019-12-20T08:05:02Z

Test build #115608 has finished for PR 26924 at commit 3260be1.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

Ngone51 · 2019-12-20T15:48:50Z

Would you please update PR description according to your latest updates?

Ngone51 · 2019-12-20T15:49:04Z

retest this please.

vanzin

I'm not really sold on the unit test; it seems to be racy regardless of how you write it. If you happen to hit the original bug you'd fail the test, but I'm not sure how effective the test actually is in hitting that situation. But maybe that's enough...

vanzin · 2019-12-20T18:14:29Z

core/src/test/scala/org/apache/spark/scheduler/SparkListenerSuite.scala

+    val suffix = if (throwInterruptedException) "throw interrupt" else "set Thread interrupted"
+    test(s"SPARK-30285: Fix deadlock in AsyncEventQueue.removeListenerOnError: $suffix") {
+      val conf = new SparkConf(false)
+        .set(LISTENER_BUS_EVENT_QUEUE_CAPACITY, 5)


Do you really need this?

vanzin · 2019-12-20T18:14:50Z

core/src/test/scala/org/apache/spark/scheduler/SparkListenerSuite.scala

+   *    else count SparkListenerJobEnd numbers
+   */
+  private class DelayInterruptingJobCounter(
+    val throwInterruptedException: Boolean,


nit: indent more

vanzin · 2019-12-20T18:23:54Z

core/src/test/scala/org/apache/spark/scheduler/SparkListenerSuite.scala

+      })
+      stoppingThread.start()
+      // Notify interrupting listener starts to work
+      interruptingListener.sleep = false


Are you trying to make sure listeners throw the exception after "stop()" is called? That's going to be hard, and your code isn't really guaranteeing that.

You could use a CountDownLatch that you signal right before calling stop() (in the thread) to unblock the listener; that will at least narrow the race down a bit.

Maybe we could check the stopped status of bus in the listener.
This would be better than using a CountDownLatch, however, it can't get rid of racing completely. WDYT?

CountDownLatch always make things deterministic and it sounds better to me.

What do you mean by "it can't get rid of racing completely"?

As the PR description, to reproduce the original issue, we have to make sure:

Holding the synchronized lock of bus in the stopping thread

Trying to acquire the synchronized lock of bus in the interrupting listener thread

But signal the listener starts to interrupt just before bus.stop by a CountDownLatch can't guarantee this 100%, right?

Maybe, you should insert CountDownLatch after bus.stop?

Unfortunately, checking the stoped status can't guarantee this. It's likely that the bus has already set the stoped status to true, but has not acquired the synchronized lock yet.

IIUC, you want to let interruptingListener start to work once bus has moved to stop status and acquired the synchronized lock, right?

But how can bus acquired the synchronized lock now? This fix has already removed the synchronized lock. The only thing you could do is to check bus status now and I think it's enough.

Got your point.

Now, there are two things.

Without the fix, how the test would behave.

With the fix, how to make sure that there is no deadlock when a listener is interrupted after bus.stop is called.

For (1), we can't avoid racing without changing the bus.stop code (e.g. add a callback).
For (2), we at least have to expose the internal stoped status of bus, which maybe is not recommended.

So WDYT?

Only focus on LiveListenerBus may be impossible to workaround the difficulties you mentioned above. Maybe we should move to AsyncEventQueue.

How about this way:

Add a method status() in AsyncEventQueue for testing only;

In interruptingListener, keep checking AsyncEventQueue.status() until it's stopped. So, when AsyncEventQueue is stopped, we're sure that LiveListenerBus has stopped too and acquired the lock(without fix).

WDYT?

I believe this would work. In AsyncEventQueue, in fact, there is also a stoped status that we could check.
But associating a listener with its AsyncEventQueue would be another problem we have to resolve. Currently, it's encapsulated by bus.addToXXXQueue inside the bus code.

You guys are trying to fabricate a test that will not be testing what the actual code is doing when a real app is running. That's the problem.

To do that you'd need the stop() code in the listener bus to wait holding a lock while the queues are being drained; and one of those queues need to run into the error that causes it to remove a bad listener. That's hard to do without inserting callbacks that don't exist into the code; and adding those callbacks would only be enabling the test, which is why that's questionably.

So you basically need this in the new stop():

def stop() { // do some stop stuff here testStartCallback() // clear the queues here testEndCallback() }

The two callbacks are needed because otherwise there is no guarantee that what the queues do will happen before stop() does its thing.

But really I don't see what really that test would be actually testing now that there is no synchronized block anymore.

Anything you do here without these callbacks will be racy, and thus may not hit the original issue. Also, without the synchronized block, there's nothing to cause a deadlock in the first place, so that's why I said the test isn't that great to begin with.

So I'd avoid trying to create a fancy test that isn't really testing the issue and just adding unneeded hooks into the main code. The current test is ok and as close as you'll get without the above callbacks; so either go with that, or just remove the test.

vanzin · 2019-12-20T18:24:29Z

core/src/test/scala/org/apache/spark/scheduler/SparkListenerSuite.scala

+      // Notify interrupting listener starts to work
+      interruptingListener.sleep = false
+      // Wait for bus to stop
+      stoppingThread.join()


Since you're trying to detect a deadlock, shouldn't this have a timeout?

SparkQA · 2019-12-20T18:48:27Z

Test build #115628 has finished for PR 26924 at commit 3260be1.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-12-21T08:05:01Z

Test build #115640 has finished for PR 26924 at commit 3d7f435.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

wangshuo128 · 2019-12-23T12:11:10Z

@vanzin @squito
Thanks a lot for helping with this.
I refined the unit test. Could you take another look and give some advice?

vanzin · 2019-12-23T19:24:45Z

I don't think there's a way to write a proper test here without changing a bunch of things in the bus and queue code to expose internal hooks... and I don't think that's desirable.

I guess the current test is good enough as an attempt to test this.

But Jenkins seems to be hosed, so running the tests here will probably have to wait until after the holidays...

vanzin · 2019-12-23T19:24:52Z

retest this please

SparkQA · 2019-12-23T22:47:30Z

Test build #115655 has finished for PR 26924 at commit 3d7f435.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

vanzin · 2019-12-23T22:49:23Z

retest this please

SparkQA · 2019-12-24T01:19:21Z

Test build #115663 has finished for PR 26924 at commit 3d7f435.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

wangshuo128 · 2019-12-24T03:25:51Z

Ok, thanks for taking care. I saw the "Jenkins looks hosed" discussion in the spark dev mail list. Let's wait until then.

Ngone51 · 2019-12-24T07:07:52Z

PySpark failure introduced by a mistake merge, which has been reverted just now.

Ngone51 · 2019-12-24T07:08:04Z

retest this please.

SparkQA · 2019-12-24T08:05:01Z

Test build #115700 has finished for PR 26924 at commit 3d7f435.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

wangshuo128 · 2019-12-24T08:44:27Z

A little weird. The failed test can pass locally.

wangshuo128 · 2019-12-26T07:36:16Z

@Ngone51 Would you please trigger the test again? I think the test failed due to a flaky test, see #27010 for details.

Ngone51 · 2019-12-26T07:49:42Z

retest this please.

SparkQA · 2019-12-26T08:05:02Z

Test build #115798 has finished for PR 26924 at commit 3d7f435.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

HeartSaVioR · 2019-12-26T08:20:25Z

retest this, please

SparkQA · 2019-12-26T10:57:20Z

Test build #115804 has finished for PR 26924 at commit 3d7f435.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

vanzin · 2020-01-02T21:59:25Z

retest this please

SparkQA · 2020-01-03T00:27:05Z

Test build #116057 has finished for PR 26924 at commit 3d7f435.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

vanzin · 2020-01-03T00:40:07Z

Merging to master / 2.4.

…ncEventQueue#removeListenerOnError There is a deadlock between `LiveListenerBus#stop` and `AsyncEventQueue#removeListenerOnError`. We can reproduce as follows: 1. Post some events to `LiveListenerBus` 2. Call `LiveListenerBus#stop` and hold the synchronized lock of `bus`(https://github.com/apache/spark/blob/5e92301723464d0876b5a7eec59c15fed0c5b98c/core/src/main/scala/org/apache/spark/scheduler/LiveListenerBus.scala#L229), waiting until all the events are processed by listeners, then remove all the queues 3. Event queue would drain out events by posting to its listeners. If a listener is interrupted, it will call `AsyncEventQueue#removeListenerOnError`, inside it will call `bus.removeListener`(https://github.com/apache/spark/blob/7b1b60c7583faca70aeab2659f06d4e491efa5c0/core/src/main/scala/org/apache/spark/scheduler/AsyncEventQueue.scala#L207), trying to acquire synchronized lock of bus, resulting in deadlock This PR removes the `synchronized` from `LiveListenerBus.stop` because underlying data structures themselves are thread-safe. To fix deadlock. No. New UT. Closes #26924 from wangshuo128/event-queue-race-condition. Authored-by: Wang Shuo <wangshuo128@gmail.com> Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com> (cherry picked from commit 10cae04) Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>

vanzin · 2020-01-03T00:47:33Z

FYI I had to resolve a trivial conflict and make a small scala 2.11-related change to the code in 2.4.

wangshuo128 · 2020-01-03T02:42:34Z

Thanks a lot!

Fix race condition between LiveListenerBus#stop and AsyncEventQueue#r…

03a41d1

…emoveListenerOnError

squito reviewed Dec 17, 2019

View reviewed changes

wangshuo128 changed the title ~~[SPARK-30285][CORE]Fix race condition between LiveListenerBus#stop and AsyncEventQueue#removeListenerOnError~~ [SPARK-30285][CORE]Fix deadlock between LiveListenerBus#stop and AsyncEventQueue#removeListenerOnError Dec 18, 2019

wangshuo128 added 3 commits December 18, 2019 10:21

Fix UT

781caba

Address review feedback

7b8b1fa

Fix UT

c2afd63

dongjoon-hyun changed the title ~~[SPARK-30285][CORE]Fix deadlock between LiveListenerBus#stop and AsyncEventQueue#removeListenerOnError~~ [SPARK-30285][CORE] Fix deadlock between LiveListenerBus#stop and AsyncEventQueue#removeListenerOnError Dec 19, 2019

dongjoon-hyun added the SPARK CORE label Dec 19, 2019

Remove synchronized in LiveListenerBus.stop

3260be1

vanzin reviewed Dec 20, 2019

View reviewed changes

Fix UT, address some review feedback

3d7f435

vanzin closed this in 10cae04 Jan 3, 2020

[SPARK-30285][CORE] Fix deadlock between LiveListenerBus#stop and AsyncEventQueue#removeListenerOnError #26924

[SPARK-30285][CORE] Fix deadlock between LiveListenerBus#stop and AsyncEventQueue#removeListenerOnError #26924

Conversation

wangshuo128 commented Dec 17, 2019 • edited Loading

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

wangshuo128 commented Dec 17, 2019

squito left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dongjoon-hyun commented Dec 19, 2019

SparkQA commented Dec 19, 2019

vanzin commented Dec 20, 2019

wangshuo128 commented Dec 20, 2019 • edited Loading

SparkQA commented Dec 20, 2019

Ngone51 commented Dec 20, 2019

Ngone51 commented Dec 20, 2019

vanzin left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wangshuo128 Dec 26, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wangshuo128 Dec 26, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Dec 20, 2019

SparkQA commented Dec 21, 2019

wangshuo128 commented Dec 23, 2019

vanzin commented Dec 23, 2019

vanzin commented Dec 23, 2019

SparkQA commented Dec 23, 2019

vanzin commented Dec 23, 2019

SparkQA commented Dec 24, 2019

wangshuo128 commented Dec 24, 2019

Ngone51 commented Dec 24, 2019

Ngone51 commented Dec 24, 2019

SparkQA commented Dec 24, 2019

wangshuo128 commented Dec 24, 2019 • edited Loading

wangshuo128 commented Dec 26, 2019 • edited Loading

Ngone51 commented Dec 26, 2019

SparkQA commented Dec 26, 2019

HeartSaVioR commented Dec 26, 2019

SparkQA commented Dec 26, 2019

vanzin commented Jan 2, 2020

SparkQA commented Jan 3, 2020

vanzin commented Jan 3, 2020

vanzin commented Jan 3, 2020

wangshuo128 commented Jan 3, 2020

wangshuo128 commented Dec 17, 2019 •

edited

Loading

wangshuo128 commented Dec 20, 2019 •

edited

Loading

wangshuo128 Dec 26, 2019 •

edited

Loading

wangshuo128 Dec 26, 2019 •

edited

Loading

wangshuo128 commented Dec 24, 2019 •

edited

Loading

wangshuo128 commented Dec 26, 2019 •

edited

Loading