[SPARK-11655] [core] Fix deadlock in handling of launcher stop(). #9633

vanzin · 2015-11-11T18:26:58Z

The stop() callback was trying to close the launcher connection in the
same thread that handles connection data, which ended up causing a
deadlock. So avoid that by dispatching the stop() request in its own
thread.

On top of that, add some exception safety to a few parts of the code,
and use "destroyForcibly" from Java 8 if it's available, to force
kill the child process. The flip side is that "kill()" may not actually
work if running Java 7.

The stop() callback was trying to close the launcher connection in the same thread that handles connection data, which ended up causing a deadlock. So avoid that by dispatching the stop() request in its own thread. On top of that, add some exception safety to a few parts of the code, and use "destroyForcibly" from Java 8 if it's available, to force kill the child process. The flip side is that "kill()" may not actually work if running Java 7.

vanzin · 2015-11-11T18:27:09Z

@JoshRosen

SparkQA · 2015-11-11T20:56:01Z

Test build #45656 has finished for PR 9633 at commit 9fcd201.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

JoshRosen · 2015-11-12T01:17:43Z

I can confirm that this seems to fix the problem when running locally.

JoshRosen · 2015-11-12T02:00:20Z

Based on http://bugs.java.com/view_bug.do?bug_id=4073195, it sounds like many *nix implementations of Process.destroy() work by sending SIGTERM to the child process. I suppose that anything that caused SIGTERM to be swallowed / ignored by one of the child processes could keep this from working on Java 7. PySpark used to be vulnerable to similar problems, so it includes a test case which specifically checks the SIGTERM-handling behavior:

spark/python/pyspark/tests.py

Line 1580 in b8ff688

"""Ensure that daemon and workers terminate on SIGTERM."""

I commented out the handle.stop() call and verified that the child process stops almost immediately under Java 7, so it appears that this has fixed the issue. I suppose that we could try adding regression tests, but I'd also be fine doing that as a followup; I'd like to try to get this fix in sooner rather than later given the impact that it will have on Jenkins performance.

JoshRosen · 2015-11-12T02:01:57Z

launcher/src/main/java/org/apache/spark/launcher/ChildProcAppHandle.java

@@ -102,8 +103,20 @@ public synchronized void kill() {
      disconnect();


I was initially worried that this needs to be in a try block but it doesn't look like disconnect() is capable of throwing any exceptions.

vanzin · 2015-11-12T02:25:47Z

Note that the fix is NOT about whether destroy or destroyForcibly is used. The fix was a real deadlock in the code; that was made worse by the destroy call not actually killing the child process, which caused the process leak.

With the deadlock out of the way, calling destroy shouldn't really be needed since the child process will exit properly.

vanzin · 2015-11-12T20:54:17Z

@JoshRosen do you have any extra feedback here? I'll push the change otherwise.

vanzin · 2015-11-12T22:28:46Z

Merging to master / 1.6, we can do post-review later if needed.

The stop() callback was trying to close the launcher connection in the same thread that handles connection data, which ended up causing a deadlock. So avoid that by dispatching the stop() request in its own thread. On top of that, add some exception safety to a few parts of the code, and use "destroyForcibly" from Java 8 if it's available, to force kill the child process. The flip side is that "kill()" may not actually work if running Java 7. Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #9633 from vanzin/SPARK-11655. (cherry picked from commit 767d288) Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>

JoshRosen · 2015-11-13T02:48:02Z

Sorry for the late / flaky review replies; I've been home sick with strep throat and spent most of the day asleep. This seems fine to me.

JoshRosen · 2015-11-13T03:05:54Z

Maybe I'm overlooking something really obvious, but I think it's pretty hard to spot the circular wait condition which led to the deadlock. For posterity, could you post a brief description of the participants in that cycle?

The stop() callback was trying to close the launcher connection in the same thread that handles connection data, which ended up causing a deadlock. So avoid that by dispatching the stop() request in its own thread. On top of that, add some exception safety to a few parts of the code, and use "destroyForcibly" from Java 8 if it's available, to force kill the child process. The flip side is that "kill()" may not actually work if running Java 7. Author: Marcelo Vanzin <vanzin@cloudera.com> Closes apache#9633 from vanzin/SPARK-11655.

vanzin · 2015-11-13T18:56:14Z

LauncherBackend.close() waits for the communication thread to finish execution, so it can't be called from that thread or it will deadlock. (It's a little weird that you're even allowed to do that, but go figure.)

JoshRosen reviewed Nov 12, 2015
View reviewed changes

asfgit closed this in 767d288 Nov 12, 2015

vanzin deleted the SPARK-11655 branch November 19, 2015 23:42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-11655] [core] Fix deadlock in handling of launcher stop(). #9633

[SPARK-11655] [core] Fix deadlock in handling of launcher stop(). #9633

vanzin commented Nov 11, 2015

vanzin commented Nov 11, 2015

SparkQA commented Nov 11, 2015

JoshRosen commented Nov 12, 2015

JoshRosen commented Nov 12, 2015

JoshRosen Nov 12, 2015

vanzin commented Nov 12, 2015

vanzin commented Nov 12, 2015

vanzin commented Nov 12, 2015

JoshRosen commented Nov 13, 2015

JoshRosen commented Nov 13, 2015

vanzin commented Nov 13, 2015

		@@ -102,8 +103,20 @@ public synchronized void kill() {
		disconnect();

[SPARK-11655] [core] Fix deadlock in handling of launcher stop(). #9633

[SPARK-11655] [core] Fix deadlock in handling of launcher stop(). #9633

Conversation

vanzin commented Nov 11, 2015

vanzin commented Nov 11, 2015

SparkQA commented Nov 11, 2015

JoshRosen commented Nov 12, 2015

JoshRosen commented Nov 12, 2015

JoshRosen Nov 12, 2015

Choose a reason for hiding this comment

vanzin commented Nov 12, 2015

vanzin commented Nov 12, 2015

vanzin commented Nov 12, 2015

JoshRosen commented Nov 13, 2015

JoshRosen commented Nov 13, 2015

vanzin commented Nov 13, 2015