Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-11655] [core] Fix deadlock in handling of launcher stop(). #9633

Closed
wants to merge 1 commit into from

Conversation

vanzin
Copy link
Contributor

@vanzin vanzin commented Nov 11, 2015

The stop() callback was trying to close the launcher connection in the
same thread that handles connection data, which ended up causing a
deadlock. So avoid that by dispatching the stop() request in its own
thread.

On top of that, add some exception safety to a few parts of the code,
and use "destroyForcibly" from Java 8 if it's available, to force
kill the child process. The flip side is that "kill()" may not actually
work if running Java 7.

The stop() callback was trying to close the launcher connection in the
same thread that handles connection data, which ended up causing a
deadlock. So avoid that by dispatching the stop() request in its own
thread.

On top of that, add some exception safety to a few parts of the code,
and use "destroyForcibly" from Java 8 if it's available, to force
kill the child process. The flip side is that "kill()" may not actually
work if running Java 7.
@vanzin
Copy link
Contributor Author

vanzin commented Nov 11, 2015

@JoshRosen

@SparkQA
Copy link

SparkQA commented Nov 11, 2015

Test build #45656 has finished for PR 9633 at commit 9fcd201.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@JoshRosen
Copy link
Contributor

I can confirm that this seems to fix the problem when running locally.

@JoshRosen
Copy link
Contributor

Based on http://bugs.java.com/view_bug.do?bug_id=4073195, it sounds like many *nix implementations of Process.destroy() work by sending SIGTERM to the child process. I suppose that anything that caused SIGTERM to be swallowed / ignored by one of the child processes could keep this from working on Java 7. PySpark used to be vulnerable to similar problems, so it includes a test case which specifically checks the SIGTERM-handling behavior:

"""Ensure that daemon and workers terminate on SIGTERM."""

I commented out the handle.stop() call and verified that the child process stops almost immediately under Java 7, so it appears that this has fixed the issue. I suppose that we could try adding regression tests, but I'd also be fine doing that as a followup; I'd like to try to get this fix in sooner rather than later given the impact that it will have on Jenkins performance.

@@ -102,8 +103,20 @@ public synchronized void kill() {
disconnect();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was initially worried that this needs to be in a try block but it doesn't look like disconnect() is capable of throwing any exceptions.

@vanzin
Copy link
Contributor Author

vanzin commented Nov 12, 2015

Note that the fix is NOT about whether destroy or destroyForcibly is used. The fix was a real deadlock in the code; that was made worse by the destroy call not actually killing the child process, which caused the process leak.

With the deadlock out of the way, calling destroy shouldn't really be needed since the child process will exit properly.

@vanzin
Copy link
Contributor Author

vanzin commented Nov 12, 2015

@JoshRosen do you have any extra feedback here? I'll push the change otherwise.

@vanzin
Copy link
Contributor Author

vanzin commented Nov 12, 2015

Merging to master / 1.6, we can do post-review later if needed.

asfgit pushed a commit that referenced this pull request Nov 12, 2015
The stop() callback was trying to close the launcher connection in the
same thread that handles connection data, which ended up causing a
deadlock. So avoid that by dispatching the stop() request in its own
thread.

On top of that, add some exception safety to a few parts of the code,
and use "destroyForcibly" from Java 8 if it's available, to force
kill the child process. The flip side is that "kill()" may not actually
work if running Java 7.

Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #9633 from vanzin/SPARK-11655.

(cherry picked from commit 767d288)
Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>
@asfgit asfgit closed this in 767d288 Nov 12, 2015
@JoshRosen
Copy link
Contributor

Sorry for the late / flaky review replies; I've been home sick with strep throat and spent most of the day asleep. This seems fine to me.

@JoshRosen
Copy link
Contributor

Maybe I'm overlooking something really obvious, but I think it's pretty hard to spot the circular wait condition which led to the deadlock. For posterity, could you post a brief description of the participants in that cycle?

dskrvk pushed a commit to dskrvk/spark that referenced this pull request Nov 13, 2015
The stop() callback was trying to close the launcher connection in the
same thread that handles connection data, which ended up causing a
deadlock. So avoid that by dispatching the stop() request in its own
thread.

On top of that, add some exception safety to a few parts of the code,
and use "destroyForcibly" from Java 8 if it's available, to force
kill the child process. The flip side is that "kill()" may not actually
work if running Java 7.

Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes apache#9633 from vanzin/SPARK-11655.
@vanzin
Copy link
Contributor Author

vanzin commented Nov 13, 2015

LauncherBackend.close() waits for the communication thread to finish execution, so it can't be called from that thread or it will deadlock. (It's a little weird that you're even allowed to do that, but go figure.)

@vanzin vanzin deleted the SPARK-11655 branch November 19, 2015 23:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants