-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-27900][CORE] Add uncaught exception handler to the driver #24796
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -861,6 +861,16 @@ private[spark] class SparkSubmit extends Logging { | |
* running cluster deploy mode or python applications. | ||
*/ | ||
private def runMain(args: SparkSubmitArguments, uninitLog: Boolean): Unit = { | ||
// set the handler as early as possible | ||
// we need the driver check because spark-submit is used by mesos to launch | ||
// executors (via the SparkLauncher) | ||
val isExecutor = args.childArgs.contains("--executor-id") | ||
args.deployMode match { | ||
case "client" | null if !isExecutor => | ||
Thread.setDefaultUncaughtExceptionHandler( | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I remember there was an argument that we should not set the default UncaughtExceptionHandler for driver is Spark doesn't control the whole driver JVM and we should not just help the user decide what should be done in this JVM. They may have their own There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. There should be documentation or something about what is provided (because Spark on K8s is affected for example) but even if the user adds a handler, there will be a deadlock here (could be fixed I guess). In general I dont see people setting one and what happens if the same code is implemented in python (which calls the jvm stuff behind the scenes) you need a default handler in that case I guess, my 2 cents. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yeah, I was actually supporting this since if a Spark thread crashes, SparkContext and other components probably will not work correctly. Could not find the related PR or JIRA ticket though... One alternative solution is just setting up UncaughtExceptionHandler for all Spark created threads. Then this will not impact any other user threads. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is currently done only for the fsHistoryProvider (unrelated to spark core):
I am not sure how this will help as we will have to exit from there anyway, no? |
||
new SparkUncaughtExceptionHandler(isDriver = true)) | ||
} | ||
|
||
val (childArgs, childClasspath, sparkConf, childMainClass) = prepareSubmitEnvironment(args) | ||
// Let the main class re-initialize the logging system once it starts. | ||
if (uninitLog) { | ||
|
Original file line number | Diff line number | Diff line change | ||||
---|---|---|---|---|---|---|
|
@@ -45,7 +45,7 @@ private[spark] object ShutdownHookManager extends Logging { | |||||
*/ | ||||||
val TEMP_DIR_SHUTDOWN_PRIORITY = 25 | ||||||
|
||||||
private lazy val shutdownHooks = { | ||||||
private[spark] lazy val shutdownHooks = { | ||||||
val manager = new SparkShutdownHookManager() | ||||||
manager.install() | ||||||
manager | ||||||
|
@@ -204,6 +204,11 @@ private [util] class SparkShutdownHookManager { | |||||
hooks.synchronized { hooks.remove(ref) } | ||||||
} | ||||||
|
||||||
def clear(): Unit = { | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm still not so clear why you want to forget all the hooks that were registered -- what's the scenario where they prevent termination, and can we fix the hook? seems like we'd like them to try to complete There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The scenario is described in the jira. I see a deadlock. The DAG event loop thread dies because it tries to schedule a big number of tasks it gets an OOM and gets unresponsive (never wakes from the interrupt). I can point to the full jstack output if we need to dig further into this.
In general I am not confident we can predict the status of the jvm in case of an oom and what can go wrong in order to do a safe shutdown, that is why an immediate exit is one good option (maybe not the only one). There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is it a deadlock? I understand just exiting anyway from the driver if an uncaught exception happens, just not so clear why one would remove the hooks. If they execute, good, or are you saying they're the issue? if so, is there a fix for the hook instead? I'm not sure it's better to not execute them. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I tried to fix it but the Event loop thread will not get interrupted so it cannot complete the join (btw its a daemon thread). I will paste the jstack output shortly so you can have a look, that is my view so far. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @srowen here it is: https://gist.github.com/skonto/74181e434a727901d4f3323461c1050b
Blocking for ever there might be a problem if something goes wrong. Now if you check the output:
this will never finish and it waits at the join(EventLoop.scala:81) for
which waits for the shutodwn hook to finish which is invoked by the oom which was created by itself. So its a deadlock. If you check the log oom comes for that thread which tries to submit 1M tasks ;). dag-scheduler-event-loop -> shutdownHook -> calls join from the other thread and waits for dag-scheduler-event-loop (deadlock). There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Do you know why the event loop thread won't get interrupted? I tried to reproduce the dead lock locally using this change zsxwing@b1aecf2 but I could not reproduce the following hang:
but saw a different behavior: JVM just exits when reaching in
My JDK version is There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @zsxwing interesting... I use the jvm in the alpine docker image used for Spark on K8s openjdk:8-alpine
which is the latest... There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @zsxwing its possible your change is not triggering the same thing because throwing an OOM from within java code is not the same as an OOM from within the JVM itself (IIRC, when the user throws OOM themselves none of the jvm XX options related to OOMs get triggered). |
||||||
hooks.synchronized { | ||||||
hooks.clear() | ||||||
} | ||||||
} | ||||||
} | ||||||
|
||||||
private class SparkShutdownHook(private val priority: Int, hook: () => Unit) | ||||||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -29,4 +29,6 @@ private[spark] object SparkExitCode { | |
OutOfMemoryError. */ | ||
val OOM = 52 | ||
|
||
val VM_ERROR = 53 | ||
|
||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no the best but at this point this is what is available (the alternative is to add it at the SparkContext side where SparkEnv provides the info if we are at the driver jvm).