Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-21146] [CORE] Master/Worker should handle and shutdown when any thread gets UncaughtException #18357

Closed
wants to merge 4 commits into from

Conversation

devaraj-kavali
Copy link

What changes were proposed in this pull request?

Adding the default UncaughtExceptionHandler to the Worker.

How was this patch tested?

I verified it manually, when any of the worker thread gets uncaught exceptions then the default UncaughtExceptionHandler will handle those exceptions.

@jiangxb1987
Copy link
Contributor

Could you elaborate on when we need this improvement? In fact I don't see much benefit of having this at first glance.

@devaraj-kavali
Copy link
Author

@jiangxb1987 Thanks for looking at this.

Any one of the threads in Worker gets an exception which is unhandled then the thread gets terminated and process(Worker) keeps running, and the Worker doesn't perform the terminated thread functionality. One instance I mentioned in the JIRA description, where dispatcher-event thread got the exception and terminated, worker keeps running without performing the terminated thread functionality. I think we should handle this at the process level for all the threads, if any thread gets unhandled exception then the process(Worker) should exit instead of performing partial functionality.

This PR code change would set the default UncaughtExceptionHandler as SparkUncaughtExceptionHandler and for any unhandled exception in any thread of the process, SparkUncaughtExceptionHandler handles the exception and exit the process with the appropriate exit code.

@jiangxb1987
Copy link
Contributor

AFAIK SparkUncaughtExceptionHandler is designed to make Executors fail fast, because Executors are relatively lightweight, but I'm not sure we can afford to let Workers, even the Master fail fast too.
cc @cloud-fan for more comments.

@cloud-fan
Copy link
Contributor

according to the JIRA, if the worker meets some unrecoverable issues and the state becomes DEAD, it seems make sense to terminate this worker as it's not functional. But can you double check that this change won't terminate a still-functional worker?

@devaraj-kavali
Copy link
Author

But can you double check that this change won't terminate a still-functional worker?

The change doesn't impact/harm the fully functional Worker, SparkUncaughtExceptionHandler handles only when any thread gets terminated due to the exception thrown from it without handling.

@zsxwing
Copy link
Member

zsxwing commented Jun 21, 2017

ok to test

@zsxwing
Copy link
Member

zsxwing commented Jun 21, 2017

LGTM. I thought we already did it :( I'm taking it back. I didn't realized SparkUncaughtExceptionHandler also kills the process for non-fatal exceptions. Then this seems dangerous. I'm not sure if we forget catch any non-fatal exceptions. Could you add a flag to enable/disable it?

@SparkQA
Copy link

SparkQA commented Jun 21, 2017

Test build #78404 has finished for PR 18357 at commit c6e71f8.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@devaraj-kavali
Copy link
Author

@zsxwing Thanks for looking into this, I see SparkUncaughtExceptionHandler exits the process for all the exceptions/errors.

Can we modify SparkUncaughtExceptionHandler in such a way that it kills the process for all Errors(always) and kills the process for Exceptions only when the flag is true, this flag can be enabled for Executors and disabled for other daemons(Master, Worker and HistoryServer)?

@zsxwing
Copy link
Member

zsxwing commented Jun 21, 2017

@devaraj-kavali SparkUncaughtExceptionHandler is a singleton so you cannot get the configuration from SparkConf.

@devaraj-kavali
Copy link
Author

can we change SparkUncaughtExceptionHandler to class and create instance in each daemons main() with the constructor flag whether to kill the process for Exceptions or not? Something like below,

For Executor :

Thread.setDefaultUncaughtExceptionHandler(new SparkUncaughtExceptionHandler(true)) // Here true denotes to kill the process for Exceptions

For Other Daemons:

Thread.setDefaultUncaughtExceptionHandler(new SparkUncaughtExceptionHandler(false)) // false denotes not to kill the process for Exceptions

kill/not-to-kill the process for the unhandled exceptions with the default
value as true, behaviour remains same for unhandled Error instances.
@SparkQA
Copy link

SparkQA commented Jun 29, 2017

Test build #78838 has finished for PR 18357 at commit a2f30f0.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@jiangxb1987
Copy link
Contributor

retest this please.

@SparkQA
Copy link

SparkQA commented Jun 29, 2017

Test build #78912 has finished for PR 18357 at commit a2f30f0.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@@ -26,27 +26,34 @@ import org.apache.spark.internal.Logging
*/
private[spark] object SparkUncaughtExceptionHandler
extends Thread.UncaughtExceptionHandler with Logging {
private[this] var exitOnException = true
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of using a static variable, I prefer to change SparkUncaughtExceptionHandler to a class.

}
} catch {
case oom: OutOfMemoryError => Runtime.getRuntime.halt(SparkExitCode.OOM)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you keep these codes? It unlikely happens but since the codes are there, it's better to not change it.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zsxwing, this code is still there but I moved the try&catch to the block where we invoke System.exit. Do you mean moving the whole code in uncaughtException() to try block and having the catch block?

+      } catch {
+        case oom: OutOfMemoryError => Runtime.getRuntime.halt(SparkExitCode.OOM)
+        case t: Throwable => Runtime.getRuntime.halt(SparkExitCode.UNCAUGHT_EXCEPTION_TWICE)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean moving the whole code in uncaughtException() to try block and having the catch block?

Yes.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zsxwing Thanks for the clarification.

@devaraj-kavali devaraj-kavali changed the title [SPARK-21146] [CORE] Worker should handle and shutdown when any thread gets UncaughtException [SPARK-21146] [CORE] Master/Worker should handle and shutdown when any thread gets UncaughtException Jun 30, 2017
@SparkQA
Copy link

SparkQA commented Jun 30, 2017

Test build #78997 has finished for PR 18357 at commit 56c2e4d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

logError("[Process in shutdown] " + errMsg, exception)
} else if (exception.isInstanceOf[Error] ||
(!exception.isInstanceOf[Error] && exitOnException)) {
logError(errMsg + ". Shutting down now..", exception)
if (exception.isInstanceOf[OutOfMemoryError]) {
System.exit(SparkExitCode.OOM)
} else {
System.exit(SparkExitCode.UNCAUGHT_EXCEPTION)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current changes are too much. Could you rename exitOnException to exitOnUncaughtException and just change this line to

if (exitOnUncaughtException) {
  System.exit(SparkExitCode.UNCAUGHT_EXCEPTION)
}

@@ -737,6 +737,7 @@ private[deploy] object Worker extends Logging {
val ENDPOINT_NAME = "Worker"

def main(argStrings: Array[String]) {
Thread.setDefaultUncaughtExceptionHandler(new SparkUncaughtExceptionHandler(false))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: false => exitOnException = false

*/
private[spark] object SparkUncaughtExceptionHandler
private[spark] class SparkUncaughtExceptionHandler(val exitOnException: Boolean = true)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: please also document exitOnException using

@param exitOnException ...

@@ -1037,6 +1037,7 @@ private[deploy] object Master extends Logging {
val ENDPOINT_NAME = "Master"

def main(argStrings: Array[String]) {
Thread.setDefaultUncaughtExceptionHandler(new SparkUncaughtExceptionHandler(false))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: false => exitOnException = false

@zsxwing
Copy link
Member

zsxwing commented Jul 6, 2017

LGTM. Pending tests. Thanks!

@SparkQA
Copy link

SparkQA commented Jul 6, 2017

Test build #79298 has finished for PR 18357 at commit 3d5106b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@devaraj-kavali
Copy link
Author

@zsxwing Tests have passed, can you check this? Thanks

@zsxwing
Copy link
Member

zsxwing commented Jul 11, 2017

@devaraj-kavali Sorray. I forgot this PR. I will trigger a new run as master has been updated a lot. Will set a reminder for me to merge this PR :)

@zsxwing
Copy link
Member

zsxwing commented Jul 11, 2017

retest this please

@SparkQA
Copy link

SparkQA commented Jul 11, 2017

Test build #79537 has finished for PR 18357 at commit 3d5106b.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@zsxwing
Copy link
Member

zsxwing commented Jul 11, 2017

retest this please

@SparkQA
Copy link

SparkQA commented Jul 12, 2017

Test build #79542 has finished for PR 18357 at commit 3d5106b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@zsxwing
Copy link
Member

zsxwing commented Jul 12, 2017

Thanks! Merging to master.

@asfgit asfgit closed this in e16e8c7 Jul 12, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants