-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-4012] stop SparkContext when the exception is thrown from an infinite loop #5004
Conversation
Test build #28525 has started for PR 5004 at commit
|
Test build #28525 has finished for PR 5004 at commit
|
Test PASSed. |
Do you think calling |
Hi, @zsxwing , but I'm not sure if we need to set exceptionHandler of all threads as SparkUncaughtExceptionHandler, because that means, once there is an uncaught exception, we stop the program... |
Yes. So I suggests only threads created by Spark should set SparkUncaughtExceptionHandler. I think if a Spark internal thread throws an uncaught exception, it often means some Spark internal module has crashed. Just my 2 cents about improving the robustness. Of course, look good to me about your changes. |
@zsxwing , sounds reasonable let's wait for more eyes on this |
I tend to favor this change for I'd like it if, say, @aarondav might comment, since he added one of the lines being changed. |
@srowen thanks for the comments, the reason to change FsHistoryProvider is that the runner generated by this function is essentially executed by the threads with fixed rate |
Is it really OK to System.exit() the driver JVM? This may be user code that has an embedded SparkContext. The SparkUncaughtExceptionHandler is suitable for Executors, where we have full control over the JVM, and AppClient for the same reason, but I'm not sure TaskSchedulerImpl should be using it, or ContextCleaner, for instance. For driver shutdowns, it seems safer just to stop() the SparkContext. |
Test build #28613 has started for PR 5004 at commit
|
@aarondav , thanks for the insightful suggestion I just updated the patch the change becomes a bit bigger, as I need to create a new method in Utils as also, I need to pass a SparkContext reference to |
Test build #28613 has finished for PR 5004 at commit
|
Test PASSed. |
@@ -1156,6 +1156,18 @@ private[spark] object Utils extends Logging { | |||
} | |||
|
|||
/** | |||
* Execute a block of code that evaluates to Unit, stop SparkContext is any uncaught exception |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a comment contrasting this to tryOrExit, saying that this method is suitable for the driver while tryOrExit should be used for other JVMs started by Spark, over which we have full control. Also, second part should say something like "stopping the SparkContext if there is any uncaught exception."
@aarondav thanks for the comments, I just updated the patch |
Test build #28619 has started for PR 5004 at commit
|
Test build #28619 has finished for PR 5004 at commit
|
Test PASSed. |
logError(s"uncaught error in thread ${Thread.currentThread().getName}, stopping " + | ||
"SparkContext", t) | ||
sc.stop() | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about throwing t
again here? So that the user can use UncaughtExceptionHandler
to monitor the uncaught exception. If not, the user won't be aware that sc
is shutdown until calling runJob
next time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, @zsxwing thanks for the comments
I personally prefer a more conservative way here (the current approach)
Because the throwable thrown from here can be varying in terms of types, and I'm concerning that the Throwable from here, like OOM
, would be mixed with the instances of the same type generated by the other components in user's program; on the other hand, our goal is just to let the user know SparkContext
is stopped
So I prefer to letting the user call SparkContext.runJob to get a IllegalStateException("SparkContext has been shutdown")
which (hopefully) will be handled exactly
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if we catch NonFatal(e) and re-throw other Throwables? Basically saying that fatal errors should be re-thrown, but lesser ones can just stop here, they should only application-level exceptions which are our code's concern.
I don't have a strong opinion on this one. I suppose there's a question of what may happen if you stop the |
Since |
Test build #28825 has started for PR 5004 at commit
|
Test build #28825 has finished for PR 5004 at commit
|
Test PASSed. |
LGTM |
1 similar comment
LGTM |
thanks, guys~ |
Cool, merging this into master. Thanks! |
https://issues.apache.org/jira/browse/SPARK-4012
This patch is a resubmission for #2864
What I am proposing in this patch is that _when the exception is thrown from an infinite loop, we should stop the SparkContext, instead of let JVM throws exception forever_
So, in the infinite loops where we originally wrapped with a
logUncaughtExceptions
, I changed totryOrStopSparkContext
, so that the Spark component is stoppedEarly stopped JVM process is helpful for HA scheme design, for example,
The user has a script checking the existence of the pid of the Spark Streaming driver for monitoring the availability; with the code before this patch, the JVM process is still available but not functional when the exceptions are thrown
@andrewor14, @srowen , mind taking further consideration about the change?