[SPARK-4012] stop SparkContext when the exception is thrown from an infinite loop #5004

CodingCat · 2015-03-12T17:47:10Z

https://issues.apache.org/jira/browse/SPARK-4012

This patch is a resubmission for #2864

What I am proposing in this patch is that _when the exception is thrown from an infinite loop, we should stop the SparkContext, instead of let JVM throws exception forever_

So, in the infinite loops where we originally wrapped with a logUncaughtExceptions, I changed to tryOrStopSparkContext, so that the Spark component is stopped

Early stopped JVM process is helpful for HA scheme design, for example,

The user has a script checking the existence of the pid of the Spark Streaming driver for monitoring the availability; with the code before this patch, the JVM process is still available but not functional when the exceptions are thrown

@andrewor14, @srowen , mind taking further consideration about the change?

SparkQA · 2015-03-12T17:53:07Z

Test build #28525 has started for PR 5004 at commit 6322959.

This patch merges cleanly.

SparkQA · 2015-03-12T19:13:29Z

Test build #28525 has finished for PR 5004 at commit 6322959.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2015-03-12T19:13:34Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/28525/
Test PASSed.

zsxwing · 2015-03-13T02:24:35Z

Do you think calling Thread.setDefaultUncaughtExceptionHandler(SparkUncaughtExceptionHandler) in the driver side is acceptable? Or at least, I think we should make sure every thread created by Spark should set SparkUncaughtExceptionHandler.

CodingCat · 2015-03-13T10:56:10Z

Hi, @zsxwing , Thread.setDefaultUncaughtExceptionHandler(SparkUncaughtExceptionHandler) is fine, though it's equivalent to the current patch in terms of functionality

but I'm not sure if we need to set exceptionHandler of all threads as SparkUncaughtExceptionHandler, because that means, once there is an uncaught exception, we stop the program...

zsxwing · 2015-03-13T11:23:34Z

once there is an uncaught exception, we stop the program...

Yes. So I suggests only threads created by Spark should set SparkUncaughtExceptionHandler. I think if a Spark internal thread throws an uncaught exception, it often means some Spark internal module has crashed. Just my 2 cents about improving the robustness.

Of course, look good to me about your changes.

CodingCat · 2015-03-13T11:26:23Z

@zsxwing , sounds reasonable

let's wait for more eyes on this

srowen · 2015-03-13T13:40:43Z

I tend to favor this change for ContextCleaner and AsynchronousListenerBus, because an exception there kills a thread that is clearly intended to run forever. I'm not as sure about FsHistoryProvider, since there it is not as clear (yet) that the Runnable should cause the worker to die if it hits an exception. why this one?

I'd like it if, say, @aarondav might comment, since he added one of the lines being changed.

CodingCat · 2015-03-14T12:12:57Z

@srowen thanks for the comments,

the reason to change FsHistoryProvider is that the runner generated by this function is essentially executed by the threads with fixed rate

https://github.com/CodingCat/spark/blob/SPARK-4012-1/core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala#L138

aarondav · 2015-03-14T18:42:39Z

Is it really OK to System.exit() the driver JVM? This may be user code that has an embedded SparkContext. The SparkUncaughtExceptionHandler is suitable for Executors, where we have full control over the JVM, and AppClient for the same reason, but I'm not sure TaskSchedulerImpl should be using it, or ContextCleaner, for instance.

For driver shutdowns, it seems safer just to stop() the SparkContext.

SparkQA · 2015-03-14T20:08:02Z

Test build #28613 has started for PR 5004 at commit 6087864.

This patch merges cleanly.

CodingCat · 2015-03-14T20:10:36Z

@aarondav , thanks for the insightful suggestion

I just updated the patch

the change becomes a bit bigger, as I need to create a new method in Utils as tryOrStopSparkContext, which is a curried method in the current patch

also, I need to pass a SparkContext reference to AsynchronousListenerBus to be able to pass sparkContext as the parameter, so I changed the start() to start(sparkContext: SparkContext) in which I set the value of the newly added sparkContext variable member of AsynchronousListenerBus

SparkQA · 2015-03-14T21:28:23Z

Test build #28613 has finished for PR 5004 at commit 6087864.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2015-03-14T21:28:27Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/28613/
Test PASSed.

aarondav · 2015-03-14T21:32:29Z

core/src/main/scala/org/apache/spark/util/Utils.scala

@@ -1156,6 +1156,18 @@ private[spark] object Utils extends Logging {
  }

  /**
+   * Execute a block of code that evaluates to Unit, stop SparkContext is any uncaught exception


Add a comment contrasting this to tryOrExit, saying that this method is suitable for the driver while tryOrExit should be used for other JVMs started by Spark, over which we have full control. Also, second part should say something like "stopping the SparkContext if there is any uncaught exception."

CodingCat · 2015-03-15T01:11:02Z

@aarondav thanks for the comments, I just updated the patch

SparkQA · 2015-03-15T01:13:04Z

Test build #28619 has started for PR 5004 at commit 3c72cd8.

This patch merges cleanly.

SparkQA · 2015-03-15T02:33:08Z

Test build #28619 has finished for PR 5004 at commit 3c72cd8.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2015-03-15T02:33:13Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/28619/
Test PASSed.

aarondav · 2015-03-18T00:04:51Z

Alright, this looks good to me. @srowen and @zsxwing, any comments?

zsxwing · 2015-03-18T02:05:53Z

core/src/main/scala/org/apache/spark/util/Utils.scala

+          logError(s"uncaught error in thread ${Thread.currentThread().getName}, stopping " +
+            "SparkContext", t)
+          sc.stop()
+        }


How about throwing t again here? So that the user can use UncaughtExceptionHandler to monitor the uncaught exception. If not, the user won't be aware that sc is shutdown until calling runJob next time.

Hi, @zsxwing thanks for the comments

I personally prefer a more conservative way here (the current approach)

Because the throwable thrown from here can be varying in terms of types, and I'm concerning that the Throwable from here, like OOM, would be mixed with the instances of the same type generated by the other components in user's program; on the other hand, our goal is just to let the user know SparkContext is stopped

So I prefer to letting the user call SparkContext.runJob to get a IllegalStateException("SparkContext has been shutdown") which (hopefully) will be handled exactly

@srowen @aarondav , your thoughts?

What if we catch NonFatal(e) and re-throw other Throwables? Basically saying that fatal errors should be re-thrown, but lesser ones can just stop here, they should only application-level exceptions which are our code's concern.

srowen · 2015-03-18T11:41:41Z

I don't have a strong opinion on this one. I suppose there's a question of what may happen if you stop the SparkContext but then immediately kill the calling thread with an uncaught Throwable. Does that undermine the action of stop() in some cases?

CodingCat · 2015-03-18T15:11:14Z

@srowen I checked the code, one suspicious part might be the asynchronous shutdown of ActorSystem, but that should be OK in Spark case...

@aarondav ?

zsxwing · 2015-03-18T15:30:46Z

Since tryOrStopSparkContext is in Utils, I think it's better to check if it's reasonable in the general case. I'm concerning that swallowing fatal errors (OOM) or InterruptException may cause some issue.

SparkQA · 2015-03-18T18:42:49Z

Test build #28825 has started for PR 5004 at commit 589276a.

This patch merges cleanly.

SparkQA · 2015-03-18T20:03:20Z

Test build #28825 has finished for PR 5004 at commit 589276a.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2015-03-18T20:03:24Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/28825/
Test PASSed.

CodingCat · 2015-03-18T22:09:27Z

@zsxwing @srowen @aarondav how about the current version?

zsxwing · 2015-03-19T01:29:13Z

LGTM

aarondav · 2015-03-19T01:31:29Z

LGTM

CodingCat · 2015-03-19T01:51:07Z

thanks, guys~

aarondav · 2015-03-19T06:48:18Z

Cool, merging this into master. Thanks!

exit JVM process when the exception is thrown from an infinite loop

6322959

CodingCat added 2 commits March 14, 2015 16:04

stop SparkContext instead of quit the JVM process

6ad3eb0

revise comments

6087864

CodingCat changed the title ~~[SPARK-4012] exit JVM process when the exception is thrown from an infinite loop~~ [SPARK-4012] stop SparkContext when the exception is thrown from an infinite loop Mar 14, 2015

aarondav reviewed Mar 14, 2015
View reviewed changes

address the comments

3c72cd8

zsxwing reviewed Mar 18, 2015
View reviewed changes

throw fatal error again

589276a

asfgit closed this in 2c3f83c Mar 19, 2015

zsxwing mentioned this pull request Mar 23, 2015

[SPARK-6449][YARN] Report failure status if driver throws exception #5130

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-4012] stop SparkContext when the exception is thrown from an infinite loop #5004

[SPARK-4012] stop SparkContext when the exception is thrown from an infinite loop #5004

CodingCat commented Mar 12, 2015

SparkQA commented Mar 12, 2015

SparkQA commented Mar 12, 2015

AmplabJenkins commented Mar 12, 2015

zsxwing commented Mar 13, 2015

CodingCat commented Mar 13, 2015

zsxwing commented Mar 13, 2015

CodingCat commented Mar 13, 2015

srowen commented Mar 13, 2015

CodingCat commented Mar 14, 2015

aarondav commented Mar 14, 2015

SparkQA commented Mar 14, 2015

CodingCat commented Mar 14, 2015

SparkQA commented Mar 14, 2015

AmplabJenkins commented Mar 14, 2015

aarondav Mar 14, 2015

CodingCat commented Mar 15, 2015

SparkQA commented Mar 15, 2015

SparkQA commented Mar 15, 2015

AmplabJenkins commented Mar 15, 2015

aarondav commented Mar 18, 2015

zsxwing Mar 18, 2015

CodingCat Mar 18, 2015

aarondav Mar 18, 2015

srowen commented Mar 18, 2015

CodingCat commented Mar 18, 2015

zsxwing commented Mar 18, 2015

SparkQA commented Mar 18, 2015

SparkQA commented Mar 18, 2015

AmplabJenkins commented Mar 18, 2015

CodingCat commented Mar 18, 2015

zsxwing commented Mar 19, 2015

aarondav commented Mar 19, 2015

CodingCat commented Mar 19, 2015

aarondav commented Mar 19, 2015

[SPARK-4012] stop SparkContext when the exception is thrown from an infinite loop #5004

[SPARK-4012] stop SparkContext when the exception is thrown from an infinite loop #5004

Conversation

CodingCat commented Mar 12, 2015

SparkQA commented Mar 12, 2015

SparkQA commented Mar 12, 2015

AmplabJenkins commented Mar 12, 2015

zsxwing commented Mar 13, 2015

CodingCat commented Mar 13, 2015

zsxwing commented Mar 13, 2015

CodingCat commented Mar 13, 2015

srowen commented Mar 13, 2015

CodingCat commented Mar 14, 2015

aarondav commented Mar 14, 2015

SparkQA commented Mar 14, 2015

CodingCat commented Mar 14, 2015

SparkQA commented Mar 14, 2015

AmplabJenkins commented Mar 14, 2015

aarondav Mar 14, 2015

Choose a reason for hiding this comment

CodingCat commented Mar 15, 2015

SparkQA commented Mar 15, 2015

SparkQA commented Mar 15, 2015

AmplabJenkins commented Mar 15, 2015

aarondav commented Mar 18, 2015

zsxwing Mar 18, 2015

Choose a reason for hiding this comment

CodingCat Mar 18, 2015

Choose a reason for hiding this comment

aarondav Mar 18, 2015

Choose a reason for hiding this comment

srowen commented Mar 18, 2015

CodingCat commented Mar 18, 2015

zsxwing commented Mar 18, 2015

SparkQA commented Mar 18, 2015

SparkQA commented Mar 18, 2015

AmplabJenkins commented Mar 18, 2015

CodingCat commented Mar 18, 2015

zsxwing commented Mar 19, 2015

aarondav commented Mar 19, 2015

CodingCat commented Mar 19, 2015

aarondav commented Mar 19, 2015