[SPARK-22123][CORE] Add latest failure reason for task set blacklist #19338

caneGuy · 2017-09-25T12:16:56Z

What changes were proposed in this pull request?

This patch add latest failure reason for task set blacklist.Which can be showed on spark ui and let user know failure reason directly.
Till now , every job which aborted by completed blacklist just show log like below which has no more information:
Aborting $taskSet because task $indexInTaskSet (partition $partition) cannot run anywhere due to node and executor blacklist. Blacklisting behavior cannot run anywhere due to node and executor blacklist.Blacklisting behavior can be configured via spark.blacklist.*."
After modify:

Aborting TaskSet 0.0 because task 0 (partition 0)
cannot run anywhere due to node and executor blacklist.
Most recent failure:
Some(Lost task 0.1 in stage 0.0 (TID 3,xxx, executor 1): java.lang.Exception: Fake error!
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:73)
at org.apache.spark.scheduler.Task.run(Task.scala:99)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:305)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
). 

Blacklisting behavior can be configured via spark.blacklist.*.

How was this patch tested?

Unit test and manually test.

caneGuy · 2017-09-25T12:17:36Z

@squito Could you help review this?

squito · 2017-09-25T16:06:07Z

@caneGuy thanks for working on this, looks very reasonable to me, I am going to take a closer look at a couple of details. But can you make a couple of updates in the meantime:

Can you open a new jira for this, and put that in the commit summary? SPARK-21539 is referring to something else entirely
Can you reformat the new exception to look a bit more like the formatting for when there are too many failures of a specific task? maybe like this:

User class threw exception: org.apache.spark.SparkException: Job aborted due to stage failure: Aborting TaskSet 0.0 because task 0 (partition 0) cannot run anywhere due to node and executor blacklist. Most recent failure:
Lost task 0.1 in stage 0.0 (TID 3,xxx, executor 1): java.lang.Exception: Fake error!
 at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:73)
 at org.apache.spark.scheduler.Task.run(Task.scala:99)
 at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:305)
 at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 at java.lang.Thread.run(Thread.java:745) 

Blacklisting behavior can be configured via spark.blacklist.*.

Driver Stacktrace:
 at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1458)
...

squito · 2017-09-25T16:06:16Z

Jenkins, ok to test

squito

Thanks @caneGuy overall looks good, just some minor stuff to fix.

squito · 2017-09-25T16:12:19Z

core/src/main/scala/org/apache/spark/scheduler/TaskSetBlacklist.scala

@@ -94,7 +96,9 @@ private[scheduler] class TaskSetBlacklist(val conf: SparkConf, val stageId: Int,
  private[scheduler] def updateBlacklistForFailedTask(
      host: String,
      exec: String,
-      index: Int): Unit = {
+      index: Int,
+      failureReason: Option[String] = None): Unit = {


failureReason should always be present in this call, so it shouldn't be an Option as an arg to this method.

(I realize this is a bit of a pain as you have to modify all the call sites in tests, sorry about that).

Actually , you are right.For feature completion i should modify this.

squito · 2017-09-25T16:13:12Z

core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala

@@ -838,7 +840,7 @@ private[spark] class TaskSetManager(

    if (!isZombie && reason.countTowardsTaskFailures) {
      taskSetBlacklistHelperOpt.foreach(_.updateBlacklistForFailedTask(
-        info.host, info.executorId, index))
+        info.host, info.executorId, index, Some(failureReason)))
      assert (null != failureReason)


move the assert (null != failureReason) first, and to go along with the other change, drop the Some wrapper around failureReason.

SparkQA · 2017-09-25T19:11:14Z

Test build #82152 has finished for PR 19338 at commit 8134142.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

caneGuy · 2017-09-26T01:41:59Z

Thanks for your time @squito I will open an other jira for this pr.And update code as soon as possible.

SparkQA · 2017-09-26T05:48:19Z

Test build #82169 has finished for PR 19338 at commit 57190ef.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

jerryshao · 2017-09-26T09:20:12Z

core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala

@@ -671,8 +671,9 @@ private[spark] class TaskSetManager(
          if (blacklistedEverywhere) {
            val partition = tasks(indexInTaskSet).partitionId
            abort(s"Aborting $taskSet because task $indexInTaskSet (partition $partition) " +
-              s"cannot run anywhere due to node and executor blacklist.  Blacklisting behavior " +
-              s"can be configured via spark.blacklist.*.")
+              s"cannot run anywhere due to node and executor blacklist." +


Can you please prettify the output message a bit. From what I saw in the PR description, it is a bit messy there.

jerryshao · 2017-09-26T09:22:17Z

core/src/main/scala/org/apache/spark/scheduler/TaskSetBlacklist.scala

@@ -61,6 +61,8 @@ private[scheduler] class TaskSetBlacklist(val conf: SparkConf, val stageId: Int,
  private val blacklistedExecs = new HashSet[String]()
  private val blacklistedNodes = new HashSet[String]()

+  var taskSetLatestFailureReason: String = null


Can we please avoid public variables here? Also why not make it less verbose to change to latestFailureReason?

I have thought about public problem,but we need to get this value from TaskSetManager. If i add a def getLatestFailureReason make sense? @jerryshao

SparkQA · 2017-09-26T12:58:43Z

Test build #82190 has finished for PR 19338 at commit 2147450.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

squito

@caneGuy can you update the PR description to match the new formatting of the error msg?

@jerryshao I think this is fine, do you have more concerns?

squito · 2017-09-26T15:47:34Z

core/src/main/scala/org/apache/spark/scheduler/TaskSetBlacklist.scala

+   * Get the most recent failure reason of this TaskSet.
+   * @return
+   */
+  def getLatestFailureReason: String = {


@jerryshao the whole class is private[scheduler], so I think is OK.

@squito yes from scope level it is fine. My thought is that this exposes the class member to other class unnecessarily. Yeah it is not a big deal, just my personal preference.

Thanks @jerryshao @squito Could you help trigger an other jenkins test?Since last one has pySpark failure.

caneGuy · 2017-09-27T01:36:21Z

Thanks @squito i have updated the description

jerryshao · 2017-09-27T01:40:32Z

core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala

@@ -671,8 +671,10 @@ private[spark] class TaskSetManager(
          if (blacklistedEverywhere) {
            val partition = tasks(indexInTaskSet).partitionId
            abort(s"Aborting $taskSet because task $indexInTaskSet (partition $partition) " +
-              s"cannot run anywhere due to node and executor blacklist.  Blacklisting behavior " +
-              s"can be configured via spark.blacklist.*.")
+              s"cannot run anywhere due to node and executor blacklist.\n" +


Can we change to Scala's triple quoted string interpolation here? you can refer to here

Nice @jerryshao i will update as soon as possible.

jerryshao · 2017-09-27T02:58:25Z

core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala

+              |Aborting $taskSet because task $indexInTaskSet (partition $partition)
+              |cannot run anywhere due to node and executor blacklist.
+              |Most recent failure:
+              |${taskSetBlacklist.getLatestFailureReason}\n


You don't have to add "\n", and this "\n" will not be escaped in triple quoted format. Please trying to get understand the basic of this API before modifying it.

Also verified in locally before pushing a new commit, and update the PR description to reflect your new format accordingly.

Actually i tested locally with something like below:

scala> val s = s""" | sss\n | sss""" scala> print(s) sss sss

And i found that there is no need to add \n since triple quoted designed to avoid such character.Sorry for that,and i have updated before your newest comment .Thanks @jerryshao

Are you sure?

scala> val s = """fdsafdsa\n\nfdsafdsa""" s: String = fdsafdsa\n\nfdsafdsa

NVM, looks like whether using string interpolation the result is different.

It is my fault,scala version affected.My default scala is 2.10.4.And below is the result i tested with scala 2.11.2:

scala> val s="""ss\nss | sss\n""" s: String = ss\nss sss\n scala> print(s) ss\nss sss\n

SparkQA · 2017-09-27T05:03:28Z

Test build #82215 has finished for PR 19338 at commit 4d906b7.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-09-27T05:26:58Z

Test build #82217 has finished for PR 19338 at commit 05adc2a.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

jerryshao · 2017-09-27T05:44:05Z

There's one related test failure, can you please check.

SparkQA · 2017-09-27T07:04:43Z

Test build #82220 has finished for PR 19338 at commit 9a2d7ba.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

jerryshao · 2017-09-27T07:17:42Z

Jenkins, retest this please.

SparkQA · 2017-09-27T10:27:16Z

Test build #82226 has finished for PR 19338 at commit 9a2d7ba.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

jerryshao · 2017-09-27T12:40:36Z

core/src/test/scala/org/apache/spark/scheduler/BlacklistIntegrationSuite.scala

-      val pattern = ("Aborting TaskSet 0.0 because task .* " +
-        "cannot run anywhere due to node and executor blacklist").r
+      val pattern = (s"""
+                   |Aborting TaskSet 0.0 because task .*


I think here it should be two space indent, also for the below line.

SparkQA · 2017-09-27T16:38:05Z

Test build #82241 has finished for PR 19338 at commit d01e112.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

jerryshao · 2017-09-28T01:18:35Z

lGTM, merging to master.

Add latest failure reason for task set blacklist

668d5a7

Modify variable name

8134142

squito suggested changes Sep 25, 2017

View reviewed changes

caneGuy changed the title ~~[SPARK-21539][CORE] Add latest failure reason for task set blacklist~~ [SPARK-22123][CORE] Add latest failure reason for task set blacklist Sep 26, 2017

Update as commented

57190ef

jerryshao reviewed Sep 26, 2017

View reviewed changes

Update as commented

2147450

squito approved these changes Sep 26, 2017

View reviewed changes

jerryshao reviewed Sep 27, 2017

View reviewed changes

caneGuy added 2 commits September 27, 2017 10:29

Use triple quoted string

4d906b7

Format string style

05adc2a

jerryshao reviewed Sep 27, 2017

View reviewed changes

Fix unit test

9a2d7ba

jerryshao reviewed Sep 27, 2017

View reviewed changes

Update code style

d01e112

asfgit closed this in 3b117d6 Sep 28, 2017

caneGuy deleted the zhoukang/improve-blacklist branch September 28, 2017 01:58

[SPARK-22123][CORE] Add latest failure reason for task set blacklist #19338

[SPARK-22123][CORE] Add latest failure reason for task set blacklist #19338

Conversation

caneGuy commented Sep 25, 2017 • edited Loading

What changes were proposed in this pull request?

How was this patch tested?

caneGuy commented Sep 25, 2017

squito commented Sep 25, 2017

squito commented Sep 25, 2017

squito left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Sep 25, 2017

caneGuy commented Sep 26, 2017

SparkQA commented Sep 26, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Sep 26, 2017

squito left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

caneGuy commented Sep 27, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jerryshao Sep 27, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Sep 27, 2017

SparkQA commented Sep 27, 2017

jerryshao commented Sep 27, 2017

SparkQA commented Sep 27, 2017

jerryshao commented Sep 27, 2017

SparkQA commented Sep 27, 2017

Choose a reason for hiding this comment

SparkQA commented Sep 27, 2017

jerryshao commented Sep 28, 2017

caneGuy commented Sep 25, 2017 •

edited

Loading

jerryshao Sep 27, 2017 •

edited

Loading