[SPARK-30359][CORE] Don't clear executorsPendingToRemove at the beginning of CoarseGrainedSchedulerBackend.reset #27017

Ngone51 · 2019-12-26T12:57:34Z

What changes were proposed in this pull request?

Remove executorsPendingToRemove.clear() from CoarseGrainedSchedulerBackend.reset().

Why are the changes needed?

Clear executorsPendingToRemove before remove executors will cause all tasks running on those "pending to remove" executors to count failures. But that's not true for the case of executorsPendingToRemove(execId)=true.

Besides, executorsPendingToRemove will be cleaned up within removeExecutor() at the end just as same as executorsPendingLossReason.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Added a new test in TaskSetManagerSuite.

SparkQA · 2019-12-26T13:12:38Z

Test build #115812 has finished for PR 27017 at commit 511058c.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

Ngone51 · 2019-12-26T15:28:46Z

cc @cloud-fan @jiangxb1987 @jerryshao

SparkQA · 2019-12-26T16:52:54Z

Test build #115816 has finished for PR 27017 at commit 3cfd80f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-12-27T05:53:07Z

Test build #115837 has finished for PR 27017 at commit 3368a3e.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

core/src/main/scala/org/apache/spark/scheduler/cluster/CoarseGrainedSchedulerBackend.scala

cloud-fan · 2019-12-27T06:33:17Z

core/src/test/scala/org/apache/spark/scheduler/TaskSetManagerSuite.scala

@@ -1894,4 +1903,60 @@ class TaskSetManagerSuite extends SparkFunSuite with LocalSparkContext with Logg
    manager.handleFailedTask(offerResult.get.taskId, TaskState.FAILED, reason)
    assert(sched.taskSetsFailed.contains(taskSet.id))
  }
+
+  test("SPARK-30359: Don't clear executorsPendingToRemove in CoarseGrainedSchedulerBackend.reset")


nit: don't clean executorsPendingToRemove at the beginning of 'reset'. We do clear it eventually.

SparkQA · 2019-12-27T08:05:01Z

Test build #115842 has finished for PR 27017 at commit eae69ce.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

jiangxb1987

LGTM only nits

jiangxb1987 · 2019-12-28T01:28:42Z

core/src/test/scala/org/apache/spark/scheduler/TaskSetManagerSuite.scala

+      // use local-cluster mode in order to get CoarseGrainedSchedulerBackend
+      .setMaster("local-cluster[2, 1, 2048]")
+      // allow to set up at most two executors
+      .set("spark.cores.max", "2")


why do we still need this config?

In order to create at most 2 executors at the beginning...Though, this may not necessary..

jiangxb1987 · 2019-12-28T01:32:50Z

core/src/test/scala/org/apache/spark/scheduler/TaskSetManagerSuite.scala

+    backend.reset()
+
+    eventually(timeout(10.seconds), interval(100.milliseconds)) {
+      // executorsPendingToRemove should still be empty after reset()


nit: stil -> eventually

jiangxb1987 · 2019-12-28T01:34:47Z

core/src/test/scala/org/apache/spark/scheduler/TaskSetManagerSuite.scala

+      assert(manager.invokePrivate(numFailures())(index0) === 0)
+      assert(manager.invokePrivate(numFailures())(index1) === 1)
+    }
+    sc.stop()


This is not necessary coz LocalSparkContext would stop it after each test case.

jiangxb1987 · 2019-12-28T01:36:04Z

Jenkins retest this please

jiangxb1987 · 2019-12-28T01:36:55Z

core/src/test/scala/org/apache/spark/scheduler/TaskSetManagerSuite.scala

@@ -1904,8 +1904,7 @@ class TaskSetManagerSuite
    assert(sched.taskSetsFailed.contains(taskSet.id))
  }

-  test("SPARK-30359: Don't clear executorsPendingToRemove in CoarseGrainedSchedulerBackend.reset")
-  {
+  test("SPARK-30359: don't clean executorsPendingToRemove at the beginning of 'reset'") {


nit: you still need to mention CoarseGrainedSchedulerBackend.reset

SparkQA · 2019-12-28T03:50:09Z

Test build #115876 has finished for PR 27017 at commit d12dd45.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-12-28T04:18:22Z

Test build #115877 has finished for PR 27017 at commit d12dd45.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-12-28T04:38:22Z

Test build #115879 has finished for PR 27017 at commit 77c09e8.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2020-01-02T11:16:13Z

core/src/test/scala/org/apache/spark/scheduler/TaskSetManagerSuite.scala

+
+    // task0 on exec0 should not count failures
+    backend.executorsPendingToRemove(exec0) = true
+    // task1 on exec1 should count failures


what makes exec1 different from exec0 and count failures?

Here, executorsPendingToRemove(exec0)=true while executorsPendingToRemove(exec1)=false. And false means that the crash of executor may possibly related to bad tasks running on it. So, those task should be counted failures. However, true means the executor is killed by driver and has non business of tasks.

jiangxb1987

LGTM

SparkQA · 2020-01-03T08:05:01Z

Test build #116080 has finished for PR 27017 at commit 4fdb7cb.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2020-01-03T08:07:11Z

retest this please

SparkQA · 2020-01-03T10:55:31Z

Test build #116084 has finished for PR 27017 at commit 4fdb7cb.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2020-01-03T14:54:20Z

thanks, merging to master!

SPARK-30359

511058c

fix compile error

3cfd80f

call clear at the end

3368a3e

Ngone51 changed the title ~~[SPARK-30359][CORE] Do not clear executorsPendingToRemove in CoarseGrainedSchedulerBackend.reset~~ [SPARK-30359][CORE] Move executorsPendingToRemove.clear to the end of CoarseGrainedSchedulerBackend.reset Dec 27, 2019

cloud-fan reviewed Dec 27, 2019

View reviewed changes

core/src/main/scala/org/apache/spark/scheduler/cluster/CoarseGrainedSchedulerBackend.scala Outdated Show resolved Hide resolved

revert last change which fails test

eae69ce

Ngone51 changed the title ~~[SPARK-30359][CORE] Move executorsPendingToRemove.clear to the end of CoarseGrainedSchedulerBackend.reset~~ [SPARK-30359][CORE] Do not clear executorsPendingToRemove in CoarseGrainedSchedulerBackend.reset Dec 27, 2019

cloud-fan reviewed Dec 27, 2019

View reviewed changes

address comment

d12dd45

Ngone51 changed the title ~~[SPARK-30359][CORE] Do not clear executorsPendingToRemove in CoarseGrainedSchedulerBackend.reset~~ [SPARK-30359][CORE] Don't clear executorsPendingToRemove at the beginning of CoarseGrainedSchedulerBackend.reset Dec 28, 2019

jiangxb1987 reviewed Dec 28, 2019

View reviewed changes

address comments

77c09e8

cloud-fan reviewed Jan 2, 2020

View reviewed changes

jiangxb1987 approved these changes Jan 2, 2020

View reviewed changes

update comment

4fdb7cb

cloud-fan approved these changes Jan 3, 2020

View reviewed changes

cloud-fan closed this in 4a09317 Jan 3, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-30359][CORE] Don't clear executorsPendingToRemove at the beginning of CoarseGrainedSchedulerBackend.reset #27017

[SPARK-30359][CORE] Don't clear executorsPendingToRemove at the beginning of CoarseGrainedSchedulerBackend.reset #27017

Ngone51 commented Dec 26, 2019 •

edited

Loading

SparkQA commented Dec 26, 2019

Ngone51 commented Dec 26, 2019

SparkQA commented Dec 26, 2019

SparkQA commented Dec 27, 2019

cloud-fan Dec 27, 2019 •

edited

Loading

SparkQA commented Dec 27, 2019

jiangxb1987 left a comment

jiangxb1987 Dec 28, 2019

Ngone51 Dec 28, 2019

jiangxb1987 Dec 28, 2019

jiangxb1987 Dec 28, 2019

jiangxb1987 commented Dec 28, 2019

jiangxb1987 Dec 28, 2019

SparkQA commented Dec 28, 2019

SparkQA commented Dec 28, 2019

SparkQA commented Dec 28, 2019

cloud-fan Jan 2, 2020

Ngone51 Jan 2, 2020

jiangxb1987 left a comment

SparkQA commented Jan 3, 2020

cloud-fan commented Jan 3, 2020

SparkQA commented Jan 3, 2020

cloud-fan commented Jan 3, 2020

[SPARK-30359][CORE] Don't clear executorsPendingToRemove at the beginning of CoarseGrainedSchedulerBackend.reset #27017

[SPARK-30359][CORE] Don't clear executorsPendingToRemove at the beginning of CoarseGrainedSchedulerBackend.reset #27017

Conversation

Ngone51 commented Dec 26, 2019 • edited Loading

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

SparkQA commented Dec 26, 2019

Ngone51 commented Dec 26, 2019

SparkQA commented Dec 26, 2019

SparkQA commented Dec 27, 2019

cloud-fan Dec 27, 2019 • edited Loading

Choose a reason for hiding this comment

SparkQA commented Dec 27, 2019

jiangxb1987 left a comment

Choose a reason for hiding this comment

jiangxb1987 Dec 28, 2019

Choose a reason for hiding this comment

Ngone51 Dec 28, 2019

Choose a reason for hiding this comment

jiangxb1987 Dec 28, 2019

Choose a reason for hiding this comment

jiangxb1987 Dec 28, 2019

Choose a reason for hiding this comment

jiangxb1987 commented Dec 28, 2019

jiangxb1987 Dec 28, 2019

Choose a reason for hiding this comment

SparkQA commented Dec 28, 2019

SparkQA commented Dec 28, 2019

SparkQA commented Dec 28, 2019

cloud-fan Jan 2, 2020

Choose a reason for hiding this comment

Ngone51 Jan 2, 2020

Choose a reason for hiding this comment

jiangxb1987 left a comment

Choose a reason for hiding this comment

SparkQA commented Jan 3, 2020

cloud-fan commented Jan 3, 2020

SparkQA commented Jan 3, 2020

cloud-fan commented Jan 3, 2020

Ngone51 commented Dec 26, 2019 •

edited

Loading

cloud-fan Dec 27, 2019 •

edited

Loading