Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-30359][CORE] Don't clear executorsPendingToRemove at the beginning of CoarseGrainedSchedulerBackend.reset #27017

Closed
wants to merge 7 commits into from

Conversation

Ngone51
Copy link
Member

@Ngone51 Ngone51 commented Dec 26, 2019

What changes were proposed in this pull request?

Remove executorsPendingToRemove.clear() from CoarseGrainedSchedulerBackend.reset().

Why are the changes needed?

Clear executorsPendingToRemove before remove executors will cause all tasks running on those "pending to remove" executors to count failures. But that's not true for the case of executorsPendingToRemove(execId)=true.

Besides, executorsPendingToRemove will be cleaned up within removeExecutor() at the end just as same as executorsPendingLossReason.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Added a new test in TaskSetManagerSuite.

@SparkQA
Copy link

SparkQA commented Dec 26, 2019

Test build #115812 has finished for PR 27017 at commit 511058c.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

@Ngone51
Copy link
Member Author

Ngone51 commented Dec 26, 2019

@SparkQA
Copy link

SparkQA commented Dec 26, 2019

Test build #115816 has finished for PR 27017 at commit 3cfd80f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Dec 27, 2019

Test build #115837 has finished for PR 27017 at commit 3368a3e.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@Ngone51 Ngone51 changed the title [SPARK-30359][CORE] Do not clear executorsPendingToRemove in CoarseGrainedSchedulerBackend.reset [SPARK-30359][CORE] Move executorsPendingToRemove.clear to the end of CoarseGrainedSchedulerBackend.reset Dec 27, 2019
@Ngone51 Ngone51 changed the title [SPARK-30359][CORE] Move executorsPendingToRemove.clear to the end of CoarseGrainedSchedulerBackend.reset [SPARK-30359][CORE] Do not clear executorsPendingToRemove in CoarseGrainedSchedulerBackend.reset Dec 27, 2019
@@ -1894,4 +1903,60 @@ class TaskSetManagerSuite extends SparkFunSuite with LocalSparkContext with Logg
manager.handleFailedTask(offerResult.get.taskId, TaskState.FAILED, reason)
assert(sched.taskSetsFailed.contains(taskSet.id))
}

test("SPARK-30359: Don't clear executorsPendingToRemove in CoarseGrainedSchedulerBackend.reset")
Copy link
Contributor

@cloud-fan cloud-fan Dec 27, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: don't clean executorsPendingToRemove at the beginning of 'reset'. We do clear it eventually.

@SparkQA
Copy link

SparkQA commented Dec 27, 2019

Test build #115842 has finished for PR 27017 at commit eae69ce.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@Ngone51 Ngone51 changed the title [SPARK-30359][CORE] Do not clear executorsPendingToRemove in CoarseGrainedSchedulerBackend.reset [SPARK-30359][CORE] Don't clear executorsPendingToRemove at the beginning of CoarseGrainedSchedulerBackend.reset Dec 28, 2019
Copy link
Contributor

@jiangxb1987 jiangxb1987 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM only nits

// use local-cluster mode in order to get CoarseGrainedSchedulerBackend
.setMaster("local-cluster[2, 1, 2048]")
// allow to set up at most two executors
.set("spark.cores.max", "2")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we still need this config?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In order to create at most 2 executors at the beginning...Though, this may not necessary..

backend.reset()

eventually(timeout(10.seconds), interval(100.milliseconds)) {
// executorsPendingToRemove should still be empty after reset()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: stil -> eventually

assert(manager.invokePrivate(numFailures())(index0) === 0)
assert(manager.invokePrivate(numFailures())(index1) === 1)
}
sc.stop()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not necessary coz LocalSparkContext would stop it after each test case.

@jiangxb1987
Copy link
Contributor

Jenkins retest this please

@@ -1904,8 +1904,7 @@ class TaskSetManagerSuite
assert(sched.taskSetsFailed.contains(taskSet.id))
}

test("SPARK-30359: Don't clear executorsPendingToRemove in CoarseGrainedSchedulerBackend.reset")
{
test("SPARK-30359: don't clean executorsPendingToRemove at the beginning of 'reset'") {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: you still need to mention CoarseGrainedSchedulerBackend.reset

@SparkQA
Copy link

SparkQA commented Dec 28, 2019

Test build #115876 has finished for PR 27017 at commit d12dd45.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Dec 28, 2019

Test build #115877 has finished for PR 27017 at commit d12dd45.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Dec 28, 2019

Test build #115879 has finished for PR 27017 at commit 77c09e8.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.


// task0 on exec0 should not count failures
backend.executorsPendingToRemove(exec0) = true
// task1 on exec1 should count failures
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what makes exec1 different from exec0 and count failures?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here, executorsPendingToRemove(exec0)=true while executorsPendingToRemove(exec1)=false. And false means that the crash of executor may possibly related to bad tasks running on it. So, those task should be counted failures. However, true means the executor is killed by driver and has non business of tasks.

Copy link
Contributor

@jiangxb1987 jiangxb1987 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@SparkQA
Copy link

SparkQA commented Jan 3, 2020

Test build #116080 has finished for PR 27017 at commit 4fdb7cb.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

retest this please

@SparkQA
Copy link

SparkQA commented Jan 3, 2020

Test build #116084 has finished for PR 27017 at commit 4fdb7cb.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

thanks, merging to master!

@cloud-fan cloud-fan closed this in 4a09317 Jan 3, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants