-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-30359][CORE] Don't clear executorsPendingToRemove at the beginning of CoarseGrainedSchedulerBackend.reset #27017
Conversation
Test build #115812 has finished for PR 27017 at commit
|
Test build #115816 has finished for PR 27017 at commit
|
Test build #115837 has finished for PR 27017 at commit
|
core/src/main/scala/org/apache/spark/scheduler/cluster/CoarseGrainedSchedulerBackend.scala
Outdated
Show resolved
Hide resolved
@@ -1894,4 +1903,60 @@ class TaskSetManagerSuite extends SparkFunSuite with LocalSparkContext with Logg | |||
manager.handleFailedTask(offerResult.get.taskId, TaskState.FAILED, reason) | |||
assert(sched.taskSetsFailed.contains(taskSet.id)) | |||
} | |||
|
|||
test("SPARK-30359: Don't clear executorsPendingToRemove in CoarseGrainedSchedulerBackend.reset") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: don't clean executorsPendingToRemove at the beginning of 'reset'
. We do clear it eventually.
Test build #115842 has finished for PR 27017 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM only nits
// use local-cluster mode in order to get CoarseGrainedSchedulerBackend | ||
.setMaster("local-cluster[2, 1, 2048]") | ||
// allow to set up at most two executors | ||
.set("spark.cores.max", "2") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we still need this config?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In order to create at most 2 executors at the beginning...Though, this may not necessary..
backend.reset() | ||
|
||
eventually(timeout(10.seconds), interval(100.milliseconds)) { | ||
// executorsPendingToRemove should still be empty after reset() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: stil
-> eventually
assert(manager.invokePrivate(numFailures())(index0) === 0) | ||
assert(manager.invokePrivate(numFailures())(index1) === 1) | ||
} | ||
sc.stop() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not necessary coz LocalSparkContext
would stop it after each test case.
Jenkins retest this please |
@@ -1904,8 +1904,7 @@ class TaskSetManagerSuite | |||
assert(sched.taskSetsFailed.contains(taskSet.id)) | |||
} | |||
|
|||
test("SPARK-30359: Don't clear executorsPendingToRemove in CoarseGrainedSchedulerBackend.reset") | |||
{ | |||
test("SPARK-30359: don't clean executorsPendingToRemove at the beginning of 'reset'") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: you still need to mention CoarseGrainedSchedulerBackend.reset
Test build #115876 has finished for PR 27017 at commit
|
Test build #115877 has finished for PR 27017 at commit
|
Test build #115879 has finished for PR 27017 at commit
|
|
||
// task0 on exec0 should not count failures | ||
backend.executorsPendingToRemove(exec0) = true | ||
// task1 on exec1 should count failures |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what makes exec1 different from exec0 and count failures?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here, executorsPendingToRemove(exec0)=true
while executorsPendingToRemove(exec1)=false
. And false
means that the crash of executor may possibly related to bad tasks running on it. So, those task should be counted failures. However, true
means the executor is killed by driver and has non business of tasks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Test build #116080 has finished for PR 27017 at commit
|
retest this please |
Test build #116084 has finished for PR 27017 at commit
|
thanks, merging to master! |
What changes were proposed in this pull request?
Remove
executorsPendingToRemove.clear()
fromCoarseGrainedSchedulerBackend.reset()
.Why are the changes needed?
Clear
executorsPendingToRemove
before remove executors will cause all tasks running on those "pending to remove" executors to count failures. But that's not true for the case ofexecutorsPendingToRemove(execId)=true
.Besides,
executorsPendingToRemove
will be cleaned up withinremoveExecutor()
at the end just as same asexecutorsPendingLossReason
.Does this PR introduce any user-facing change?
No
How was this patch tested?
Added a new test in
TaskSetManagerSuite
.