-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-15865][CORE] Blacklist should not result in job hanging with less than 4 executors #13603
Conversation
…d of just hanging)
val index = allPendingTasks(indexOffset) | ||
if (copiesRunning(index) == 0 && !successful(index)) { | ||
return Some(index) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm pretty sure that we could add
else {
// this task has already been scheduled from one of our other task queues, so remove it
// from this one as well, even though we're not actually scheduling anything here.
allPendingTasks.remove(indexOffset)
}
But its shouldn't be necessary to do here, and I'm just nervous enough about adding it that I opted not to.
Test build #60300 has finished for PR 13603 at commit
|
q: was blacklist merged in 2.0? |
@squito if it's not too painful, would you mind moving the visibility stuff to a separate PR? (I suspect that PR can be merged almost immediately!). |
Did you consider instead doing this when a task fails (on line 761 in TaskSetManager)? Instead of just checking if the number of failures is greater than maxTaskFailures, you could add a second check (if blacklisting is enabled) that checks whether the task that just failed could be scheduled anywhere, and if it can't be, fail the task set. This seems simpler to me. The main drawback I see in that approach is that it could be the case that the task failure was caused by an executor failure, and the cluster manager is in the process of launching a new executor that the task could run on, so it's not correct to fail the task set. My sense is that it's OK-ish to fail in that case, since that seems like it will only happen for jobs that use a super small number of executors, in which case random-ish failures are less likely, so the failure is more likely to be a real issue with the job. |
@kayousterhout sure I'll pull the visibility stuff out. I did consider trying to do a check on task failure instead. However, I don't think that is sufficient, because you can have an executor fail. Imagine you have task 1 on executor A & task 2 on executor B. Task 1 fails, gets blacklisted from executor A -- but it can still be scheduled on executor B so you don't fail the stage. Then executor B dies. Task 2 can run on executor A, so that isn't stuck. But task 1 now can't run anywhere. Probably unlikely, but still having the job just hang is so bad that I think we really should avoid it. Plus it becomes much more likely w/ the new blacklisting I'm working on -- in that case, executor B gets blacklisted for the bad stage because of many task failures, and now there isn't any place for the first failed tasks to run. I actually ran into that case when testing an early iteration of that change. This is subtle enough its probably worth codifying into a test -- I'll work on adding that. (I agree with you that its OK to fail the task set even if a new executor is just about to launch. Even this version doesn't really avoid something like that.) |
* that is schedulable, and after scheduling all of those, we'll eventually find the unschedulable | ||
* task. | ||
*/ | ||
private[scheduler] def isTaskSetCompletelyBlacklisted( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be cleaner to add this method to the TaskSetManager class (and then you don't need the pollPendingTask method) -- and then just pass in the executorsByHost map. That also makes things a little easier to change in the future, if there gets to be some easier way of checking if a particular task set is completely blacklisted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah I put it here b/c in the blacklisting stuff I'm still working on, I felt it made more sense outside TaskSetManager, since blacklisting extends beyond a single taskset (executor & node blacklisting). But I'll change it here, and we can revisit that discussion when looking at that change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah got it -- agree that in the long term, given the new blacklistling, it might make sense here! But let's put it in the TaskSetManager for now.
Ohh good point that makes sense re: lost executors. Given that, I agree that this approach seems like the right one. |
bc80e8c
to
f870bde
Compare
Test build #60851 has finished for PR 13603 at commit
|
@kayousterhout sorry for the delays on my end, I've updated with the requested changes. The check is now inside I also found something kinda weird about |
} | ||
|
||
test("Scheduler does not crash when tasks are not serializable") { | ||
sc = new SparkContext("local", "TaskSchedulerImplSuite") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately, this case will also trigger a failure with the msg "Aborting TaskSet ... due to blacklist". I'm pretty sure this is a bug in TaskSchedulerImpl
, but I am so shocked by it I'd like a quick sanity check. What's going on here is that no tasks have been accepted, so executorsByHost
never actually adds the new executor in resourceOfferSingleTaskSet
. But executorsByHost
has already added the host at the beginning of resourceOffers
, just not the executor.
But isn't there a bug in resourceOffers
-- shouldn't that loop be updating newExecAvailable
even if an executor is added to an already existing host? I expect this to actually be quite common under dynamic allocation. The end result is that locality preferences aren't properly updated, and failedEpochs
aren't updated correctly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that this looks completely wrong, and that (1) resourceOffers should mark a new executor as available more times, not just when a new host is available -- and this means the HDFS cache locality hasn't been working when folks have multiple executors and (2) resourceOffers, and not resourceOffersSingleTaskSet, should add the executor to executorsByHost. Do you have time to file a JIRA / fix this? Seems like a quick fix, and would be nice to do before this PR, because of this weird failure (which seems like something that will be user-visible, since I'm guessing it's not uncommon, for new users, that their first task isn't serializable).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
opened https://issues.apache.org/jira/browse/SPARK-16106. was just surprised enough that I wanted a sanity check first :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mentioned this below as well, but just to be clear -- I was mistaken, that bug doesn't effect the case where the tasks aren't serializable. That still correctly fails with an error about serialization. The error I was encountering is in a different case ("multiple CPUs per task", since there you never add the executors, just the hosts), and still needs a workaround for now, which I've added.
Test build #60856 has finished for PR 13603 at commit
|
Test build #60855 has finished for PR 13603 at commit
|
} | ||
} | ||
} | ||
abort(s"Aborting ${taskSet} because it has a task which cannot be scheduled on any" + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you include the task ID here?
@@ -35,7 +35,7 @@ import org.apache.spark.scheduler.SchedulingMode.SchedulingMode | |||
import org.apache.spark.scheduler.TaskLocality.TaskLocality | |||
import org.apache.spark.scheduler.local.LocalSchedulerBackend | |||
import org.apache.spark.storage.BlockManagerId | |||
import org.apache.spark.util.{AccumulatorV2, ThreadUtils, Utils} | |||
import org.apache.spark.util._ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
undo this change? (nice to have the explicit imports as long as they're short)
Test build #61385 has finished for PR 13603 at commit
|
} | ||
} | ||
abort(s"Aborting ${taskSet} because Task $taskId (partition " + | ||
s"${tasks(taskId).partitionId}) cannot be scheduled on any executor due to blacklists.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe include the executors here (esp. since this is something users might see)?
s"Aborting ${taskSet} because task
This LGTM -- just a bunch of cosmetic suggestions |
Test build #61399 has finished for PR 13603 at commit
|
// take any task that needs to be scheduled, and see if we can find some executor it *could* | ||
// run on | ||
pendingTask.foreach { taskId => | ||
executors.foreach { exec => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
now that executors is an iterable, just do "if (executors.find(executorIsBlacklisted(_, taskId)).isEmpty) { .. abort ...}" here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good point, in fact I can just use executors.forall
. Sorry I keep working on the new blacklist version in between and sometimes don't see some of these obvious simplifications in this version, thanks for catching them.
*/ | ||
private[scheduler] def abortIfCompletelyBlacklisted(executors: Iterable[String]): Unit = { | ||
|
||
def pendingTask: Option[Int] = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
gah sorry one more tiny thing: can this just be a val?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can't just change this to a val
with nothing else, b/c of the return
when we find the task. Though you could make it a val
by changing the inner-logic, with a var keepGoing
in the while loop or something. I actually changed it once and couldn't really make up my mind which version was cleaner ... in the end I felt an inner function wasn't so bad, but happy to change it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I pushed a commit with changing it to a val, so you can see both options. easy enough to back out that last commit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what about
val pendingTask: Option[Int] = allPendingTasks.lastIndexWhere { indexInTaskSet =>
copiesRunning(indexInTaskSet) == 0 && !successful(indexInTaskSet)
}.map(allPendingTasks(_))
(I realize we're really in the weeds here so whatever you prefer here is fine)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh I didn't even know about lastIndexWhere! thanks, simpler, and despite being a minor point I appreciate learning something new :)
sorry I think I just got on the wrong track in this while thinking about doing the lazy-removal here as well, and when I decided against it never stepped back to simplify it.
Test build #61488 has finished for PR 13603 at commit
|
Test build #61497 has finished for PR 13603 at commit
|
Test build #61505 has finished for PR 13603 at commit
|
LGTM! |
merged to master. thanks Kay! |
…elyBlacklisted This patch addresses a minor scheduler performance issue that was introduced in #13603. If you run ``` sc.parallelize(1 to 100000, 100000).map(identity).count() ``` then most of the time ends up being spent in `TaskSetManager.abortIfCompletelyBlacklisted()`:  When processing resource offers, the scheduler uses a nested loop which considers every task set at multiple locality levels: ```scala for (taskSet <- sortedTaskSets; maxLocality <- taskSet.myLocalityLevels) { do { launchedTask = resourceOfferSingleTaskSet( taskSet, maxLocality, shuffledOffers, availableCpus, tasks) } while (launchedTask) } ``` In order to prevent jobs with globally blacklisted tasks from hanging, #13603 added a `taskSet.abortIfCompletelyBlacklisted` call inside of `resourceOfferSingleTaskSet`; if a call to `resourceOfferSingleTaskSet` fails to schedule any tasks, then `abortIfCompletelyBlacklisted` checks whether the tasks are completely blacklisted in order to figure out whether they will ever be schedulable. The problem with this placement of the call is that the last call to `resourceOfferSingleTaskSet` in the `while` loop will return `false`, implying that `resourceOfferSingleTaskSet` will call `abortIfCompletelyBlacklisted`, so almost every call to `resourceOffers` will trigger the `abortIfCompletelyBlacklisted` check for every task set. Instead, I think that this call should be moved out of the innermost loop and should be called _at most_ once per task set in case none of the task set's tasks can be scheduled at any locality level. Before this patch's changes, the microbenchmark example that I posted above took 35 seconds to run, but it now only takes 15 seconds after this change. /cc squito and kayousterhout for review. Author: Josh Rosen <joshrosen@databricks.com> Closes #14871 from JoshRosen/bail-early-if-no-cpus.
What changes were proposed in this pull request?
Before this change, when you turn on blacklisting with
spark.scheduler.executorTaskBlacklistTime
, but you have fewer thanspark.task.maxFailures
executors, you can end with a job "hung" after some task failures.Whenever a taskset is unable to schedule anything on resourceOfferSingleTaskSet, we check whether the last pending task can be scheduled on any known executor. If not, the taskset (and any corresponding jobs) are failed.
How was this patch tested?
Added unit test which failed before the change, ran new test 5k times manually, ran all scheduler tests manually, and the full suite via jenkins.