Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-17304] Fix perf. issue caused by TaskSetManager.abortIfCompletelyBlacklisted #14871

Closed
wants to merge 3 commits into from

Conversation

JoshRosen
Copy link
Contributor

@JoshRosen JoshRosen commented Aug 30, 2016

This patch addresses a minor scheduler performance issue that was introduced in #13603. If you run

sc.parallelize(1 to 100000, 100000).map(identity).count()

then most of the time ends up being spent in TaskSetManager.abortIfCompletelyBlacklisted():

image

When processing resource offers, the scheduler uses a nested loop which considers every task set at multiple locality levels:

   for (taskSet <- sortedTaskSets; maxLocality <- taskSet.myLocalityLevels) {
      do {
        launchedTask = resourceOfferSingleTaskSet(
            taskSet, maxLocality, shuffledOffers, availableCpus, tasks)
      } while (launchedTask)
    }

In order to prevent jobs with globally blacklisted tasks from hanging, #13603 added a taskSet.abortIfCompletelyBlacklisted call inside of resourceOfferSingleTaskSet; if a call to resourceOfferSingleTaskSet fails to schedule any tasks, then abortIfCompletelyBlacklisted checks whether the tasks are completely blacklisted in order to figure out whether they will ever be schedulable. The problem with this placement of the call is that the last call to resourceOfferSingleTaskSet in the while loop will return false, implying that resourceOfferSingleTaskSet will call abortIfCompletelyBlacklisted, so almost every call to resourceOffers will trigger the abortIfCompletelyBlacklisted check for every task set.

Instead, I think that this call should be moved out of the innermost loop and should be called at most once per task set in case none of the task set's tasks can be scheduled at any locality level.

Before this patch's changes, the microbenchmark example that I posted above took 35 seconds to run, but it now only takes 15 seconds after this change.

/cc @squito and @kayousterhout for review.

launchedTask = resourceOfferSingleTaskSet(
for (taskSet <- sortedTaskSets) {
var launchedAnyTask = false
var launchedTaskAtMaxLocality = false
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know it's super verbose, but to minimize how confusing this code is, what about calling this launchedTaskAtCurrentMaxLocality, and renaming maxLocality below to currentMaxLocality?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea; updated.

@SparkQA
Copy link

SparkQA commented Aug 30, 2016

Test build #64609 has finished for PR 14871 at commit 5d20b44.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 30, 2016

Test build #64610 has finished for PR 14871 at commit 321d0c6.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@kayousterhout
Copy link
Contributor

LGTM

Josh, how long does your microbenchmark take if you comment out the call to abortIfCompletelyBlacklisted? Wondering how much that continues to affect performance.

@JoshRosen
Copy link
Contributor Author

@kayousterhout, after this patch's changes commenting out abortIfCompletelyBlacklisted appears to provide no further performance improvement. However, I have identified a few bottlenecks in Accumulator serialization that can be used to obtain an additional ~30% speedup on this microbenchmark; I'll submit those in a separate patch.

@kayousterhout
Copy link
Contributor

Ok cool thanks @JoshRosen -- just wanted to make sure that the blacklisting wasn't still hurting performance. Thanks for fixing this and sorry about the oversight originally!

@JoshRosen
Copy link
Contributor Author

Merging this to master. Thanks!

@asfgit asfgit closed this in fb20084 Aug 30, 2016
@JoshRosen JoshRosen deleted the bail-early-if-no-cpus branch August 30, 2016 20:26
@squito
Copy link
Contributor

squito commented Aug 31, 2016

thanks for finding this and the quick fix @JoshRosen !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants