Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-4020] Do not rely on timeouts to remove failed block managers #2865

Closed
wants to merge 1 commit into from

Conversation

andrewor14
Copy link
Contributor

If an executor fails without being scheduled to run any tasks, then DAGScheduler won't notify BlockManagerMasterActor that the associated block manager should be removed. Instead, the associated block manager will be expired only after a few rounds of heartbeat timeouts. In terms of removal treatment, there should really be no distinction between executors that have been scheduled tasks and those that have not.

The fix, then, is to add all known executors to TaskSchedulerImpl's activeExecutorIds whether or not it has been scheduled a task. In fact, the existing comment above activeExecutorIds is

// Which executor IDs we have executors on
val activeExecutorIds = new HashSet[String]

not "Which executors have been scheduled tasks thus far."

@andrewor14
Copy link
Contributor Author

Hey @kayousterhout can you take a look at this at your earliest convenience? This is blocking #2840.

@SparkQA
Copy link

SparkQA commented Oct 21, 2014

QA tests have started for PR 2865 at commit ff3172b.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Oct 21, 2014

QA tests have finished for PR 2865 at commit ff3172b.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21939/
Test FAILed.

@shaneknapp
Copy link
Contributor

jenkins, test this please

@SparkQA
Copy link

SparkQA commented Oct 21, 2014

QA tests have started for PR 2865 at commit ff3172b.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Oct 21, 2014

QA tests have finished for PR 2865 at commit ff3172b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21941/
Test PASSed.

@kayousterhout
Copy link
Contributor

@andrewor14 this looks good, and definitely seems to represent the "expected" use of activeExecutorIds.

@asfgit asfgit closed this in 61ca774 Oct 21, 2014
@andrewor14 andrewor14 deleted the active-executors branch October 21, 2014 20:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants