Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-20163] Kill all running tasks in a stage in case of fetch failure #17485

Closed

Conversation

sitalkedia
Copy link

What changes were proposed in this pull request?

Currently, the scheduler does not kill the running tasks in a stage when it encounters fetch failure, as a result, we might end up running many duplicate tasks in the cluster. There is already a TODO in TaskSetManager to kill all running tasks which has not been implemented.

How was this patch tested?

Unit tests.

@sitalkedia
Copy link
Author

@tgravescs
Copy link
Contributor

see the discussion on the mailing list. We now have 4 different jira for handling fetch failures. I think we should get a design for the entire thing first.

personally I don't want to kill the running ones as they have done useful work.

sched.backend.killTask(
attemptInfo.taskId,
attemptInfo.executorId,
interruptThread = true,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's not valid. We don't know that this can be done safely, which is why spark.job.interruptOnCancel defaults to false. SPARK-17064

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, @markhamstra, does it makes sense to do it only if spark.job.interruptOnCancel is enabled?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can do it then, but there is still the question of whether we should do it. That discussion belongs in SPARK-20178.

@SparkQA
Copy link

SparkQA commented Mar 30, 2017

Test build #75402 has finished for PR 17485 at commit ec2ac34.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@sitalkedia
Copy link
Author

see the discussion on the mailing list. We now have 4 different jira for handling fetch failures. I think we should get a design for the entire thing first.

Sure @tgravescs, let me put out a design doc with my initial thoughts on it.

@maropu maropu mentioned this pull request Apr 23, 2017
maropu added a commit to maropu/spark that referenced this pull request Apr 23, 2017
@asfgit asfgit closed this in e9f9715 Apr 24, 2017
@sitalkedia sitalkedia deleted the kill_tasks_on_stage_failure branch April 25, 2017 10:18
peter-toth pushed a commit to peter-toth/spark that referenced this pull request Oct 6, 2018
This pr proposed to close stale PRs. Currently, we have 400+ open PRs and there are some stale PRs whose JIRA tickets have been already closed and whose JIRA tickets does not exist (also, they seem not to be minor issues).

// Open PRs whose JIRA tickets have been already closed
Closes apache#11785
Closes apache#13027
Closes apache#13614
Closes apache#13761
Closes apache#15197
Closes apache#14006
Closes apache#12576
Closes apache#15447
Closes apache#13259
Closes apache#15616
Closes apache#14473
Closes apache#16638
Closes apache#16146
Closes apache#17269
Closes apache#17313
Closes apache#17418
Closes apache#17485
Closes apache#17551
Closes apache#17463
Closes apache#17625

// Open PRs whose JIRA tickets does not exist and they are not minor issues
Closes apache#10739
Closes apache#15193
Closes apache#15344
Closes apache#14804
Closes apache#16993
Closes apache#17040
Closes apache#15180
Closes apache#17238

N/A

Author: Takeshi Yamamuro <yamamuro@apache.org>

Closes apache#17734 from maropu/resolved_pr.

Change-Id: Id2e590aa7283fe5ac01424d30a40df06da6098b5
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants