Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-4105][CORE] regenerate the shuffle file when it is corrupted #12700

Closed
wants to merge 3 commits into from
Closed

[SPARK-4105][CORE] regenerate the shuffle file when it is corrupted #12700

wants to merge 3 commits into from

Conversation

pzzs
Copy link
Contributor

@pzzs pzzs commented Apr 26, 2016

I find that some task recompute before FAILED_TO_UNCOMPRESS happened,and I think that retry operation Corrupted shuffle file that caused this problem. I debug the code and corrupted the shuffle file before it has been readed, this problem happened every time.maybe we can regenerate the shuffle file when it is corrupted

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@pzzs pzzs changed the title [SPARK-4105][Core] regenerate the shuffle file when it is corrupted [SPARK-4105][CORE] regenerate the shuffle file when it is corrupted Apr 26, 2016
@srowen
Copy link
Member

srowen commented Apr 26, 2016

I get that it's just a band-aid, but it isn't solving the underlying problem right?

@jerryshao
Copy link
Contributor

and I think that retry operation Corrupted shuffle file that caused this problem

Can you explain more about the problem you encountered?

@pzzs
Copy link
Contributor Author

pzzs commented Apr 27, 2016

yeah, I haven't found the root-cause yet and been troubled by this problem for a long time. Any idea for this problem @srowen

@pzzs pzzs closed this Apr 27, 2016
@pzzs pzzs reopened this Apr 27, 2016
@pzzs
Copy link
Contributor Author

pzzs commented Apr 27, 2016

I find that some task recompute before FAILED_TO_UNCOMPRESS happened and think that something like #9610 caused this problem. @jerryshao

@jerryshao
Copy link
Contributor

some task recompute before FAILED_TO_UNCOMPRESS happened

What's the meaning of this? From the code you changed, looks like this corrupted file is happened in shuffle fetch, so what are you referring to "task recompute", map task or reduce task?

Also it would be better to have a simple reproducible case to narrow down the problem and fix it. Otherwise I don't think current fix is quite solid.

@viper-kun
Copy link
Contributor

@jerryshao @srowen
We met this problem in spark 1.4, spark 1.5 and spark 1.6 and just know shuffle file is broken. We can reproduce this problem by modify shuffle file, but don't know the root-cause. Any idea for this problem?

@jerryshao
Copy link
Contributor

Since I don't meet this problem recently, so I cannot exactly tell what actually cause it, maybe race condition, maybe flush problem.

Since you already have the reproducible case, why not dig into more details.

@pzzs
Copy link
Contributor Author

pzzs commented Apr 27, 2016

Now ,i just know that corrupted shuffle file could caused this problem, but i do not know why shufflle file is corrupted. @jerryshao @viper-kun

@pzzs pzzs closed this Jun 27, 2016
@kuixiang
Copy link

kuixiang commented Jun 5, 2017

Can one of the admins verify this patch?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants