Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix AbstractStorage#containsWriteHandler #281

Merged
merged 4 commits into from
Nov 4, 2022

Conversation

xianjingfeng
Copy link
Member

What changes were proposed in this pull request?

Fix AbstractStorage#containsWriteHandler

Why are the changes needed?

It is a bug, and it is obvious.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Feel unnecessary

@@ -143,7 +143,7 @@ public Storage selectStorage(ShuffleDataFlushEvent event) {
event.getStartPartition()));
if (storage.containsWriteHandler(event.getAppId(), event.getShuffleId(), event.getStartPartition())
&& storage.isCorrupted()) {
throw new RuntimeException("storage " + storage.getBasePath() + " is corrupted");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we remove this exception?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know what is the purpose of this exception. And i think it should continue if one storage is corrupted.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The removal of the Exception seems irreverent with the title, can you revert this change?
In another word, you should change it in another PR if it's really needed.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DiskErrorToleranceTest#diskErrorTest will fail without this change. https://github.com/apache/incubator-uniffle/actions/runs/3328036207/jobs/5503556615

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, it makes sense.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. This is a problem, but maybe we should open another issue to solve it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DiskErrorToleranceTest#diskErrorTest will fail without this change. https://github.com/apache/incubator-uniffle/actions/runs/3328036207/jobs/5503556615

But how do I deal with this problem. I think this ut is reasonable. Merge another pr first?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jerqi What do you think? fix in another pr or in current pr? if fix in another pr, i will fix DiskErrorToleranceTest#diskErrorTest in current pr.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think if #297 fixed, this exception should not be thrown. And this exception will never be thrown in original logic. So i think is ok to remove this exception

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do we determine that disk is broken?
I think the disk can't be read or written. So I don't think this is a bug.
So we should let the application fail fast. But I already have multiple replicas, so it's unnecessary to fail fast here.
It's ok for me after I think twice.

Copy link
Contributor

@kaijchen kaijchen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @xianjingfeng for the improvement, I have left some suggestions.

@@ -143,7 +143,7 @@ public Storage selectStorage(ShuffleDataFlushEvent event) {
event.getStartPartition()));
if (storage.containsWriteHandler(event.getAppId(), event.getShuffleId(), event.getStartPartition())
&& storage.isCorrupted()) {
throw new RuntimeException("storage " + storage.getBasePath() + " is corrupted");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The removal of the Exception seems irreverent with the title, can you revert this change?
In another word, you should change it in another PR if it's really needed.

@codecov-commenter
Copy link

codecov-commenter commented Nov 3, 2022

Codecov Report

Merging #281 (3e237ea) into master (47effb2) will increase coverage by 0.79%.
The diff coverage is 60.00%.

@@             Coverage Diff              @@
##             master     #281      +/-   ##
============================================
+ Coverage     59.71%   60.50%   +0.79%     
- Complexity     1377     1426      +49     
============================================
  Files           166      175       +9     
  Lines          8918     9085     +167     
  Branches        853      873      +20     
============================================
+ Hits           5325     5497     +172     
+ Misses         3318     3295      -23     
- Partials        275      293      +18     
Impacted Files Coverage Δ
...he/uniffle/server/storage/LocalStorageManager.java 78.15% <0.00%> (+11.76%) ⬆️
...apache/uniffle/storage/common/AbstractStorage.java 46.34% <75.00%> (+35.81%) ⬆️
...ava/org/apache/uniffle/common/RssShuffleUtils.java 0.00% <0.00%> (-95.66%) ⬇️
.../java/org/apache/hadoop/mapreduce/RssMRConfig.java 23.07% <0.00%> (-51.93%) ⬇️
.../java/org/apache/spark/shuffle/RssSparkConfig.java 90.90% <0.00%> (-5.87%) ⬇️
...e/storage/handler/impl/HdfsShuffleReadHandler.java 55.73% <0.00%> (-1.41%) ⬇️
.../java/org/apache/uniffle/common/util/RssUtils.java 69.04% <0.00%> (-0.28%) ⬇️
...pache/uniffle/server/ShuffleServerGrpcService.java 0.87% <0.00%> (-0.01%) ⬇️
.../org/apache/uniffle/common/ShuffleIndexResult.java 100.00% <0.00%> (ø)
...rg/apache/hadoop/mapred/RssMapOutputCollector.java 0.00% <0.00%> (ø)
... and 25 more

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

Copy link
Contributor

@kaijchen kaijchen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks @xianjingfeng for updating the patch.

@jerqi
Copy link
Contributor

jerqi commented Nov 4, 2022

LGTM. thanks @xianjingfeng @kaijchen

@jerqi jerqi merged commit c69f173 into apache:master Nov 4, 2022
kaijchen pushed a commit that referenced this pull request Nov 12, 2022
### What changes were proposed in this pull request?
Fix AbstractStorage#containsWriteHandler

### Why are the changes needed?
 It is a bug, and it is obvious.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Feel unnecessary
@xianjingfeng xianjingfeng deleted the fix_storage_bug branch August 9, 2024 06:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants