-
Notifications
You must be signed in to change notification settings - Fork 153
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix AbstractStorage#containsWriteHandler #281
Conversation
…ehandler when select storage
@@ -143,7 +143,7 @@ public Storage selectStorage(ShuffleDataFlushEvent event) { | |||
event.getStartPartition())); | |||
if (storage.containsWriteHandler(event.getAppId(), event.getShuffleId(), event.getStartPartition()) | |||
&& storage.isCorrupted()) { | |||
throw new RuntimeException("storage " + storage.getBasePath() + " is corrupted"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we remove this exception?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know what is the purpose of this exception. And i think it should continue if one storage is corrupted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The removal of the Exception seems irreverent with the title, can you revert this change?
In another word, you should change it in another PR if it's really needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
DiskErrorToleranceTest#diskErrorTest
will fail without this change. https://github.com/apache/incubator-uniffle/actions/runs/3328036207/jobs/5503556615
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, it makes sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. This is a problem, but maybe we should open another issue to solve it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
DiskErrorToleranceTest#diskErrorTest
will fail without this change. https://github.com/apache/incubator-uniffle/actions/runs/3328036207/jobs/5503556615
But how do I deal with this problem. I think this ut is reasonable. Merge another pr first?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jerqi What do you think? fix in another pr or in current pr? if fix in another pr, i will fix DiskErrorToleranceTest#diskErrorTest
in current pr.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think if #297 fixed, this exception should not be thrown. And this exception will never be thrown in original logic. So i think is ok to remove this exception
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How do we determine that disk is broken?
I think the disk can't be read or written. So I don't think this is a bug.
So we should let the application fail fast. But I already have multiple replicas, so it's unnecessary to fail fast here.
It's ok for me after I think twice.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @xianjingfeng for the improvement, I have left some suggestions.
@@ -143,7 +143,7 @@ public Storage selectStorage(ShuffleDataFlushEvent event) { | |||
event.getStartPartition())); | |||
if (storage.containsWriteHandler(event.getAppId(), event.getShuffleId(), event.getStartPartition()) | |||
&& storage.isCorrupted()) { | |||
throw new RuntimeException("storage " + storage.getBasePath() + " is corrupted"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The removal of the Exception seems irreverent with the title, can you revert this change?
In another word, you should change it in another PR if it's really needed.
storage/src/main/java/org/apache/uniffle/storage/common/AbstractStorage.java
Show resolved
Hide resolved
Codecov Report
@@ Coverage Diff @@
## master #281 +/- ##
============================================
+ Coverage 59.71% 60.50% +0.79%
- Complexity 1377 1426 +49
============================================
Files 166 175 +9
Lines 8918 9085 +167
Branches 853 873 +20
============================================
+ Hits 5325 5497 +172
+ Misses 3318 3295 -23
- Partials 275 293 +18
📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks @xianjingfeng for updating the patch.
LGTM. thanks @xianjingfeng @kaijchen |
### What changes were proposed in this pull request? Fix AbstractStorage#containsWriteHandler ### Why are the changes needed? It is a bug, and it is obvious. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Feel unnecessary
What changes were proposed in this pull request?
Fix AbstractStorage#containsWriteHandler
Why are the changes needed?
It is a bug, and it is obvious.
Does this PR introduce any user-facing change?
No
How was this patch tested?
Feel unnecessary