-
Notifications
You must be signed in to change notification settings - Fork 153
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[#808] improvement(spark): Verify the number of written records to enhance data correctness #1558
Conversation
…en records to enhance data accuracy
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #1558 +/- ##
============================================
+ Coverage 53.65% 54.57% +0.91%
- Complexity 2818 2819 +1
============================================
Files 436 416 -20
Lines 24549 22195 -2354
Branches 2080 2080
============================================
- Hits 13172 12113 -1059
+ Misses 10554 9330 -1224
+ Partials 823 752 -71 ☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM overall. Left some minor comments.
And I hope this could be covered by test cases. Like throwing exception when encountering records is not same.
client-spark/spark2/src/main/java/org/apache/spark/shuffle/writer/RssShuffleWriter.java
Outdated
Show resolved
Hide resolved
client-spark/spark3/src/main/java/org/apache/spark/shuffle/writer/RssShuffleWriter.java
Outdated
Show resolved
Hide resolved
client-spark/spark3/src/main/java/org/apache/spark/shuffle/writer/RssShuffleWriter.java
Outdated
Show resolved
Hide resolved
I think the existing UTs have already covered this. Because this is a base method which will be used in many places. |
Maybe record count isn't enough. We should check block count, too. |
BlockIds will be checked in the method |
No I hope you could mock throwing exception when record is not same. |
I think it's not easy to mock this and might not be very meaningful. The issue in the past was due to a critical bug in the code itself, caused by concurrency problems, which led to data loss even after calling the method The fact that the unit test with the current check passes actually indicates that the current code is working correctly. |
Merged. Thanks @rickyma |
…1756) ### What changes were proposed in this pull request? 1. When the spill ratio is `1.0` , the process of calculating target spill size will be ignored to avoid potential race condition that the `usedBytes` and `inSendBytes` are not thread safe. This could guarantee that the all data is flushed to the shuffle server at the end of task. 2. Adding the `bufferManager's` buffer remaining check ### Why are the changes needed? Due to the #1670 , the partial data held by the bufferManager will not be flushed to shuffle servers in some corner cases, this will make task fail fast rather than silently data loss that should thanks the #1558 ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing tests.
…umber (apache#1756) ### What changes were proposed in this pull request? 1. When the spill ratio is `1.0` , the process of calculating target spill size will be ignored to avoid potential race condition that the `usedBytes` and `inSendBytes` are not thread safe. This could guarantee that the all data is flushed to the shuffle server at the end of task. 2. Adding the `bufferManager's` buffer remaining check ### Why are the changes needed? Due to the apache#1670 , the partial data held by the bufferManager will not be flushed to shuffle servers in some corner cases, this will make task fail fast rather than silently data loss that should thanks the apache#1558 ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing tests.
What changes were proposed in this pull request?
Verify the number of written records to enhance data accuracy.
Make sure all data records are sent by clients.
Make sure bugs like #714 will never be introduced into the code.
Why are the changes needed?
A follow-up PR for #848.
Does this PR introduce any user-facing change?
No.
How was this patch tested?
Existing UTs.