-
Notifications
You must be signed in to change notification settings - Fork 227
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Take into account StreamDecoder.hasLeftoverChar in trying to exactly always correctly determine how much has been read #468
Conversation
…always correctly determine how much has been read
start + countingIn.getByteCount - readerByteBuffer.remaining() | ||
start + countingIn.getByteCount - | ||
readerByteBuffer.remaining() - | ||
(if (readerLeftoverCharFn()) 1 else 0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
1 buffered char doesn't necessarily mean one byte. I don't see a way to reason about how many bytes the char actually caused to be read. Because we are worried about the case of <
, which is a single byte in most encodings, this is mostly OK. I truly hope that Hadoop figures out how to not split multibyte chars across splits, or else, I can't see how any multibyte text is reliably read when split by Hadoop.
@ericsun95 @PeterNmp not sure if either of you have time to try a build from this branch to see if it fixes your issue. I can create an assembly JAR if needed. You're my only known reproducers! |
I can have a try. Could you show me how to build the jar on top of this / give this assembly JAR to me? |
@ericsun95 sure, are you on Scala 2.11 or 2.12? |
2.11 |
@ericsun95 Try this assembly JAR. Be sure to remove any other versions you have first: |
@srowen - thanks i'll test it too! Do i need to do anything apart from installing the assembly? |
It looks really good!!! I've tested two files that fails with 0.9.0. |
Good to hear that was a good guess. If there are no other comments I'll commit and roll a new release shortly. Thank you for testing. |
The number is correct. While when I repartition, it will throw null pointer exception, not sure if there is data corruption during this process, will have more test to check. |
If the NPE is from spark-xml, post it, and I'll try to figure out if it's related or another issue. |
Congs. It's just a small bug in my code. I think this solution works well! Thanks! |
FYI I just released 0.10.0 with this and other changes / fixes. |
See #450
The idea here is that we have to account for exactly how much data has been read from the underlying split, but not yet evaluated. Buffering is the problem. Previously we accounted for data buffered in StreamDecoder's ByteBuffer. I think we may have to account for the fact that it reads 2 chars at a time, and so may have buffered one more.
It should be extremely rare that this causes a problem - it would have to have read a start
<
that is the final char in a split, and have no buffered last char. But it's possible and indeed see the issue above - in tens of millions of records it might happen.I don't know if this is the solution, but an OK guess.