Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Take into account StreamDecoder.hasLeftoverChar in trying to exactly always correctly determine how much has been read #468

Merged
merged 1 commit into from
Aug 25, 2020

Conversation

srowen
Copy link
Collaborator

@srowen srowen commented Aug 18, 2020

See #450

The idea here is that we have to account for exactly how much data has been read from the underlying split, but not yet evaluated. Buffering is the problem. Previously we accounted for data buffered in StreamDecoder's ByteBuffer. I think we may have to account for the fact that it reads 2 chars at a time, and so may have buffered one more.

It should be extremely rare that this causes a problem - it would have to have read a start < that is the final char in a split, and have no buffered last char. But it's possible and indeed see the issue above - in tens of millions of records it might happen.

I don't know if this is the solution, but an OK guess.

…always correctly determine how much has been read
@srowen srowen added this to the 0.10.0 milestone Aug 18, 2020
@srowen srowen self-assigned this Aug 18, 2020
start + countingIn.getByteCount - readerByteBuffer.remaining()
start + countingIn.getByteCount -
readerByteBuffer.remaining() -
(if (readerLeftoverCharFn()) 1 else 0)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 buffered char doesn't necessarily mean one byte. I don't see a way to reason about how many bytes the char actually caused to be read. Because we are worried about the case of <, which is a single byte in most encodings, this is mostly OK. I truly hope that Hadoop figures out how to not split multibyte chars across splits, or else, I can't see how any multibyte text is reliably read when split by Hadoop.

@srowen srowen marked this pull request as draft August 18, 2020 14:58
@srowen srowen changed the title [WIP] Take into account StreamDecoder.hasLeftoverChar in trying to exactly always correctly determine how much has been read Take into account StreamDecoder.hasLeftoverChar in trying to exactly always correctly determine how much has been read Aug 18, 2020
@srowen srowen linked an issue Aug 18, 2020 that may be closed by this pull request
@srowen srowen added the bug label Aug 18, 2020
@srowen
Copy link
Collaborator Author

srowen commented Aug 24, 2020

@ericsun95 @PeterNmp not sure if either of you have time to try a build from this branch to see if it fixes your issue. I can create an assembly JAR if needed. You're my only known reproducers!

@srowen srowen marked this pull request as ready for review August 24, 2020 19:18
@ericsun95
Copy link
Contributor

@ericsun95 @PeterNmp not sure if either of you have time to try a build from this branch to see if it fixes your issue. I can create an assembly JAR if needed. You're my only known reproducers!

I can have a try. Could you show me how to build the jar on top of this / give this assembly JAR to me?

@srowen
Copy link
Collaborator Author

srowen commented Aug 24, 2020

@ericsun95 sure, are you on Scala 2.11 or 2.12?

@ericsun95
Copy link
Contributor

@ericsun95 sure, are you on Scala 2.11 or 2.12?

2.11

@srowen
Copy link
Collaborator Author

srowen commented Aug 24, 2020

@ericsun95 Try this assembly JAR. Be sure to remove any other versions you have first:
https://drive.google.com/file/d/13rPJoH814VycIbBbLWfPCzFvoJ-KXzT-/view?usp=sharing

@PeterNmp
Copy link

@srowen - thanks i'll test it too! Do i need to do anything apart from installing the assembly?

@PeterNmp
Copy link

It looks really good!!! I've tested two files that fails with 0.9.0.
Loaded both compressed and uncompressed file. Same number of rows in dataframes. Subtracting uncompressed from compressed and vice versa gives no result. Looking forward to this being released! Again - thanks for the effort!

@srowen
Copy link
Collaborator Author

srowen commented Aug 25, 2020

Good to hear that was a good guess. If there are no other comments I'll commit and roll a new release shortly. Thank you for testing.

@ericsun95
Copy link
Contributor

Good to hear that was a good guess. If there are no other comments I'll commit and roll a new release shortly. Thank you for testing.

The number is correct. While when I repartition, it will throw null pointer exception, not sure if there is data corruption during this process, will have more test to check.

@srowen
Copy link
Collaborator Author

srowen commented Aug 25, 2020

If the NPE is from spark-xml, post it, and I'll try to figure out if it's related or another issue.

@srowen srowen merged commit f28f1d2 into databricks:master Aug 25, 2020
@ericsun95
Copy link
Contributor

If the NPE is from spark-xml, post it, and I'll try to figure out if it's related or another issue.

Congs. It's just a small bug in my code. I think this solution works well! Thanks!

@srowen
Copy link
Collaborator Author

srowen commented Aug 25, 2020

FYI I just released 0.10.0 with this and other changes / fixes.
https://github.com/databricks/spark-xml/releases/tag/v0.10.0

@srowen srowen deleted the Issue450.2 branch September 1, 2020 13:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Data loss when input file partitioned through rowTag element
3 participants