Take into account StreamDecoder.hasLeftoverChar in trying to exactly always correctly determine how much has been read #468

srowen · 2020-08-18T14:50:26Z

The idea here is that we have to account for exactly how much data has been read from the underlying split, but not yet evaluated. Buffering is the problem. Previously we accounted for data buffered in StreamDecoder's ByteBuffer. I think we may have to account for the fact that it reads 2 chars at a time, and so may have buffered one more.

It should be extremely rare that this causes a problem - it would have to have read a start < that is the final char in a split, and have no buffered last char. But it's possible and indeed see the issue above - in tens of millions of records it might happen.

I don't know if this is the solution, but an OK guess.

…always correctly determine how much has been read

srowen · 2020-08-18T14:54:34Z

src/main/scala/com/databricks/spark/xml/XmlInputFormat.scala

-      start + countingIn.getByteCount - readerByteBuffer.remaining()
+      start + countingIn.getByteCount -
+        readerByteBuffer.remaining() -
+        (if (readerLeftoverCharFn()) 1 else 0)


1 buffered char doesn't necessarily mean one byte. I don't see a way to reason about how many bytes the char actually caused to be read. Because we are worried about the case of <, which is a single byte in most encodings, this is mostly OK. I truly hope that Hadoop figures out how to not split multibyte chars across splits, or else, I can't see how any multibyte text is reliably read when split by Hadoop.

srowen · 2020-08-24T19:18:49Z

@ericsun95 @PeterNmp not sure if either of you have time to try a build from this branch to see if it fixes your issue. I can create an assembly JAR if needed. You're my only known reproducers!

ericsun95 · 2020-08-24T20:21:09Z

@ericsun95 @PeterNmp not sure if either of you have time to try a build from this branch to see if it fixes your issue. I can create an assembly JAR if needed. You're my only known reproducers!

I can have a try. Could you show me how to build the jar on top of this / give this assembly JAR to me?

srowen · 2020-08-24T20:36:49Z

@ericsun95 sure, are you on Scala 2.11 or 2.12?

ericsun95 · 2020-08-24T20:58:11Z

@ericsun95 sure, are you on Scala 2.11 or 2.12?

2.11

srowen · 2020-08-24T21:01:29Z

@ericsun95 Try this assembly JAR. Be sure to remove any other versions you have first:
https://drive.google.com/file/d/13rPJoH814VycIbBbLWfPCzFvoJ-KXzT-/view?usp=sharing

PeterNmp · 2020-08-25T06:44:30Z

@srowen - thanks i'll test it too! Do i need to do anything apart from installing the assembly?

PeterNmp · 2020-08-25T10:46:36Z

It looks really good!!! I've tested two files that fails with 0.9.0.
Loaded both compressed and uncompressed file. Same number of rows in dataframes. Subtracting uncompressed from compressed and vice versa gives no result. Looking forward to this being released! Again - thanks for the effort!

srowen · 2020-08-25T13:08:56Z

Good to hear that was a good guess. If there are no other comments I'll commit and roll a new release shortly. Thank you for testing.

ericsun95 · 2020-08-25T19:25:18Z

Good to hear that was a good guess. If there are no other comments I'll commit and roll a new release shortly. Thank you for testing.

The number is correct. While when I repartition, it will throw null pointer exception, not sure if there is data corruption during this process, will have more test to check.

srowen · 2020-08-25T19:29:30Z

If the NPE is from spark-xml, post it, and I'll try to figure out if it's related or another issue.

ericsun95 · 2020-08-25T21:32:46Z

If the NPE is from spark-xml, post it, and I'll try to figure out if it's related or another issue.

Congs. It's just a small bug in my code. I think this solution works well! Thanks!

srowen · 2020-08-25T22:06:55Z

FYI I just released 0.10.0 with this and other changes / fixes.
https://github.com/databricks/spark-xml/releases/tag/v0.10.0

Take into account StreamDecoder.hasLeftoverChar in trying to exactly …

b40fb13

…always correctly determine how much has been read

srowen added this to the 0.10.0 milestone Aug 18, 2020

srowen self-assigned this Aug 18, 2020

srowen commented Aug 18, 2020

View reviewed changes

srowen mentioned this pull request Aug 18, 2020

Data loss when input file partitioned through rowTag element #450

Closed

srowen marked this pull request as draft August 18, 2020 14:58

srowen changed the title ~~[WIP] Take into account StreamDecoder.hasLeftoverChar in trying to exactly always correctly determine how much has been read~~ Take into account StreamDecoder.hasLeftoverChar in trying to exactly always correctly determine how much has been read Aug 18, 2020

srowen linked an issue Aug 18, 2020 that may be closed by this pull request

Data loss when input file partitioned through rowTag element #450

Closed

srowen added the bug label Aug 18, 2020

srowen marked this pull request as ready for review August 24, 2020 19:18

srowen merged commit f28f1d2 into databricks:master Aug 25, 2020

srowen deleted the Issue450.2 branch September 1, 2020 13:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Take into account StreamDecoder.hasLeftoverChar in trying to exactly always correctly determine how much has been read #468

Take into account StreamDecoder.hasLeftoverChar in trying to exactly always correctly determine how much has been read #468

srowen commented Aug 18, 2020

srowen Aug 18, 2020

srowen commented Aug 24, 2020

ericsun95 commented Aug 24, 2020

srowen commented Aug 24, 2020

ericsun95 commented Aug 24, 2020

srowen commented Aug 24, 2020

PeterNmp commented Aug 25, 2020

PeterNmp commented Aug 25, 2020

srowen commented Aug 25, 2020

ericsun95 commented Aug 25, 2020

srowen commented Aug 25, 2020

ericsun95 commented Aug 25, 2020

srowen commented Aug 25, 2020

Take into account StreamDecoder.hasLeftoverChar in trying to exactly always correctly determine how much has been read #468

Take into account StreamDecoder.hasLeftoverChar in trying to exactly always correctly determine how much has been read #468

Conversation

srowen commented Aug 18, 2020

srowen Aug 18, 2020

Choose a reason for hiding this comment

srowen commented Aug 24, 2020

ericsun95 commented Aug 24, 2020

srowen commented Aug 24, 2020

ericsun95 commented Aug 24, 2020

srowen commented Aug 24, 2020

PeterNmp commented Aug 25, 2020

PeterNmp commented Aug 25, 2020

srowen commented Aug 25, 2020

ericsun95 commented Aug 25, 2020

srowen commented Aug 25, 2020

ericsun95 commented Aug 25, 2020

srowen commented Aug 25, 2020