Data loss when input file partitioned through rowTag element #450

PeterNmp · 2020-06-02T09:56:08Z

Hi,

Thanks for all the effort put into this library!
We still seem to be having this issue related to #399 with 0.9.0 :(
We have large xmlfiles - 10+ GB with format like this:

...
<SoundRecording>
...
</SoundRecording>
...
<Release>
...
</Release>
...
<ReleaseTransactions>
...
</ReleaseTransactions>

When I count the number of SoundRecording/Release/ReleaseTransactions in the files it is the same (and should be), but processing the files like this:
spark.read.format("com.databricks.spark.xml").....option("rowTag","SoundRecording")
Gives me different counts of SoundRecording/Release/ReleaseTransactions for some files processed.

The text was updated successfully, but these errors were encountered:

srowen · 2020-06-02T13:18:35Z

Are you sure there are the same amounts? hard to say without any reproduction
Compressed or uncompressed?

PeterNmp · 2020-06-02T13:24:34Z

They are uncompressed. I can try process the compressed .gz files (that did help when we had the problem before version 0.7.0).
I'm fairly sure about the counts - did count like this on the files:
grep -o '<Release>' file.xml | wc -l
Count here didn't match result form processing in databricks.
btw we are running with "option("mode","FAILFAST")"

srowen · 2020-06-02T13:33:39Z

Yeah I'd be interested if the compressed case is different. They are different code paths and both rely a bit on assumptions about the implementation to get it right. The main fix last time was for the uncompressed path indeed.
What Hadoop / Spark version?

PeterNmp · 2020-06-02T14:44:22Z

We are running Databricks in Azure 6.3 (includes Apache Spark 2.4.4, Scala 2.11)
Tried running on an uncompressed file - it gives the correct count on all the dataframes.
When running on the same uncompressed file one of the dataframes consistently gets the wrong count - one less than expected.
I'm sorry but i can't share the files :(

srowen · 2020-06-02T14:52:09Z

Hm, OK. What kind of compression? I do have some tests that check compressed files across block boundaries but there may well be all kinds of corner cases. Really, splittable or unsplittable compression?

PeterNmp · 2020-06-02T15:00:26Z

It's gzip compression. I don't know if it's splittable or not but the compressed files seem to run slower and require more memory on the nodes.
Before the fix in 0.7.0 we saw the same - that the gzipped files would process correctly but not the unzipped ones.

srowen · 2020-06-02T15:11:01Z

Yeah gzip-compressed text is not splittable. I do have a test case for that which appears to work, but who knows. The logic for handling this case is copied from Hadoop even.

To clarify, you have one big file? and how many records do you expect vs see? that might narrow down a guess at what is going on.

If you can, a different compression like bzip2 would probably be better all around (smaller, splittable) and may happen to avoid this.

PeterNmp · 2020-06-03T14:51:57Z

We process one big file split by option("rowTag"...
The counts for processing the file that fails are:

Compressed:
Count of Release : 4825182
Count of SoundRecording: 4825182
Count of ReleaseTransactions: 4825182

Uncompressed:
Count of Release: 4825181
Count of SoundRecording: 4825182
Count of ReleaseTransactions: 4825182

As you can see - one off on Release processing the uncompressed file.
We recieve the files gzipped - could try to unzip and re-compress.
We really appreciate your help here!

srowen · 2020-06-03T15:01:49Z

Do you have any way of telling which Release doesn't seem to be present - is it in the middle or at the end? maybe not. I am not sure how it happens but have some guesses; not sure how to fix it even if those guesses are right.

Certainly you can try recompressing as you might get better performance. right now this probably runs just one task because the file isn't splittable.

PeterNmp · 2020-07-10T13:36:33Z

Sorry – been some time since updating this issue. Tried bzip2 format and it seems to behave like gzip i.e. no errors but we do not seem to get the benefit of splitting the files.

I looked through the errors we have seen so far and here is a list of where we are missing records:
File1 (total lines = 77580849) – xml missing around line 19433586
File2 (total lines = 277617855) – xml missing around line 69024653
File3 (total lines = 260228464) – xml missing around line 145926138
File4 (total lines = 405442857) – xml missing around line 256136432

Also, this exclusion of elements seem to happen in different components of the xml file.
Hope this helps!

srowen · 2020-07-10T18:28:38Z

OK so it happens on uncompressed files and misses one record. I'm pretty sure this is the weak point: https://github.com/databricks/spark-xml/blob/master/src/main/scala/com/databricks/spark/xml/XmlInputFormat.scala#L152

It's hacky, but works fine in general. I would not be surprised if there's a corner case here, where somehow the inferred file position is past the end of a partition, but it really isn't and so has missed a tag. I can't at the moment think of the case where this breaks down though. Multi-byte characters? is this any unusual encoding?

Any chance you can try Spark 3? no idea whether it helps. Or the latest 0.9.0 version?

Yeah I get that you can't share the files. If you want to spend some time with this, I can advise about how to inspect what might go wrong, but I imagine it's really hard to debug if it requires executing over huge files. Maybe a matter of logging the exact state when it decides that the file position is beyond the end. I recall it was fairly tricky to narrow this down on a tiny trivial file, which prompted the original fix.

I'm sorry I don't have good ideas now, and I think there probably is a niche but real problem in this hack. If you're able, it seems like compressing the files avoids this code path and might work - is that viable?

ericsun95 · 2020-07-28T01:24:49Z

Hi, I also met similar problem when read large xml file with different row tags. For example, A file with row tag: A, B, C, when I generate df_A, df_B and df_C, the count for each dataframe varies by each run. Somtimes one record missed, sometimes a small chunk of records missed.

PeterNmp · 2020-08-14T08:51:49Z

Thanks again!
We are running the latest 0.9.0, and as far as I remember that version fixed some of the errors we got.
We are running on UTF-8 files so we should be ok encoding wise?
It might be a niche problem, but we cannot lose transactions. If you find it tricky to narrow down errors on small files, I think we will try to run on uncompressed files … though it somewhat defeats the purpose of running spark…
Apart from debugging this ourselves - is there any other way we can help get this problem solved? Contacting Databricks?

srowen · 2020-08-14T16:15:16Z

Encoding won't matter, or at least, UTF-8 should be fine.
I am also the only guy at Databricks who maintains this informally, so that won't help I'm afraid. It just comes to me.

Of course there are workarounds - no compression, or splittable compression, it seems (right? bzip2 worked?). Those would actually be more compatible with Spark as they are splittable.

Knowing that it only affects gzip does help narrow it down, because that would mean it has to do with the non-splittable case. I have a decent theory about why it happens, though it may not be consistent with your findings.

To figure out when to stop reading a split, it looks at how much of the underlying file has been read vs where the split should stop in the file. This is tricky. In the compressed case, I think what happens is that it can only report how much of the compressed file has been read - but the decompressor buffers reads. So it may read more than it has returned. This could cause the logic to prematurely decide there is no more to read.

That makes good sense except that then I would expect you miss a record or two off of the end of each file, not in the middle. Does that make sense - is that actually what you observe?

Fixing that isn't hard just means more hacking. I can pull together a POC if that theory sounds right and you're willing to run it.

srowen · 2020-08-17T15:12:39Z

Sorry just pinging you @PeterNmp on this issue too to see if you can test - let me know. If it works, great, I'll make a new release. If not, I'll try to think of something else!

PeterNmp · 2020-08-18T07:51:54Z

Hi,
bzip2 did work, but spark did not seem to split the file – same as with gzip. All the compressed files work but are not split so they take a long time to process.
It’s only once we try to process the uncompressed files we get the problem – so the problem seems to be in the splitting scenario?
We loose records in the middle of the file (see comment above) running on uncompressed files.
We would be very happy to test possible fixes but is seems the current fix is on compressed files and we are not seeing any problems there?

srowen · 2020-08-18T13:56:32Z

Wait, I thought the problem was with compressed files? See comments starting at #450 (comment) Just want to clarify we're even looking in the right place

PeterNmp · 2020-08-18T14:01:29Z

Sorry - I can see how that is very confusing!
No - the problem is on the uncompressed files. All compressed files are processed fine.

srowen · 2020-08-18T14:55:56Z

Here's another theory: #468
I don't know why bzip2 isn't splittable. It should always be. Maybe something about how it is being encoded. Are you leaving a .bz2 suffix on the files?

ericsun95 · 2020-08-19T18:41:27Z

Here's another theory: #468
I don't know why bzip2 isn't splittable. It should always be. Maybe something about how it is being encoded. Are you leaving a .bz2 suffix on the files?

Hey srowen. I also met same case when using databricks-xml 0.9.0 with glue-1.0 (spark 2.4.3).

For compressed file (bzip2): The record numbers are correct, while it didn't split the file, which makes the whole process slow and I have to repartition after getting the dataframe.
For uncompressed file (xml): there would be one record missing all the time. While when i just run locally, there wouldn't be any missing record. So weird.

srowen · 2020-08-19T18:59:16Z

If anyone can reproduce this on 0.9.0, and can run a test - build from the change in #468 or I can whip up an assembly JAR. I'm still not sure what's going on but that is my latest OK guess.

srowen added the bug label Jun 10, 2020

srowen mentioned this issue Aug 16, 2020

[WIP] Assume unsplittable codec means partition is whole file; don't use compressed input file position vs end #467

Closed

srowen mentioned this issue Aug 18, 2020

Take into account StreamDecoder.hasLeftoverChar in trying to exactly always correctly determine how much has been read #468

Merged

srowen linked a pull request Aug 18, 2020 that will close this issue

Take into account StreamDecoder.hasLeftoverChar in trying to exactly always correctly determine how much has been read #468

Merged

srowen closed this as completed in #468 Aug 25, 2020

srowen added this to the 0.10.0 milestone Aug 25, 2020

srowen self-assigned this Aug 25, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Data loss when input file partitioned through rowTag element #450

Data loss when input file partitioned through rowTag element #450

PeterNmp commented Jun 2, 2020 •

edited by srowen

Loading

srowen commented Jun 2, 2020

PeterNmp commented Jun 2, 2020 •

edited

Loading

srowen commented Jun 2, 2020

PeterNmp commented Jun 2, 2020

srowen commented Jun 2, 2020

PeterNmp commented Jun 2, 2020

srowen commented Jun 2, 2020

PeterNmp commented Jun 3, 2020

srowen commented Jun 3, 2020

PeterNmp commented Jul 10, 2020

srowen commented Jul 10, 2020

ericsun95 commented Jul 28, 2020

PeterNmp commented Aug 14, 2020

srowen commented Aug 14, 2020

srowen commented Aug 17, 2020

PeterNmp commented Aug 18, 2020

srowen commented Aug 18, 2020

PeterNmp commented Aug 18, 2020

srowen commented Aug 18, 2020

ericsun95 commented Aug 19, 2020

srowen commented Aug 19, 2020

Data loss when input file partitioned through rowTag element #450

Data loss when input file partitioned through rowTag element #450

Comments

PeterNmp commented Jun 2, 2020 • edited by srowen Loading

srowen commented Jun 2, 2020

PeterNmp commented Jun 2, 2020 • edited Loading

srowen commented Jun 2, 2020

PeterNmp commented Jun 2, 2020

srowen commented Jun 2, 2020

PeterNmp commented Jun 2, 2020

srowen commented Jun 2, 2020

PeterNmp commented Jun 3, 2020

srowen commented Jun 3, 2020

PeterNmp commented Jul 10, 2020

srowen commented Jul 10, 2020

ericsun95 commented Jul 28, 2020

PeterNmp commented Aug 14, 2020

srowen commented Aug 14, 2020

srowen commented Aug 17, 2020

PeterNmp commented Aug 18, 2020

srowen commented Aug 18, 2020

PeterNmp commented Aug 18, 2020

srowen commented Aug 18, 2020

ericsun95 commented Aug 19, 2020

srowen commented Aug 19, 2020

PeterNmp commented Jun 2, 2020 •

edited by srowen

Loading

PeterNmp commented Jun 2, 2020 •

edited

Loading