-
Notifications
You must be signed in to change notification settings - Fork 227
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Data loss when input file partitioned through rowTag element #450
Comments
Are you sure there are the same amounts? hard to say without any reproduction |
They are uncompressed. I can try process the compressed .gz files (that did help when we had the problem before version 0.7.0). |
Yeah I'd be interested if the compressed case is different. They are different code paths and both rely a bit on assumptions about the implementation to get it right. The main fix last time was for the uncompressed path indeed. |
We are running Databricks in Azure 6.3 (includes Apache Spark 2.4.4, Scala 2.11) |
Hm, OK. What kind of compression? I do have some tests that check compressed files across block boundaries but there may well be all kinds of corner cases. Really, splittable or unsplittable compression? |
It's gzip compression. I don't know if it's splittable or not but the compressed files seem to run slower and require more memory on the nodes. |
Yeah gzip-compressed text is not splittable. I do have a test case for that which appears to work, but who knows. The logic for handling this case is copied from Hadoop even. To clarify, you have one big file? and how many records do you expect vs see? that might narrow down a guess at what is going on. If you can, a different compression like bzip2 would probably be better all around (smaller, splittable) and may happen to avoid this. |
We process one big file split by option("rowTag"... Compressed: Uncompressed: As you can see - one off on Release processing the uncompressed file. |
Do you have any way of telling which Release doesn't seem to be present - is it in the middle or at the end? maybe not. I am not sure how it happens but have some guesses; not sure how to fix it even if those guesses are right. Certainly you can try recompressing as you might get better performance. right now this probably runs just one task because the file isn't splittable. |
Sorry – been some time since updating this issue. Tried bzip2 format and it seems to behave like gzip i.e. no errors but we do not seem to get the benefit of splitting the files. I looked through the errors we have seen so far and here is a list of where we are missing records: Also, this exclusion of elements seem to happen in different components of the xml file. |
OK so it happens on uncompressed files and misses one record. I'm pretty sure this is the weak point: https://github.com/databricks/spark-xml/blob/master/src/main/scala/com/databricks/spark/xml/XmlInputFormat.scala#L152 It's hacky, but works fine in general. I would not be surprised if there's a corner case here, where somehow the inferred file position is past the end of a partition, but it really isn't and so has missed a tag. I can't at the moment think of the case where this breaks down though. Multi-byte characters? is this any unusual encoding? Any chance you can try Spark 3? no idea whether it helps. Or the latest 0.9.0 version? Yeah I get that you can't share the files. If you want to spend some time with this, I can advise about how to inspect what might go wrong, but I imagine it's really hard to debug if it requires executing over huge files. Maybe a matter of logging the exact state when it decides that the file position is beyond the end. I recall it was fairly tricky to narrow this down on a tiny trivial file, which prompted the original fix. I'm sorry I don't have good ideas now, and I think there probably is a niche but real problem in this hack. If you're able, it seems like compressing the files avoids this code path and might work - is that viable? |
Hi, I also met similar problem when read large xml file with different row tags. For example, A file with row tag: A, B, C, when I generate df_A, df_B and df_C, the count for each dataframe varies by each run. Somtimes one record missed, sometimes a small chunk of records missed. |
Thanks again! |
Encoding won't matter, or at least, UTF-8 should be fine. Of course there are workarounds - no compression, or splittable compression, it seems (right? bzip2 worked?). Those would actually be more compatible with Spark as they are splittable. Knowing that it only affects gzip does help narrow it down, because that would mean it has to do with the non-splittable case. I have a decent theory about why it happens, though it may not be consistent with your findings. To figure out when to stop reading a split, it looks at how much of the underlying file has been read vs where the split should stop in the file. This is tricky. In the compressed case, I think what happens is that it can only report how much of the compressed file has been read - but the decompressor buffers reads. So it may read more than it has returned. This could cause the logic to prematurely decide there is no more to read. That makes good sense except that then I would expect you miss a record or two off of the end of each file, not in the middle. Does that make sense - is that actually what you observe? Fixing that isn't hard just means more hacking. I can pull together a POC if that theory sounds right and you're willing to run it. |
Sorry just pinging you @PeterNmp on this issue too to see if you can test - let me know. If it works, great, I'll make a new release. If not, I'll try to think of something else! |
Hi, |
Wait, I thought the problem was with compressed files? See comments starting at #450 (comment) Just want to clarify we're even looking in the right place |
Sorry - I can see how that is very confusing! |
Here's another theory: #468 |
Hey srowen. I also met same case when using databricks-xml 0.9.0 with glue-1.0 (spark 2.4.3).
|
If anyone can reproduce this on 0.9.0, and can run a test - build from the change in #468 or I can whip up an assembly JAR. I'm still not sure what's going on but that is my latest OK guess. |
Hi,
Thanks for all the effort put into this library!
We still seem to be having this issue related to #399 with 0.9.0 :(
We have large xmlfiles - 10+ GB with format like this:
When I count the number of SoundRecording/Release/ReleaseTransactions in the files it is the same (and should be), but processing the files like this:
spark.read.format("com.databricks.spark.xml").....option("rowTag","SoundRecording")
Gives me different counts of SoundRecording/Release/ReleaseTransactions for some files processed.
The text was updated successfully, but these errors were encountered: