No error raised if xml has only one failed row #436

Mimetis · 2020-02-12T14:47:48Z

Hi.

We are making some failure tests with spark-xml package within Databricks
We discovered that a xml file containing only one failing row will throw nothing.

Let me explain it with a simple example.
I've used your /tests samples for reproduction:

Here a good malformatted xml file:

<?xml version="1.0"?>
<catalog>
   <book id="Malformed attribute with " caracter ">
      <author>Kress, Peter</author>
      <title>Paradox Lost</title>
      <genre>Science Fiction</genre>
      <price>6.95</price>
      <publish_date>2000-11-02</publish_date>
      <description>After an inadvertant trip through a Heisenberg
      Uncertainty Device, James Salway discovers the problems
      of being quantum.</description>
   </book>
   <book id='I'm malformed too!'>
      <author>O'Brien, Tim</author>
      <title>Microsoft .NET: The Programming Bible</title>
      <genre>Computer</genre>
      <price>36.95</price>
      <publish_date>2000-12-09</publish_date>
      <description>Microsoft's .NET initiative is explored in
      detail in this deep programmer's reference.</description>
   </book>
   <book id="bk111">
      <author>O'Brien, Tim</author>
      <title>MSXML3: A Comprehensive Guide</title>
      <genre>Computer</genre>
      <price>36.95</price>
      <publish_date>2000-12-01</publish_date>
      <description>The Microsoft MSXML3 parser is covered in
      detail, with attention to XML DOM interfaces, XSLT processing,
      SAX and more.</description>
   </book>
   <book id="bk112">
      <author>Galos, Mike</author>
      <title>Visual Studio 7: A Comprehensive Guide</title>
      <genre>Computer</genre>
      <price>49.95</price>
      <publish_date>2001-04-16</publish_date>
      <description>Microsoft Visual Studio 7 is explored in depth,
      looking at how Visual Basic, Visual C++, C#, and ASP+ are
      integrated into a comprehensive development
      environment.</description>
   </book>
</catalog>

The test is using the FAILFAST option, but the default PERMISSIVE option will have the same behavior at the end.

So far, the test is:

# COMMAND ----------
file_location = "/FileStore/tables/books_malformed_attributes-345be.xml"

df = (spark.read.format("xml")
  .option("rowTag", "book")
  .option("mode", "FAILFAST")    
  .load(file_location))

display(df)

and the expected result is:

SparkException: Job aborted due to stage failure: Task 0 in stage 42.0 failed 4 times, most recent failure: Lost task 0.3 in stage 42.0 (TID 73, 10.139.64.4, executor 0): java.lang.IllegalArgumentException: Malformed line in FAILFAST mode: <book id="Malformed attribute with " caracter ">
      <author>Kress, Peter</author>
      <title>Paradox Lost</title>
      <genre>Science Fiction</genre>
      <price>6.95</price>
      <publish_date>2000-11-02</publish_date>
      <description>After an inadvertant trip through a Heisenberg
      Uncertainty Device, James Salway discovers the problems
      of being quantum.</description>

But now, if I have a file with only one failing row, like this:

<?xml version="1.0"?>
<catalog>
   <book id="Malformed attribute with " caracter ">
</catalog>

Now, with the same test, the expected result should be pretty the same as the last test, but unfortunatelly, the result is:

(2) Spark Jobs
df:pyspark.sql.dataframe.DataFrame
OK

Any thought ?

Additional question : Just to know, do you support the databricks option "badRecordsPath" ?

The text was updated successfully, but these errors were encountered:

srowen · 2020-02-12T18:46:20Z

The issue here is that in the second example, <book> is never closed. It is never even considered a valid input to parse. You end up with no error because there is no input. If you close it, you'll find it throws an error as expected even with just one instance.

Now, arguably that should be an error. If a start tag is found, but no end, something's wrong. Right now it's silently ignored. I think I can see a way to implement it: return whatever text it found after the start tag, which will certainly fail to parse correctly later, and then PERMISSIVE/FAILFAST works.

There's a subtle issue here in any event. If an end tag is missing, it will keep reading. If it's followed by valid row tags, they are considered nested. It will just eat the rest of the file and not recover to parse the valid tags. Allowing nested row tags is a valid use case, so don't want to change that. At least one would get a real chance to handle the error with a change like the above.

Mimetis · 2020-02-13T10:12:08Z

Hey @srowen Thanks for the quick update and PR !
Awesome job.
Will make a new integration test once the 0.9 version will be available.

Thanks !

gpadavala · 2020-02-13T10:14:30Z

Shouldn't it throw an error if it can't find a start tag at all, i mean if a JSON file is there in the path instead of XML, it should throw an Exception?
At least that is how the .format('json') reader works

srowen · 2020-02-13T16:06:13Z

Good question. I hesitate to do it because, well, the current behavior is to just return nothing (which doesn't mean it's correct). The XML data source is a little bit different in that it's meant to potentially just examine a small subset of larger files in general, so maybe it's less unusual to accept not finding any row tags in input sometimes?

What I think we might want to do is throw an error if the schema is auto-inferred, but it's entirely empty. This doesn't seem like a meaningful outcome and could/should be an error?

…ling instead (#437) See #436 (comment) for context. This stops silently swallowing unclosed tags at the end of input and instead propagates the remaining content for normal error handling. Closes #436

…ling instead (#437) See databricks/spark-xml#436 (comment) for context. This stops silently swallowing unclosed tags at the end of input and instead propagates the remaining content for normal error handling. Closes #436

srowen mentioned this issue Feb 13, 2020

Don't silently swallow unclosed tag content; propagate for error handling instead #437

Merged

HyukjinKwon closed this as completed in #437 Feb 14, 2020

srowen added the bug label Feb 29, 2020

srowen self-assigned this Feb 29, 2020

srowen added this to the 0.9.0 milestone Feb 29, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

No error raised if xml has only one failed row #436

No error raised if xml has only one failed row #436

Mimetis commented Feb 12, 2020 •

edited

Loading

srowen commented Feb 12, 2020

Mimetis commented Feb 13, 2020

gpadavala commented Feb 13, 2020 •

edited

Loading

srowen commented Feb 13, 2020

No error raised if xml has only one failed row #436

No error raised if xml has only one failed row #436

Comments

Mimetis commented Feb 12, 2020 • edited Loading

srowen commented Feb 12, 2020

Mimetis commented Feb 13, 2020

gpadavala commented Feb 13, 2020 • edited Loading

srowen commented Feb 13, 2020

Mimetis commented Feb 12, 2020 •

edited

Loading

gpadavala commented Feb 13, 2020 •

edited

Loading