Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

No error raised if xml has only one failed row #436

Closed
Mimetis opened this issue Feb 12, 2020 · 4 comments · Fixed by #437
Closed

No error raised if xml has only one failed row #436

Mimetis opened this issue Feb 12, 2020 · 4 comments · Fixed by #437
Assignees
Labels
Milestone

Comments

@Mimetis
Copy link

Mimetis commented Feb 12, 2020

Hi.

We are making some failure tests with spark-xml package within Databricks
We discovered that a xml file containing only one failing row will throw nothing.

Let me explain it with a simple example.
I've used your /tests samples for reproduction:

Here a good malformatted xml file:

<?xml version="1.0"?>
<catalog>
   <book id="Malformed attribute with " caracter ">
      <author>Kress, Peter</author>
      <title>Paradox Lost</title>
      <genre>Science Fiction</genre>
      <price>6.95</price>
      <publish_date>2000-11-02</publish_date>
      <description>After an inadvertant trip through a Heisenberg
      Uncertainty Device, James Salway discovers the problems
      of being quantum.</description>
   </book>
   <book id='I'm malformed too!'>
      <author>O'Brien, Tim</author>
      <title>Microsoft .NET: The Programming Bible</title>
      <genre>Computer</genre>
      <price>36.95</price>
      <publish_date>2000-12-09</publish_date>
      <description>Microsoft's .NET initiative is explored in
      detail in this deep programmer's reference.</description>
   </book>
   <book id="bk111">
      <author>O'Brien, Tim</author>
      <title>MSXML3: A Comprehensive Guide</title>
      <genre>Computer</genre>
      <price>36.95</price>
      <publish_date>2000-12-01</publish_date>
      <description>The Microsoft MSXML3 parser is covered in
      detail, with attention to XML DOM interfaces, XSLT processing,
      SAX and more.</description>
   </book>
   <book id="bk112">
      <author>Galos, Mike</author>
      <title>Visual Studio 7: A Comprehensive Guide</title>
      <genre>Computer</genre>
      <price>49.95</price>
      <publish_date>2001-04-16</publish_date>
      <description>Microsoft Visual Studio 7 is explored in depth,
      looking at how Visual Basic, Visual C++, C#, and ASP+ are
      integrated into a comprehensive development
      environment.</description>
   </book>
</catalog>

The test is using the FAILFAST option, but the default PERMISSIVE option will have the same behavior at the end.

So far, the test is:

# COMMAND ----------
file_location = "/FileStore/tables/books_malformed_attributes-345be.xml"

df = (spark.read.format("xml")
  .option("rowTag", "book")
  .option("mode", "FAILFAST")    
  .load(file_location))

display(df)

and the expected result is:

SparkException: Job aborted due to stage failure: Task 0 in stage 42.0 failed 4 times, most recent failure: Lost task 0.3 in stage 42.0 (TID 73, 10.139.64.4, executor 0): java.lang.IllegalArgumentException: Malformed line in FAILFAST mode: <book id="Malformed attribute with " caracter ">
      <author>Kress, Peter</author>
      <title>Paradox Lost</title>
      <genre>Science Fiction</genre>
      <price>6.95</price>
      <publish_date>2000-11-02</publish_date>
      <description>After an inadvertant trip through a Heisenberg
      Uncertainty Device, James Salway discovers the problems
      of being quantum.</description>

But now, if I have a file with only one failing row, like this:

<?xml version="1.0"?>
<catalog>
   <book id="Malformed attribute with " caracter ">
</catalog>

Now, with the same test, the expected result should be pretty the same as the last test, but unfortunatelly, the result is:

(2) Spark Jobs
df:pyspark.sql.dataframe.DataFrame
OK

image

Any thought ?

Additional question : Just to know, do you support the databricks option "badRecordsPath" ?

@srowen
Copy link
Collaborator

srowen commented Feb 12, 2020

The issue here is that in the second example, <book> is never closed. It is never even considered a valid input to parse. You end up with no error because there is no input. If you close it, you'll find it throws an error as expected even with just one instance.

Now, arguably that should be an error. If a start tag is found, but no end, something's wrong. Right now it's silently ignored. I think I can see a way to implement it: return whatever text it found after the start tag, which will certainly fail to parse correctly later, and then PERMISSIVE/FAILFAST works.

There's a subtle issue here in any event. If an end tag is missing, it will keep reading. If it's followed by valid row tags, they are considered nested. It will just eat the rest of the file and not recover to parse the valid tags. Allowing nested row tags is a valid use case, so don't want to change that. At least one would get a real chance to handle the error with a change like the above.

@Mimetis
Copy link
Author

Mimetis commented Feb 13, 2020

Hey @srowen Thanks for the quick update and PR !
Awesome job.
Will make a new integration test once the 0.9 version will be available.

Thanks !

@gpadavala
Copy link

gpadavala commented Feb 13, 2020

Shouldn't it throw an error if it can't find a start tag at all, i mean if a JSON file is there in the path instead of XML, it should throw an Exception?
At least that is how the .format('json') reader works

@srowen
Copy link
Collaborator

srowen commented Feb 13, 2020

Good question. I hesitate to do it because, well, the current behavior is to just return nothing (which doesn't mean it's correct). The XML data source is a little bit different in that it's meant to potentially just examine a small subset of larger files in general, so maybe it's less unusual to accept not finding any row tags in input sometimes?

What I think we might want to do is throw an error if the schema is auto-inferred, but it's entirely empty. This doesn't seem like a meaningful outcome and could/should be an error?

HyukjinKwon pushed a commit that referenced this issue Feb 14, 2020
…ling instead (#437)

See #436 (comment) for context. This stops silently swallowing unclosed tags at the end of input and instead propagates the remaining content for normal error handling.

Closes #436
@srowen srowen added the bug label Feb 29, 2020
@srowen srowen self-assigned this Feb 29, 2020
@srowen srowen added this to the 0.9.0 milestone Feb 29, 2020
beluisterql added a commit to beluisterql/spark-xml that referenced this issue Aug 4, 2024
…ling instead (#437)

See databricks/spark-xml#436 (comment) for context. This stops silently swallowing unclosed tags at the end of input and instead propagates the remaining content for normal error handling.

Closes #436
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants