-
Notifications
You must be signed in to change notification settings - Fork 28.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-44732][SQL] Built-in XML data source support #41832
Conversation
This would need an SPIP. |
96156c1
to
b0fdc6a
Compare
b0fdc6a
to
cc0807b
Compare
|
Seems the test failure is flakiness. Mind retriggering https://github.com/sandip-db/spark/actions/runs/5745270542/job/15573043078 please @sandip-db ? |
} | ||
|
||
private def readUntilStartElement(): Boolean = { | ||
currentStartTag = startTag |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
QQ: is there a existing lib we can use for the parsing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
XMLEventReader is being used to parse fields from XML records. It may be possible to use XMLEventReader to extract records as well. The current scope of the PR is to inline the spark-xml as is.
The following test (unrelated to this PR) is failing consistently. It was supposedly fixed by the Jira/PR. But its continuing to fail even after the fix. How do we get past this? @HyukjinKwon 2023-08-03T17:34:49.5298626Z [info] *** 1 TEST FAILED *** StackTrace: |
- Copy spark-xml sources - Update license - Scala style and format fixes - AnyFunSuite --> SparkFunSuite
… or SQLTestUtils Also converted AnyFunSuite to SharedSparkSession or SQLTestUtils The following test in XmlSuite.scala failed with SharedSparkSession, but works with SQLTestUtils. test("from_xml array basic test")
cc0807b
to
e4a3300
Compare
@sandip-db would you mind adding some comments about what's new codes (e.g., pom.xml)? Then I think it'd be much easier to merge since most of the codes are already exiting one. I took a cursory look, and it seems pretty fine to me. |
Hi @HyukjinKwon and reviewers, |
import org.apache.spark.sql.types._ | ||
import org.apache.spark.unsafe.types.UTF8String | ||
|
||
case class XmlDataToCatalyst( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would need to register this to SQL expression. Can be done in a followup.
/** | ||
* A collection of static functions for working with XML files in Spark SQL | ||
*/ | ||
class XmlReader(private var schema: StructType, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually I think we can remove the whole file but can be done separately in a followup.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should probably add a couple of methods in DataFrameReader/DataStreamReader
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have a PR ready to be sent out with these and other changes you have suggested. Just have to get this merged first :-)
fs.delete(filesystemPath, true) | ||
} catch { | ||
case e: IOException => | ||
throw new IOException( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As a followup, we should leverage error framework we added (see QueryExecutionErrors
)
* Support functions for working with XML columns directly. | ||
*/ | ||
// scalastyle:off: object.name | ||
object functions { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As a followup, we should move this function to org.apache.spark.sql.functions
import org.apache.spark.sql.execution.datasources.xml.util.{InferSchema, XmlFile} | ||
import org.apache.spark.sql.types.{ArrayType, StructType} | ||
|
||
package object xml { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would have to remove this file too, and move those utils such as schema_of_xml
to somewhere else.
@@ -0,0 +1,336 @@ | |||
/* |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would have to be placed together with CSVInferSchema at catalyst.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the parsers too
* See the License for the specific language governing permissions and | ||
* limitations under the License. | ||
*/ | ||
package org.apache.spark.sql.execution.datasources.xml.util |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We actually have it in the codebase already :-).
import org.apache.spark.sql.execution.datasources.xml.{XmlInputFormat, XmlOptions} | ||
import org.apache.spark.sql.execution.datasources.xml.parsers.StaxXmlGenerator | ||
|
||
private[xml] object XmlFile { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we won't likely need this file too
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All of the above comments will be addressed in this sub-task:
https://issues.apache.org/jira/browse/SPARK-44751
@@ -0,0 +1,20 @@ | |||
<people> | |||
<person> | |||
<age born=" 1990-02-24 "> 25 </age> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was so young :-)
public final class JavaXmlSuite { | ||
|
||
private static final int numBooks = 12; | ||
private static final String booksFile = "src/test/resources/test-data/xml-resources/books.xml"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we should just name it resources
instead of xml-resources
. Or create a sub directory?
I think I'd prefer to merge this first, and address them as a followup so other people can work together. I will review one more time tmr but would be great if others can review this too. |
|
||
def this() = this(Map.empty) | ||
|
||
val charset = parameters.getOrElse("charset", XmlOptions.DEFAULT_CHARSET) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should document it in https://spark.apache.org/docs/latest/sql-data-sources.html
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Created a sub-task for documentation: https://issues.apache.org/jira/browse/SPARK-44752
* column is string-valued, or ArrayType[StructType] if column is an array of strings | ||
* @param options key-value pairs that correspond to those supported by [[XmlOptions]] | ||
*/ | ||
def from_xml(e: Column, schema: DataType, options: Map[String, String] = Map.empty): Column = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should add Python and sparkR binding including Spark Connect
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a sub-task: https://issues.apache.org/jira/browse/SPARK-44753
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have no more highlevel comments. I will file a JIRAs for my own comments tmr.
Merged to master. |
@sandip-db I would appreciate if you create sub-tasks JIRAs based on my comments when you find some time. |
### What changes were proposed in this pull request? XML is a widely used data format. An external spark-xml package (https://github.com/databricks/spark-xml) is available to read and write XML data in spark. Making spark-xml built-in will provide a better user experience for Spark SQL and structured streaming. The proposal is to inline code from spark-xml package. The PR has the following commits: i) The first commit has the following: - Copy of spark-xml src files. - Update license - Scala style and format fixes - Change AnyFunSuite to SparkFunSuite ii) Miscellaneous import and scala style fixes. iii) Add library dependencies iv) Resource file path fixes and change AnyFunSuite to SharedSparkSession or SQLTestUtils v) Exclude XML test resource files from license check vi) Change import from scala.jdk.Collections to scala.collection.JavaConverters ### Why are the changes needed? Built-in support for XML data source would provide better user experience than having to import an external package. ### Does this PR introduce _any_ user-facing change? Yes, Add built-in support for XML data source. ### How was this patch tested? Tested the new unit-tests that came with the imported spark-xml package. Also ran ./dev/run-test Closes apache#41832 from sandip-db/spark-xml-master. Authored-by: Sandip Agarwala <131817656+sandip-db@users.noreply.github.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
### What changes were proposed in this pull request? This is the second PR related to the built-in XML data source implementation ([jira](https://issues.apache.org/jira/browse/SPARK-44751)). The previous [PR](#41832) ported the spark-xml package. This PR addresses the following: - Implement FileFormat interface - Address the review comments in the previous [XML PR](#41832) - Moved from_xml and schema_of_xml to sql/functions - Moved ".xml" to DataFrameReader/DataFrameWriter - Removed old APIs like XmlRelation, XmlReader, etc. - StaxXmlParser changes: - Use FailureSafeParser - Convert 'Row' usage to 'InternalRow' - Convert String to UTF8String - Handle MapData and ArrayData for MapType and ArrayType respectively - Use TimestampFormatter to parse timestamp - Use DateFormatter to parse date - StaxXmlGenerator changes: - Convert 'Row' usage to 'InternalRow' - Handle UTF8String for StringType - Handle MapData and ArrayData for MapType and ArrayType respectively - Use TimestampFormatter to format timestamp - Use DateFormatter to format date - Update XML tests accordingly because of the above changes ### Why are the changes needed? These changes are required to bring XML data source capability at par with CSV and JSON and supports features like streaming, which requires FileFormat interface to be implemented. ### Does this PR introduce _any_ user-facing change? Yes, it adds support for XML data source. ### How was this patch tested? - Ran all the XML unit tests. - Github Action Closes #42462 from sandip-db/xml-file-format-master. Authored-by: Sandip Agarwala <131817656+sandip-db@users.noreply.github.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
### What changes were proposed in this pull request? XML is a widely used data format. An external spark-xml package (https://github.com/databricks/spark-xml) is available to read and write XML data in spark. Making spark-xml built-in will provide a better user experience for Spark SQL and structured streaming. The proposal is to inline code from spark-xml package. The PR has the following commits: i) The first commit has the following: - Copy of spark-xml src files. - Update license - Scala style and format fixes - Change AnyFunSuite to SparkFunSuite ii) Miscellaneous import and scala style fixes. iii) Add library dependencies iv) Resource file path fixes and change AnyFunSuite to SharedSparkSession or SQLTestUtils v) Exclude XML test resource files from license check vi) Change import from scala.jdk.Collections to scala.collection.JavaConverters ### Why are the changes needed? Built-in support for XML data source would provide better user experience than having to import an external package. ### Does this PR introduce _any_ user-facing change? Yes, Add built-in support for XML data source. ### How was this patch tested? Tested the new unit-tests that came with the imported spark-xml package. Also ran ./dev/run-test Closes apache#41832 from sandip-db/spark-xml-master. Authored-by: Sandip Agarwala <131817656+sandip-db@users.noreply.github.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
### What changes were proposed in this pull request? This is the second PR related to the built-in XML data source implementation ([jira](https://issues.apache.org/jira/browse/SPARK-44751)). The previous [PR](apache#41832) ported the spark-xml package. This PR addresses the following: - Implement FileFormat interface - Address the review comments in the previous [XML PR](apache#41832) - Moved from_xml and schema_of_xml to sql/functions - Moved ".xml" to DataFrameReader/DataFrameWriter - Removed old APIs like XmlRelation, XmlReader, etc. - StaxXmlParser changes: - Use FailureSafeParser - Convert 'Row' usage to 'InternalRow' - Convert String to UTF8String - Handle MapData and ArrayData for MapType and ArrayType respectively - Use TimestampFormatter to parse timestamp - Use DateFormatter to parse date - StaxXmlGenerator changes: - Convert 'Row' usage to 'InternalRow' - Handle UTF8String for StringType - Handle MapData and ArrayData for MapType and ArrayType respectively - Use TimestampFormatter to format timestamp - Use DateFormatter to format date - Update XML tests accordingly because of the above changes ### Why are the changes needed? These changes are required to bring XML data source capability at par with CSV and JSON and supports features like streaming, which requires FileFormat interface to be implemented. ### Does this PR introduce _any_ user-facing change? Yes, it adds support for XML data source. ### How was this patch tested? - Ran all the XML unit tests. - Github Action Closes apache#42462 from sandip-db/xml-file-format-master. Authored-by: Sandip Agarwala <131817656+sandip-db@users.noreply.github.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
### What changes were proposed in this pull request? XML is a widely used data format. An external spark-xml package (https://github.com/databricks/spark-xml) is available to read and write XML data in spark. Making spark-xml built-in will provide a better user experience for Spark SQL and structured streaming. The proposal is to inline code from spark-xml package. The PR has the following commits: i) The first commit has the following: - Copy of spark-xml src files. - Update license - Scala style and format fixes - Change AnyFunSuite to SparkFunSuite ii) Miscellaneous import and scala style fixes. iii) Add library dependencies iv) Resource file path fixes and change AnyFunSuite to SharedSparkSession or SQLTestUtils v) Exclude XML test resource files from license check vi) Change import from scala.jdk.Collections to scala.collection.JavaConverters ### Why are the changes needed? Built-in support for XML data source would provide better user experience than having to import an external package. ### Does this PR introduce _any_ user-facing change? Yes, Add built-in support for XML data source. ### How was this patch tested? Tested the new unit-tests that came with the imported spark-xml package. Also ran ./dev/run-test Closes apache#41832 from sandip-db/spark-xml-master. Authored-by: Sandip Agarwala <131817656+sandip-db@users.noreply.github.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
### What changes were proposed in this pull request? This is the second PR related to the built-in XML data source implementation ([jira](https://issues.apache.org/jira/browse/SPARK-44751)). The previous [PR](apache#41832) ported the spark-xml package. This PR addresses the following: - Implement FileFormat interface - Address the review comments in the previous [XML PR](apache#41832) - Moved from_xml and schema_of_xml to sql/functions - Moved ".xml" to DataFrameReader/DataFrameWriter - Removed old APIs like XmlRelation, XmlReader, etc. - StaxXmlParser changes: - Use FailureSafeParser - Convert 'Row' usage to 'InternalRow' - Convert String to UTF8String - Handle MapData and ArrayData for MapType and ArrayType respectively - Use TimestampFormatter to parse timestamp - Use DateFormatter to parse date - StaxXmlGenerator changes: - Convert 'Row' usage to 'InternalRow' - Handle UTF8String for StringType - Handle MapData and ArrayData for MapType and ArrayType respectively - Use TimestampFormatter to format timestamp - Use DateFormatter to format date - Update XML tests accordingly because of the above changes ### Why are the changes needed? These changes are required to bring XML data source capability at par with CSV and JSON and supports features like streaming, which requires FileFormat interface to be implemented. ### Does this PR introduce _any_ user-facing change? Yes, it adds support for XML data source. ### How was this patch tested? - Ran all the XML unit tests. - Github Action Closes apache#42462 from sandip-db/xml-file-format-master. Authored-by: Sandip Agarwala <131817656+sandip-db@users.noreply.github.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
What changes were proposed in this pull request?
XML is a widely used data format. An external spark-xml package (https://github.com/databricks/spark-xml) is available to read and write XML data in spark. Making spark-xml built-in will provide a better user experience for Spark SQL and structured streaming. The proposal is to inline code from spark-xml package.
The PR has the following commits:
i) The first commit has the following:
ii) Miscellaneous import and scala style fixes.
iii) Add library dependencies
iv) Resource file path fixes and change AnyFunSuite to SharedSparkSession or SQLTestUtils
v) Exclude XML test resource files from license check
vi) Change import from scala.jdk.Collections to scala.collection.JavaConverters
Why are the changes needed?
Built-in support for XML data source would provide better user experience than having to import an external package.
Does this PR introduce any user-facing change?
Yes, Add built-in support for XML data source.
How was this patch tested?
Tested the new unit-tests that came with the imported spark-xml package.
Also ran ./dev/run-test