Allow custom timestamp with Spark timezone property #621

JorisTruong · 2022-12-26T09:54:03Z

Related to issue #612 and to previous pull request #616.

There are still some issues as spark.sql.session.timeZone uses Java's TimeZone.getDefault.getID according to the source code here, and it can result in a null value.

As a result, it will be mandatory to set spark.sql.session.timeZone, otherwise spark-xml will throw an NoSuchElementException when trying to retrieve the Spark property with spark.conf.get() method. Can reproduce this when running the XmlPartitioningSuite.

We may still need a default value for the timezone.

src/main/scala/com/databricks/spark/xml/DefaultSource.scala

README.md

srowen · 2022-12-26T16:22:25Z

src/main/scala/com/databricks/spark/xml/util/TypeCast.scala

@@ -119,7 +119,15 @@ private[xml] object TypeCast {
      map(supportedXmlTimestampFormatters :+ _).getOrElse(supportedXmlTimestampFormatters)
    formatters.foreach { format =>
      try {
-        return Timestamp.from(ZonedDateTime.parse(value, format).toInstant)
+        // If format is not in supported and no timezone in format, use default Spark timezone
+        if (!supportedXmlTimestampFormatters.contains(format) && Option(format.getZone).isEmpty) {


Rather than do this, just break this method into two checks - one loop over built-in formats, then the custom format with special TZ handling

Do you mean we should use two methods? If I understand correctly,

val formatters = options.timestampFormat.map(DateTimeFormatter.ofPattern). map(supportedXmlTimestampFormatters :+ _).getOrElse(supportedXmlTimestampFormatters) formatters.foreach { format =>

This block of code means that the current parseXmlTimestamp() method already loops over built-in formats + the custom format. It is just that if the specified format does not contains a timezone, the custom format does not work. So why break it if the original idea is to loop over all formats?

If you meant just break in the method, doesn't the if branch does it?

It's all about the same, but why combine two types of input that need different processing and then re-differentiate in a loop? handle the loop, then handle the special case. I think it's more straightforward, maybe not.

srowen · 2022-12-26T16:23:04Z

src/main/scala/com/databricks/spark/xml/util/TypeCast.scala

+        // If format is not in supported and no timezone in format, use default Spark timezone
+        if (!supportedXmlTimestampFormatters.contains(format) && Option(format.getZone).isEmpty) {
+          return Timestamp.from(
+            ZonedDateTime.parse(value, format.withZone(ZoneId.of(options.timezone.get))).toInstant


Need to deal with timezone not being specified - I suppose throw an error explicitly?

With the current approach to put the timezone in XmlRelation of createRelation(), isn't it better the create a default timezone in case spark.sql.session.timeZone is not set?

Hm, maybe, I just feel like that's an error. You are parsing times that are ambiguous without a timzeone, and didn't give a timezone - just assuming "UTC" or something doesn't quite seem right vs highlighting the error.

… format processing

…ezone

… ISSUE-612

JorisTruong · 2022-12-29T17:20:15Z

src/main/scala/com/databricks/spark/xml/util/TypeCast.scala

+        // Custom format without timezone or offset
+        return Timestamp.from(
+          ZonedDateTime.parse(value, format.withZone(ZoneId.of(options.timezone.get))).toInstant
+        )


We can check if there is a timezone but there is no method to determine if an offset is defined

Hm. What if we parse, and if it fails with the error you are facing, add 'withZone'? the only problem is that may be a huge amount of perf overhead, so we'd have to have some way to test and store the format with the zone, only if it's needed. Not pretty but that would be the way forward

When parsing fails, it will always throw a DateTimeParseException, no matter if it is because of the problem we are talking about or not. Rather than parsing the error message, I wrote a isParseableAsZonedDateTime() method and used it during the loop.

src/main/scala/com/databricks/spark/xml/DefaultSource.scala

srowen · 2022-12-30T14:36:40Z

src/main/scala/com/databricks/spark/xml/util/TypeCast.scala

@@ -112,15 +113,23 @@ private[xml] object TypeCast {
    // 2002-05-30T21:46:54+06:00
    DateTimeFormatter.ISO_OFFSET_DATE_TIME,
    // 2002-05-30T21:46:54.1234Z
-    DateTimeFormatter.ISO_INSTANT
+    DateTimeFormatter.ISO_INSTANT.withZone(ZoneId.of("UTC"))


This should already be UTC; I don't think we want to change this
https://docs.oracle.com/javase/8/docs/api/java/time/format/DateTimeFormatter.html#ISO_INSTANT

I haven't been able to run Timestamp.from(ZonedDateTime.parse(value, DateTimeFormatter.ISO_INSTANT).toInstant). The doc also says:

As such, an Instant cannot be formatted as a date or time without providing some form of time-zone.

I feel like that ISO_INSTANT should not be in supportedXmlTimestampFormatters.
Same issue here: https://stackoverflow.com/questions/25612129/java-8-datetimeformatter-and-iso-instant-issues-with-zoneddatetime

I think we need to support the format; I think I copied these from a list of standard formats according to XSD specs or something. It's super standard.

We're parsing rather than formatting here. What's the issue - does it not work with ZonedDateTime? maybe never did, if so.

Hm OK I see why you made that change, it doesn't seem to think it's "UTC" otherwise, when it should be by nature. OK leave it in

OK different idea - what if we write the parsing without ZonedDateTime? Timestamp.from(Instant.from(format.parse(value))) Does that help? seems to be simpler, not sure why it wasn't written that way in the first place

src/main/scala/com/databricks/spark/xml/util/TypeCast.scala

srowen · 2022-12-30T14:39:05Z

src/main/scala/com/databricks/spark/xml/util/TypeCast.scala

@@ -268,4 +277,17 @@ private[xml] object TypeCast {
      TypeCast.castTo(data, FloatType, options).asInstanceOf[Float]
    }
  }
+
+  private[xml] def isParseableAsZonedDateTime(value: String,


For later, we might have to figure out a way to cache the parsing of a custom format into a pattern, and maybe this check, because it'll happen for every single row

srowen · 2022-12-31T01:53:13Z

Take a look at this change -- I think the core of this works? maybe adapt this approach
#624

…Instant

JorisTruong · 2022-12-31T04:46:44Z

I think you have the best answer; I added some more tests in the pull request. I'll try to look into why tests are failing though

srowen · 2022-12-31T16:02:09Z

I think I figured out the test failure - tiny but subtle issue in handling the param map. See my latest push

srowen

Looking good to me otherwise

srowen · 2023-01-02T15:00:55Z

src/main/scala/com/databricks/spark/xml/util/TypeCast.scala

+    }
+    options.timestampFormat.foreach { formatString =>
+      // Check if there is offset or timezone and apply Spark timeZone if not
+      val hasTemporalInformation = formatString.indexOf("V") +


I'm not sure we need this - I found that the docs for "withZone" say that it's ignored if the pattern contains timezone info. So it seems like it will just be a default. OK to write a test for that though!

Yes I also saw this, but then I encountered some problems between Java 8 and Java 11. We can see in Java 8 here that zone is prioritized over offset, but in Java 9 here, offset is prioritized over zone. I made an example below to see the differences:

Java 8:

Java 11:

So I don't think we can always apply withZone(), and that's why I wanted to use hasTemporalInformation. unfortunately, I haven't been able to think of a cleaner way to do this

OK, let's leave it this way for now. If you have a sec, throw in a comment about the Java version

JorisTruong · 2023-01-03T15:17:25Z

@srowen thank you so much for your help!

closes #612

feat: allow custom timestamp with spark timezone

b672d80

srowen requested changes Dec 26, 2022

View reviewed changes

JorisTruong added 6 commits December 29, 2022 11:57

docs: updated README

351b2aa

fix: ability to run without setting spark.sql.session.timeZone

aa65f28

feat: break parseXmlTimestamp method into built-in formats and custom…

53279a7

… format processing

fix: removed unused code

b5435ad

fix: timestampFormat with offset should not use spark.sql.session.tim…

8f48368

…ezone

Merge branch 'master' of https://github.com/databricks/spark-xml into…

1e54f2d

… ISSUE-612

JorisTruong commented Dec 29, 2022

View reviewed changes

JorisTruong added 2 commits December 30, 2022 16:11

fix: ISO_INSTANT

375f7ea

feat: added isParseableAsZonedDateTime

869558f

srowen requested changes Dec 30, 2022

View reviewed changes

refactor: isParseableAsZonedDateTime and Spark timeZone

8b53bbb

JorisTruong marked this pull request as ready for review December 30, 2022 15:19

fix: removed sys.exit

ab97af2

feat: use Instant.from() instead of converting a ZonedDateTime to an …

124a90a

…Instant

JorisTruong added 4 commits January 1, 2023 00:44

fix: parameters with timezone

5728d54

fix: apply Spark timeZone only if no temporal information

b8d3e4b

fix: hasTemporalInformation

3789a67

fix: spark config

a88a20f

srowen reviewed Jan 2, 2023

View reviewed changes

docs: commented for java 8 and 11

89486f1

srowen approved these changes Jan 3, 2023

View reviewed changes

srowen merged commit d376877 into databricks:master Jan 3, 2023

srowen assigned JorisTruong Jan 3, 2023

srowen added the enhancement label Jan 3, 2023

srowen added this to the 0.16.0 milestone Jan 3, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow custom timestamp with Spark timezone property #621

Allow custom timestamp with Spark timezone property #621

JorisTruong commented Dec 26, 2022 •

edited

Loading

srowen Dec 26, 2022

JorisTruong Dec 29, 2022

srowen Dec 29, 2022

srowen Dec 26, 2022

JorisTruong Dec 29, 2022

srowen Dec 29, 2022

JorisTruong Dec 29, 2022

srowen Dec 29, 2022

JorisTruong Dec 30, 2022

srowen Dec 30, 2022

JorisTruong Dec 30, 2022

srowen Dec 30, 2022 •

edited

Loading

srowen Dec 30, 2022

srowen Dec 30, 2022

srowen Dec 30, 2022

srowen commented Dec 31, 2022

JorisTruong commented Dec 31, 2022 •

edited

Loading

srowen commented Dec 31, 2022

srowen left a comment

srowen Jan 2, 2023

JorisTruong Jan 2, 2023

srowen Jan 2, 2023

JorisTruong commented Jan 3, 2023 •

edited

Loading

Allow custom timestamp with Spark timezone property #621

Allow custom timestamp with Spark timezone property #621

Conversation

JorisTruong commented Dec 26, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

srowen Dec 30, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

srowen commented Dec 31, 2022

JorisTruong commented Dec 31, 2022 • edited Loading

srowen commented Dec 31, 2022

srowen left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

JorisTruong commented Jan 3, 2023 • edited Loading

JorisTruong commented Dec 26, 2022 •

edited

Loading

srowen Dec 30, 2022 •

edited

Loading

JorisTruong commented Dec 31, 2022 •

edited

Loading

JorisTruong commented Jan 3, 2023 •

edited

Loading