Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow custom timestamp with Spark timezone property #621

Merged
merged 17 commits into from
Jan 3, 2023

Conversation

JorisTruong
Copy link
Contributor

@JorisTruong JorisTruong commented Dec 26, 2022

Related to issue #612 and to previous pull request #616.

There are still some issues as spark.sql.session.timeZone uses Java's TimeZone.getDefault.getID according to the source code here, and it can result in a null value.

As a result, it will be mandatory to set spark.sql.session.timeZone, otherwise spark-xml will throw an NoSuchElementException when trying to retrieve the Spark property with spark.conf.get() method. Can reproduce this when running the XmlPartitioningSuite.

We may still need a default value for the timezone.

README.md Outdated Show resolved Hide resolved
@@ -119,7 +119,15 @@ private[xml] object TypeCast {
map(supportedXmlTimestampFormatters :+ _).getOrElse(supportedXmlTimestampFormatters)
formatters.foreach { format =>
try {
return Timestamp.from(ZonedDateTime.parse(value, format).toInstant)
// If format is not in supported and no timezone in format, use default Spark timezone
if (!supportedXmlTimestampFormatters.contains(format) && Option(format.getZone).isEmpty) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than do this, just break this method into two checks - one loop over built-in formats, then the custom format with special TZ handling

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean we should use two methods? If I understand correctly,

    val formatters = options.timestampFormat.map(DateTimeFormatter.ofPattern).
      map(supportedXmlTimestampFormatters :+ _).getOrElse(supportedXmlTimestampFormatters)
    formatters.foreach { format =>

This block of code means that the current parseXmlTimestamp() method already loops over built-in formats + the custom format. It is just that if the specified format does not contains a timezone, the custom format does not work. So why break it if the original idea is to loop over all formats?

If you meant just break in the method, doesn't the if branch does it?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's all about the same, but why combine two types of input that need different processing and then re-differentiate in a loop? handle the loop, then handle the special case. I think it's more straightforward, maybe not.

// If format is not in supported and no timezone in format, use default Spark timezone
if (!supportedXmlTimestampFormatters.contains(format) && Option(format.getZone).isEmpty) {
return Timestamp.from(
ZonedDateTime.parse(value, format.withZone(ZoneId.of(options.timezone.get))).toInstant
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to deal with timezone not being specified - I suppose throw an error explicitly?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the current approach to put the timezone in XmlRelation of createRelation(), isn't it better the create a default timezone in case spark.sql.session.timeZone is not set?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, maybe, I just feel like that's an error. You are parsing times that are ambiguous without a timzeone, and didn't give a timezone - just assuming "UTC" or something doesn't quite seem right vs highlighting the error.

// Custom format without timezone or offset
return Timestamp.from(
ZonedDateTime.parse(value, format.withZone(ZoneId.of(options.timezone.get))).toInstant
)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can check if there is a timezone but there is no method to determine if an offset is defined

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm. What if we parse, and if it fails with the error you are facing, add 'withZone'? the only problem is that may be a huge amount of perf overhead, so we'd have to have some way to test and store the format with the zone, only if it's needed. Not pretty but that would be the way forward

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When parsing fails, it will always throw a DateTimeParseException, no matter if it is because of the problem we are talking about or not. Rather than parsing the error message, I wrote a isParseableAsZonedDateTime() method and used it during the loop.

@@ -112,15 +113,23 @@ private[xml] object TypeCast {
// 2002-05-30T21:46:54+06:00
DateTimeFormatter.ISO_OFFSET_DATE_TIME,
// 2002-05-30T21:46:54.1234Z
DateTimeFormatter.ISO_INSTANT
DateTimeFormatter.ISO_INSTANT.withZone(ZoneId.of("UTC"))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should already be UTC; I don't think we want to change this
https://docs.oracle.com/javase/8/docs/api/java/time/format/DateTimeFormatter.html#ISO_INSTANT

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't been able to run Timestamp.from(ZonedDateTime.parse(value, DateTimeFormatter.ISO_INSTANT).toInstant). The doc also says:

As such, an Instant cannot be formatted as a date or time without providing some form of time-zone.

I feel like that ISO_INSTANT should not be in supportedXmlTimestampFormatters.
Same issue here: https://stackoverflow.com/questions/25612129/java-8-datetimeformatter-and-iso-instant-issues-with-zoneddatetime

Copy link
Collaborator

@srowen srowen Dec 30, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to support the format; I think I copied these from a list of standard formats according to XSD specs or something. It's super standard.

We're parsing rather than formatting here. What's the issue - does it not work with ZonedDateTime? maybe never did, if so.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm OK I see why you made that change, it doesn't seem to think it's "UTC" otherwise, when it should be by nature. OK leave it in

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK different idea - what if we write the parsing without ZonedDateTime? Timestamp.from(Instant.from(format.parse(value))) Does that help? seems to be simpler, not sure why it wasn't written that way in the first place

@@ -268,4 +277,17 @@ private[xml] object TypeCast {
TypeCast.castTo(data, FloatType, options).asInstanceOf[Float]
}
}

private[xml] def isParseableAsZonedDateTime(value: String,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For later, we might have to figure out a way to cache the parsing of a custom format into a pattern, and maybe this check, because it'll happen for every single row

@JorisTruong JorisTruong marked this pull request as ready for review December 30, 2022 15:19
@srowen
Copy link
Collaborator

srowen commented Dec 31, 2022

Take a look at this change -- I think the core of this works? maybe adapt this approach
#624

@JorisTruong
Copy link
Contributor Author

JorisTruong commented Dec 31, 2022

I think you have the best answer; I added some more tests in the pull request. I'll try to look into why tests are failing though

@srowen
Copy link
Collaborator

srowen commented Dec 31, 2022

I think I figured out the test failure - tiny but subtle issue in handling the param map. See my latest push

Copy link
Collaborator

@srowen srowen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good to me otherwise

}
options.timestampFormat.foreach { formatString =>
// Check if there is offset or timezone and apply Spark timeZone if not
val hasTemporalInformation = formatString.indexOf("V") +
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure we need this - I found that the docs for "withZone" say that it's ignored if the pattern contains timezone info. So it seems like it will just be a default. OK to write a test for that though!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I also saw this, but then I encountered some problems between Java 8 and Java 11. We can see in Java 8 here that zone is prioritized over offset, but in Java 9 here, offset is prioritized over zone. I made an example below to see the differences:

  • Java 8: image
  • Java 11: image

So I don't think we can always apply withZone(), and that's why I wanted to use hasTemporalInformation. unfortunately, I haven't been able to think of a cleaner way to do this

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, let's leave it this way for now. If you have a sec, throw in a comment about the Java version

@srowen srowen merged commit d376877 into databricks:master Jan 3, 2023
@srowen srowen added this to the 0.16.0 milestone Jan 3, 2023
@JorisTruong
Copy link
Contributor Author

JorisTruong commented Jan 3, 2023

@srowen thank you so much for your help!

closes #612

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants