-
Notifications
You must be signed in to change notification settings - Fork 227
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow custom timestamp with Spark timezone property #621
Conversation
@@ -119,7 +119,15 @@ private[xml] object TypeCast { | |||
map(supportedXmlTimestampFormatters :+ _).getOrElse(supportedXmlTimestampFormatters) | |||
formatters.foreach { format => | |||
try { | |||
return Timestamp.from(ZonedDateTime.parse(value, format).toInstant) | |||
// If format is not in supported and no timezone in format, use default Spark timezone | |||
if (!supportedXmlTimestampFormatters.contains(format) && Option(format.getZone).isEmpty) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rather than do this, just break this method into two checks - one loop over built-in formats, then the custom format with special TZ handling
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean we should use two methods? If I understand correctly,
val formatters = options.timestampFormat.map(DateTimeFormatter.ofPattern).
map(supportedXmlTimestampFormatters :+ _).getOrElse(supportedXmlTimestampFormatters)
formatters.foreach { format =>
This block of code means that the current parseXmlTimestamp()
method already loops over built-in formats + the custom format. It is just that if the specified format does not contains a timezone, the custom format does not work. So why break it if the original idea is to loop over all formats?
If you meant just break in the method, doesn't the if branch does it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's all about the same, but why combine two types of input that need different processing and then re-differentiate in a loop? handle the loop, then handle the special case. I think it's more straightforward, maybe not.
// If format is not in supported and no timezone in format, use default Spark timezone | ||
if (!supportedXmlTimestampFormatters.contains(format) && Option(format.getZone).isEmpty) { | ||
return Timestamp.from( | ||
ZonedDateTime.parse(value, format.withZone(ZoneId.of(options.timezone.get))).toInstant |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need to deal with timezone not being specified - I suppose throw an error explicitly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With the current approach to put the timezone in XmlRelation
of createRelation()
, isn't it better the create a default timezone in case spark.sql.session.timeZone
is not set?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, maybe, I just feel like that's an error. You are parsing times that are ambiguous without a timzeone, and didn't give a timezone - just assuming "UTC" or something doesn't quite seem right vs highlighting the error.
… format processing
// Custom format without timezone or offset | ||
return Timestamp.from( | ||
ZonedDateTime.parse(value, format.withZone(ZoneId.of(options.timezone.get))).toInstant | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can check if there is a timezone but there is no method to determine if an offset is defined
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm. What if we parse, and if it fails with the error you are facing, add 'withZone'? the only problem is that may be a huge amount of perf overhead, so we'd have to have some way to test and store the format with the zone, only if it's needed. Not pretty but that would be the way forward
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When parsing fails, it will always throw a DateTimeParseException
, no matter if it is because of the problem we are talking about or not. Rather than parsing the error message, I wrote a isParseableAsZonedDateTime()
method and used it during the loop.
@@ -112,15 +113,23 @@ private[xml] object TypeCast { | |||
// 2002-05-30T21:46:54+06:00 | |||
DateTimeFormatter.ISO_OFFSET_DATE_TIME, | |||
// 2002-05-30T21:46:54.1234Z | |||
DateTimeFormatter.ISO_INSTANT | |||
DateTimeFormatter.ISO_INSTANT.withZone(ZoneId.of("UTC")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should already be UTC; I don't think we want to change this
https://docs.oracle.com/javase/8/docs/api/java/time/format/DateTimeFormatter.html#ISO_INSTANT
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't been able to run Timestamp.from(ZonedDateTime.parse(value, DateTimeFormatter.ISO_INSTANT).toInstant)
. The doc also says:
As such, an Instant cannot be formatted as a date or time without providing some form of time-zone.
I feel like that ISO_INSTANT should not be in supportedXmlTimestampFormatters
.
Same issue here: https://stackoverflow.com/questions/25612129/java-8-datetimeformatter-and-iso-instant-issues-with-zoneddatetime
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need to support the format; I think I copied these from a list of standard formats according to XSD specs or something. It's super standard.
We're parsing rather than formatting here. What's the issue - does it not work with ZonedDateTime? maybe never did, if so.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm OK I see why you made that change, it doesn't seem to think it's "UTC" otherwise, when it should be by nature. OK leave it in
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK different idea - what if we write the parsing without ZonedDateTime? Timestamp.from(Instant.from(format.parse(value)))
Does that help? seems to be simpler, not sure why it wasn't written that way in the first place
@@ -268,4 +277,17 @@ private[xml] object TypeCast { | |||
TypeCast.castTo(data, FloatType, options).asInstanceOf[Float] | |||
} | |||
} | |||
|
|||
private[xml] def isParseableAsZonedDateTime(value: String, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For later, we might have to figure out a way to cache the parsing of a custom format into a pattern, and maybe this check, because it'll happen for every single row
Take a look at this change -- I think the core of this works? maybe adapt this approach |
I think you have the best answer; I added some more tests in the pull request. I'll try to look into why tests are failing though |
I think I figured out the test failure - tiny but subtle issue in handling the param map. See my latest push |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking good to me otherwise
} | ||
options.timestampFormat.foreach { formatString => | ||
// Check if there is offset or timezone and apply Spark timeZone if not | ||
val hasTemporalInformation = formatString.indexOf("V") + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure we need this - I found that the docs for "withZone" say that it's ignored if the pattern contains timezone info. So it seems like it will just be a default. OK to write a test for that though!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes I also saw this, but then I encountered some problems between Java 8 and Java 11. We can see in Java 8 here that zone is prioritized over offset, but in Java 9 here, offset is prioritized over zone. I made an example below to see the differences:
So I don't think we can always apply withZone()
, and that's why I wanted to use hasTemporalInformation
. unfortunately, I haven't been able to think of a cleaner way to do this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, let's leave it this way for now. If you have a sec, throw in a comment about the Java version
Related to issue #612 and to previous pull request #616.
There are still some issues as
spark.sql.session.timeZone
uses Java'sTimeZone.getDefault.getID
according to the source code here, and it can result in a null value.As a result, it will be mandatory to set
spark.sql.session.timeZone
, otherwisespark-xml
will throw anNoSuchElementException
when trying to retrieve the Spark property withspark.conf.get()
method. Can reproduce this when running theXmlPartitioningSuite
.We may still need a default value for the timezone.