-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-12297] Table timezone correction for Timestamps #19250
Conversation
more appropriate location. Ensure rule works without hive support. Extra checks on when table timezones are set.
@@ -92,7 +92,7 @@ case class CreateHiveTableAsSelectCommand( | |||
} | |||
|
|||
override def argString: String = { | |||
s"[Database:${tableDesc.database}}, " + | |||
s"[Database:${tableDesc.database}, " + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
totally unrelated typo fix, but didn't seem worth an independent pr
Test build #81830 has finished for PR 19250 at commit
|
Test build #81833 has finished for PR 19250 at commit
|
Test build #81888 has finished for PR 19250 at commit
|
Test build #81892 has finished for PR 19250 at commit
|
Test build #81931 has finished for PR 19250 at commit
|
Test build #81929 has finished for PR 19250 at commit
|
Test build #81943 has finished for PR 19250 at commit
|
also cc @yhuai @liancheng would appreciate a review since you've looked at sql & hive compatibility in the past |
Test build #82037 has finished for PR 19250 at commit
|
@HyukjinKwon you might be interested in this one also |
FYI Imran is probably going to be out for a few weeks so I'll try to address the feedback here. It would be nice to have people take a look at this, though. |
* We intentionally do not provide an ExpressionDescription as this is not meant to be exposed to | ||
* users, its only used for internal conversions. | ||
*/ | ||
private[spark] case class TimestampTimezoneCorrection( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we need a whole expression for this? can't we just reuse existing expressions? It's just simple arithmetics isn't it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess you could use ToUTCTimestamp / FromUTCTimestamp for this, but that would be more expensive since you'd be doing the conversion twice for each value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What I'm saying is the analysis rule can just determine the delta, and then just do a simple add/delete.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, let me see if I can figure that out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately the offset depends on the actual date, so a timezone conversion can not be simplified to a simple delta.
Daylight saving time starts and ends on different days in different timezones, while some timezones don't have DST changes at all. Additionally, timezone rules have changed over time and keep changing. Both the basic timezone offset and the DST rules themselves could change (and have changed) over time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Additionally, timezone rules have changed over time and keep changing.
Ah, yes, that makes sense... it's also why my initial tests were failing at the DST boundaries. :-/
@@ -1015,6 +1020,10 @@ object DateTimeUtils { | |||
guess | |||
} | |||
|
|||
def convertTz(ts: SQLTimestamp, fromZone: String, toZone: String): SQLTimestamp = { | |||
convertTz(ts, getTimeZone(fromZone), getTimeZone(toZone)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
performance is going to suck here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess caching the value as done by the FromUTCTimestamp
expression is the right way to go?
@@ -266,6 +267,10 @@ final class DataFrameWriter[T] private[sql](ds: Dataset[T]) { | |||
* @since 1.4.0 | |||
*/ | |||
def insertInto(tableName: String): Unit = { | |||
extraOptions.get(TimestampTableTimeZone.TIMEZONE_PROPERTY).foreach { tz => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we don't seem to be doing this type of validity check in general; otherwise we'd need to add a lot more checks here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The spec requests errors when using invalid time zones. I guess this would still fail with a different error in that case, so I can remove this if you're really against adding it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm. I tried a couple of things, and while it may be possible to remove some of these checks and replace them with a check in DateTimeUtils.computeTimeZone
, that doesn't cover all cases. For example, you could use "ALTER TABLE" with an invalid time zone and that wouldn't trigger the check.
So given the spec I'm inclined to leave the checks as is, unless @zivanfi is open to making the spec more lax in that area. (TimeZone.getTimeZone(invalidId)
actually returns the UTC time zone, as unexpected as that behavior may be, so things won't necessarily break without the checks.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Although other table properties don't have similar checks, their effect is usually easy to see. The effect of this specific table property however is not immediately apparent: for new data it is only revealed in interoperability with other components, and for existing data it should not have any visible effect if set correctly. Therefore we decided it would be best to be very strict in checks, because otherwise a typo in the table property value could only be discovered after some data has already been written with irreversible errors. This was the reasoning behind this part of specs.
@@ -230,6 +230,13 @@ case class AlterTableSetPropertiesCommand( | |||
isView: Boolean) | |||
extends RunnableCommand { | |||
|
|||
if (isView) { | |||
properties.get(TimestampTableTimeZone.TIMEZONE_PROPERTY).foreach { _ => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is there even a meaning to set properties for any views? we should either drop this check, or have a more general check.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
HiveQL explicitly allows properties in view; I've never used them, though.
* like TIMESTAMP WITHOUT TIMEZONE. This gives correct behavior if you process data with | ||
* machines in different timezones, or if you access the data from multiple SQL engines. | ||
*/ | ||
private[sql] case class TimestampTableTimeZone(sparkSession: SparkSession) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i need to take a look at this again to make sure i understand what's happening. conceptually what you are doing is pretty simple and it doesn't seem right it needs so many lines of code, but maybe i'm missing something.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm also trying to see if this can be simplified; I guess the main thing is Imran's comment about not being able to use transformUp
. I need to take a look at whether that's really the case.
This rule also doesn't seem to handle InsertIntoHiveTable
, which is in the hive module so can't be handled here. Probably will need a new rule based on this one in the hive module.
This triggers use of InsertIntoHiveTable. Tests currently fail because that command is not yet processed by the new timestamp code.
Tests still don't pass. Kinda hard to figure out which test is failing with current code.
This makes it easier to see what combinations fail, even if the suite itself ends up running for a bit longer. Doesn't seem bad, though.
Also some other minor fixes and cleanups to the code.
I pushed a bunch of changes to address feedback here and also my own feedback that I didn't bother to write down. Main changes:
I have not cleaned up the timestamp correction rule yet; I think I have a better idea of the code now and will do that next. |
Test build #82561 has finished for PR 19250 at commit
|
Running the rule during resolution also allowed to do all the needed ajustments with a single rule (instead of needing a Hive-specific rule for InsertIntoHiveTable).
Test build #82592 has finished for PR 19250 at commit
|
Test build #82595 has finished for PR 19250 at commit
|
Test build #82610 has finished for PR 19250 at commit
|
@attilajeges has just found a problem with the behavior specified in the requirements:
We can still solve the issue using a file-format-specific table property though. @rxin, I would like to ask you opinion of such a solution:
Would you find such a soltion be acceptable, given that a file-format-agnostic fix seems infeasible at this point? Thanks, Zoltan |
why is this patch so complicated? Based on the fact that data sources accept a "timezone" option for read/writre, I'd expect it to be just:
For data sources that doesn't support "timezone" options, there is nothing we can do. |
@cloud-fan I think you misunderstand the purpose of this change. The primary purpose is actually to deal with parquet, where that option doesn't do anything. We need this for parquet for two reasons:
To be honest, I see limited value in this change for formats other than parquet -- I added only because I thought Reynold wanted it (for symmetry across formats, I suppose?). As the purpose of this is to undo timezones, you can already achieve something similar in text-based formats by specifying a format which leaves out the timezone. But it doesn't hurt. We could reuse "timezone" option for parquet for this purpose, but that would be rather strange as its almost doing the opposite as what that property does for text-based formats, as that property is for adding a timezone, and this is for "removing" it. Its doing something special enough it seems like it deserves a more specific name than just "timezone". (This is all discussed at greater length, including showing how this type behaves in other sql engines, and how spark's behavior is non-standard, and how it changed in 2.0.1, in the design docs.) |
What's the interoperability issue with Impala? I think both Spark and Impala store timestamp as parquet INT96, representing nanoseconds from epoch, there is no timezone confusion. Internally Spark uses a long to store timestamp, representing microseconds from epoch, so we don't and shoud't consider timezone when reading parquet INT96 timestamp. I think your problem may about display. When Spark displays a timestamp value, via
This behavior, I think makes sense, but may not be SQL-compliant. A clean solution is to add Your proposal seems to hack the internal long value and lie to Spark about the microseconds from eppch, which doesn't look good. |
The interoperability issue is that Impala follows timezone-agnostic timestamp semantics as mandated by the SQL standard, while SparkSQL follows UTC-normalized semantics instead (which is not SQL-compliant). Please see the design doc for details. The long-term solution is using separate SQL types for different timestamps semantics indeed as you suggest. However, until we get there we would like to have a short-term solution that fixes timestamps that are already written. Please note that the "ms from epoch" value not being constant any more is a consequence of the timezone-agnostic semantics. The SQL standard specifies this in Section 4.6.2 of Part 2: Foundation. Pure TIMESTAMP has to mean TIMESTAMP WITHOUT TIME ZONE and the latter must have the same hour, minute, second values regardless of the timezone. It also specifies that the TIMESTAMP WITHOUT TIME ZONE to TIMESTAMP WITH TIME ZONE conversion has to happen by subtracting the session-local time zone offset from the TIMESTAMP WITHOUT TIME ZONE value to calculate a corresponding UTC value. It naturally follows that during this conversion the UTC value will change depending the session-local time zone. It is like parsing dates from text files in SparkSQL, where the UTC value also changes depending on the session-local time zone. |
Ah now I understand this issue. Yes Spark doesn't follow the SQL standard, the Spark timestamp is actually TIMESTAMP WITH LOCAL TIME ZONE, which is not SQL standard but used in some databases like Oracle. Although Impala follows SQL standard, it doesn't follow parquet standard, that's why we need to deal with the parquet INT96 issue here. I think we can follow what Hive/Impala did for interoperability, i.e. create a config to interpret parquet INT96 as timezone-agnostic timestamp in parquet reader of Spark. However, I'm less sure about the |
If I understand what you are asking correctly, I think this is what went into the original PR:
All three engines were going to make the change, till it was reverted from Spark. Now the process as a whole is blocked on Spark -- if this change (or the prior one) is accepted, then the other engines can move forward too. |
Hive and Impala introduced the following workaround for timestamp interoperability a long ago: The footer of the Parquet file contains metadata about the library that wrote the file. For Hive and Spark this value is parquet-mr, for Impala it is impala itself, since it has its own implementation. Since Hive and Spark writes using UTC-normalized semantics and Impala writes using timezone-agnostic semantics, we can deduce the used semantics from the writer info. So, when Hive sees a Parquet file written by Impala, it will adjust timestamps to compensate for the difference. Impala has an option (-convert_legacy_hive_parquet_utc_timestamp) to do the same when it sees a Parquet file written by anything else than Impala. There are two problems with this workaround:
To address both of these issues, we both added the recognition and adjustment of Impala-written timestamps to SparkSQL and also added a table property to record the timezone that should be used for these adjustments, so that mixed table do not lead to unintuitive behaviour any more. We also added this table property to Impala and Hive logic and tested that they can correctly read each other's timestamps. To address the concerns that lead to Reynold to revert our initial commit, Imran made three changes compared to our original proposal:
|
IIUC, using the I think a better solution is the reader-specific approach. Impala keeps writing timezone-agnostic timestamp, and Spark/Hive keep writing UTC timestamp. At the read path, we should detect who wrote the parquet file and do timestamp adjustment based on this information(can also have a config to protect this behavior). cc @rxin |
Yes, you understand correctly, the table property affects both the read path and the write path, while the current workaround used by Hive and Impala only affects the read path. (Both are Parquet-specific though, I think you meant to write "writer-specific" when referring to the latter.) |
actually I took a look at #16781 , It also proposed a table property, instead of a simple spark config. |
Yes, that is correct. We introduced the table property to address the 2nd problem I mentioned above: "The adjustment depends on the local timezone." (details in my previous comment). But I think that a simpler workaround similar to what already exists in Hive would already be a big step forward for interoperability of existing data. |
ok @squito can we send a new PR to do it? basically in parquet read task, get the writer info from the footer. If the writer is impala, and a config is set, we treat the seconds as seconds from epoch of session local time zone, and adjust the seconds to seconds from Unix epoch. |
@cloud-fan yes I can do that, will be next week before I get to it |
@squito we can close this right? |
What changes were proposed in this pull request?
When reading and writing data, spark will adjust timestamp data based on the delta between the current session timezone and the table time zone (specified either by a persistent table property, or an option to the DataFrameReader / Writer). This is particularly important for parquet data, so that it can be treated equivalently by other SQL engines (eg. Impala and Hive). Furthermore, this is useful if the same data is processed by multiple clusters in different time zones, and "timestamp without time zone" semantics are desired.
Design doc: https://docs.google.com/document/d/1mcbkVo-PSsFh6iOOYx6Rk_34aY25H_zT1E2f7KmLMOU/edit#heading=h.rj5dmsuhw2zg
How was this patch tested?
Unit tests.