-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-31710][SQL] Fail casting numeric to timestamp by default #28593
Conversation
@cloud-fan @bart-samwel @MaxGekk Could you please help me review it ? |
@@ -59,8 +59,8 @@ object Cast { | |||
case (StringType, TimestampType) => true | |||
case (BooleanType, TimestampType) => true | |||
case (DateType, TimestampType) => true | |||
case (_: NumericType, TimestampType) => true | |||
|
|||
case (_: NumericType, TimestampType) => if ( SQLConf.get.numericConvertToTimestampEnable ) true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just SQLConf.get.numericConvertToTimestampEnable
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes,it's a flag, we will fail for casting numeric to timestmap by default, if enable, then we provide two choices for user ,or reject the casting
@@ -266,7 +266,12 @@ abstract class CastBase extends UnaryExpression with TimeZoneAwareExpression wit | |||
TypeCheckResult.TypeCheckSuccess | |||
} else { | |||
TypeCheckResult.TypeCheckFailure( | |||
s"cannot cast ${child.dataType.catalogString} to ${dataType.catalogString}") | |||
if ( child.dataType.isInstanceOf[NumericType] && dataType.isInstanceOf[TimestampType]) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please, remove spaces:
if (child.dataType.isInstanceOf[NumericType] && dataType.isInstanceOf[TimestampType])
Take a look at the style guide: https://github.com/databricks/scala-style-guide
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes,i will correct it.
@@ -454,7 +459,10 @@ abstract class CastBase extends UnaryExpression with TimeZoneAwareExpression wit | |||
} | |||
|
|||
// converting seconds to us | |||
private[this] def longToTimestamp(t: Long): Long = SECONDS.toMicros(t) | |||
private[this] def longToTimestamp(t: Long): Long = { | |||
if (SQLConf.get.numericConvertToTimestampInSeconds) t * MICROS_PER_SECOND |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't think checking the flag per every values is good idea. This will badly impact on performance.
private[this] def longToTimeStampCode(l: ExprValue): Block = code"$l * (long)$MICROS_PER_SECOND" | ||
private[this] def longToTimeStampCode(l: ExprValue): Block = { | ||
if (SQLConf.get.numericConvertToTimestampInSeconds) code"" + | ||
code"$l * $MICROS_PER_SECOND" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does string interpolation work well here?
def checkLongToTimestamp(l: Long, expected: Long): Unit = { | ||
checkEvaluation(cast(l, TimestampType), expected) | ||
} | ||
checkLongToTimestamp(253402272000L, 253402272000000L) | ||
checkLongToTimestamp(-5L, -5000L) | ||
checkLongToTimestamp(1L, 1000L) | ||
checkLongToTimestamp(0L, 0L) | ||
checkLongToTimestamp(123L, 123000L) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indentation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes,all corrected. thanks MaxGekk
ok to test |
Test build #122896 has finished for PR 28593 at commit
|
Test build #122903 has finished for PR 28593 at commit
|
Test build #122906 has finished for PR 28593 at commit
|
update masterbranch
Test build #122909 has finished for PR 28593 at commit
|
Test build #122910 has finished for PR 28593 at commit
|
@@ -1277,7 +1285,11 @@ abstract class CastBase extends UnaryExpression with TimeZoneAwareExpression wit | |||
val block = inline"new java.math.BigDecimal($MICROS_PER_SECOND)" | |||
code"($d.toBigDecimal().bigDecimal().multiply($block)).longValue()" | |||
} | |||
private[this] def longToTimeStampCode(l: ExprValue): Block = code"$l * (long)$MICROS_PER_SECOND" | |||
private[this] def longToTimeStampCode(l: ExprValue): Block = { | |||
if (SQLConf.get.numericConvertToTimestampInSeconds) code"" + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's change l
to something else per https://github.com/databricks/scala-style-guide#variable-naming while we're here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes,let me correct it.
.internal() | ||
.doc("The legacy only works when LEGACY_NUMERIC_CONVERT_TO_TIMESTAMP_ENABLE is true." + | ||
"when true,the value will be interpreted as seconds,which follow spark style," + | ||
"when false,value is interpreted as milliseconds,which follow hive style") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry but I can't still follow why Spark should take care about following Hive style here. Most likely the legacy users are already depending on this behaviour, and few users might had to do the workaround by themselves. I don't think even cast(ts as long)
is a standard and an widely accepted behaviour. -1 from me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi HyukjinKwon
thanks for reviewing,we discussed the pain point when we move to spark in #28568, i mean we can adopt both the compatibility flag and adding functions, for using the function, user need to modify tasks one by one with the casting compatibility flag turning off,unfortunally ,we have almost hundred thousand tasks migrating from hive to spark, so with a flag ,we will first fail the tasks if it had CAST_NUMERIC_TO_TIMESTAMP,if user really want to use, we will suggest the NEW adding three functions for them,maybe it's a good way to avoid the case when the task has been succeed,while the casting result is wrong,which is more serious,maybe other brothers meet the same headache problem,so i hope we will embracing spark better with this patch,
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This kind of compatibility isn't fully guaranteed in Spark, see also Compatibility with Apache Hive. Simply following Hive doesn't justify this PR.
There are already a bunch of mismatched behaviours and I don't like to target more compatibility, in particular, by fixing the basic functionalities such as cast and adding such switches to maintain. Why is it difficult to just add ts / 1000
? The workaround seems very easy.
If we target to get rid of cast(ts as long)
away, adding separate functions is a-okay because it doesn't affect existing users, and also looks other DBMSs have their own ways by having such functions in general. Looks we will have workarounds once these functions from #28534 are merged, and seems you can leverage these functions alternatively as well.
I would say a-no to fix a basic functionality to have another variant and non-standard behaviour, which could potentially trigger to have another set of non-standard behaviours in many other places in Spark.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi HyukjinKwon
i agree with you that the function is a cool desgin ,But to be honest , if i change one sql,it will be easy,but we have almost hundred thousand tasks,To change all tasks with obvious conversion,implicit conversion,or expression convesion will be a huge task,some individual have over 1 thousand tasks, it will be a really touch thing.i will jump for joy if we have better ideas . @HyukjinKwon
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's no such workload that can be migrated without touching anything in practice from A to B system where A doesn't guarantee full compatibility with B. I don't have a good idea for your workload.
I don't think this is only the case where it needs some fixes when you migrate from Hive to Spark. Spark doesn't target the full compatibility by design. We could think about some non-invasive fixes practically but this fix seems no.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 to simply forbid cast long to timestamp in Spark. Hive compatibility is not strong enough to justify the change, as other people may keep adding new behaviors for compatibility with other systems, and this can be end-less.
Instead, I think it's better to forbid this non-standard cast. You can find all the places that need to change, with explicit error from Spark. And you can add Hive UDF as @bart-samwel suggested, if you need to fallback to Hive.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 from me too
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great! it seems i cannot reopen the pr
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
reopened
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @cloud-fan ,i have reverted commit, could you help review ?
Let me close this for now. |
7f0ba76
to
8d1deee
Compare
OK to test |
@@ -59,7 +59,7 @@ object Cast { | |||
case (StringType, TimestampType) => true | |||
case (BooleanType, TimestampType) => true | |||
case (DateType, TimestampType) => true | |||
case (_: NumericType, TimestampType) => true | |||
case (_: FractionalType, TimestampType) => true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's forbid fraction as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and provide a legacy config to restore the old behavior.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok , will correct it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do you need forbiding casting timestamp to numeric type at the same time,maybe someone will complain about it later
Test build #124029 has finished for PR 28593 at commit
|
retest this please |
Test build #124042 has finished for PR 28593 at commit
|
@@ -180,7 +180,7 @@ class HiveQuerySuite extends HiveComparisonTest with SQLTestUtils with BeforeAnd | |||
"SELECT CAST(CAST('NaN' AS DOUBLE) AS DECIMAL(1,1)) FROM src LIMIT 1") | |||
|
|||
createQueryTest("constant null testing", | |||
"""SELECT | |||
"""| SELECT |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: can we revert this change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, have reverted
@@ -1 +0,0 @@ | |||
1.2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry I was wrong. calling TIMESTAMP_SECONDS
and casting double to int is not the same as the legacy cast double to timestamp.
Can we simply remove that test?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if we remove it, there will be lack of testing for casting from double to timestamp for createQueryTest, we may need to add the legacy to let the test case pass.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's OK. We have tests for casting double to timestamp in CastSuite
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW shall we remove this golden file since the test is removed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah.i have removed it this early morning just fews minutes after your comment. it has disappeared in last commit.and the following test also has been finished
@@ -0,0 +1 @@ | |||
1.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
which test produces this golden file?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test case "timestamp cast #3"
before it is used for testing casting double to timestamp.
sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveQuerySuite.scala
Show resolved
Hide resolved
assert(-1 == res.get(0)) | ||
} | ||
|
||
createQueryTest("timestamp cast #3", | ||
"SELECT CAST(CAST(1.2 AS TIMESTAMP) AS DOUBLE) FROM src LIMIT 1") | ||
""" | ||
|SELECT CAST(TIMESTAMP_SECONDS(CAST(1.2 AS INT)) AS DOUBLE) FROM src LIMIT 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's just remove this test. We can't cast fractional values to timestamp now.
|
||
createQueryTest("timestamp cast #4", | ||
"SELECT CAST(CAST(-1.2 AS TIMESTAMP) AS DOUBLE) FROM src LIMIT 1") | ||
""" | ||
|SELECT CAST(TIMESTAMP_SECONDS(CAST(-1.2 AS INT)) AS DOUBLE) FROM src LIMIT 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM except a few comments for the test
Test build #124070 has finished for PR 28593 at commit
|
Test build #124066 has finished for PR 28593 at commit
|
Test build #124072 has finished for PR 28593 at commit
|
thanks, merging to master! |
Thanks @cloud-fan ! |
After rebasing on the recent master, I faced to failures of |
@since(3.1) | ||
def timestamp_seconds(col): | ||
""" | ||
>>> from pyspark.sql.functions import timestamp_seconds |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you set the session timezone? It caused SPARK-32088
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks HyukjinKwon, will fix soon
/** | ||
* Creates timestamp from the number of seconds since UTC epoch. | ||
* @group = datetime_funcs | ||
* @since = 3.1.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit:
@group datetime_funcs
@since 3.1.0
Out of curiosity ‒ why do we provide Scala / Python API for |
I think it's because |
Makes sense, thanks. |
What changes were proposed in this pull request?
we fail casting from numeric to timestamp by default.
Why are the changes needed?
casting from numeric to timestamp is not a non-standard,meanwhile it may generate different result between spark and other systems,for example hive
Does this PR introduce any user-facing change?
Yes,user cannot cast numeric to timestamp directly,user have to use the following function to achieve the same effect:TIMESTAMP_SECONDS/TIMESTAMP_MILLIS/TIMESTAMP_MICROS
How was this patch tested?
unit test added