Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-31710][SQL] Fail casting numeric to timestamp by default #28593

Closed
wants to merge 101 commits into from

Conversation

GuoPhilipse
Copy link
Member

@GuoPhilipse GuoPhilipse commented May 20, 2020

What changes were proposed in this pull request?

we fail casting from numeric to timestamp by default.

Why are the changes needed?

casting from numeric to timestamp is not a non-standard,meanwhile it may generate different result between spark and other systems,for example hive

Does this PR introduce any user-facing change?

Yes,user cannot cast numeric to timestamp directly,user have to use the following function to achieve the same effect:TIMESTAMP_SECONDS/TIMESTAMP_MILLIS/TIMESTAMP_MICROS

How was this patch tested?

unit test added

@GuoPhilipse
Copy link
Member Author

@cloud-fan @bart-samwel @MaxGekk Could you please help me review it ?

@@ -59,8 +59,8 @@ object Cast {
case (StringType, TimestampType) => true
case (BooleanType, TimestampType) => true
case (DateType, TimestampType) => true
case (_: NumericType, TimestampType) => true

case (_: NumericType, TimestampType) => if ( SQLConf.get.numericConvertToTimestampEnable ) true
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just SQLConf.get.numericConvertToTimestampEnable?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes,it's a flag, we will fail for casting numeric to timestmap by default, if enable, then we provide two choices for user ,or reject the casting

@@ -266,7 +266,12 @@ abstract class CastBase extends UnaryExpression with TimeZoneAwareExpression wit
TypeCheckResult.TypeCheckSuccess
} else {
TypeCheckResult.TypeCheckFailure(
s"cannot cast ${child.dataType.catalogString} to ${dataType.catalogString}")
if ( child.dataType.isInstanceOf[NumericType] && dataType.isInstanceOf[TimestampType]) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please, remove spaces:

if (child.dataType.isInstanceOf[NumericType] && dataType.isInstanceOf[TimestampType])

Take a look at the style guide: https://github.com/databricks/scala-style-guide

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes,i will correct it.

@@ -454,7 +459,10 @@ abstract class CastBase extends UnaryExpression with TimeZoneAwareExpression wit
}

// converting seconds to us
private[this] def longToTimestamp(t: Long): Long = SECONDS.toMicros(t)
private[this] def longToTimestamp(t: Long): Long = {
if (SQLConf.get.numericConvertToTimestampInSeconds) t * MICROS_PER_SECOND
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't think checking the flag per every values is good idea. This will badly impact on performance.

private[this] def longToTimeStampCode(l: ExprValue): Block = code"$l * (long)$MICROS_PER_SECOND"
private[this] def longToTimeStampCode(l: ExprValue): Block = {
if (SQLConf.get.numericConvertToTimestampInSeconds) code"" +
code"$l * $MICROS_PER_SECOND"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does string interpolation work well here?

Comment on lines 1319 to 1326
def checkLongToTimestamp(l: Long, expected: Long): Unit = {
checkEvaluation(cast(l, TimestampType), expected)
}
checkLongToTimestamp(253402272000L, 253402272000000L)
checkLongToTimestamp(-5L, -5000L)
checkLongToTimestamp(1L, 1000L)
checkLongToTimestamp(0L, 0L)
checkLongToTimestamp(123L, 123000L)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indentation

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes,all corrected. thanks MaxGekk

@cloud-fan
Copy link
Contributor

ok to test

@SparkQA
Copy link

SparkQA commented May 20, 2020

Test build #122896 has finished for PR 28593 at commit 6c1ffef.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented May 20, 2020

Test build #122903 has finished for PR 28593 at commit 3a20aa6.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented May 20, 2020

Test build #122906 has finished for PR 28593 at commit 4577fa8.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented May 21, 2020

Test build #122909 has finished for PR 28593 at commit a39067d.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented May 21, 2020

Test build #122910 has finished for PR 28593 at commit 7f0ba76.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@@ -1277,7 +1285,11 @@ abstract class CastBase extends UnaryExpression with TimeZoneAwareExpression wit
val block = inline"new java.math.BigDecimal($MICROS_PER_SECOND)"
code"($d.toBigDecimal().bigDecimal().multiply($block)).longValue()"
}
private[this] def longToTimeStampCode(l: ExprValue): Block = code"$l * (long)$MICROS_PER_SECOND"
private[this] def longToTimeStampCode(l: ExprValue): Block = {
if (SQLConf.get.numericConvertToTimestampInSeconds) code"" +
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's change l to something else per https://github.com/databricks/scala-style-guide#variable-naming while we're here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes,let me correct it.

.internal()
.doc("The legacy only works when LEGACY_NUMERIC_CONVERT_TO_TIMESTAMP_ENABLE is true." +
"when true,the value will be interpreted as seconds,which follow spark style," +
"when false,value is interpreted as milliseconds,which follow hive style")
Copy link
Member

@HyukjinKwon HyukjinKwon May 21, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry but I can't still follow why Spark should take care about following Hive style here. Most likely the legacy users are already depending on this behaviour, and few users might had to do the workaround by themselves. I don't think even cast(ts as long) is a standard and an widely accepted behaviour. -1 from me.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi HyukjinKwon
thanks for reviewing,we discussed the pain point when we move to spark in #28568, i mean we can adopt both the compatibility flag and adding functions, for using the function, user need to modify tasks one by one with the casting compatibility flag turning off,unfortunally ,we have almost hundred thousand tasks migrating from hive to spark, so with a flag ,we will first fail the tasks if it had CAST_NUMERIC_TO_TIMESTAMP,if user really want to use, we will suggest the NEW adding three functions for them,maybe it's a good way to avoid the case when the task has been succeed,while the casting result is wrong,which is more serious,maybe other brothers meet the same headache problem,so i hope we will embracing spark better with this patch,

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This kind of compatibility isn't fully guaranteed in Spark, see also Compatibility with Apache Hive. Simply following Hive doesn't justify this PR.

There are already a bunch of mismatched behaviours and I don't like to target more compatibility, in particular, by fixing the basic functionalities such as cast and adding such switches to maintain. Why is it difficult to just add ts / 1000? The workaround seems very easy.

If we target to get rid of cast(ts as long) away, adding separate functions is a-okay because it doesn't affect existing users, and also looks other DBMSs have their own ways by having such functions in general. Looks we will have workarounds once these functions from #28534 are merged, and seems you can leverage these functions alternatively as well.

I would say a-no to fix a basic functionality to have another variant and non-standard behaviour, which could potentially trigger to have another set of non-standard behaviours in many other places in Spark.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi HyukjinKwon
i agree with you that the function is a cool desgin ,But to be honest , if i change one sql,it will be easy,but we have almost hundred thousand tasks,To change all tasks with obvious conversion,implicit conversion,or expression convesion will be a huge task,some individual have over 1 thousand tasks, it will be a really touch thing.i will jump for joy if we have better ideas . @HyukjinKwon

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's no such workload that can be migrated without touching anything in practice from A to B system where A doesn't guarantee full compatibility with B. I don't have a good idea for your workload.

I don't think this is only the case where it needs some fixes when you migrate from Hive to Spark. Spark doesn't target the full compatibility by design. We could think about some non-invasive fixes practically but this fix seems no.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to simply forbid cast long to timestamp in Spark. Hive compatibility is not strong enough to justify the change, as other people may keep adding new behaviors for compatibility with other systems, and this can be end-less.

Instead, I think it's better to forbid this non-standard cast. You can find all the places that need to change, with explicit error from Spark. And you can add Hive UDF as @bart-samwel suggested, if you need to fallback to Hive.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 from me too

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great! it seems i cannot reopen the pr

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reopened

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @cloud-fan ,i have reverted commit, could you help review ?

@HyukjinKwon
Copy link
Member

Let me close this for now.

@cloud-fan cloud-fan reopened this May 22, 2020
@GuoPhilipse GuoPhilipse force-pushed the 31710-fix-compatibility branch from 7f0ba76 to 8d1deee Compare May 22, 2020 16:23
@GuoPhilipse GuoPhilipse changed the title [SPARK-31710][SQL] Add two compatibility flag to cast long to timestamp [SPARK-31710][SQL] Fail casting integral to timestamp by default May 22, 2020
@cloud-fan
Copy link
Contributor

OK to test

@@ -59,7 +59,7 @@ object Cast {
case (StringType, TimestampType) => true
case (BooleanType, TimestampType) => true
case (DateType, TimestampType) => true
case (_: NumericType, TimestampType) => true
case (_: FractionalType, TimestampType) => true
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's forbid fraction as well.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and provide a legacy config to restore the old behavior.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok , will correct it

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you need forbiding casting timestamp to numeric type at the same time,maybe someone will complain about it later

@SparkQA
Copy link

SparkQA commented Jun 15, 2020

Test build #124029 has finished for PR 28593 at commit 8fe1960.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

retest this please

@SparkQA
Copy link

SparkQA commented Jun 15, 2020

Test build #124042 has finished for PR 28593 at commit 8fe1960.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@@ -180,7 +180,7 @@ class HiveQuerySuite extends HiveComparisonTest with SQLTestUtils with BeforeAnd
"SELECT CAST(CAST('NaN' AS DOUBLE) AS DECIMAL(1,1)) FROM src LIMIT 1")

createQueryTest("constant null testing",
"""SELECT
"""| SELECT
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: can we revert this change?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, have reverted

@@ -1 +0,0 @@
1.2
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry I was wrong. calling TIMESTAMP_SECONDS and casting double to int is not the same as the legacy cast double to timestamp.

Can we simply remove that test?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we remove it, there will be lack of testing for casting from double to timestamp for createQueryTest, we may need to add the legacy to let the test case pass.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's OK. We have tests for casting double to timestamp in CastSuite.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW shall we remove this golden file since the test is removed?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah.i have removed it this early morning just fews minutes after your comment. it has disappeared in last commit.and the following test also has been finished

@@ -0,0 +1 @@
1.0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

which test produces this golden file?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test case "timestamp cast #3"
before it is used for testing casting double to timestamp.

assert(-1 == res.get(0))
}

createQueryTest("timestamp cast #3",
"SELECT CAST(CAST(1.2 AS TIMESTAMP) AS DOUBLE) FROM src LIMIT 1")
"""
|SELECT CAST(TIMESTAMP_SECONDS(CAST(1.2 AS INT)) AS DOUBLE) FROM src LIMIT 1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's just remove this test. We can't cast fractional values to timestamp now.


createQueryTest("timestamp cast #4",
"SELECT CAST(CAST(-1.2 AS TIMESTAMP) AS DOUBLE) FROM src LIMIT 1")
"""
|SELECT CAST(TIMESTAMP_SECONDS(CAST(-1.2 AS INT)) AS DOUBLE) FROM src LIMIT 1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

Copy link
Contributor

@cloud-fan cloud-fan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM except a few comments for the test

@SparkQA
Copy link

SparkQA commented Jun 15, 2020

Test build #124070 has finished for PR 28593 at commit 08aee30.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jun 15, 2020

Test build #124066 has finished for PR 28593 at commit bc4b62c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jun 15, 2020

Test build #124072 has finished for PR 28593 at commit 12b4239.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

thanks, merging to master!

@cloud-fan cloud-fan closed this in f0e6d0e Jun 16, 2020
@GuoPhilipse
Copy link
Member Author

Thanks @cloud-fan !

@MaxGekk
Copy link
Member

MaxGekk commented Jun 16, 2020

After rebasing on the recent master, I faced to failures of DateTimeBenchmark because of this PR. I fixed the issue in the PR #28843

@since(3.1)
def timestamp_seconds(col):
"""
>>> from pyspark.sql.functions import timestamp_seconds
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you set the session timezone? It caused SPARK-32088

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks HyukjinKwon, will fix soon

/**
* Creates timestamp from the number of seconds since UTC epoch.
* @group = datetime_funcs
* @since = 3.1.0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit:

@group datetime_funcs
@since 3.1.0

@zero323
Copy link
Member

zero323 commented Sep 20, 2020

Out of curiosity ‒ why do we provide Scala / Python API for timestamp_seconds and not for timestamp_millis and timestamp_micros?

@HyukjinKwon
Copy link
Member

I think it's because timestamp_seconds can be the direct replacement in cast(num as timestamp) but arguably timestamp_millis and timestamp_micros are less common. So I think the other two were not added (per https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/functions.scala#L45-L56)

@zero323
Copy link
Member

zero323 commented Sep 21, 2020

Makes sense, thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.