[SPARK-31710][SQL] Fail casting numeric to timestamp by default #28593

GuoPhilipse · 2020-05-20T10:16:04Z

What changes were proposed in this pull request?

we fail casting from numeric to timestamp by default.

Why are the changes needed?

casting from numeric to timestamp is not a non-standard,meanwhile it may generate different result between spark and other systems,for example hive

Does this PR introduce any user-facing change?

Yes,user cannot cast numeric to timestamp directly,user have to use the following function to achieve the same effect:TIMESTAMP_SECONDS/TIMESTAMP_MILLIS/TIMESTAMP_MICROS

How was this patch tested?

unit test added

GuoPhilipse · 2020-05-20T11:11:55Z

@cloud-fan @bart-samwel @MaxGekk Could you please help me review it ?

MaxGekk · 2020-05-20T11:36:14Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Cast.scala

@@ -59,8 +59,8 @@ object Cast {
    case (StringType, TimestampType) => true
    case (BooleanType, TimestampType) => true
    case (DateType, TimestampType) => true
-    case (_: NumericType, TimestampType) => true
-
+    case (_: NumericType, TimestampType) => if ( SQLConf.get.numericConvertToTimestampEnable ) true


Just SQLConf.get.numericConvertToTimestampEnable?

yes,it's a flag, we will fail for casting numeric to timestmap by default, if enable, then we provide two choices for user ,or reject the casting

MaxGekk · 2020-05-20T11:37:51Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Cast.scala

@@ -266,7 +266,12 @@ abstract class CastBase extends UnaryExpression with TimeZoneAwareExpression wit
      TypeCheckResult.TypeCheckSuccess
    } else {
      TypeCheckResult.TypeCheckFailure(
-        s"cannot cast ${child.dataType.catalogString} to ${dataType.catalogString}")
+        if ( child.dataType.isInstanceOf[NumericType] && dataType.isInstanceOf[TimestampType]) {


Please, remove spaces:

if (child.dataType.isInstanceOf[NumericType] && dataType.isInstanceOf[TimestampType])

Take a look at the style guide: https://github.com/databricks/scala-style-guide

yes,i will correct it.

MaxGekk · 2020-05-20T11:39:41Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Cast.scala

@@ -454,7 +459,10 @@ abstract class CastBase extends UnaryExpression with TimeZoneAwareExpression wit
  }

  // converting seconds to us
-  private[this] def longToTimestamp(t: Long): Long = SECONDS.toMicros(t)
+  private[this] def longToTimestamp(t: Long): Long = {
+    if (SQLConf.get.numericConvertToTimestampInSeconds) t * MICROS_PER_SECOND


Don't think checking the flag per every values is good idea. This will badly impact on performance.

MaxGekk · 2020-05-20T11:40:36Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Cast.scala

-  private[this] def longToTimeStampCode(l: ExprValue): Block = code"$l * (long)$MICROS_PER_SECOND"
+  private[this] def longToTimeStampCode(l: ExprValue): Block = {
+    if (SQLConf.get.numericConvertToTimestampInSeconds) code"" +
+      code"$l * $MICROS_PER_SECOND"


Does string interpolation work well here?

MaxGekk · 2020-05-20T11:41:14Z

sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/CastSuite.scala

+      def checkLongToTimestamp(l: Long, expected: Long): Unit = {
+        checkEvaluation(cast(l, TimestampType), expected)
+      }
+      checkLongToTimestamp(253402272000L, 253402272000000L)
+      checkLongToTimestamp(-5L, -5000L)
+      checkLongToTimestamp(1L, 1000L)
+      checkLongToTimestamp(0L, 0L)
+      checkLongToTimestamp(123L, 123000L)


Indentation

yes,all corrected. thanks MaxGekk

cloud-fan · 2020-05-20T12:11:57Z

ok to test

SparkQA · 2020-05-20T12:21:16Z

Test build #122896 has finished for PR 28593 at commit 6c1ffef.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-05-20T15:58:07Z

Test build #122903 has finished for PR 28593 at commit 3a20aa6.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-05-20T23:11:26Z

Test build #122906 has finished for PR 28593 at commit 4577fa8.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

update masterbranch

SparkQA · 2020-05-21T00:39:23Z

Test build #122909 has finished for PR 28593 at commit a39067d.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-05-21T02:53:39Z

Test build #122910 has finished for PR 28593 at commit 7f0ba76.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2020-05-21T03:17:18Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Cast.scala

@@ -1277,7 +1285,11 @@ abstract class CastBase extends UnaryExpression with TimeZoneAwareExpression wit
    val block = inline"new java.math.BigDecimal($MICROS_PER_SECOND)"
    code"($d.toBigDecimal().bigDecimal().multiply($block)).longValue()"
  }
-  private[this] def longToTimeStampCode(l: ExprValue): Block = code"$l * (long)$MICROS_PER_SECOND"
+  private[this] def longToTimeStampCode(l: ExprValue): Block = {
+    if (SQLConf.get.numericConvertToTimestampInSeconds) code"" +


Let's change l to something else per https://github.com/databricks/scala-style-guide#variable-naming while we're here.

yes,let me correct it.

HyukjinKwon · 2020-05-21T03:44:01Z

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

+      .internal()
+      .doc("The legacy only works when LEGACY_NUMERIC_CONVERT_TO_TIMESTAMP_ENABLE is true." +
+        "when true,the value will be  interpreted as seconds,which follow spark style," +
+        "when false,value is interpreted as milliseconds,which follow hive style")


Sorry but I can't still follow why Spark should take care about following Hive style here. Most likely the legacy users are already depending on this behaviour, and few users might had to do the workaround by themselves. I don't think even cast(ts as long) is a standard and an widely accepted behaviour. -1 from me.

Hi HyukjinKwon
thanks for reviewing,we discussed the pain point when we move to spark in #28568, i mean we can adopt both the compatibility flag and adding functions, for using the function, user need to modify tasks one by one with the casting compatibility flag turning off,unfortunally ,we have almost hundred thousand tasks migrating from hive to spark, so with a flag ,we will first fail the tasks if it had CAST_NUMERIC_TO_TIMESTAMP,if user really want to use, we will suggest the NEW adding three functions for them,maybe it's a good way to avoid the case when the task has been succeed,while the casting result is wrong,which is more serious,maybe other brothers meet the same headache problem,so i hope we will embracing spark better with this patch,

This kind of compatibility isn't fully guaranteed in Spark, see also Compatibility with Apache Hive. Simply following Hive doesn't justify this PR.

There are already a bunch of mismatched behaviours and I don't like to target more compatibility, in particular, by fixing the basic functionalities such as cast and adding such switches to maintain. Why is it difficult to just add ts / 1000? The workaround seems very easy.

If we target to get rid of cast(ts as long) away, adding separate functions is a-okay because it doesn't affect existing users, and also looks other DBMSs have their own ways by having such functions in general. Looks we will have workarounds once these functions from #28534 are merged, and seems you can leverage these functions alternatively as well.

I would say a-no to fix a basic functionality to have another variant and non-standard behaviour, which could potentially trigger to have another set of non-standard behaviours in many other places in Spark.

Hi HyukjinKwon
i agree with you that the function is a cool desgin ,But to be honest , if i change one sql,it will be easy,but we have almost hundred thousand tasks,To change all tasks with obvious conversion,implicit conversion，or expression convesion will be a huge task,some individual have over 1 thousand tasks, it will be a really touch thing.i will jump for joy if we have better ideas . @HyukjinKwon

There's no such workload that can be migrated without touching anything in practice from A to B system where A doesn't guarantee full compatibility with B. I don't have a good idea for your workload.

I don't think this is only the case where it needs some fixes when you migrate from Hive to Spark. Spark doesn't target the full compatibility by design. We could think about some non-invasive fixes practically but this fix seems no.

+1 to simply forbid cast long to timestamp in Spark. Hive compatibility is not strong enough to justify the change, as other people may keep adding new behaviors for compatibility with other systems, and this can be end-less.

Instead, I think it's better to forbid this non-standard cast. You can find all the places that need to change, with explicit error from Spark. And you can add Hive UDF as @bart-samwel suggested, if you need to fallback to Hive.

+1 from me too

Great! it seems i cannot reopen the pr

Thanks @cloud-fan ,i have reverted commit, could you help review ?

sync

HyukjinKwon · 2020-05-22T11:21:40Z

Let me close this for now.

sync

cloud-fan · 2020-05-25T14:28:48Z

OK to test

cloud-fan · 2020-05-25T14:30:52Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Cast.scala

@@ -59,7 +59,7 @@ object Cast {
    case (StringType, TimestampType) => true
    case (BooleanType, TimestampType) => true
    case (DateType, TimestampType) => true
-    case (_: NumericType, TimestampType) => true
+    case (_: FractionalType, TimestampType) => true


let's forbid fraction as well.

and provide a legacy config to restore the old behavior.

ok , will correct it

do you need forbiding casting timestamp to numeric type at the same time,maybe someone will complain about it later

sync

SparkQA · 2020-06-15T07:05:02Z

Test build #124029 has finished for PR 28593 at commit 8fe1960.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2020-06-15T07:51:56Z

retest this please

SparkQA · 2020-06-15T14:35:28Z

Test build #124042 has finished for PR 28593 at commit 8fe1960.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2020-06-15T16:11:27Z

sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveQuerySuite.scala

@@ -180,7 +180,7 @@ class HiveQuerySuite extends HiveComparisonTest with SQLTestUtils with BeforeAnd
    "SELECT CAST(CAST('NaN' AS DOUBLE) AS DECIMAL(1,1)) FROM src LIMIT 1")

  createQueryTest("constant null testing",
-    """SELECT
+    """| SELECT


nit: can we revert this change?

yes, have reverted

cloud-fan · 2020-06-15T16:18:45Z

sql/hive/src/test/resources/golden/timestamp cast #3-0-732ed232ac592c5e7f7c913a88874fd2

@@ -1 +0,0 @@
-1.2


sorry I was wrong. calling TIMESTAMP_SECONDS and casting double to int is not the same as the legacy cast double to timestamp.

Can we simply remove that test?

if we remove it, there will be lack of testing for casting from double to timestamp for createQueryTest, we may need to add the legacy to let the test case pass.

It's OK. We have tests for casting double to timestamp in CastSuite.

BTW shall we remove this golden file since the test is removed?

Ah.i have removed it this early morning just fews minutes after your comment. it has disappeared in last commit.and the following test also has been finished

cloud-fan · 2020-06-15T16:19:42Z

sql/hive/src/test/resources/golden/timestamp cast #3-0-89ef5480d0dd001cec4642de58f8632f

@@ -0,0 +1 @@
+1.0


which test produces this golden file?

This test case "timestamp cast #3"
before it is used for testing casting double to timestamp.

sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveQuerySuite.scala

cloud-fan · 2020-06-15T17:01:42Z

sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveQuerySuite.scala

    assert(-1 == res.get(0))
  }

  createQueryTest("timestamp cast #3",
-    "SELECT CAST(CAST(1.2 AS TIMESTAMP) AS DOUBLE) FROM src LIMIT 1")
+    """
+      |SELECT CAST(TIMESTAMP_SECONDS(CAST(1.2 AS INT)) AS DOUBLE) FROM src LIMIT 1


Let's just remove this test. We can't cast fractional values to timestamp now.

cloud-fan · 2020-06-15T17:01:57Z

sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveQuerySuite.scala


  createQueryTest("timestamp cast #4",
-    "SELECT CAST(CAST(-1.2 AS TIMESTAMP) AS DOUBLE) FROM src LIMIT 1")
+    """
+      |SELECT CAST(TIMESTAMP_SECONDS(CAST(-1.2 AS INT)) AS DOUBLE) FROM src LIMIT 1


cloud-fan

LGTM except a few comments for the test

SparkQA · 2020-06-15T18:38:11Z

Test build #124070 has finished for PR 28593 at commit 08aee30.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-06-15T22:50:36Z

Test build #124066 has finished for PR 28593 at commit bc4b62c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-06-15T23:19:26Z

Test build #124072 has finished for PR 28593 at commit 12b4239.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2020-06-16T08:35:34Z

thanks, merging to master!

GuoPhilipse · 2020-06-16T09:00:16Z

Thanks @cloud-fan !

MaxGekk · 2020-06-16T13:59:01Z

After rebasing on the recent master, I faced to failures of DateTimeBenchmark because of this PR. I fixed the issue in the PR #28843

HyukjinKwon · 2020-06-26T07:50:53Z

python/pyspark/sql/functions.py

+@since(3.1)
+def timestamp_seconds(col):
+    """
+    >>> from pyspark.sql.functions import timestamp_seconds


Can you set the session timezone? It caused SPARK-32088

Thanks HyukjinKwon, will fix soon

HyukjinKwon · 2020-06-26T07:51:20Z

sql/core/src/main/scala/org/apache/spark/sql/functions.scala

+  /**
+   * Creates timestamp from the number of seconds since UTC epoch.
+   * @group = datetime_funcs
+   * @since = 3.1.0


nit:

@group datetime_funcs @since 3.1.0

zero323 · 2020-09-20T16:00:35Z

Out of curiosity ‒ why do we provide Scala / Python API for timestamp_seconds and not for timestamp_millis and timestamp_micros?

HyukjinKwon · 2020-09-21T01:02:05Z

I think it's because timestamp_seconds can be the direct replacement in cast(num as timestamp) but arguably timestamp_millis and timestamp_micros are less common. So I think the other two were not added (per https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/functions.scala#L45-L56)

zero323 · 2020-09-21T05:54:28Z

Makes sense, thanks.

probot-autolabeler bot added the SQL label May 20, 2020

MaxGekk reviewed May 20, 2020

View reviewed changes

Merge pull request #1 from apache/master

bb1efa2

update masterbranch

HyukjinKwon reviewed May 21, 2020

View reviewed changes

HyukjinKwon mentioned this pull request May 21, 2020

[SPARK-31797][SQL] Adds TIMESTAMP_SECONDS, TIMESTAMP_MILLIS and TIMESTAMP_MICROS functions #28534

Closed

Merge pull request #2 from apache/master

1459d5b

sync

HyukjinKwon closed this May 22, 2020

HyukjinKwon mentioned this pull request May 22, 2020

[SPARK-31710][SQL]Add compatibility flag to cast long to timestamp #28568

Closed

Merge pull request #3 from apache/master

88c40fe

sync

cloud-fan reopened this May 22, 2020

fail casting from integral to timestamp

8d1deee

GuoPhilipse force-pushed the 31710-fix-compatibility branch from 7f0ba76 to 8d1deee Compare May 22, 2020 16:23

GuoPhilipse changed the title ~~[SPARK-31710][SQL] Add two compatibility flag to cast long to timestamp~~ [SPARK-31710][SQL] Fail casting integral to timestamp by default May 22, 2020

cloud-fan reviewed May 25, 2020

View reviewed changes

GuoPhilipse and others added 3 commits May 25, 2020 23:20

Merge pull request #4 from apache/master

df22083

sync

'fix-testcase-fail'

a02c6c7

'add-restore-flag'

e61c484

improve test queality

8fe1960

cloud-fan reviewed Jun 15, 2020

View reviewed changes

GuoPhilipse added 2 commits June 16, 2020 00:37

remove unnecessary commit

ffa3079

remove test file

bc4b62c

cloud-fan reviewed Jun 15, 2020

View reviewed changes

sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveQuerySuite.scala Show resolved Hide resolved

cloud-fan reviewed Jun 15, 2020

View reviewed changes

cloud-fan approved these changes Jun 15, 2020

View reviewed changes

GuoPhilipse added 2 commits June 16, 2020 01:08

improve test case

08aee30

improve test case

12b4239

cloud-fan closed this in f0e6d0e Jun 16, 2020

HyukjinKwon reviewed Jun 26, 2020

View reviewed changes

zero323 mentioned this pull request Jul 18, 2020

[SPARK-31710][SQL] Fail casting numeric to timestamp by default zero323/pyspark-stubs#433

Closed

zero323 mentioned this pull request Sep 22, 2020

[SPARK-32949][R][SQL] Add timestamp_seconds to SparkR #29822

Closed

zero323 mentioned this pull request Oct 3, 2020

[SPARK-33061][SQL] Expose inverse hyperbolic trig functions through sql.functions API #29938

Closed

		@@ -1 +0,0 @@
		1.2

		@@ -0,0 +1 @@
		1.0

[SPARK-31710][SQL] Fail casting numeric to timestamp by default #28593

[SPARK-31710][SQL] Fail casting numeric to timestamp by default #28593

Conversation

GuoPhilipse commented May 20, 2020 • edited Loading

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

GuoPhilipse commented May 20, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cloud-fan commented May 20, 2020

SparkQA commented May 20, 2020

SparkQA commented May 20, 2020

SparkQA commented May 20, 2020

SparkQA commented May 21, 2020

SparkQA commented May 21, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

HyukjinKwon May 21, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

HyukjinKwon commented May 22, 2020

cloud-fan commented May 25, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Jun 15, 2020

cloud-fan commented Jun 15, 2020

SparkQA commented Jun 15, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cloud-fan left a comment

Choose a reason for hiding this comment

SparkQA commented Jun 15, 2020

SparkQA commented Jun 15, 2020

SparkQA commented Jun 15, 2020

cloud-fan commented Jun 16, 2020

GuoPhilipse commented Jun 16, 2020

MaxGekk commented Jun 16, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zero323 commented Sep 20, 2020

HyukjinKwon commented Sep 21, 2020

zero323 commented Sep 21, 2020

GuoPhilipse commented May 20, 2020 •

edited

Loading

HyukjinKwon May 21, 2020 •

edited

Loading