[SPARK-29870][SQL] Unify the logic of multi-units interval string to CalendarInterval #26491

yaooqinn · 2019-11-13T04:53:49Z

What changes were proposed in this pull request?

We now have two different implementation for multi-units interval strings to CalendarInterval type values.

One is used to covert interval string literals to CalendarInterval. This approach will re-delegate the interval string to spark parser which handles the string as a singleInterval -> multiUnitsInterval -> eventually call IntervalUtils.fromUnitStrings

The other is used in Cast, which eventually calls IntervalUtils.stringToInterval. This approach is ~10 times faster than the other.

We should unify these two for better performance and simple logic. this pr uses the 2nd approach.

Why are the changes needed?

We should unify these two for better performance and simple logic.

Does this PR introduce any user-facing change?

no

How was this patch tested?

we shall not fail on existing uts

…CalendarInterval

yaooqinn · 2019-11-13T04:55:14Z

cc @cloud-fan @MaxGekk @maropu @HyukjinKwon thanks very much in advance.

cloud-fan · 2019-11-13T05:43:52Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/AstBuilder.scala

+          v + " " + u
+        }
+        val str = kvs.mkString(" ")
+        IntervalUtils.fromString(str)


when we at here, the parsing is already done by antlr, so this may make things slower. But it's better to unify the code as the perf doesn't matter that much for interval literals.

yes, the performance improvement is actually for type constructor to parse interval string literals, which can't be seen via code change.

https://github.com/apache/spark/pull/26491/files#diff-9847f5cef7cf7fbc5830fbc6b779ee10R1875

With a particular modified IntervalBenchmark test, which mocks the type constructor logic, which is directly different with an IntervalUtils.fromString call only.

private def addCase(benchmark: Benchmark, cardinality: Long, units: Seq[String]): Unit = { Seq(true, false).foreach { withPrefix => val expr = buildString(withPrefix, units).cast("interval") val note = if (withPrefix) "w/ interval" else "w/o interval" benchmark.addCase(s"${units.length + 1} units $note", numIters = 3) { _ => // doBenchmark(cardinality, expr) (0L until cardinality).foreach(_ => IntervalUtils.fromString(units.mkString(" "))) } } }

we can see huge perfomance improment here. Any way, this is just used to parse typed literals, not a big deal acturally.

info] Running case: 1 units w/ interval [info] Stopped after 3 iterations, 98544 ms [info] Running case: 1 units w/o interval [info] Stopped after 3 iterations, 78871 ms [info] Running case: 2 units w/ interval [info] Stopped after 3 iterations, 72469 ms [info] Running case: 2 units w/o interval [info] Stopped after 3 iterations, 78753 ms

[info] Running case: 1 units w/ interval [info] Stopped after 3 iterations, 8926 ms [info] Running case: 1 units w/o interval [info] Stopped after 3 iterations, 8881 ms [info] Running case: 2 units w/ interval [info] Stopped after 3 iterations, 8773 ms [info] Running case: 2 units w/o interval [info] Stopped after 3 iterations, 8815 ms

SparkQA · 2019-11-13T06:55:19Z

Test build #113668 has finished for PR 26491 at commit 49b74c3.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

yaooqinn · 2019-11-13T06:56:41Z

benchmark test

with val N = 5000000, performace seems not to be affected that much

before

[info] Java HotSpot(TM) 64-Bit Server VM 1.8.0_65-b17 on Mac OS X 10.15.1
[info] Intel(R) Core(TM) i5-5287U CPU @ 2.90GHz
[info] cast strings to intervals:                Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
[info] ------------------------------------------------------------------------------------------------------------------------
[info] prepare string w/ interval                         2198           2231          46          2.3         439.6       1.0X
[info] prepare string w/o interval                        2189           2636         388          2.3         437.8       1.0X
[info] 1 units w/ interval                                2256           2315          76          2.2         451.2       1.0X
[info] 1 units w/o interval                               2106           2182          95          2.4         421.2       1.0X
[info] 2 units w/ interval                                3137           3597         792          1.6         627.5       0.7X
[info] 2 units w/o interval                               3038           3299         241          1.6         607.6       0.7X
[info] 3 units w/ interval                                6747           7466         638          0.7        1349.3       0.3X
[info] 3 units w/o interval                               6905           7583         605          0.7        1381.0       0.3X
[info] 4 units w/ interval                                7621           8126         785          0.7        1524.2       0.3X
[info] 4 units w/o interval                               7323           8823        1305          0.7        1464.5       0.3X
[info] 5 units w/ interval                                8297           8522         378          0.6        1659.4       0.3X
[info] 5 units w/o interval                               8215           8220           6          0.6        1642.9       0.3X
[info] 6 units w/ interval                                9022           9111          78          0.6        1804.4       0.2X
[info] 6 units w/o interval                               9642          12338         NaN          0.5        1928.4       0.2X
[info] 7 units w/ interval                                9343          48099        1662          0.5        1868.6       0.2X
[info] 7 units w/o interval                               8879           8941          54          0.6        1775.7       0.2X
[info] 8 units w/ interval                               10401          10537         225          0.5        2080.1       0.2X
[info] 8 units w/o interval                              10398          10413          20          0.5        2079.6       0.2X
[info] 9 units w/ interval                               10206          10373         189          0.5        2041.2       0.2X
[info] 9 units w/o interval                              10100          10151          45          0.5        2020.0       0.2X
[info] 10 units w/ interval                              10887          11881        1596          0.5        2177.5       0.2X
[info] 10 units w/o interval                             11430          13329        1754          0.4        2286.1       0.2X
[info] 11 units w/ interval                              13046          14058         877          0.4        2609.2       0.2X
[info] 11 units w/o interval                             12059          12297         265          0.4        2411.7       0.2X
[info]

after

info] Java HotSpot(TM) 64-Bit Server VM 1.8.0_65-b17 on Mac OS X 10.15.1
[info] Intel(R) Core(TM) i5-5287U CPU @ 2.90GHz
[info] cast strings to intervals:                Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
[info] ------------------------------------------------------------------------------------------------------------------------
[info] prepare string w/ interval                         2055           2230         202          2.4         411.1       1.0X
[info] prepare string w/o interval                        1894           1938          71          2.6         378.8       1.1X
[info] 1 units w/ interval                                2235           2285          68          2.2         447.1       0.9X
[info] 1 units w/o interval                               2046           2076          38          2.4         409.1       1.0X
[info] 2 units w/ interval                                3024           3624         978          1.7         604.8       0.7X
[info] 2 units w/o interval                               4622           4963         336          1.1         924.3       0.4X
[info] 3 units w/ interval                                6794           7369         499          0.7        1358.7       0.3X
[info] 3 units w/o interval                               6090           7030         815          0.8        1217.9       0.3X
[info] 4 units w/ interval                                8100          10666         NaN          0.6        1619.9       0.3X
[info] 4 units w/o interval                               8063          10349         NaN          0.6        1612.5       0.3X
[info] 5 units w/ interval                                8429           8630         318          0.6        1685.8       0.2X
[info] 5 units w/o interval                               8448           9335        1365          0.6        1689.7       0.2X
[info] 6 units w/ interval                                8671           8707          31          0.6        1734.2       0.2X
[info] 6 units w/o interval                               8613           8638          38          0.6        1722.5       0.2X
[info] 7 units w/ interval                                8924           8981          68          0.6        1784.8       0.2X
[info] 7 units w/o interval                               8858           8889          52          0.6        1771.5       0.2X
[info] 8 units w/ interval                               10640          11768         979          0.5        2127.9       0.2X
[info] 8 units w/o interval                              10990          11551         554          0.5        2197.9       0.2X
[info] 9 units w/ interval                               10783          11849        1110          0.5        2156.5       0.2X
[info] 9 units w/o interval                              10408          12267        1609          0.5        2081.6       0.2X
[info] 10 units w/ interval                              11245          13406         851          0.4        2249.1       0.2X
[info] 10 units w/o interval                             11331          11444         102          0.4        2266.2       0.2X
[info] 11 units w/ interval                              12480          13403         919          0.4        2496.0       0.2X
[info] 11 units w/o interval                             12001          12009           7          0.4        2400.2       0.2X

SparkQA · 2019-11-13T08:05:02Z

Test build #113675 has finished for PR 26491 at commit 803d454.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

yaooqinn · 2019-11-13T08:19:53Z

retest this please

SparkQA · 2019-11-13T11:37:24Z

Test build #113686 has finished for PR 26491 at commit 803d454.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2019-11-13T11:51:22Z

The behavior of select interval '.1111111111' second becomes different after this PR

yaooqinn · 2019-11-13T11:55:53Z

The behavior of select interval '.1111111111' second becomes different after this PR

Add to the pr desc or need to document it?

cloud-fan · 2019-11-13T11:59:23Z

sql/core/src/test/scala/org/apache/spark/sql/execution/command/DDLParserSuite.scala

@@ -789,7 +789,7 @@ class DDLParserSuite extends AnalysisTest with SharedSparkSession {
    assertError("select interval '23:61:15' hour to second",
      "minute 61 outside range [0, 59]")
    assertError("select interval '.1111111111' second",
-      "nanosecond 1111111111 outside range")
+      "Invalid interval string")


This actually exposes a problem of stringToInterval: the error reporting becomes worse.

I think the reason is stringToInterval returns null to indicate invalid input, but then there is no chance to know what is the exact failure. Can we make stringToInterval throw exception? cc @MaxGekk

it used to replace safeFromString, we may could wrap a try catch here and use a safe flag to control throw exception or return null

instead of a safe flag, I'd like to have 2 methods like before

stringToInterval which throws exception on failure

safeStringToInterval which calls stringToInterval and try-catch to return null on failure

ok, i will check on this right now.

This makes sense. Initially I thought about raising exceptions from stringToInterval and catch them on the next level to convert to null. For now, SGTM.

For this part, I encounter another bug here, current string to interval cast logic does not support i.e. cast('.111 second' as interval) which will fail in SIGN state and return null, actually, it is 00:00:00.111. I had fixed in the current pr, I also wonder if I do it separately to fix the bug first.

These are the results of the master branch.

-- !query 63 select interval '.111 seconds' -- !query 63 schema struct<0.111 seconds:interval> -- !query 63 output 0.111 seconds -- !query 64 select cast('.111 seconds' as interval) -- !query 64 schema struct<CAST(.111 seconds AS INTERVAL):interval> -- !query 64 output NULL

Seems like the newly added interval parser has a bug...

SparkQA · 2019-11-13T15:21:41Z

Test build #113714 has finished for PR 26491 at commit 79f5892.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-11-13T15:37:14Z

Test build #113715 has finished for PR 26491 at commit 0814fc4.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-11-13T15:53:44Z

Test build #113703 has finished for PR 26491 at commit ed3b35f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-11-13T17:47:31Z

Test build #113718 has finished for PR 26491 at commit 3b1667e.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-11-14T04:16:52Z

Test build #113743 has finished for PR 26491 at commit 5aa09ca.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/IntervalUtils.scala

cloud-fan · 2019-11-15T06:59:06Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/IntervalUtils.scala

@@ -482,13 +409,17 @@ object IntervalUtils {
      }
    }

+    def nextWord: UTF8String = {
+      s.substring(i, s.numBytes()).subStringIndex(UTF8String.blankString(1), 1)


This is for error reporting, so perf doesn't matter. Can we make it more readable? like

s.substring(i, s.numBytes()).toString.split("\\s+").head

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/IntervalUtils.scala

cloud-fan · 2019-11-15T07:03:11Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/IntervalUtils.scala

@@ -547,14 +478,16 @@ object IntervalUtils {
            case ' ' =>
              fraction /= NANOS_PER_MICROS.toInt
              state = TRIM_BEFORE_UNIT
-            case _ => return null
+            case _ if '0' <= b && b <= '9' =>
+              throwIAE(s"invalid value fractional part '$fraction$nextWord' out of range")


how about interval can only support nanosecond precision, '$nextWord' is out of range

BTW can we really implement nextWord correctly? We need to know where we start to parse a number, seems we don't track it now.

how about interval can only support nanosecond precision, '$nextWord' is out of range

the is not suitable, 0.9999999999 the nextword will be 9 only but '$fraction$nextWord' is the exact 9999999999

https://github.com/apache/spark/pull/26491/files#diff-105a430951a95bd0c9c899d2a24e3badR107-R115 you can check some test case here.

I'd improve this.

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/IntervalUtils.scala

cloud-fan · 2019-11-15T07:07:21Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/IntervalUtils.scala

          }
          state = UNIT_SUFFIX
        case UNIT_SUFFIX =>
          b match {
            case 's' => state = UNIT_END
            case ' ' => state = TRIM_BEFORE_SIGN
-            case _ => return null
+            case _ => throwIAE(s"invalid unit suffix '$nextWord'")


invalid unit '$nextWord' is better if we can implement nextword correctly. Or we should introduce "currentWord"

the nextword represents which character the error occurs, the only special case is for nanoseconds out of range, the fraction + nextword could just exactly show the out of range number

or we change the logic of unit parsing we can extract the whole part like case _ if b>= 'A' $$ b<= 'z' => unit = s.substring(i, s.numBytes()).subStringIndex(UTF8String.blankString(1), 1) than do unit case matching and error capture.

for error reporting, usually backtracing is necessary. For example, it's better to tell users that 123a is not valid, instead of just saying a is not valid.

SparkQA · 2019-11-15T08:05:02Z

Test build #113851 has finished for PR 26491 at commit da7f9e8.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

yaooqinn · 2019-11-15T08:37:59Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/IntervalUtils.scala

@@ -482,13 +409,19 @@ object IntervalUtils {
      }
    }

+    def currentWord: String = {


@cloud-fan this method should be able to extract the error word

SparkQA · 2019-11-15T11:51:59Z

Test build #113864 has finished for PR 26491 at commit 61626ff.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

yaooqinn · 2019-11-15T14:48:31Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/IntervalUtils.scala

@@ -482,13 +409,19 @@ object IntervalUtils {
      }
    }

+    def currentWord: UTF8String = {
+      val strings = s.split(UTF8String.blankString(1), -1)
+      val lenLeft = s.substring(i, s.numBytes()).split(UTF8String.blankString(1), -1).length


still use UTF8String here, because the trim here in val s = input.trim.toLowerCase does handle '\n' '\t' ..., If we use toString here, we need to additional logic to avoid ArrayIndexOutOfBoundsException

SparkQA · 2019-11-15T18:32:57Z

Test build #113883 has finished for PR 26491 at commit e12bb86.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/IntervalUtils.scala

cloud-fan · 2019-11-16T06:39:52Z

sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/util/IntervalUtilsSuite.scala

-    checkFromInvalidString("1.5 days", "Error parsing interval string")
-    checkFromInvalidString("1. hour", "Error parsing interval string")
+    checkFromInvalidString("1.5 days", "'days' cannot have fractional part")
+    checkFromInvalidString("1. hour", "'hour' cannot have fractional part")


does 1. second work?

yes, it does. But I guess this should not work, we may just expose another bug in the new string to interval parser. we may need a follow-up to disable this.

checked pgsql

cloud0fan=# select interval '1.' second; ERROR: invalid input syntax for type interval: "1." LINE 1: select interval '1.' second;

we should fix it later.

OK, I'll raise a jira and an pr later

cloud-fan · 2019-11-16T06:42:07Z

sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/util/IntervalUtilsSuite.scala

+    checkFromInvalidString("1 Mour", "invalid unit 'mour'")
+    checkFromInvalidString("1 aour", "invalid unit 'aour'")
+    checkFromInvalidString("1a1 hour", "invalid value '1a1'")
+    checkFromInvalidString("1.1a1 seconds", "invalid value '1.1a1' in fractional part")


nit: seems like invalid value '1.1a1' is more clearer

cloud-fan · 2019-11-16T09:06:21Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/IntervalUtils.scala

+    def currentWord: UTF8String = {
+      val strings = s.split(UTF8String.blankString(1), -1)
+      val lenLeft = s.substring(i, s.numBytes()).split(UTF8String.blankString(1), -1).length
+      strings(strings.length - lenLeft)


IIUC lenLeft should be lenRight?

Ok, I guess left is a bit ambiguous here, right is better

SparkQA · 2019-11-16T12:29:03Z

Test build #113918 has finished for PR 26491 at commit 50bfd2e.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
case class DescribeNamespaceStatement(
case class ShowTableStatement(
case class DescribeNamespace(
case class DescribeNamespaceExec(

SparkQA · 2019-11-16T14:08:02Z

Test build #113923 has finished for PR 26491 at commit 32d5194.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2019-11-18T07:50:20Z

thanks, merging to master!

[SPARK-29870][SQL] Unify the logic of multi-units interval string to …

49b74c3

…CalendarInterval

cloud-fan approved these changes Nov 13, 2019

View reviewed changes

cloud-fan reviewed Nov 13, 2019

View reviewed changes

fix ut

803d454

fix ut

ed3b35f

cloud-fan reviewed Nov 13, 2019

View reviewed changes

add exception for string to interval

79f5892

fix '.111'

0814fc4

add uts

3b1667e

dongjoon-hyun added the SQL label Nov 13, 2019

yaooqinn added 2 commits November 14, 2019 09:43

Merge branch 'master' into SPARK-29870

8ad53bc

fix build

5aa09ca

cloud-fan reviewed Nov 14, 2019

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/IntervalUtils.scala Outdated Show resolved Hide resolved

cloud-fan reviewed Nov 14, 2019

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/IntervalUtils.scala Outdated Show resolved Hide resolved

yaooqinn mentioned this pull request Nov 14, 2019

[SPARK-29888][SQL] new interval string parser shall handle numeric with only fractional part #26514

Closed

yaooqinn added 2 commits November 15, 2019 13:57

Merge branch 'master' into SPARK-29870

f6a9424

rm state in error and fix ut

da7f9e8

cloud-fan reviewed Nov 15, 2019

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/IntervalUtils.scala Outdated Show resolved Hide resolved

cloud-fan reviewed Nov 15, 2019

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/IntervalUtils.scala Outdated Show resolved Hide resolved

cloud-fan reviewed Nov 15, 2019

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/IntervalUtils.scala Outdated Show resolved Hide resolved

cloud-fan reviewed Nov 15, 2019

View reviewed changes

improve err msg

61626ff

yaooqinn commented Nov 15, 2019

View reviewed changes

fix ut

e12bb86

yaooqinn commented Nov 15, 2019

View reviewed changes

cloud-fan reviewed Nov 16, 2019

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/IntervalUtils.scala Show resolved Hide resolved

cloud-fan reviewed Nov 16, 2019

View reviewed changes

cloud-fan approved these changes Nov 16, 2019

View reviewed changes

yaooqinn added 2 commits November 16, 2019 16:31

nit

89dd64f

Merge remote-tracking branch 'origin/master' into SPARK-29870

50bfd2e

cloud-fan reviewed Nov 16, 2019

View reviewed changes

naming

32d5194

cloud-fan closed this in 50f6d93 Nov 18, 2019

yaooqinn deleted the SPARK-29870 branch November 18, 2019 07:52

yaooqinn mentioned this pull request Nov 18, 2019

[SPARK-29926][SQL] Fix weird interval string whose value is only a dangling decimal point #26573

Closed

[SPARK-29870][SQL] Unify the logic of multi-units interval string to CalendarInterval #26491

[SPARK-29870][SQL] Unify the logic of multi-units interval string to CalendarInterval #26491

Conversation

yaooqinn commented Nov 13, 2019

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

yaooqinn commented Nov 13, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Nov 13, 2019

yaooqinn commented Nov 13, 2019

benchmark test

before

after

SparkQA commented Nov 13, 2019

yaooqinn commented Nov 13, 2019

SparkQA commented Nov 13, 2019

cloud-fan commented Nov 13, 2019

yaooqinn commented Nov 13, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yaooqinn Nov 13, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Nov 13, 2019

SparkQA commented Nov 13, 2019

SparkQA commented Nov 13, 2019

SparkQA commented Nov 13, 2019

SparkQA commented Nov 14, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yaooqinn Nov 15, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Nov 15, 2019

Choose a reason for hiding this comment

SparkQA commented Nov 15, 2019

Choose a reason for hiding this comment

SparkQA commented Nov 15, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Nov 16, 2019

SparkQA commented Nov 16, 2019

cloud-fan commented Nov 18, 2019

yaooqinn Nov 13, 2019 •

edited

Loading

yaooqinn Nov 15, 2019 •

edited

Loading