-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-29870][SQL] Unify the logic of multi-units interval string to CalendarInterval #26491
Conversation
cc @cloud-fan @MaxGekk @maropu @HyukjinKwon thanks very much in advance. |
v + " " + u | ||
} | ||
val str = kvs.mkString(" ") | ||
IntervalUtils.fromString(str) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
when we at here, the parsing is already done by antlr, so this may make things slower. But it's better to unify the code as the perf doesn't matter that much for interval literals.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, the performance improvement is actually for type constructor to parse interval string literals, which can't be seen via code change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With a particular modified IntervalBenchmark
test, which mocks the type constructor logic, which is directly different with an IntervalUtils.fromString
call only.
private def addCase(benchmark: Benchmark, cardinality: Long, units: Seq[String]): Unit = {
Seq(true, false).foreach { withPrefix =>
val expr = buildString(withPrefix, units).cast("interval")
val note = if (withPrefix) "w/ interval" else "w/o interval"
benchmark.addCase(s"${units.length + 1} units $note", numIters = 3) { _ =>
// doBenchmark(cardinality, expr)
(0L until cardinality).foreach(_ => IntervalUtils.fromString(units.mkString(" ")))
}
}
}
we can see huge perfomance improment here. Any way, this is just used to parse typed literals, not a big deal acturally.
info] Running case: 1 units w/ interval
[info] Stopped after 3 iterations, 98544 ms
[info] Running case: 1 units w/o interval
[info] Stopped after 3 iterations, 78871 ms
[info] Running case: 2 units w/ interval
[info] Stopped after 3 iterations, 72469 ms
[info] Running case: 2 units w/o interval
[info] Stopped after 3 iterations, 78753 ms
[info] Running case: 1 units w/ interval
[info] Stopped after 3 iterations, 8926 ms
[info] Running case: 1 units w/o interval
[info] Stopped after 3 iterations, 8881 ms
[info] Running case: 2 units w/ interval
[info] Stopped after 3 iterations, 8773 ms
[info] Running case: 2 units w/o interval
[info] Stopped after 3 iterations, 8815 ms
Test build #113668 has finished for PR 26491 at commit
|
benchmark testwith before[info] Java HotSpot(TM) 64-Bit Server VM 1.8.0_65-b17 on Mac OS X 10.15.1
[info] Intel(R) Core(TM) i5-5287U CPU @ 2.90GHz
[info] cast strings to intervals: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
[info] ------------------------------------------------------------------------------------------------------------------------
[info] prepare string w/ interval 2198 2231 46 2.3 439.6 1.0X
[info] prepare string w/o interval 2189 2636 388 2.3 437.8 1.0X
[info] 1 units w/ interval 2256 2315 76 2.2 451.2 1.0X
[info] 1 units w/o interval 2106 2182 95 2.4 421.2 1.0X
[info] 2 units w/ interval 3137 3597 792 1.6 627.5 0.7X
[info] 2 units w/o interval 3038 3299 241 1.6 607.6 0.7X
[info] 3 units w/ interval 6747 7466 638 0.7 1349.3 0.3X
[info] 3 units w/o interval 6905 7583 605 0.7 1381.0 0.3X
[info] 4 units w/ interval 7621 8126 785 0.7 1524.2 0.3X
[info] 4 units w/o interval 7323 8823 1305 0.7 1464.5 0.3X
[info] 5 units w/ interval 8297 8522 378 0.6 1659.4 0.3X
[info] 5 units w/o interval 8215 8220 6 0.6 1642.9 0.3X
[info] 6 units w/ interval 9022 9111 78 0.6 1804.4 0.2X
[info] 6 units w/o interval 9642 12338 NaN 0.5 1928.4 0.2X
[info] 7 units w/ interval 9343 48099 1662 0.5 1868.6 0.2X
[info] 7 units w/o interval 8879 8941 54 0.6 1775.7 0.2X
[info] 8 units w/ interval 10401 10537 225 0.5 2080.1 0.2X
[info] 8 units w/o interval 10398 10413 20 0.5 2079.6 0.2X
[info] 9 units w/ interval 10206 10373 189 0.5 2041.2 0.2X
[info] 9 units w/o interval 10100 10151 45 0.5 2020.0 0.2X
[info] 10 units w/ interval 10887 11881 1596 0.5 2177.5 0.2X
[info] 10 units w/o interval 11430 13329 1754 0.4 2286.1 0.2X
[info] 11 units w/ interval 13046 14058 877 0.4 2609.2 0.2X
[info] 11 units w/o interval 12059 12297 265 0.4 2411.7 0.2X
[info] after
|
Test build #113675 has finished for PR 26491 at commit
|
retest this please |
Test build #113686 has finished for PR 26491 at commit
|
The behavior of |
Add to the pr desc or need to document it? |
@@ -789,7 +789,7 @@ class DDLParserSuite extends AnalysisTest with SharedSparkSession { | |||
assertError("select interval '23:61:15' hour to second", | |||
"minute 61 outside range [0, 59]") | |||
assertError("select interval '.1111111111' second", | |||
"nanosecond 1111111111 outside range") | |||
"Invalid interval string") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This actually exposes a problem of stringToInterval
: the error reporting becomes worse.
I think the reason is stringToInterval
returns null to indicate invalid input, but then there is no chance to know what is the exact failure. Can we make stringToInterval
throw exception? cc @MaxGekk
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it used to replace safeFromString
, we may could wrap a try catch here and use a safe
flag to control throw exception or return null
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
instead of a safe flag, I'd like to have 2 methods like before
stringToInterval
which throws exception on failuresafeStringToInterval
which callsstringToInterval
and try-catch to return null on failure
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, i will check on this right now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This makes sense. Initially I thought about raising exceptions from stringToInterval
and catch them on the next level to convert to null. For now, SGTM.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For this part, I encounter another bug here, current string to interval cast logic does not support i.e. cast('.111 second' as interval)
which will fail in SIGN state and return null
, actually, it is 00:00:00.111
. I had fixed in the current pr, I also wonder if I do it separately to fix the bug first.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are the results of the master branch.
-- !query 63
select interval '.111 seconds'
-- !query 63 schema
struct<0.111 seconds:interval>
-- !query 63 output
0.111 seconds
-- !query 64
select cast('.111 seconds' as interval)
-- !query 64 schema
struct<CAST(.111 seconds AS INTERVAL):interval>
-- !query 64 output
NULL
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like the newly added interval parser has a bug...
Test build #113714 has finished for PR 26491 at commit
|
Test build #113715 has finished for PR 26491 at commit
|
Test build #113703 has finished for PR 26491 at commit
|
Test build #113718 has finished for PR 26491 at commit
|
Test build #113743 has finished for PR 26491 at commit
|
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/IntervalUtils.scala
Outdated
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/IntervalUtils.scala
Outdated
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/IntervalUtils.scala
Outdated
Show resolved
Hide resolved
@@ -482,13 +409,17 @@ object IntervalUtils { | |||
} | |||
} | |||
|
|||
def nextWord: UTF8String = { | |||
s.substring(i, s.numBytes()).subStringIndex(UTF8String.blankString(1), 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is for error reporting, so perf doesn't matter. Can we make it more readable? like
s.substring(i, s.numBytes()).toString.split("\\s+").head
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/IntervalUtils.scala
Outdated
Show resolved
Hide resolved
@@ -547,14 +478,16 @@ object IntervalUtils { | |||
case ' ' => | |||
fraction /= NANOS_PER_MICROS.toInt | |||
state = TRIM_BEFORE_UNIT | |||
case _ => return null | |||
case _ if '0' <= b && b <= '9' => | |||
throwIAE(s"invalid value fractional part '$fraction$nextWord' out of range") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how about interval can only support nanosecond precision, '$nextWord' is out of range
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW can we really implement nextWord
correctly? We need to know where we start to parse a number, seems we don't track it now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how about
interval can only support nanosecond precision, '$nextWord' is out of range
the is not suitable, 0.9999999999
the nextword will be 9
only but '$fraction$nextWord' is the exact 9999999999
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
https://github.com/apache/spark/pull/26491/files#diff-105a430951a95bd0c9c899d2a24e3badR107-R115 you can check some test case here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd improve this.
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/IntervalUtils.scala
Outdated
Show resolved
Hide resolved
} | ||
state = UNIT_SUFFIX | ||
case UNIT_SUFFIX => | ||
b match { | ||
case 's' => state = UNIT_END | ||
case ' ' => state = TRIM_BEFORE_SIGN | ||
case _ => return null | ||
case _ => throwIAE(s"invalid unit suffix '$nextWord'") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
invalid unit '$nextWord'
is better if we can implement nextword
correctly. Or we should introduce "currentWord"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the nextword
represents which character the error occurs, the only special case is for nanoseconds out of range, the fraction + nextword
could just exactly show the out of range number
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
or we change the logic of unit parsing we can extract the whole part like case _ if b>= 'A' $$ b<= 'z' => unit = s.substring(i, s.numBytes()).subStringIndex(UTF8String.blankString(1), 1)
than do unit
case matching and error capture.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for error reporting, usually backtracing is necessary. For example, it's better to tell users that 123a
is not valid, instead of just saying a
is not valid.
Test build #113851 has finished for PR 26491 at commit
|
@@ -482,13 +409,19 @@ object IntervalUtils { | |||
} | |||
} | |||
|
|||
def currentWord: String = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cloud-fan this method should be able to extract the error word
Test build #113864 has finished for PR 26491 at commit
|
@@ -482,13 +409,19 @@ object IntervalUtils { | |||
} | |||
} | |||
|
|||
def currentWord: UTF8String = { | |||
val strings = s.split(UTF8String.blankString(1), -1) | |||
val lenLeft = s.substring(i, s.numBytes()).split(UTF8String.blankString(1), -1).length |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
still use UTF8String
here, because the trim here in val s = input.trim.toLowerCase
does handle '\n' '\t' ..., If we use toString
here, we need to additional logic to avoid ArrayIndexOutOfBoundsException
Test build #113883 has finished for PR 26491 at commit
|
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/IntervalUtils.scala
Show resolved
Hide resolved
checkFromInvalidString("1.5 days", "Error parsing interval string") | ||
checkFromInvalidString("1. hour", "Error parsing interval string") | ||
checkFromInvalidString("1.5 days", "'days' cannot have fractional part") | ||
checkFromInvalidString("1. hour", "'hour' cannot have fractional part") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does 1. second
work?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, it does. But I guess this should not work, we may just expose another bug in the new string to interval parser. we may need a follow-up to disable this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
checked pgsql
cloud0fan=# select interval '1.' second;
ERROR: invalid input syntax for type interval: "1."
LINE 1: select interval '1.' second;
we should fix it later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, I'll raise a jira and an pr later
checkFromInvalidString("1 Mour", "invalid unit 'mour'") | ||
checkFromInvalidString("1 aour", "invalid unit 'aour'") | ||
checkFromInvalidString("1a1 hour", "invalid value '1a1'") | ||
checkFromInvalidString("1.1a1 seconds", "invalid value '1.1a1' in fractional part") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: seems like invalid value '1.1a1'
is more clearer
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yea
def currentWord: UTF8String = { | ||
val strings = s.split(UTF8String.blankString(1), -1) | ||
val lenLeft = s.substring(i, s.numBytes()).split(UTF8String.blankString(1), -1).length | ||
strings(strings.length - lenLeft) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIUC lenLeft
should be lenRight
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I guess left
is a bit ambiguous here, right is better
Test build #113918 has finished for PR 26491 at commit
|
Test build #113923 has finished for PR 26491 at commit
|
thanks, merging to master! |
What changes were proposed in this pull request?
We now have two different implementation for multi-units interval strings to CalendarInterval type values.
One is used to covert interval string literals to CalendarInterval. This approach will re-delegate the interval string to spark parser which handles the string as a
singleInterval
->multiUnitsInterval
-> eventually callIntervalUtils.fromUnitStrings
The other is used in
Cast
, which eventually callsIntervalUtils.stringToInterval
. This approach is ~10 times faster than the other.We should unify these two for better performance and simple logic. this pr uses the 2nd approach.
Why are the changes needed?
We should unify these two for better performance and simple logic.
Does this PR introduce any user-facing change?
no
How was this patch tested?
we shall not fail on existing uts