-
Notifications
You must be signed in to change notification settings - Fork 245
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add GpuConv operator for the conv
10<->16 expression
#8925
Add GpuConv operator for the conv
10<->16 expression
#8925
Conversation
Signed-off-by: Gera Shegalov <gera@apache.org>
Signed-off-by: Gera Shegalov <gera@apache.org>
Signed-off-by: Gera Shegalov <gera@apache.org>
…gerashegalov/issue8511
sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuOverrides.scala
Outdated
Show resolved
Hide resolved
sql-plugin/src/main/scala/org/apache/spark/sql/rapids/stringFunctions.scala
Outdated
Show resolved
Hide resolved
sql-plugin/src/main/scala/org/apache/spark/sql/rapids/stringFunctions.scala
Outdated
Show resolved
Hide resolved
…into gerashegalov/issue8511
…into gerashegalov/issue8511
Signed-off-by: Gera Shegalov <gera@apache.org>
conv
expressionconv
expression
Signed-off-by: Gera Shegalov <gera@apache.org>
Signed-off-by: Gera Shegalov <gera@apache.org>
Signed-off-by: Gera Shegalov <gera@apache.org>
…into gerashegalov/issue8511
Signed-off-by: Gera Shegalov <gera@apache.org>
build |
conv
expressionconv
expression [databricks]
build |
case _ => | ||
willNotWorkOnGpu(because = "only literal 10 or 16 for from_base and to_base are supported") | ||
} | ||
if (SQLConf.get.ansiEnabled) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ANSI only shows up in 3.4.0+ https://issues.apache.org/jira/browse/SPARK-42427 and if the expression is in ANSI mode or not should come from expr
not directly from SQLConf.get.ansiEnabled
.
willNotWorkOnGpu(because = "only literal 10 or 16 for from_base and to_base are supported") | ||
} | ||
if (SQLConf.get.ansiEnabled) { | ||
willNotWorkOnGpu(because = " the GPU has no overflow checking.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure we can enable this by default. Even in 3.1.x Spark checks for overflow when encoding the value, and will return -1 if it sees an overflow. We do not do that.
The ANSI mode fix in 3.4.0 only added in an exception instead of returning -1.
The following was run on Spark 3.3.0
scala> val df = Seq("9223372036854775807", "-9223372036854775808", "9223372036854775808", "-9223372036854775809", "10223372036854775807", "-10223372036854775808", null).toDF
scala> df.repartition(1).selectExpr("conv(value, 10, 16)", "value").show(false)
+-------------------+---------------------+
|conv(value, 10, 16)|value |
+-------------------+---------------------+
|7FFFFFFFFFFFFFFF |9223372036854775807 |
|8000000000000000 |-9223372036854775808 |
|8000000000000000 |9223372036854775808 |
|7FFFFFFFFFFFFFFF |-9223372036854775809 |
|8DE0B6B3A763FFFF |10223372036854775807 |
|721F494C589C0000 |-10223372036854775808|
|null |null |
+-------------------+---------------------+
scala> spark.conf.set("spark.rapids.sql.enabled", false)
scala> df.repartition(1).selectExpr("conv(value, 10, 16)", "value").show(false)
+-------------------+---------------------+
|conv(value, 10, 16)|value |
+-------------------+---------------------+
|7FFFFFFFFFFFFFFF |9223372036854775807 |
|FFFFFFFFFFFFFFFF |-9223372036854775808 |
|8000000000000000 |9223372036854775808 |
|FFFFFFFFFFFFFFFF |-9223372036854775809 |
|8DE0B6B3A763FFFF |10223372036854775807 |
|FFFFFFFFFFFFFFFF |-10223372036854775808|
|null |null |
+-------------------+---------------------+
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am aware of the -1-equivalent output, i.e conversion to 18446744073709551615
representation as a string in to_base if to_base is positive and -1
if to_base is negative (which we already fallback to CPU).
>>> spark.createDataFrame([('-1',), ('18446744073709551615',), ('18446744073709551616',)], 'a string').selectExpr('conv(a, 10, 10)').show()
+--------------------+
| conv(a, 10, 10)|
+--------------------+
|18446744073709551615|
|18446744073709551615|
|18446744073709551615|
+--------------------+
My reasoning is that since Spark does not allow distinguishing 18446744073709551615 as a result of the overflow check or based on the original data, it does not really matter. However, it's true that the customer may have a process in place making sure that 18446744073709551615 is uniquely due to an overflow, and filter out these by filtering the output != '18446744073709551615'
.
So we can disable by default for unlimited StringType and safely enable of StringType(length) for lengths guaranteed not to overflow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that is probably the reason that Spark added in an ANSI mode because it really is ambiguous. Seeing FFFFFFFFFFFFFFFF
on the output when converting to hex is ambiguous and you just don't know if an overflow happened or not without further processing.
For me I don't really want to enable this by default if it is only partially done. But I can see an argument for allowing it. @sameerz @mattf do you have an opinion if we should put this in with conv
enabled by default while we work on a better long term solution? Would it be okay to put it in with conv
disabled by default and some docs so users know how to enable it if there are potential incompatibilities?
I also ran some performance tests because I was nervous about the use of regular expressions to try and implement this. I am less concerned now. Not because the regular expressions are bad, but because the CPU code is really bad too. I generated 1 - billion rows of two longs. One column I just cast to a string and the other column I converted to base 16, then wrote the data out to parquet. I then ran the following to get a baseline for reading in the data an doing a min/max on it.
The median of 5 runs for the CPU was 56.607 seconds and the GPU was 10.487 seconds (the GPU a6000 is about 5.4x faster than my old 6 core 12 thread desktop CPU) I then ran the following to understand the cost of
The CPU took 568.396 seconds (I only ran it once) and the GPU took 183.554 seconds (again I only ran it once the GPU was 100% utilized the entire query and it started to throttle from heat). That results in a difference of 511.789 seconds for the CPU runs and 173.067 seconds for the GPU runs. So the GPU is about 3 times faster than the CPU for |
@revans2 Thank for doing the measurements. This PR is meant as a stepping stone to prevent CPU fallbacks for the cases that libcudf already can support. I will work on the custom kernel as a follow-on |
Signed-off-by: Gera Shegalov <gera@apache.org>
build |
Signed-off-by: Gera Shegalov <gera@apache.org>
build |
Signed-off-by: Gera Shegalov <gera@apache.org>
build |
conv
expression [databricks]conv
expression
build |
1 similar comment
build |
conv
expressionconv
10<->16 expression
Contributes to #8511
POC only supports 10/16<->10/16 radix conversions, without overflow checks it's guaranteed to produce identical results to CPU only for
Signed-off-by: Gera Shegalov gera@apache.org