-
Notifications
You must be signed in to change notification settings - Fork 244
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Case when performance improvement: reduce the copy_if_else
[databricks]
#10951
Conversation
Signed-off-by: Chong Gao <res_life@163.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add some UTs here.
) { | ||
// return type is string type; all the then and else exprs are Scalars | ||
// avoid to use multiple `computeIfElse`s which will create multiple temp columns | ||
logWarning("==================== Running case with experimental =========== ") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need to remove this debug log?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will remove
logWarning("==================== Running case with experimental =========== ") | ||
|
||
val whenBoolCols = branches.safeMap(_._1.columnarEval(batch).getBase).toArray | ||
val firstTrueIndex: ColumnVector = withResource(whenBoolCols) { _ => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: selectivityCol
better name?
Have build error because of depending on JNI PR. |
build |
End to End Perf test result:
Test steps:
3 branches:
|
This PR only improves the case that all This PR only improves the non side effect code. |
Signed-off-by: Chong Gao <res_life@163.com>
case ((predicateExpr, trueExpr), falseRet) => | ||
computeIfElse(batch, predicateExpr, trueExpr, falseRet) | ||
if (caseWhenFuseEnabled && branches.size > 2 && | ||
inputTypesForMerging.head == StringType && |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why just strings? The gather used should be generic enough to do anything.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For other type, need to test, will update if get perf improvement for other types.
build |
2 similar comments
build |
build |
Perf tests for other types:
|
build |
copy_if_else
copy_if_else
[databricks]
@revans2 Please help review. |
build |
sql-plugin/src/main/scala/com/nvidia/spark/rapids/conditionalExpressions.scala
Outdated
Show resolved
Hide resolved
sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuColumnVectorUtils.scala
Outdated
Show resolved
Hide resolved
sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuColumnVectorUtils.scala
Show resolved
Hide resolved
@@ -296,3 +296,85 @@ def test_conditional_with_side_effects_unary_minus(data_gen, ansi_enabled): | |||
'CASE WHEN a > -32768 THEN -a ELSE null END'), | |||
conf = {'spark.sql.ansi.enabled': ansi_enabled}) | |||
|
|||
_case_when_scalars = [ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible to leverage the data_gen
to build a more robust test case just as other test?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this case test_case_when_all_then_values_are_scalars
, for the bool columns, the data_gen
is as following:
data_gen = [
("a", boolean_gen),
("b", boolean_gen),
("c", boolean_gen),
("d", boolean_gen),
("e", boolean_gen)
]
sql = """
select case
when a then {}
when b then {}
when c then {}
when d then {}
else {}
end
from tab
"""
The _case_when_scalars
specify the const scalars, data_gen
can not be used.
build |
build |
LGTM |
sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuColumnVectorUtils.scala
Outdated
Show resolved
Hide resolved
sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuColumnVectorUtils.scala
Outdated
Show resolved
Hide resolved
sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuColumnVectorUtils.scala
Outdated
Show resolved
Hide resolved
Premerge failure reason: |
build |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM again
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Learned something new about the unapply in Spark for decimals. Thanks.
contributes to #2084
Current case when logic: link
Iteratively invoke copy_if_else to merge the tail 2 branches.
This will create lots of temp string columns via
copy_if_else
which will introduce more spend time.new logic
For the
case when
with multiple branches and thethen
values are all string scalars.e.g.:
This PR removed the
copy_if_else
s.Steps:
expr1
expr2...., and get bool columns.For more details, refer to JNI PR: NVIDIA/spark-rapids-jni#2079
Depending on PR: NVIDIA/spark-rapids-jni#2079
Signed-off-by: Chong Gao res_life@163.com