-
Notifications
You must be signed in to change notification settings - Fork 70
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
case when improvement: avoid copy_if_else #2079
Conversation
Signed-off-by: Chong Gao <res_life@163.com>
My biggest concern is with side effects. And to be clear that is not an issue with this code. It is a more general problem with case/when I want you to be aware of. Case/when and if/else in Spark are lazy. This means that if an expression in the when part has a side effect, like can throw an exception, then we cannot evaluate it for any row that would cause an exception to be triggered. This appears to be specific to scalars in the case/when so it should be fine. |
Signed-off-by: Chong Gao <res_life@163.com>
@@ -42,6 +44,47 @@ void selectIndexTest() { | |||
} | |||
} | |||
|
|||
public static ColumnVector fromBooleansWithNulls(Boolean... values) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
build |
Please provide benchmarks to show off better how much benefit this can provide? |
Will do it. |
For end to end perf result, refer to: @ttnghia is it needed to add benchmark tests? Above is end to end result. |
@ttnghia Help review again. |
src/main/cpp/src/case_when.cu
Outdated
if (row_count == 0) // empty begets empty | ||
return cudf::make_empty_column(cudf::type_id::INT32); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if (row_count == 0) // empty begets empty | |
return cudf::make_empty_column(cudf::type_id::INT32); | |
if (row_count == 0) { // empty begets empty | |
return cudf::make_empty_column(cudf::type_id::INT32); | |
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done.
src/main/cpp/src/case_when.cu
Outdated
cudf::data_type{cudf::type_id::INT32}, row_count, cudf::mask_state::ALL_VALID, stream, mr); | ||
|
||
// select first true index | ||
auto d_table = cudf::table_device_view::create(when_bool_columns, stream); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
auto d_table = cudf::table_device_view::create(when_bool_columns, stream); | |
auto const d_table_ptr = cudf::table_device_view::create(when_bool_columns, stream); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done.
src/main/cpp/src/case_when.hpp
Outdated
* Select strings in scalar column according to index column. | ||
* If index is out of bound, use NULL value | ||
* e.g.: | ||
* scalar column: s0, s1, s2 | ||
* index column: 0, 1, 2, 2, 1, 0, 3 | ||
* output column: s0, s1, s2, s2, s1, s0, NULL | ||
* | ||
*/ | ||
std::unique_ptr<cudf::column> select_from_index( | ||
cudf::strings_column_view const& then_and_else_scalar_column, | ||
cudf::column_view const& select_index_column, | ||
rmm::cuda_stream_view stream = cudf::get_default_stream(), | ||
rmm::mr::device_memory_resource* mr = rmm::mr::get_current_device_resource()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh wait. I just realize that this is just a gather. So we don't need this function at all. Just call cudf::gather
(through Java_ai_rapids_cudf_Table_gather
), which already supports all data types.
@ttnghia Thanks a lot, I will fix. |
Signed-off-by: Chong Gao <res_life@163.com>
build |
@ttnghia Please review again. |
Spark-Rapids corresponding change:
|
build |
build |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry there are redundant headers that need to be removed.
src/main/cpp/src/case_when.cu
Outdated
namespace spark_rapids_jni { | ||
namespace detail { | ||
|
||
/** | ||
* Select the column index for the first true in bool columns for the specified row | ||
*/ | ||
struct select_first_true_fn { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: We should wrap anything that is locally used in a source file into an anonymous namespace to avoid name clashing in the future with other source files.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
Added anonymous namespace
build |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code looks good, but what do the performance numbers look like?
The perf numbers |
Yes, please refer to the above link. |
closes #2084
Case when improvement: avoid lots of copy_if_else which generates a string column.
For the following
case when
:Current logic, link:
Iteratively invoke
copy_if_else
to merge the tail 2 branches.This incurs lots of memory(string column) operations, here intruduced 3
copy_if_else
Improvement:
First evaluate all the
when
exprs and get bool columns.Then select the first true in the bool columns and return the bool column index
Then select salars according to the select column.
implement 2 kernels to handle:
Signed-off-by: Chong Gao res_life@163.com