Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

case when improvement: avoid copy_if_else #2079

Merged
merged 10 commits into from
Jun 25, 2024

Conversation

res-life
Copy link
Collaborator

@res-life res-life commented May 28, 2024

closes #2084

Case when improvement: avoid lots of copy_if_else which generates a string column.

For the following case when:

select
  case
    when bool_1_expr then "value_1"
    when bool_2_expr then "value_2"
    when bool_3_expr then "value_3"
    ......
    else "value_else"
  end
from tab

Current logic, link:

Iteratively invoke copy_if_else to merge the tail 2 branches.
This incurs lots of memory(string column) operations, here intruduced 3 copy_if_else

      val elseRet = elseValue
        .map(_.columnarEvalAny(batch))
        .getOrElse(GpuScalar(null, branches.last._2.dataType))
      val any = branches.foldRight[Any](elseRet) {
        case ((predicateExpr, trueExpr), falseRet) =>
          computeIfElse(batch, predicateExpr, trueExpr, falseRet)
      }

Improvement:

First evaluate all the when exprs and get bool columns.
Then select the first true in the bool columns and return the bool column index
Then select salars according to the select column.

implement 2 kernels to handle:

/**
 * select the first column index with true value.
 * e.g.:
 * column 0 in table: true,  false, false
 * column 1 in table: false, true,  false
 * column 2 in table: false, false, true
 * 
 * return column: 0, 1, 2
*/
std::unique_ptr<cudf::column> select_first_true_index(
  cudf::table_view const& when_bool_columns,
  rmm::cuda_stream_view stream        = cudf::get_default_stream(),
  rmm::mr::device_memory_resource* mr = rmm::mr::get_current_device_resource());

/**
 * Select strings int scalar column according to index column
 * scalar column: s0, s1, s2
 * index  column: 0,  1,  2,  2,  1,  0,  3
 * output column: s0, s1, s2, s2, s1, s0, null
 * 
*/
std::unique_ptr<cudf::column> select_from_index(
  cudf::strings_column_view const& then_and_else_scalar_column,
  cudf::column_view const& select_index_column,
  rmm::cuda_stream_view stream        = cudf::get_default_stream(),
  rmm::mr::device_memory_resource* mr = rmm::mr::get_current_device_resource());

Signed-off-by: Chong Gao res_life@163.com

Signed-off-by: Chong Gao <res_life@163.com>
@res-life res-life requested review from revans2 and thirtiseven May 28, 2024 11:17
@revans2
Copy link
Collaborator

revans2 commented May 28, 2024

My biggest concern is with side effects. And to be clear that is not an issue with this code. It is a more general problem with case/when I want you to be aware of. Case/when and if/else in Spark are lazy. This means that if an expression in the when part has a side effect, like can throw an exception, then we cannot evaluate it for any row that would cause an exception to be triggered. This appears to be specific to scalars in the case/when so it should be fine.

@winningsix
Copy link

@res-life created an issue for this. #2084

Chong Gao added 2 commits May 29, 2024 14:14
Signed-off-by: Chong Gao <res_life@163.com>
@@ -42,6 +44,47 @@ void selectIndexTest() {
}
}

public static ColumnVector fromBooleansWithNulls(Boolean... values) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@res-life res-life changed the base branch from branch-24.06 to branch-24.08 May 29, 2024 14:25
@res-life
Copy link
Collaborator Author

build

@ttnghia
Copy link
Collaborator

ttnghia commented May 30, 2024

Please provide benchmarks to show off better how much benefit this can provide?

@res-life
Copy link
Collaborator Author

res-life commented Jun 3, 2024

Please provide benchmarks to show off better how much benefit this can provide?

Will do it.

@res-life
Copy link
Collaborator Author

res-life commented Jun 3, 2024

For end to end perf result, refer to:
NVIDIA/spark-rapids#10951 (comment)

@ttnghia is it needed to add benchmark tests? Above is end to end result.

@res-life
Copy link
Collaborator Author

res-life commented Jun 4, 2024

@ttnghia Help review again.

Comment on lines 76 to 77
if (row_count == 0) // empty begets empty
return cudf::make_empty_column(cudf::type_id::INT32);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if (row_count == 0) // empty begets empty
return cudf::make_empty_column(cudf::type_id::INT32);
if (row_count == 0) { // empty begets empty
return cudf::make_empty_column(cudf::type_id::INT32);
}

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

cudf::data_type{cudf::type_id::INT32}, row_count, cudf::mask_state::ALL_VALID, stream, mr);

// select first true index
auto d_table = cudf::table_device_view::create(when_bool_columns, stream);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
auto d_table = cudf::table_device_view::create(when_bool_columns, stream);
auto const d_table_ptr = cudf::table_device_view::create(when_bool_columns, stream);

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

Comment on lines 62 to 74
* Select strings in scalar column according to index column.
* If index is out of bound, use NULL value
* e.g.:
* scalar column: s0, s1, s2
* index column: 0, 1, 2, 2, 1, 0, 3
* output column: s0, s1, s2, s2, s1, s0, NULL
*
*/
std::unique_ptr<cudf::column> select_from_index(
cudf::strings_column_view const& then_and_else_scalar_column,
cudf::column_view const& select_index_column,
rmm::cuda_stream_view stream = cudf::get_default_stream(),
rmm::mr::device_memory_resource* mr = rmm::mr::get_current_device_resource());
Copy link
Collaborator

@ttnghia ttnghia Jun 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh wait. I just realize that this is just a gather. So we don't need this function at all. Just call cudf::gather (through Java_ai_rapids_cudf_Table_gather), which already supports all data types.

@res-life
Copy link
Collaborator Author

res-life commented Jun 5, 2024

@ttnghia Thanks a lot, I will fix.

Chong Gao added 2 commits June 18, 2024 10:41
Signed-off-by: Chong Gao <res_life@163.com>
@res-life
Copy link
Collaborator Author

build

@res-life
Copy link
Collaborator Author

@ttnghia Please review again.

@res-life
Copy link
Collaborator Author

Spark-Rapids corresponding change:
NVIDIA/spark-rapids@7c43e69

// removed
val finalRet = CaseWhen.selectFromIndex(scalarCol, firstTrueIndex)

@res-life
Copy link
Collaborator Author

build

@ttnghia
Copy link
Collaborator

ttnghia commented Jun 24, 2024

build

ttnghia
ttnghia previously approved these changes Jun 24, 2024
Copy link
Collaborator

@ttnghia ttnghia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry there are redundant headers that need to be removed.

@ttnghia ttnghia dismissed their stale review June 24, 2024 06:03

Stale review

Comment on lines 26 to 32
namespace spark_rapids_jni {
namespace detail {

/**
* Select the column index for the first true in bool columns for the specified row
*/
struct select_first_true_fn {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: We should wrap anything that is locally used in a source file into an anonymous namespace to avoid name clashing in the future with other source files.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.
Added anonymous namespace

@res-life
Copy link
Collaborator Author

build

Copy link
Collaborator

@revans2 revans2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code looks good, but what do the performance numbers look like?

@wjxiz1992
Copy link
Collaborator

The perf numbers
NVIDIA/spark-rapids#10951 (comment)

@res-life
Copy link
Collaborator Author

res-life commented Jun 25, 2024

Yes, please refer to the above link.
I retested against the latest code, also got a similar result.

@res-life res-life merged commit c484470 into NVIDIA:branch-24.08 Jun 25, 2024
3 checks passed
@res-life res-life deleted the case-when-perf branch December 16, 2024 08:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEA][Performance] Merge multiple "copy_if_else" for "case when" in the case of multiple branches
6 participants