-
Notifications
You must be signed in to change notification settings - Fork 473
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[GLUTEN-8557][CH] Collapse nested function calls for And
/Or
for performance optimization
#8558
base: main
Are you sure you want to change the base?
Conversation
Run Gluten Clickhouse CI on x86 |
And
, GetStructField
And
, GetStructField
And
/GetStructField
Run Gluten Clickhouse CI on x86 |
And
/GetStructField
And
/Or
/GetStructField
/GetJsonObject
Run Gluten Clickhouse CI on x86 |
b86d9e7
to
2cc64f2
Compare
Run Gluten Clickhouse CI on x86 |
5 similar comments
Run Gluten Clickhouse CI on x86 |
Run Gluten Clickhouse CI on x86 |
Run Gluten Clickhouse CI on x86 |
Run Gluten Clickhouse CI on x86 |
Run Gluten Clickhouse CI on x86 |
性能测试
|
.internal() | ||
.doc("Collapse nested functions as one for optimization.") | ||
.stringConf | ||
.createWithDefault("get_struct_field,get_json_object"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what about and
, or
@@ -97,7 +97,10 @@ object ExpressionConverter extends SQLConfHelper with Logging { | |||
if (udf.udfName.isEmpty) { | |||
throw new GlutenNotSupportException("UDF name is not found!") | |||
} | |||
val substraitExprName = UDFMappings.scalaUDFMap.get(udf.udfName.get) | |||
var substraitExprName = UDFMappings.scalaUDFMap.get(udf.udfName.get) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
collapsedFunctionsMap和udf有什么关系?看起来逻辑上没必要耦合在一块,最好能解耦
import org.apache.spark.sql.execution.SparkPlan | ||
import org.apache.spark.sql.types.{DataType, DataTypes} | ||
|
||
case class CollapseNestedExpressions(spark: SparkSession) extends Rule[SparkPlan] { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add necessary comment
} | ||
|
||
private def canBeOptimized(plan: SparkPlan): Boolean = plan match { | ||
case p: ProjectExecTransformer => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can expression in generate operator be optimized ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems can not
@@ -462,8 +463,55 @@ class GetJsonObjectImpl | |||
|
|||
static size_t getNumberOfIndexArguments(const DB::ColumnsWithTypeAndName & arguments) { return arguments.size() - 1; } | |||
|
|||
bool insertResultToColumn(DB::IColumn & dest, typename JSONParser::Element & root, std::vector<std::shared_ptr<DB::GeneratorJSONPath<JSONParser>>> & generator_json_paths, size_t & json_path_pos) const |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lgbo-ustc pls review changes related to get_json_object
@@ -59,9 +59,9 @@ class GetJSONObjectParser : public FunctionParser | |||
DB::ActionsDAG & actions_dag) const override | |||
{ | |||
const auto & args = substrait_func.arguments(); | |||
if (args.size() != 2) | |||
if (args.size() < 2) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
get_json_object(get_json_object(d, '$.a'), '$.b') => optimize to get_json_object(d, '$.a', '$.b')
, which may have more than 2 arguments.
解释下get_json_object的优化思路,不是太明白为何在这里要改成多个路径参数。 |
建议不同函数的优化拆分到不同的PR里面 |
mutable size_t total_normalized_rows = 0; | ||
|
||
template<typename JSONParser, typename JSONStringSerializer> | ||
void insertResultToColumn( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems to be complex, I guess there should be a simpler implement with less branches
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should explain which case it is for each branch
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Run Gluten Clickhouse CI on x86 |
1 similar comment
Run Gluten Clickhouse CI on x86 |
9ad3176
to
6e59310
Compare
Run Gluten Clickhouse CI on x86 |
1 similar comment
Run Gluten Clickhouse CI on x86 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the work!
.internal() | ||
.doc("Collapse nested functions as one for optimization.") | ||
.stringConf | ||
.createWithDefault("and,or"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@KevinyhZou, do you find any unsuitable corner cases? E.g., wrong result, performance degradation. If no, can we always enable this optimization without introducing a config?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not found unsuitable case for the ci testing and my own testing. But I'm afraid there maybe some unsuitable case in our online sqls, which the ci testing do not cover,So I think it‘s better to keep them
* ScalaUDF at first, and then pass the ScalaUDF to clickhouse backend. e.g. And(And(a=1, | ||
* b=2),c=3) can be optimized to And(a=1, b=2, c=3),but And(a=1, b=2, c=3) can not be | ||
* supported by spark `And` function, so we need to convert it to ScalaUDF, with name is | ||
* `And`, and have 3 arguments, when pass the `ScalaUDF(#and(a=1,b=2,c=3))` to clickhouse |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Must we need a ScalaUDF? Can we make an optimized plan just compatible with the backend? Regardless of that it's not compatible with Spark.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have removed the ScalaUDF
, and re-implement this by use CHCollapseExpressionTransformer
, help to review this pr ? thanks @PHILO-HE
And
/Or
And
/Or
for performance optimization
Run Gluten Clickhouse CI on x86 |
2 similar comments
Run Gluten Clickhouse CI on x86 |
Run Gluten Clickhouse CI on x86 |
ba9497b
to
72857c0
Compare
Run Gluten Clickhouse CI on x86 |
2 similar comments
Run Gluten Clickhouse CI on x86 |
Run Gluten Clickhouse CI on x86 |
f4738d0
to
2c6d6f6
Compare
Run Gluten Clickhouse CI on x86 |
1 similar comment
Run Gluten Clickhouse CI on x86 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we also introduce a dedicated rule for this optimization? Maybe, a custom And/Or expression class should be defined when necessary?
case _ => Option.empty[String] | ||
} | ||
|
||
private def canBeOptimized(expr: Expression): Boolean = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For no nested And/Or
case, is it also viewed as an optimizable case?
|
||
def getExpressionName(expr: Expression): Option[String] = expr match { | ||
case _: And => ExpressionMappings.expressionsMap.get(classOf[And]) | ||
case _: Or => ExpressionMappings.expressionsMap.get(classOf[Or]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This check seems redundant to the configuration based check.
case _ => exprCall.children.exists(c => canBeOptimized(c)) | ||
} | ||
case Some(f) => | ||
GlutenConfig.get.getSupportedCollapsedExpressions.split(",").exists(c => c.equals(f)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For mixed And/Or case, is it viewed as optimizable case?
Run Gluten Clickhouse CI on x86 |
fb730ea
to
578808b
Compare
Run Gluten Clickhouse CI on x86 |
578808b
to
2dbad5d
Compare
Run Gluten Clickhouse CI on x86 |
Now I have implement a dedicated rule for this, introduce the |
Run Gluten Clickhouse CI on x86 |
8ac225b
to
68ad6fd
Compare
Run Gluten Clickhouse CI on x86 |
cc @PHILO-HE |
What changes were proposed in this pull request?
(Please fill in changes proposed in this fix)
(Fixes: #8557)
How was this patch tested?
test by ut