[SPARK-34388][SQL] Propagate the registered UDF name to ScalaUDF, ScalaUDAF and ScalaAggregator #31500

imback82 · 2021-02-06T22:18:40Z

What changes were proposed in this pull request?

This PR proposes to propagate the name used for registering UDFs to ScalaUDF, ScalaUDAF and ScaalAggregator.

Note that PythonUDF gets the name correctly:

Lines 358 to 359 in 466c045

    
           register_udf = UserDefinedFunction(f, returnType=returnType, name=name, 
        
                                              evalType=PythonEvalType.SQL_BATCHED_UDF)

, and same for Hive UDFs:

spark/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveSessionCatalog.scala

Line 67 in 466c045

    
           udfExpr = Some(HiveSimpleUDF(name, new HiveFunctionWrapper(clazz.getName), input))

Why are the changes needed?

This PR can help in the following scenarios:

Better EXPLAIN output
By adding def name: String to UserDefinedExpression, we can match an expression by UserDefinedExpression and look up the catalog, an use case needed for [SPARK-34152][SQL] Make CreateViewStatement.child to be LogicalPlan's children so that it's resolved in analyze phase #31273.

Does this PR introduce any user-facing change?

The EXPLAIN output involving udfs will be changed to use the name used for UDF registration.

For example, for the following:

sql("CREATE TEMPORARY FUNCTION test_udf AS 'org.apache.spark.examples.sql.Spark33084'")
sql("SELECT test_udf(col1) FROM VALUES (1), (2), (3)").explain(true)

The output of the optimized plan will change from:

Aggregate [spark33084(cast(col1#223 as bigint), org.apache.spark.examples.sql.Spark33084@6906be0f, 1, 1) AS spark33084(col1)#237]
+- LocalRelation [col1#223]

to

Aggregate [test_udf(cast(col1#223 as bigint), org.apache.spark.examples.sql.Spark33084@7a62d697, 1, 1, Some(test_udf)) AS test_udf(col1)#237]
+- LocalRelation [col1#223]

How was this patch tested?

Added new tests.

SparkQA · 2021-02-07T04:00:47Z

Test build #134963 has finished for PR 31500 at commit a68b977.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-02-07T06:12:11Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39560/

SparkQA · 2021-02-07T07:42:45Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39560/

SparkQA · 2021-02-07T09:48:00Z

Test build #134977 has finished for PR 31500 at commit 56c3001.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

imback82 · 2021-02-07T20:50:25Z

@cloud-fan this is from #31273 (comment)

srowen · 2021-02-08T02:25:05Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Expression.scala

@@ -1088,4 +1088,6 @@ trait ComplexTypeMergingExpression extends Expression {
 * Common base trait for user-defined functions, including UDF/UDAF/UDTF of different languages
 * and Hive function wrappers.
 */
-trait UserDefinedExpression
+trait UserDefinedExpression {
+  def name: String


Maybe default to using the class name or something?

It's an internal trait, seems OK to require it.

cloud-fan · 2021-02-08T16:02:03Z

thanks, merging to master!

…laUDAF and ScalaAggregator ### What changes were proposed in this pull request? This PR proposes to propagate the name used for registering UDFs to `ScalaUDF`, `ScalaUDAF` and `ScaalAggregator`. Note that `PythonUDF` gets the name correctly: https://github.com/apache/spark/blob/466c045bfac20b6ce19f5a3732e76a5de4eb4e4a/python/pyspark/sql/udf.py#L358-L359 , and same for Hive UDFs: https://github.com/apache/spark/blob/466c045bfac20b6ce19f5a3732e76a5de4eb4e4a/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveSessionCatalog.scala#L67 ### Why are the changes needed? This PR can help in the following scenarios: 1) Better EXPLAIN output 2) By adding `def name: String` to `UserDefinedExpression`, we can match an expression by `UserDefinedExpression` and look up the catalog, an use case needed for apache#31273. ### Does this PR introduce _any_ user-facing change? The EXPLAIN output involving udfs will be changed to use the name used for UDF registration. For example, for the following: ``` sql("CREATE TEMPORARY FUNCTION test_udf AS 'org.apache.spark.examples.sql.Spark33084'") sql("SELECT test_udf(col1) FROM VALUES (1), (2), (3)").explain(true) ``` The output of the optimized plan will change from: ``` Aggregate [spark33084(cast(col1#223 as bigint), org.apache.spark.examples.sql.Spark330846906be0f, 1, 1) AS spark33084(col1)apache#237] +- LocalRelation [col1#223] ``` to ``` Aggregate [test_udf(cast(col1#223 as bigint), org.apache.spark.examples.sql.Spark330847a62d697, 1, 1, Some(test_udf)) AS test_udf(col1)apache#237] +- LocalRelation [col1#223] ``` ### How was this patch tested? Added new tests. Closes apache#31500 from imback82/udaf_name. Authored-by: Terry Kim <yuminkim@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

maropu · 2021-02-10T03:48:07Z

It seems this PR already has been merged, so I'll close this.

imback82 added 3 commits February 5, 2021 22:12

initial commit

1269ab8

code changes

f2fd94c

Add tests

a68b977

github-actions bot added the SQL label Feb 6, 2021

fix tests

56c3001

srowen reviewed Feb 8, 2021

View reviewed changes

cloud-fan approved these changes Feb 8, 2021

View reviewed changes

maropu closed this Feb 10, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-34388][SQL] Propagate the registered UDF name to ScalaUDF, ScalaUDAF and ScalaAggregator #31500

[SPARK-34388][SQL] Propagate the registered UDF name to ScalaUDF, ScalaUDAF and ScalaAggregator #31500

imback82 commented Feb 6, 2021 •

edited

Loading

SparkQA commented Feb 7, 2021

SparkQA commented Feb 7, 2021

SparkQA commented Feb 7, 2021

SparkQA commented Feb 7, 2021

imback82 commented Feb 7, 2021

srowen Feb 8, 2021

cloud-fan Feb 8, 2021

cloud-fan commented Feb 8, 2021

maropu commented Feb 10, 2021

	register_udf = UserDefinedFunction(f, returnType=returnType, name=name,
	evalType=PythonEvalType.SQL_BATCHED_UDF)

[SPARK-34388][SQL] Propagate the registered UDF name to ScalaUDF, ScalaUDAF and ScalaAggregator #31500

[SPARK-34388][SQL] Propagate the registered UDF name to ScalaUDF, ScalaUDAF and ScalaAggregator #31500

Conversation

imback82 commented Feb 6, 2021 • edited Loading

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

SparkQA commented Feb 7, 2021

SparkQA commented Feb 7, 2021

SparkQA commented Feb 7, 2021

SparkQA commented Feb 7, 2021

imback82 commented Feb 7, 2021

srowen Feb 8, 2021

Choose a reason for hiding this comment

cloud-fan Feb 8, 2021

Choose a reason for hiding this comment

cloud-fan commented Feb 8, 2021

maropu commented Feb 10, 2021

imback82 commented Feb 6, 2021 •

edited

Loading