[SPARK-43868][SQL][TESTS] Remove `originalUDFs` from `TestHive` to ensure `ObjectHashAggregateExecBenchmark` can run successfully on Github Action #41369

LuciferYang · 2023-05-29T11:56:58Z

What changes were proposed in this pull request?

This pr remove originalUDFs from TestHive to ensure ObjectHashAggregateExecBenchmark can run successfully on Github Action.

Why are the changes needed?

After SPARK-43225, org.codehaus.jackson:jackson-mapper-asl becomes a test scope dependency, so when using GA to run benchmark, it is not in the classpath because GA uses

spark/.github/workflows/benchmark.yml

Lines 179 to 183 in d61c77c

    
                   bin/spark-submit \ 
        
                     --driver-memory 6g --class org.apache.spark.benchmark.Benchmarks \ 
        
                     --jars "`find . -name '*-SNAPSHOT-tests.jar' -o -name '*avro*-SNAPSHOT.jar' | paste -sd ',' -`" \ 
        
                     "`find . -name 'spark-core*-SNAPSHOT-tests.jar'`" \ 
        
                     "${{ github.event.inputs.class }}"

iunstead of the sbt Test/runMain.

ObjectHashAggregateExecBenchmark used TestHive, and TestHive will always call org.apache.hadoop.hive.ql.exec.FunctionRegistry#getFunctionNames to init originalUDFs before this pr, so when we run ObjectHashAggregateExecBenchmark on GitHub Actions, there will be the following exceptions:

Error: Exception in thread "main" java.lang.NoClassDefFoundError: org/codehaus/jackson/map/type/TypeFactory
	at org.apache.hadoop.hive.ql.udf.UDFJson.<clinit>(UDFJson.java:64)
	at java.lang.Class.forName0(Native Method)
	at java.lang.Class.forName(Class.java:348)
	at org.apache.hadoop.hive.ql.udf.generic.GenericUDFBridge.getUdfClassInternal(GenericUDFBridge.java:142)
	at org.apache.hadoop.hive.ql.udf.generic.GenericUDFBridge.getUdfClass(GenericUDFBridge.java:132)
	at org.apache.hadoop.hive.ql.exec.FunctionInfo.getFunctionClass(FunctionInfo.java:151)
	at org.apache.hadoop.hive.ql.exec.Registry.addFunction(Registry.java:519)
	at org.apache.hadoop.hive.ql.exec.Registry.registerUDF(Registry.java:163)
	at org.apache.hadoop.hive.ql.exec.Registry.registerUDF(Registry.java:154)
	at org.apache.hadoop.hive.ql.exec.Registry.registerUDF(Registry.java:147)
	at org.apache.hadoop.hive.ql.exec.FunctionRegistry.<clinit>(FunctionRegistry.java:322)
	at org.apache.spark.sql.hive.test.TestHiveSparkSession.<init>(TestHive.scala:530)
	at org.apache.spark.sql.hive.test.TestHiveSparkSession.<init>(TestHive.scala:185)
	at org.apache.spark.sql.hive.test.TestHiveContext.<init>(TestHive.scala:133)
	at org.apache.spark.sql.hive.test.TestHive$.<init>(TestHive.scala:54)
	at org.apache.spark.sql.hive.test.TestHive$.<clinit>(TestHive.scala:53)
	at org.apache.spark.sql.execution.benchmark.ObjectHashAggregateExecBenchmark$.getSparkSession(ObjectHashAggregateExecBenchmark.scala:47)
	at org.apache.spark.sql.execution.benchmark.SqlBasedBenchmark.$init$(SqlBasedBenchmark.scala:35)
	at org.apache.spark.sql.execution.benchmark.ObjectHashAggregateExecBenchmark$.<clinit>(ObjectHashAggregateExecBenchmark.scala:45)
	at org.apache.spark.sql.execution.benchmark.ObjectHashAggregateExecBenchmark.main(ObjectHashAggregateExecBenchmark.scala)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.apache.spark.benchmark.Benchmarks$.$anonfun$main$7(Benchmarks.scala:128)
	at scala.collection.ArrayOps$.foreach$extension(ArrayOps.scala:1328)
	at org.apache.spark.benchmark.Benchmarks$.main(Benchmarks.scala:91)
	at org.apache.spark.benchmark.Benchmarks.main(Benchmarks.scala)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
	at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:1025)
	at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:192)
	at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:215)
	at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:91)
	at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1116)
	at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1125)
	at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.lang.ClassNotFoundException: org.codehaus.jackson.map.type.TypeFactory
	at java.net.URLClassLoader.findClass(URLClassLoader.java:387)
	at java.lang.ClassLoader.loadClass(ClassLoader.java:418)
	at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:352)
	at java.lang.ClassLoader.loadClass(ClassLoader.java:351)
	... 40 more

Then I found that originalUDFs is a unused val in TestHive now(SPARK-1251 | #6920 introduced it and become unused after SPARK-20667 | #17908), so this pr remove it from TestHive to avoid calling FunctionRegistry#getFunctionNames.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Pass GitHub Actions
Run ObjectHashAggregateExecBenchmark on Github Action:

Before

https://github.com/LuciferYang/spark/actions/runs/5128228630/jobs/9224706982

After

https://github.com/LuciferYang/spark/actions/runs/5128227211/jobs/9224704507

ObjectHashAggregateExecBenchmark run successfully.

LuciferYang · 2023-05-31T04:03:01Z

cc @wangyum @dongjoon-hyun @pan3793 FYI

dongjoon-hyun

+1, LGTM.

wangyum · 2023-05-31T05:09:57Z

Merged to master.

LuciferYang · 2023-05-31T05:33:27Z

Thanks @wangyum @dongjoon-hyun @pan3793 ~

…sure `ObjectHashAggregateExecBenchmark` can run successfully on Github Action ### What changes were proposed in this pull request? This pr remove `originalUDFs` from `TestHive` to ensure `ObjectHashAggregateExecBenchmark` can run successfully on Github Action. ### Why are the changes needed? After SPARK-43225, `org.codehaus.jackson:jackson-mapper-asl` becomes a test scope dependency, so when using GA to run benchmark, it is not in the classpath because GA uses https://github.com/apache/spark/blob/d61c77cac17029ee27319e6b766b48d314a4dd31/.github/workflows/benchmark.yml#L179-L183 iunstead of the sbt `Test/runMain`. `ObjectHashAggregateExecBenchmark` used `TestHive`, and `TestHive` will always call `org.apache.hadoop.hive.ql.exec.FunctionRegistry#getFunctionNames` to init `originalUDFs` before this pr, so when we run `ObjectHashAggregateExecBenchmark` on GitHub Actions, there will be the following exceptions: ``` Error: Exception in thread "main" java.lang.NoClassDefFoundError: org/codehaus/jackson/map/type/TypeFactory at org.apache.hadoop.hive.ql.udf.UDFJson.<clinit>(UDFJson.java:64) at java.lang.Class.forName0(Native Method) at java.lang.Class.forName(Class.java:348) at org.apache.hadoop.hive.ql.udf.generic.GenericUDFBridge.getUdfClassInternal(GenericUDFBridge.java:142) at org.apache.hadoop.hive.ql.udf.generic.GenericUDFBridge.getUdfClass(GenericUDFBridge.java:132) at org.apache.hadoop.hive.ql.exec.FunctionInfo.getFunctionClass(FunctionInfo.java:151) at org.apache.hadoop.hive.ql.exec.Registry.addFunction(Registry.java:519) at org.apache.hadoop.hive.ql.exec.Registry.registerUDF(Registry.java:163) at org.apache.hadoop.hive.ql.exec.Registry.registerUDF(Registry.java:154) at org.apache.hadoop.hive.ql.exec.Registry.registerUDF(Registry.java:147) at org.apache.hadoop.hive.ql.exec.FunctionRegistry.<clinit>(FunctionRegistry.java:322) at org.apache.spark.sql.hive.test.TestHiveSparkSession.<init>(TestHive.scala:530) at org.apache.spark.sql.hive.test.TestHiveSparkSession.<init>(TestHive.scala:185) at org.apache.spark.sql.hive.test.TestHiveContext.<init>(TestHive.scala:133) at org.apache.spark.sql.hive.test.TestHive$.<init>(TestHive.scala:54) at org.apache.spark.sql.hive.test.TestHive$.<clinit>(TestHive.scala:53) at org.apache.spark.sql.execution.benchmark.ObjectHashAggregateExecBenchmark$.getSparkSession(ObjectHashAggregateExecBenchmark.scala:47) at org.apache.spark.sql.execution.benchmark.SqlBasedBenchmark.$init$(SqlBasedBenchmark.scala:35) at org.apache.spark.sql.execution.benchmark.ObjectHashAggregateExecBenchmark$.<clinit>(ObjectHashAggregateExecBenchmark.scala:45) at org.apache.spark.sql.execution.benchmark.ObjectHashAggregateExecBenchmark.main(ObjectHashAggregateExecBenchmark.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.spark.benchmark.Benchmarks$.$anonfun$main$7(Benchmarks.scala:128) at scala.collection.ArrayOps$.foreach$extension(ArrayOps.scala:1328) at org.apache.spark.benchmark.Benchmarks$.main(Benchmarks.scala:91) at org.apache.spark.benchmark.Benchmarks.main(Benchmarks.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52) at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:1025) at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:192) at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:215) at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:91) at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1116) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1125) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) Caused by: java.lang.ClassNotFoundException: org.codehaus.jackson.map.type.TypeFactory at java.net.URLClassLoader.findClass(URLClassLoader.java:387) at java.lang.ClassLoader.loadClass(ClassLoader.java:418) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:352) at java.lang.ClassLoader.loadClass(ClassLoader.java:351) ... 40 more ``` Then I found that `originalUDFs` is a unused val in `TestHive` now(SPARK-1251 | apache#6920 introduced it and become unused after SPARK-20667 | apache#17908), so this pr remove it from `TestHive` to avoid calling `FunctionRegistry#getFunctionNames`. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? - Pass GitHub Actions - Run `ObjectHashAggregateExecBenchmark` on Github Action: **Before** https://github.com/LuciferYang/spark/actions/runs/5128228630/jobs/9224706982 <img width="1181" alt="image" src="https://github.com/apache/spark/assets/1475305/02a58e3c-2dad-4ad4-85e4-f8576a5aabed"> **After** https://github.com/LuciferYang/spark/actions/runs/5128227211/jobs/9224704507 <img width="1282" alt="image" src="https://github.com/apache/spark/assets/1475305/27c70ec6-e55d-4a19-a6c3-e892789b97f7"> `ObjectHashAggregateExecBenchmark` run successfully. Closes apache#41369 from LuciferYang/hive-udf. Lead-authored-by: yangjie01 <yangjie01@baidu.com> Co-authored-by: YangJie <yangjie01@baidu.com> Signed-off-by: Yuming Wang <yumwang@ebay.com>

…sure `ObjectHashAggregateExecBenchmark` can run successfully on Github Action ### What changes were proposed in this pull request? This pr remove `originalUDFs` from `TestHive` to ensure `ObjectHashAggregateExecBenchmark` can run successfully on Github Action. ### Why are the changes needed? After SPARK-43225, `org.codehaus.jackson:jackson-mapper-asl` becomes a test scope dependency, so when using GA to run benchmark, it is not in the classpath because GA uses https://github.com/apache/spark/blob/d61c77cac17029ee27319e6b766b48d314a4dd31/.github/workflows/benchmark.yml#L179-L183 iunstead of the sbt `Test/runMain`. `ObjectHashAggregateExecBenchmark` used `TestHive`, and `TestHive` will always call `org.apache.hadoop.hive.ql.exec.FunctionRegistry#getFunctionNames` to init `originalUDFs` before this pr, so when we run `ObjectHashAggregateExecBenchmark` on GitHub Actions, there will be the following exceptions: ``` Error: Exception in thread "main" java.lang.NoClassDefFoundError: org/codehaus/jackson/map/type/TypeFactory at org.apache.hadoop.hive.ql.udf.UDFJson.<clinit>(UDFJson.java:64) at java.lang.Class.forName0(Native Method) at java.lang.Class.forName(Class.java:348) at org.apache.hadoop.hive.ql.udf.generic.GenericUDFBridge.getUdfClassInternal(GenericUDFBridge.java:142) at org.apache.hadoop.hive.ql.udf.generic.GenericUDFBridge.getUdfClass(GenericUDFBridge.java:132) at org.apache.hadoop.hive.ql.exec.FunctionInfo.getFunctionClass(FunctionInfo.java:151) at org.apache.hadoop.hive.ql.exec.Registry.addFunction(Registry.java:519) at org.apache.hadoop.hive.ql.exec.Registry.registerUDF(Registry.java:163) at org.apache.hadoop.hive.ql.exec.Registry.registerUDF(Registry.java:154) at org.apache.hadoop.hive.ql.exec.Registry.registerUDF(Registry.java:147) at org.apache.hadoop.hive.ql.exec.FunctionRegistry.<clinit>(FunctionRegistry.java:322) at org.apache.spark.sql.hive.test.TestHiveSparkSession.<init>(TestHive.scala:530) at org.apache.spark.sql.hive.test.TestHiveSparkSession.<init>(TestHive.scala:185) at org.apache.spark.sql.hive.test.TestHiveContext.<init>(TestHive.scala:133) at org.apache.spark.sql.hive.test.TestHive$.<init>(TestHive.scala:54) at org.apache.spark.sql.hive.test.TestHive$.<clinit>(TestHive.scala:53) at org.apache.spark.sql.execution.benchmark.ObjectHashAggregateExecBenchmark$.getSparkSession(ObjectHashAggregateExecBenchmark.scala:47) at org.apache.spark.sql.execution.benchmark.SqlBasedBenchmark.$init$(SqlBasedBenchmark.scala:35) at org.apache.spark.sql.execution.benchmark.ObjectHashAggregateExecBenchmark$.<clinit>(ObjectHashAggregateExecBenchmark.scala:45) at org.apache.spark.sql.execution.benchmark.ObjectHashAggregateExecBenchmark.main(ObjectHashAggregateExecBenchmark.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.spark.benchmark.Benchmarks$.$anonfun$main$7(Benchmarks.scala:128) at scala.collection.ArrayOps$.foreach$extension(ArrayOps.scala:1328) at org.apache.spark.benchmark.Benchmarks$.main(Benchmarks.scala:91) at org.apache.spark.benchmark.Benchmarks.main(Benchmarks.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52) at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:1025) at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:192) at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:215) at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:91) at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1116) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1125) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) Caused by: java.lang.ClassNotFoundException: org.codehaus.jackson.map.type.TypeFactory at java.net.URLClassLoader.findClass(URLClassLoader.java:387) at java.lang.ClassLoader.loadClass(ClassLoader.java:418) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:352) at java.lang.ClassLoader.loadClass(ClassLoader.java:351) ... 40 more ``` Then I found that `originalUDFs` is a unused val in `TestHive` now(SPARK-1251 | apache#6920 introduced it and become unused after SPARK-20667 | apache#17908), so this pr remove it from `TestHive` to avoid calling `FunctionRegistry#getFunctionNames`. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? - Pass GitHub Actions - Run `ObjectHashAggregateExecBenchmark` on Github Action: **Before** https://github.com/LuciferYang/spark/actions/runs/5128228630/jobs/9224706982 <img width="1181" alt="image" src="https://github.com/apache/spark/assets/1475305/02a58e3c-2dad-4ad4-85e4-f8576a5aabed"> **After** https://github.com/LuciferYang/spark/actions/runs/5128227211/jobs/9224704507 <img width="1282" alt="image" src="https://github.com/apache/spark/assets/1475305/27c70ec6-e55d-4a19-a6c3-e892789b97f7"> `ObjectHashAggregateExecBenchmark` run successfully. Closes apache#41369 from LuciferYang/hive-udf. Lead-authored-by: yangjie01 <yangjie01@baidu.com> Co-authored-by: YangJie <yangjie01@baidu.com> Signed-off-by: Yuming Wang <yumwang@ebay.com> (cherry picked from commit 3472619)

fix

bc86289

LuciferYang marked this pull request as draft May 29, 2023 11:57

github-actions bot added the SQL label May 29, 2023

LuciferYang changed the title ~~[SQL][TESTS] Remove originalUDFs from TestHive~~ [SPARK-43868][SQL][TESTS] Remove originalUDFs from TestHive May 30, 2023

Merge branch 'apache:master' into hive-udf

fb0639b

LuciferYang changed the title ~~[SPARK-43868][SQL][TESTS] Remove originalUDFs from TestHive~~ [SPARK-43868][SQL][TESTS] Remove originalUDFs from TestHive to ensure ObjectHashAggregateExecBenchmark can run successfully on GA May 31, 2023

LuciferYang marked this pull request as ready for review May 31, 2023 04:05

wangyum approved these changes May 31, 2023

View reviewed changes

pan3793 approved these changes May 31, 2023

View reviewed changes

dongjoon-hyun approved these changes May 31, 2023

View reviewed changes

wangyum closed this in 3472619 May 31, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-43868][SQL][TESTS] Remove `originalUDFs` from `TestHive` to ensure `ObjectHashAggregateExecBenchmark` can run successfully on Github Action #41369

[SPARK-43868][SQL][TESTS] Remove `originalUDFs` from `TestHive` to ensure `ObjectHashAggregateExecBenchmark` can run successfully on Github Action #41369

LuciferYang commented May 29, 2023 •

edited

Loading

LuciferYang commented May 31, 2023

dongjoon-hyun left a comment

wangyum commented May 31, 2023

LuciferYang commented May 31, 2023

	bin/spark-submit \
	--driver-memory 6g --class org.apache.spark.benchmark.Benchmarks \
	--jars "`find . -name '-SNAPSHOT-tests.jar' -o -name 'avro*-SNAPSHOT.jar' \| paste -sd ',' -`" \
	"`find . -name 'spark-core*-SNAPSHOT-tests.jar'`" \
	"${{ github.event.inputs.class }}"

[SPARK-43868][SQL][TESTS] Remove originalUDFs from TestHive to ensure ObjectHashAggregateExecBenchmark can run successfully on Github Action #41369

[SPARK-43868][SQL][TESTS] Remove originalUDFs from TestHive to ensure ObjectHashAggregateExecBenchmark can run successfully on Github Action #41369

Conversation

LuciferYang commented May 29, 2023 • edited Loading

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

LuciferYang commented May 31, 2023

dongjoon-hyun left a comment

Choose a reason for hiding this comment

wangyum commented May 31, 2023

LuciferYang commented May 31, 2023

[SPARK-43868][SQL][TESTS] Remove `originalUDFs` from `TestHive` to ensure `ObjectHashAggregateExecBenchmark` can run successfully on Github Action #41369

[SPARK-43868][SQL][TESTS] Remove `originalUDFs` from `TestHive` to ensure `ObjectHashAggregateExecBenchmark` can run successfully on Github Action #41369

LuciferYang commented May 29, 2023 •

edited

Loading