-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-8478][SQL] Harmonize UDF-related code to use uniformly UDF instead of Udf #6920
Conversation
Test build #35364 has finished for PR 6920 at commit
|
pinging @marmbrus |
Test build #35589 has finished for PR 6920 at commit
|
Jenkins, retest this please. |
Test build #35606 has finished for PR 6920 at commit
|
Jenkins, retest this please (#6974) |
Test build #35663 has finished for PR 6920 at commit
|
Test build #36009 has finished for PR 6920 at commit
|
Jenkins, retest this please. |
Test build #36017 has finished for PR 6920 at commit
|
Thanks, merged to master. |
…sure `ObjectHashAggregateExecBenchmark` can run successfully on Github Action ### What changes were proposed in this pull request? This pr remove `originalUDFs` from `TestHive` to ensure `ObjectHashAggregateExecBenchmark` can run successfully on Github Action. ### Why are the changes needed? After SPARK-43225, `org.codehaus.jackson:jackson-mapper-asl` becomes a test scope dependency, so when using GA to run benchmark, it is not in the classpath because GA uses https://github.com/apache/spark/blob/d61c77cac17029ee27319e6b766b48d314a4dd31/.github/workflows/benchmark.yml#L179-L183 iunstead of the sbt `Test/runMain`. `ObjectHashAggregateExecBenchmark` used `TestHive`, and `TestHive` will always call `org.apache.hadoop.hive.ql.exec.FunctionRegistry#getFunctionNames` to init `originalUDFs` before this pr, so when we run `ObjectHashAggregateExecBenchmark` on GitHub Actions, there will be the following exceptions: ``` Error: Exception in thread "main" java.lang.NoClassDefFoundError: org/codehaus/jackson/map/type/TypeFactory at org.apache.hadoop.hive.ql.udf.UDFJson.<clinit>(UDFJson.java:64) at java.lang.Class.forName0(Native Method) at java.lang.Class.forName(Class.java:348) at org.apache.hadoop.hive.ql.udf.generic.GenericUDFBridge.getUdfClassInternal(GenericUDFBridge.java:142) at org.apache.hadoop.hive.ql.udf.generic.GenericUDFBridge.getUdfClass(GenericUDFBridge.java:132) at org.apache.hadoop.hive.ql.exec.FunctionInfo.getFunctionClass(FunctionInfo.java:151) at org.apache.hadoop.hive.ql.exec.Registry.addFunction(Registry.java:519) at org.apache.hadoop.hive.ql.exec.Registry.registerUDF(Registry.java:163) at org.apache.hadoop.hive.ql.exec.Registry.registerUDF(Registry.java:154) at org.apache.hadoop.hive.ql.exec.Registry.registerUDF(Registry.java:147) at org.apache.hadoop.hive.ql.exec.FunctionRegistry.<clinit>(FunctionRegistry.java:322) at org.apache.spark.sql.hive.test.TestHiveSparkSession.<init>(TestHive.scala:530) at org.apache.spark.sql.hive.test.TestHiveSparkSession.<init>(TestHive.scala:185) at org.apache.spark.sql.hive.test.TestHiveContext.<init>(TestHive.scala:133) at org.apache.spark.sql.hive.test.TestHive$.<init>(TestHive.scala:54) at org.apache.spark.sql.hive.test.TestHive$.<clinit>(TestHive.scala:53) at org.apache.spark.sql.execution.benchmark.ObjectHashAggregateExecBenchmark$.getSparkSession(ObjectHashAggregateExecBenchmark.scala:47) at org.apache.spark.sql.execution.benchmark.SqlBasedBenchmark.$init$(SqlBasedBenchmark.scala:35) at org.apache.spark.sql.execution.benchmark.ObjectHashAggregateExecBenchmark$.<clinit>(ObjectHashAggregateExecBenchmark.scala:45) at org.apache.spark.sql.execution.benchmark.ObjectHashAggregateExecBenchmark.main(ObjectHashAggregateExecBenchmark.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.spark.benchmark.Benchmarks$.$anonfun$main$7(Benchmarks.scala:128) at scala.collection.ArrayOps$.foreach$extension(ArrayOps.scala:1328) at org.apache.spark.benchmark.Benchmarks$.main(Benchmarks.scala:91) at org.apache.spark.benchmark.Benchmarks.main(Benchmarks.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52) at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:1025) at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:192) at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:215) at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:91) at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1116) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1125) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) Caused by: java.lang.ClassNotFoundException: org.codehaus.jackson.map.type.TypeFactory at java.net.URLClassLoader.findClass(URLClassLoader.java:387) at java.lang.ClassLoader.loadClass(ClassLoader.java:418) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:352) at java.lang.ClassLoader.loadClass(ClassLoader.java:351) ... 40 more ``` Then I found that `originalUDFs` is a unused val in `TestHive` now(SPARK-1251 | #6920 introduced it and become unused after SPARK-20667 | #17908), so this pr remove it from `TestHive` to avoid calling `FunctionRegistry#getFunctionNames`. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? - Pass GitHub Actions - Run `ObjectHashAggregateExecBenchmark` on Github Action: **Before** https://github.com/LuciferYang/spark/actions/runs/5128228630/jobs/9224706982 <img width="1181" alt="image" src="https://github.com/apache/spark/assets/1475305/02a58e3c-2dad-4ad4-85e4-f8576a5aabed"> **After** https://github.com/LuciferYang/spark/actions/runs/5128227211/jobs/9224704507 <img width="1282" alt="image" src="https://github.com/apache/spark/assets/1475305/27c70ec6-e55d-4a19-a6c3-e892789b97f7"> `ObjectHashAggregateExecBenchmark` run successfully. Closes #41369 from LuciferYang/hive-udf. Lead-authored-by: yangjie01 <yangjie01@baidu.com> Co-authored-by: YangJie <yangjie01@baidu.com> Signed-off-by: Yuming Wang <yumwang@ebay.com>
…sure `ObjectHashAggregateExecBenchmark` can run successfully on Github Action ### What changes were proposed in this pull request? This pr remove `originalUDFs` from `TestHive` to ensure `ObjectHashAggregateExecBenchmark` can run successfully on Github Action. ### Why are the changes needed? After SPARK-43225, `org.codehaus.jackson:jackson-mapper-asl` becomes a test scope dependency, so when using GA to run benchmark, it is not in the classpath because GA uses https://github.com/apache/spark/blob/d61c77cac17029ee27319e6b766b48d314a4dd31/.github/workflows/benchmark.yml#L179-L183 iunstead of the sbt `Test/runMain`. `ObjectHashAggregateExecBenchmark` used `TestHive`, and `TestHive` will always call `org.apache.hadoop.hive.ql.exec.FunctionRegistry#getFunctionNames` to init `originalUDFs` before this pr, so when we run `ObjectHashAggregateExecBenchmark` on GitHub Actions, there will be the following exceptions: ``` Error: Exception in thread "main" java.lang.NoClassDefFoundError: org/codehaus/jackson/map/type/TypeFactory at org.apache.hadoop.hive.ql.udf.UDFJson.<clinit>(UDFJson.java:64) at java.lang.Class.forName0(Native Method) at java.lang.Class.forName(Class.java:348) at org.apache.hadoop.hive.ql.udf.generic.GenericUDFBridge.getUdfClassInternal(GenericUDFBridge.java:142) at org.apache.hadoop.hive.ql.udf.generic.GenericUDFBridge.getUdfClass(GenericUDFBridge.java:132) at org.apache.hadoop.hive.ql.exec.FunctionInfo.getFunctionClass(FunctionInfo.java:151) at org.apache.hadoop.hive.ql.exec.Registry.addFunction(Registry.java:519) at org.apache.hadoop.hive.ql.exec.Registry.registerUDF(Registry.java:163) at org.apache.hadoop.hive.ql.exec.Registry.registerUDF(Registry.java:154) at org.apache.hadoop.hive.ql.exec.Registry.registerUDF(Registry.java:147) at org.apache.hadoop.hive.ql.exec.FunctionRegistry.<clinit>(FunctionRegistry.java:322) at org.apache.spark.sql.hive.test.TestHiveSparkSession.<init>(TestHive.scala:530) at org.apache.spark.sql.hive.test.TestHiveSparkSession.<init>(TestHive.scala:185) at org.apache.spark.sql.hive.test.TestHiveContext.<init>(TestHive.scala:133) at org.apache.spark.sql.hive.test.TestHive$.<init>(TestHive.scala:54) at org.apache.spark.sql.hive.test.TestHive$.<clinit>(TestHive.scala:53) at org.apache.spark.sql.execution.benchmark.ObjectHashAggregateExecBenchmark$.getSparkSession(ObjectHashAggregateExecBenchmark.scala:47) at org.apache.spark.sql.execution.benchmark.SqlBasedBenchmark.$init$(SqlBasedBenchmark.scala:35) at org.apache.spark.sql.execution.benchmark.ObjectHashAggregateExecBenchmark$.<clinit>(ObjectHashAggregateExecBenchmark.scala:45) at org.apache.spark.sql.execution.benchmark.ObjectHashAggregateExecBenchmark.main(ObjectHashAggregateExecBenchmark.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.spark.benchmark.Benchmarks$.$anonfun$main$7(Benchmarks.scala:128) at scala.collection.ArrayOps$.foreach$extension(ArrayOps.scala:1328) at org.apache.spark.benchmark.Benchmarks$.main(Benchmarks.scala:91) at org.apache.spark.benchmark.Benchmarks.main(Benchmarks.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52) at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:1025) at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:192) at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:215) at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:91) at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1116) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1125) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) Caused by: java.lang.ClassNotFoundException: org.codehaus.jackson.map.type.TypeFactory at java.net.URLClassLoader.findClass(URLClassLoader.java:387) at java.lang.ClassLoader.loadClass(ClassLoader.java:418) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:352) at java.lang.ClassLoader.loadClass(ClassLoader.java:351) ... 40 more ``` Then I found that `originalUDFs` is a unused val in `TestHive` now(SPARK-1251 | apache#6920 introduced it and become unused after SPARK-20667 | apache#17908), so this pr remove it from `TestHive` to avoid calling `FunctionRegistry#getFunctionNames`. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? - Pass GitHub Actions - Run `ObjectHashAggregateExecBenchmark` on Github Action: **Before** https://github.com/LuciferYang/spark/actions/runs/5128228630/jobs/9224706982 <img width="1181" alt="image" src="https://github.com/apache/spark/assets/1475305/02a58e3c-2dad-4ad4-85e4-f8576a5aabed"> **After** https://github.com/LuciferYang/spark/actions/runs/5128227211/jobs/9224704507 <img width="1282" alt="image" src="https://github.com/apache/spark/assets/1475305/27c70ec6-e55d-4a19-a6c3-e892789b97f7"> `ObjectHashAggregateExecBenchmark` run successfully. Closes apache#41369 from LuciferYang/hive-udf. Lead-authored-by: yangjie01 <yangjie01@baidu.com> Co-authored-by: YangJie <yangjie01@baidu.com> Signed-off-by: Yuming Wang <yumwang@ebay.com>
…sure `ObjectHashAggregateExecBenchmark` can run successfully on Github Action ### What changes were proposed in this pull request? This pr remove `originalUDFs` from `TestHive` to ensure `ObjectHashAggregateExecBenchmark` can run successfully on Github Action. ### Why are the changes needed? After SPARK-43225, `org.codehaus.jackson:jackson-mapper-asl` becomes a test scope dependency, so when using GA to run benchmark, it is not in the classpath because GA uses https://github.com/apache/spark/blob/d61c77cac17029ee27319e6b766b48d314a4dd31/.github/workflows/benchmark.yml#L179-L183 iunstead of the sbt `Test/runMain`. `ObjectHashAggregateExecBenchmark` used `TestHive`, and `TestHive` will always call `org.apache.hadoop.hive.ql.exec.FunctionRegistry#getFunctionNames` to init `originalUDFs` before this pr, so when we run `ObjectHashAggregateExecBenchmark` on GitHub Actions, there will be the following exceptions: ``` Error: Exception in thread "main" java.lang.NoClassDefFoundError: org/codehaus/jackson/map/type/TypeFactory at org.apache.hadoop.hive.ql.udf.UDFJson.<clinit>(UDFJson.java:64) at java.lang.Class.forName0(Native Method) at java.lang.Class.forName(Class.java:348) at org.apache.hadoop.hive.ql.udf.generic.GenericUDFBridge.getUdfClassInternal(GenericUDFBridge.java:142) at org.apache.hadoop.hive.ql.udf.generic.GenericUDFBridge.getUdfClass(GenericUDFBridge.java:132) at org.apache.hadoop.hive.ql.exec.FunctionInfo.getFunctionClass(FunctionInfo.java:151) at org.apache.hadoop.hive.ql.exec.Registry.addFunction(Registry.java:519) at org.apache.hadoop.hive.ql.exec.Registry.registerUDF(Registry.java:163) at org.apache.hadoop.hive.ql.exec.Registry.registerUDF(Registry.java:154) at org.apache.hadoop.hive.ql.exec.Registry.registerUDF(Registry.java:147) at org.apache.hadoop.hive.ql.exec.FunctionRegistry.<clinit>(FunctionRegistry.java:322) at org.apache.spark.sql.hive.test.TestHiveSparkSession.<init>(TestHive.scala:530) at org.apache.spark.sql.hive.test.TestHiveSparkSession.<init>(TestHive.scala:185) at org.apache.spark.sql.hive.test.TestHiveContext.<init>(TestHive.scala:133) at org.apache.spark.sql.hive.test.TestHive$.<init>(TestHive.scala:54) at org.apache.spark.sql.hive.test.TestHive$.<clinit>(TestHive.scala:53) at org.apache.spark.sql.execution.benchmark.ObjectHashAggregateExecBenchmark$.getSparkSession(ObjectHashAggregateExecBenchmark.scala:47) at org.apache.spark.sql.execution.benchmark.SqlBasedBenchmark.$init$(SqlBasedBenchmark.scala:35) at org.apache.spark.sql.execution.benchmark.ObjectHashAggregateExecBenchmark$.<clinit>(ObjectHashAggregateExecBenchmark.scala:45) at org.apache.spark.sql.execution.benchmark.ObjectHashAggregateExecBenchmark.main(ObjectHashAggregateExecBenchmark.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.spark.benchmark.Benchmarks$.$anonfun$main$7(Benchmarks.scala:128) at scala.collection.ArrayOps$.foreach$extension(ArrayOps.scala:1328) at org.apache.spark.benchmark.Benchmarks$.main(Benchmarks.scala:91) at org.apache.spark.benchmark.Benchmarks.main(Benchmarks.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52) at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:1025) at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:192) at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:215) at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:91) at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1116) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1125) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) Caused by: java.lang.ClassNotFoundException: org.codehaus.jackson.map.type.TypeFactory at java.net.URLClassLoader.findClass(URLClassLoader.java:387) at java.lang.ClassLoader.loadClass(ClassLoader.java:418) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:352) at java.lang.ClassLoader.loadClass(ClassLoader.java:351) ... 40 more ``` Then I found that `originalUDFs` is a unused val in `TestHive` now(SPARK-1251 | apache#6920 introduced it and become unused after SPARK-20667 | apache#17908), so this pr remove it from `TestHive` to avoid calling `FunctionRegistry#getFunctionNames`. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? - Pass GitHub Actions - Run `ObjectHashAggregateExecBenchmark` on Github Action: **Before** https://github.com/LuciferYang/spark/actions/runs/5128228630/jobs/9224706982 <img width="1181" alt="image" src="https://github.com/apache/spark/assets/1475305/02a58e3c-2dad-4ad4-85e4-f8576a5aabed"> **After** https://github.com/LuciferYang/spark/actions/runs/5128227211/jobs/9224704507 <img width="1282" alt="image" src="https://github.com/apache/spark/assets/1475305/27c70ec6-e55d-4a19-a6c3-e892789b97f7"> `ObjectHashAggregateExecBenchmark` run successfully. Closes apache#41369 from LuciferYang/hive-udf. Lead-authored-by: yangjie01 <yangjie01@baidu.com> Co-authored-by: YangJie <yangjie01@baidu.com> Signed-off-by: Yuming Wang <yumwang@ebay.com> (cherry picked from commit 3472619)
Follow-up of #6902 for being coherent between
Udf
andUDF