SPARK-1314: Use SPARK_HIVE to determine if we include Hive in packaging #237

aarondav · 2014-03-26T07:36:17Z

Previously, we based our decision regarding including datanucleus jars based on the existence of a spark-hive-assembly jar, which was incidentally built whenever "sbt assembly" is run. This means that a typical and previously supported pathway would start using hive jars.

This patch has the following features/bug fixes:

Use of SPARK_HIVE (default false) to determine if we should include Hive in the assembly jar.
Analagous feature in Maven with -Phive (previously, there was no support for adding Hive to any of our jars produced by Maven)
assemble-deps fixed since we no longer use a different ASSEMBLY_DIR
avoid adding log message in compute-classpath.sh to the classpath :)

Still TODO before mergeable:

We need to download the datanucleus jars outside of sbt. Perhaps we can have spark-class download them if SPARK_HIVE is set similar to how sbt downloads itself.
Spark SQL documentation updates.

aarondav · 2014-03-26T07:39:08Z

pom.xml

@@ -373,7 +373,6 @@
        <groupId>org.apache.derby</groupId>
        <artifactId>derby</artifactId>
        <version>10.4.2.0</version>
-        <scope>test</scope>


Hive requires derby in the compile scope via a transitive dependency on hive-metastore, and this setting was overriding that. This does not seem to be pulled in from non-hive assembly jars.

aarondav · 2014-03-26T07:42:10Z

@marmbrus @pwendell Please take a look, my build-foo is not strong.

AmplabJenkins · 2014-03-26T08:13:31Z

Merged build triggered.

AmplabJenkins · 2014-03-26T08:13:31Z

Merged build started.

AmplabJenkins · 2014-03-26T09:11:49Z

Merged build finished.

AmplabJenkins · 2014-03-26T09:11:49Z

All automated tests passed.
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/13466/

pwendell · 2014-03-26T17:30:35Z

sql/hive/pom.xml

+            <!-- Matches the version of jackson-core-asl pulled in by avro -->
+            <groupId>org.codehaus.jackson</groupId>
+            <artifactId>jackson-mapper-asl</artifactId>
+            <version>1.8.8</version>


should the version number be declared in the parent pom and then the dependency added here? Btw one reason why this might be getting messed up is that we intentionally exclude jackson from a bunch of dependencies. So maybe we are actually excluding it when pulling in Avro.

pwendell · 2014-03-26T17:39:59Z

Thanks Aaron - looks good. From what I can tell you've just copied the model used for ganglia and previously YARN.

The only ugliness is that SPARK_HIVE is now needed at runtime to determine whether to include the Datanucleus jars. And this is conflated with the setting at compile-time which has a different meaning. This is a bit unfortunate - is there no better way here? We could have the assembly name be different if Hive is included, but that's sort of ugly too. We could also just include the datanucleus jars on the classpath if they are present. The only downside is that if someone did a build for hive, then did a normal build, they would still include it. But in that case maybe it's just okay to include them - it's not a widely used library.

pwendell · 2014-03-26T17:41:40Z

We could also include them and log a message like "Including Datanucleus jars needed for Hive" and then if someone happens to not want them (they previously did a Hive assembly) then they'd know what was going on.

marmbrus · 2014-03-26T17:43:41Z

One other option is to have a file flag like we do for RELEASE that is used in addition to the env var.

aarondav · 2014-03-26T17:47:32Z

One issue with the "including datanucleus if it exists" approach is that we need to at some point download datanucleus for the user. Right now sbt does that via lib_managed, but there's no pathway to getting the datanucleus jars from maven, so we may need an external script that's run in spark-class. This would require the same type of runtime option as we have now, or a file flag.

pwendell · 2014-03-26T18:02:09Z

Ah sorry I didn't read your description. Indeed, this wouldn't work for maven. Having a file flag seems potentially complicated if someone does multiple builds some with and some without hive.

shivaram · 2014-03-26T20:06:56Z

project/SparkBuild.scala

@@ -395,8 +404,6 @@ object SparkBuild extends Build {
  // assembly jar.
  def hiveSettings = sharedSettings ++ assemblyProjSettings ++ Seq(


Any reason we still need a separate hive assembly target ? If hive will be a part of the normal assembly jar (based on maybeHive) then these rules should be moved into assemblyProjSettings ?

Good catch, removed!

AmplabJenkins · 2014-03-28T01:54:32Z

Can one of the admins verify this patch?

aarondav · 2014-04-05T23:54:02Z

Alright, I changed up the solution by having Maven also download the datanucleus jars into lib_managed/jars/ (unlike SBT, Maven will only download these jars, and only when building Spark with Hive). Now we will include datanucleus on the classpath if they're present in lib_managed and the Spark assembly includes Hive (by looking up with the "jar" command).

AmplabJenkins · 2014-04-05T23:57:26Z

Build triggered.

AmplabJenkins · 2014-04-05T23:57:34Z

Build started.

AmplabJenkins · 2014-04-05T23:58:48Z

Build finished.

AmplabJenkins · 2014-04-05T23:58:49Z

Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/13809/

Previously, we based our decision regarding including datanucleus jars based on the existence of a spark-hive-assembly jar, which was incididentally built whenever "sbt assembly" is run. This means that a typical and previously supported pathway would start using hive jars. This patch has the following features/bug fixes: - Use of SPARK_HIVE (default false) to determine if we should include Hive in the assembly jar. - Analagous feature in Maven with -Phive. - assemble-deps fixed since we no longer use a different ASSEMBLY_DIR Still TODO before mergeable: - We need to download the datanucleus jars outside of sbt. Perhaps we can have spark-class download them if SPARK_HIVE is set similar to how sbt downloads itself. - Spark SQL documentation updates.

…aven

aarondav · 2014-04-06T00:34:08Z

Also fixes SPARK-1309.

AmplabJenkins · 2014-04-06T00:37:23Z

Merged build triggered.

AmplabJenkins · 2014-04-06T00:37:32Z

Merged build started.

AmplabJenkins · 2014-04-06T01:27:56Z

Merged build finished. All automated tests passed.

AmplabJenkins · 2014-04-06T01:27:56Z

All automated tests passed.
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/13811/

pwendell · 2014-04-06T03:28:36Z

bin/compute-classpath.sh

@@ -59,7 +45,7 @@ if [ -f "$ASSEMBLY_DIR"/spark-assembly*hadoop*-deps.jar ]; then
  CLASSPATH="$CLASSPATH:$FWDIR/sql/core/target/scala-$SCALA_VERSION/classes"
  CLASSPATH="$CLASSPATH:$FWDIR/sql/hive/target/scala-$SCALA_VERSION/classes"

-  DEPS_ASSEMBLY_JAR=`ls "$ASSEMBLY_DIR"/spark*-assembly*hadoop*-deps.jar`
+  DEPS_ASSEMBLY_JAR=`ls "$ASSEMBLY_DIR"/spark-assembly*hadoop*-deps.jar`


Are you sure this is valid? I seem to remember the maven and sbt builds name the assembly jar differently (?) If this is the right way to do it would it make sense to make the check in the if statement consistent with this?

maven doesn't build assemble-deps, as far as I know

That is correct. Maven doesn't build assemble-deps right now, so its fine to keep this sbt specific. It raises the question if/how we should add assemble-deps to the maven build, but thats a whole different issue

It may actually be not too hard to add assemble-deps to Maven if we have a build that inherits from "assembly" and simply excludes Spark's groupId. Though, packaging the Maven assembly is roughly 5x faster than SBT.

pwendell · 2014-04-06T03:29:57Z

bin/compute-classpath.sh

+num_datanucleus_jars=$(ls "$FWDIR"/lib_managed/jars/ | grep "datanucleus-.*\\.jar" | wc -l)
+if [ $num_datanucleus_jars -gt 0 ]; then
+  AN_ASSEMBLY_JAR=${ASSEMBLY_JAR:-$DEPS_ASSEMBLY_JAR}
+  num_hive_files=$(jar tvf  "$AN_ASSEMBLY_JAR" org/apache/hadoop/hive/ql/exec 2>/dev/null | wc -l)


also an extra space after tvf unless it's intentional?

pwendell · 2014-04-06T05:51:47Z

I tested this in maven and sbt builds. Seemed to work for me!

aarondav · 2014-04-06T17:28:01Z

Addressed comments and removed WIP.

AmplabJenkins · 2014-04-06T17:32:23Z

Merged build triggered.

AmplabJenkins · 2014-04-06T17:32:32Z

Merged build started.

AmplabJenkins · 2014-04-06T18:07:38Z

Merged build finished. All automated tests passed.

AmplabJenkins · 2014-04-06T18:07:38Z

All automated tests passed.
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/13821/

pwendell · 2014-04-06T19:30:49Z

LGTM

pwendell · 2014-04-07T00:49:21Z

Thanks a lot for looking at this Aaron, I've merged it.

Formatting fix This is a single-line change. The diff appears larger here due to github being out of sync. (cherry picked from commit 10c3c0c) Signed-off-by: Patrick Wendell <pwendell@gmail.com>

Previously, we based our decision regarding including datanucleus jars based on the existence of a spark-hive-assembly jar, which was incidentally built whenever "sbt assembly" is run. This means that a typical and previously supported pathway would start using hive jars. This patch has the following features/bug fixes: - Use of SPARK_HIVE (default false) to determine if we should include Hive in the assembly jar. - Analagous feature in Maven with -Phive (previously, there was no support for adding Hive to any of our jars produced by Maven) - assemble-deps fixed since we no longer use a different ASSEMBLY_DIR - avoid adding log message in compute-classpath.sh to the classpath :) Still TODO before mergeable: - We need to download the datanucleus jars outside of sbt. Perhaps we can have spark-class download them if SPARK_HIVE is set similar to how sbt downloads itself. - Spark SQL documentation updates. Author: Aaron Davidson <aaron@databricks.com> Closes apache#237 from aarondav/master and squashes the following commits: 5dc4329 [Aaron Davidson] Typo fixes dd4f298 [Aaron Davidson] Doc update dd1a365 [Aaron Davidson] Eliminate need for SPARK_HIVE at runtime by d/ling datanucleus from Maven a9269b5 [Aaron Davidson] [WIP] Use SPARK_HIVE to determine if we include Hive in packaging

[SPARKR-154] Phase 2: implement cartesian().

This PR pulls in recent changes in SparkR-pkg, including cartesian, intersection, sampleByKey, subtract, subtractByKey, except, and some API for StructType and StructField. Author: cafreeman <cfreeman@alteryx.com> Author: Davies Liu <davies@databricks.com> Author: Zongheng Yang <zongheng.y@gmail.com> Author: Shivaram Venkataraman <shivaram.venkataraman@gmail.com> Author: Shivaram Venkataraman <shivaram@cs.berkeley.edu> Author: Sun Rui <rui.sun@intel.com> Closes #5436 from davies/R3 and squashes the following commits: c2b09be [Davies Liu] SQLTypes -> schema a5a02f2 [Davies Liu] Merge branch 'master' of github.com:apache/spark into R3 168b7fe [Davies Liu] sort generics b1fe460 [Davies Liu] fix conflict in README.md e74c04e [Davies Liu] fix schema.R 4f5ac09 [Davies Liu] Merge branch 'master' of github.com:apache/spark into R5 41f8184 [Davies Liu] rm man ae78312 [Davies Liu] Merge pull request #237 from sun-rui/SPARKR-154_3 1bdcb63 [Zongheng Yang] Updates to README.md. 5a553e7 [cafreeman] Use object attribute instead of argument 71372d9 [cafreeman] Update docs and examples 8526d2e [cafreeman] Remove `tojson` functions 6ef5f2d [cafreeman] Fix spacing 7741d66 [cafreeman] Rename the SQL DataType function 141efd8 [Shivaram Venkataraman] Merge pull request #245 from hqzizania/upstream 9387402 [Davies Liu] fix style 40199eb [Shivaram Venkataraman] Move except into sorted position 07d0dbc [Sun Rui] [SPARKR-244] Fix test failure after integration of subtract() and subtractByKey() for RDD. 7e8caa3 [Shivaram Venkataraman] Merge pull request #246 from hlin09/fixCombineByKey ed66c81 [cafreeman] Update `subtract` to work with `generics.R` f3ba785 [cafreeman] Fixed duplicate export 275deb4 [cafreeman] Update `NAMESPACE` and tests 1a3b63d [cafreeman] new version of `CreateDF` 836c4bf [cafreeman] Update `createDataFrame` and `toDF` be5d5c1 [cafreeman] refactor schema functions 40338a4 [Zongheng Yang] Merge pull request #244 from sun-rui/SPARKR-154_5 20b97a6 [Zongheng Yang] Merge pull request #234 from hqzizania/assist ba54e34 [Shivaram Venkataraman] Merge pull request #238 from sun-rui/SPARKR-154_4 c9497a3 [Shivaram Venkataraman] Merge pull request #208 from lythesia/master b317aa7 [Zongheng Yang] Merge pull request #243 from hqzizania/master 136a07e [Zongheng Yang] Merge pull request #242 from hqzizania/stats cd66603 [cafreeman] new line at EOF 8b76e81 [Shivaram Venkataraman] Merge pull request #233 from redbaron/fail-early-on-missing-dep 7dd81b7 [cafreeman] Documentation 0e2a94f [cafreeman] Define functions for schema and fields

* MapR [SPARK-186] Update OJAI versions to the latest for Spark-2.2.1 OJAI Connector

* use environment based secrets for S3Job * add documentation, remove old secrets test

Use global env to expose all jobs env vars

* MapR [SPARK-186] Update OJAI versions to the latest for Spark-2.2.1 OJAI Connector

…laUDAF and ScalaAggregator ### What changes were proposed in this pull request? This PR proposes to propagate the name used for registering UDFs to `ScalaUDF`, `ScalaUDAF` and `ScaalAggregator`. Note that `PythonUDF` gets the name correctly: https://github.com/apache/spark/blob/466c045bfac20b6ce19f5a3732e76a5de4eb4e4a/python/pyspark/sql/udf.py#L358-L359 , and same for Hive UDFs: https://github.com/apache/spark/blob/466c045bfac20b6ce19f5a3732e76a5de4eb4e4a/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveSessionCatalog.scala#L67 ### Why are the changes needed? This PR can help in the following scenarios: 1) Better EXPLAIN output 2) By adding `def name: String` to `UserDefinedExpression`, we can match an expression by `UserDefinedExpression` and look up the catalog, an use case needed for apache#31273. ### Does this PR introduce _any_ user-facing change? The EXPLAIN output involving udfs will be changed to use the name used for UDF registration. For example, for the following: ``` sql("CREATE TEMPORARY FUNCTION test_udf AS 'org.apache.spark.examples.sql.Spark33084'") sql("SELECT test_udf(col1) FROM VALUES (1), (2), (3)").explain(true) ``` The output of the optimized plan will change from: ``` Aggregate [spark33084(cast(col1#223 as bigint), org.apache.spark.examples.sql.Spark330846906be0f, 1, 1) AS spark33084(col1)apache#237] +- LocalRelation [col1#223] ``` to ``` Aggregate [test_udf(cast(col1#223 as bigint), org.apache.spark.examples.sql.Spark330847a62d697, 1, 1, Some(test_udf)) AS test_udf(col1)apache#237] +- LocalRelation [col1#223] ``` ### How was this patch tested? Added new tests. Closes apache#31500 from imback82/udaf_name. Authored-by: Terry Kim <yuminkim@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

aarondav reviewed Mar 26, 2014
View reviewed changes

pwendell reviewed Mar 26, 2014
View reviewed changes

shivaram reviewed Mar 26, 2014
View reviewed changes

aarondav mentioned this pull request Mar 26, 2014

SPARK-1330 removed extra echo from comput_classpath.sh #241

Closed

aarondav added 3 commits April 5, 2014 17:30

Eliminate need for SPARK_HIVE at runtime by d/ling datanucleus from M…

dd1a365

…aven

Doc update

dd4f298

pwendell reviewed Apr 6, 2014
View reviewed changes

pwendell mentioned this pull request Apr 6, 2014

[SPARK-1429] Spark shell fails to start after "sbt clean assemble-deps package" #337

Closed

aarondav changed the title ~~[WIP] SPARK-1314: Use SPARK_HIVE to determine if we include Hive in packaging~~ SPARK-1314: Use SPARK_HIVE to determine if we include Hive in packaging Apr 6, 2014

Typo fixes

5dc4329

asfgit closed this in 4106558 Apr 7, 2014

davies pushed a commit to davies/spark that referenced this pull request Apr 14, 2015

Merge pull request apache#237 from sun-rui/SPARKR-154_3

ae78312

[SPARKR-154] Phase 2: implement cartesian().

jamesrgrinter pushed a commit to jamesrgrinter/spark that referenced this pull request Apr 22, 2018

Spark 186 (apache#237)

bc6afee

* MapR [SPARK-186] Update OJAI versions to the latest for Spark-2.2.1 OJAI Connector

Igosuki pushed a commit to Adikteev/spark that referenced this pull request Jul 31, 2018

[SPARK-609] use environment based secrets for S3Job (apache#237)

a52e08e

* use environment based secrets for S3Job * add documentation, remove old secrets test

bzhaoopenstack pushed a commit to bzhaoopenstack/spark that referenced this pull request Sep 11, 2019

Merge pull request apache#237 from liu-sheng/global-env

abf74f7

Use global env to expose all jobs env vars

arjunshroff pushed a commit to arjunshroff/spark that referenced this pull request Nov 24, 2020

Spark 186 (apache#237)

718f948

* MapR [SPARK-186] Update OJAI versions to the latest for Spark-2.2.1 OJAI Connector

		@@ -395,8 +404,6 @@ object SparkBuild extends Build {
		// assembly jar.
		def hiveSettings = sharedSettings ++ assemblyProjSettings ++ Seq(

SPARK-1314: Use SPARK_HIVE to determine if we include Hive in packaging #237

SPARK-1314: Use SPARK_HIVE to determine if we include Hive in packaging #237

Conversation

aarondav commented Mar 26, 2014

Choose a reason for hiding this comment

aarondav commented Mar 26, 2014

AmplabJenkins commented Mar 26, 2014

AmplabJenkins commented Mar 26, 2014

AmplabJenkins commented Mar 26, 2014

AmplabJenkins commented Mar 26, 2014

Choose a reason for hiding this comment

pwendell commented Mar 26, 2014

pwendell commented Mar 26, 2014

marmbrus commented Mar 26, 2014

aarondav commented Mar 26, 2014

pwendell commented Mar 26, 2014

Choose a reason for hiding this comment

Choose a reason for hiding this comment

AmplabJenkins commented Mar 28, 2014

aarondav commented Apr 5, 2014

AmplabJenkins commented Apr 5, 2014

AmplabJenkins commented Apr 5, 2014

AmplabJenkins commented Apr 5, 2014

AmplabJenkins commented Apr 5, 2014

aarondav commented Apr 6, 2014

AmplabJenkins commented Apr 6, 2014

AmplabJenkins commented Apr 6, 2014

AmplabJenkins commented Apr 6, 2014

AmplabJenkins commented Apr 6, 2014

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pwendell commented Apr 6, 2014

aarondav commented Apr 6, 2014

AmplabJenkins commented Apr 6, 2014

AmplabJenkins commented Apr 6, 2014

AmplabJenkins commented Apr 6, 2014

AmplabJenkins commented Apr 6, 2014

pwendell commented Apr 6, 2014

pwendell commented Apr 7, 2014