Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SPARK-1314: Use SPARK_HIVE to determine if we include Hive in packaging #237

Closed
wants to merge 4 commits into from

Conversation

aarondav
Copy link
Contributor

Previously, we based our decision regarding including datanucleus jars based on the existence of a spark-hive-assembly jar, which was incidentally built whenever "sbt assembly" is run. This means that a typical and previously supported pathway would start using hive jars.

This patch has the following features/bug fixes:

  • Use of SPARK_HIVE (default false) to determine if we should include Hive in the assembly jar.
  • Analagous feature in Maven with -Phive (previously, there was no support for adding Hive to any of our jars produced by Maven)
  • assemble-deps fixed since we no longer use a different ASSEMBLY_DIR
  • avoid adding log message in compute-classpath.sh to the classpath :)

Still TODO before mergeable:

  • We need to download the datanucleus jars outside of sbt. Perhaps we can have spark-class download them if SPARK_HIVE is set similar to how sbt downloads itself.
  • Spark SQL documentation updates.

@@ -373,7 +373,6 @@
<groupId>org.apache.derby</groupId>
<artifactId>derby</artifactId>
<version>10.4.2.0</version>
<scope>test</scope>
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hive requires derby in the compile scope via a transitive dependency on hive-metastore, and this setting was overriding that. This does not seem to be pulled in from non-hive assembly jars.

@aarondav
Copy link
Contributor Author

@marmbrus @pwendell Please take a look, my build-foo is not strong.

@AmplabJenkins
Copy link

Merged build triggered.

@AmplabJenkins
Copy link

Merged build started.

@AmplabJenkins
Copy link

Merged build finished.

@AmplabJenkins
Copy link

All automated tests passed.
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/13466/

<!-- Matches the version of jackson-core-asl pulled in by avro -->
<groupId>org.codehaus.jackson</groupId>
<artifactId>jackson-mapper-asl</artifactId>
<version>1.8.8</version>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should the version number be declared in the parent pom and then the dependency added here? Btw one reason why this might be getting messed up is that we intentionally exclude jackson from a bunch of dependencies. So maybe we are actually excluding it when pulling in Avro.

@pwendell
Copy link
Contributor

Thanks Aaron - looks good. From what I can tell you've just copied the model used for ganglia and previously YARN.

The only ugliness is that SPARK_HIVE is now needed at runtime to determine whether to include the Datanucleus jars. And this is conflated with the setting at compile-time which has a different meaning. This is a bit unfortunate - is there no better way here? We could have the assembly name be different if Hive is included, but that's sort of ugly too. We could also just include the datanucleus jars on the classpath if they are present. The only downside is that if someone did a build for hive, then did a normal build, they would still include it. But in that case maybe it's just okay to include them - it's not a widely used library.

@pwendell
Copy link
Contributor

We could also include them and log a message like "Including Datanucleus jars needed for Hive" and then if someone happens to not want them (they previously did a Hive assembly) then they'd know what was going on.

@marmbrus
Copy link
Contributor

One other option is to have a file flag like we do for RELEASE that is used in addition to the env var.

@aarondav
Copy link
Contributor Author

One issue with the "including datanucleus if it exists" approach is that we need to at some point download datanucleus for the user. Right now sbt does that via lib_managed, but there's no pathway to getting the datanucleus jars from maven, so we may need an external script that's run in spark-class. This would require the same type of runtime option as we have now, or a file flag.

@pwendell
Copy link
Contributor

Ah sorry I didn't read your description. Indeed, this wouldn't work for maven. Having a file flag seems potentially complicated if someone does multiple builds some with and some without hive.

@@ -395,8 +404,6 @@ object SparkBuild extends Build {
// assembly jar.
def hiveSettings = sharedSettings ++ assemblyProjSettings ++ Seq(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason we still need a separate hive assembly target ? If hive will be a part of the normal assembly jar (based on maybeHive) then these rules should be moved into assemblyProjSettings ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, removed!

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@aarondav
Copy link
Contributor Author

aarondav commented Apr 5, 2014

Alright, I changed up the solution by having Maven also download the datanucleus jars into lib_managed/jars/ (unlike SBT, Maven will only download these jars, and only when building Spark with Hive). Now we will include datanucleus on the classpath if they're present in lib_managed and the Spark assembly includes Hive (by looking up with the "jar" command).

@AmplabJenkins
Copy link

Build triggered.

@AmplabJenkins
Copy link

Build started.

@AmplabJenkins
Copy link

Build finished.

@AmplabJenkins
Copy link

Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/13809/

aarondav added 3 commits April 5, 2014 17:30
Previously, we based our decision regarding including datanucleus jars
based on the existence of a spark-hive-assembly jar, which was incididentally
built whenever "sbt assembly" is run. This means that a typical and
previously supported pathway would start using hive jars.

This patch has the following features/bug fixes:

- Use of SPARK_HIVE (default false) to determine if we should include Hive
  in the assembly jar.
- Analagous feature in Maven with -Phive.
- assemble-deps fixed since we no longer use a different ASSEMBLY_DIR

Still TODO before mergeable:
- We need to download the datanucleus jars outside of sbt. Perhaps we can have
  spark-class download them if SPARK_HIVE is set similar to how sbt downloads
  itself.
- Spark SQL documentation updates.
@aarondav
Copy link
Contributor Author

aarondav commented Apr 6, 2014

Also fixes SPARK-1309.

@AmplabJenkins
Copy link

Merged build triggered.

@AmplabJenkins
Copy link

Merged build started.

@AmplabJenkins
Copy link

Merged build finished. All automated tests passed.

@AmplabJenkins
Copy link

All automated tests passed.
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/13811/

@@ -59,7 +45,7 @@ if [ -f "$ASSEMBLY_DIR"/spark-assembly*hadoop*-deps.jar ]; then
CLASSPATH="$CLASSPATH:$FWDIR/sql/core/target/scala-$SCALA_VERSION/classes"
CLASSPATH="$CLASSPATH:$FWDIR/sql/hive/target/scala-$SCALA_VERSION/classes"

DEPS_ASSEMBLY_JAR=`ls "$ASSEMBLY_DIR"/spark*-assembly*hadoop*-deps.jar`
DEPS_ASSEMBLY_JAR=`ls "$ASSEMBLY_DIR"/spark-assembly*hadoop*-deps.jar`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you sure this is valid? I seem to remember the maven and sbt builds name the assembly jar differently (?) If this is the right way to do it would it make sense to make the check in the if statement consistent with this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maven doesn't build assemble-deps, as far as I know

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is correct. Maven doesn't build assemble-deps right now, so its fine to keep this sbt specific. It raises the question if/how we should add assemble-deps to the maven build, but thats a whole different issue

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It may actually be not too hard to add assemble-deps to Maven if we have a build that inherits from "assembly" and simply excludes Spark's groupId. Though, packaging the Maven assembly is roughly 5x faster than SBT.

num_datanucleus_jars=$(ls "$FWDIR"/lib_managed/jars/ | grep "datanucleus-.*\\.jar" | wc -l)
if [ $num_datanucleus_jars -gt 0 ]; then
AN_ASSEMBLY_JAR=${ASSEMBLY_JAR:-$DEPS_ASSEMBLY_JAR}
num_hive_files=$(jar tvf "$AN_ASSEMBLY_JAR" org/apache/hadoop/hive/ql/exec 2>/dev/null | wc -l)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also an extra space after tvf unless it's intentional?

@pwendell
Copy link
Contributor

pwendell commented Apr 6, 2014

I tested this in maven and sbt builds. Seemed to work for me!

@aarondav aarondav changed the title [WIP] SPARK-1314: Use SPARK_HIVE to determine if we include Hive in packaging SPARK-1314: Use SPARK_HIVE to determine if we include Hive in packaging Apr 6, 2014
@aarondav
Copy link
Contributor Author

aarondav commented Apr 6, 2014

Addressed comments and removed WIP.

@AmplabJenkins
Copy link

Merged build triggered.

@AmplabJenkins
Copy link

Merged build started.

@AmplabJenkins
Copy link

Merged build finished. All automated tests passed.

@AmplabJenkins
Copy link

All automated tests passed.
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/13821/

@pwendell
Copy link
Contributor

pwendell commented Apr 6, 2014

LGTM

@pwendell
Copy link
Contributor

pwendell commented Apr 7, 2014

Thanks a lot for looking at this Aaron, I've merged it.

@asfgit asfgit closed this in 4106558 Apr 7, 2014
jhartlaub referenced this pull request in jhartlaub/spark May 27, 2014
Formatting fix

This is a single-line change. The diff appears larger here due to github being out of sync.
(cherry picked from commit 10c3c0c)

Signed-off-by: Patrick Wendell <pwendell@gmail.com>
pdeyhim pushed a commit to pdeyhim/spark-1 that referenced this pull request Jun 25, 2014
Previously, we based our decision regarding including datanucleus jars based on the existence of a spark-hive-assembly jar, which was incidentally built whenever "sbt assembly" is run. This means that a typical and previously supported pathway would start using hive jars.

This patch has the following features/bug fixes:

- Use of SPARK_HIVE (default false) to determine if we should include Hive in the assembly jar.
- Analagous feature in Maven with -Phive (previously, there was no support for adding Hive to any of our jars produced by Maven)
- assemble-deps fixed since we no longer use a different ASSEMBLY_DIR
- avoid adding log message in compute-classpath.sh to the classpath :)

Still TODO before mergeable:
- We need to download the datanucleus jars outside of sbt. Perhaps we can have spark-class download them if SPARK_HIVE is set similar to how sbt downloads itself.
- Spark SQL documentation updates.

Author: Aaron Davidson <aaron@databricks.com>

Closes apache#237 from aarondav/master and squashes the following commits:

5dc4329 [Aaron Davidson] Typo fixes
dd4f298 [Aaron Davidson] Doc update
dd1a365 [Aaron Davidson] Eliminate need for SPARK_HIVE at runtime by d/ling datanucleus from Maven
a9269b5 [Aaron Davidson] [WIP] Use SPARK_HIVE to determine if we include Hive in packaging
davies pushed a commit to davies/spark that referenced this pull request Apr 14, 2015
[SPARKR-154] Phase 2: implement cartesian().
asfgit pushed a commit that referenced this pull request Apr 17, 2015
This PR pulls in recent changes in SparkR-pkg, including

cartesian, intersection, sampleByKey, subtract, subtractByKey, except, and some API for StructType and StructField.

Author: cafreeman <cfreeman@alteryx.com>
Author: Davies Liu <davies@databricks.com>
Author: Zongheng Yang <zongheng.y@gmail.com>
Author: Shivaram Venkataraman <shivaram.venkataraman@gmail.com>
Author: Shivaram Venkataraman <shivaram@cs.berkeley.edu>
Author: Sun Rui <rui.sun@intel.com>

Closes #5436 from davies/R3 and squashes the following commits:

c2b09be [Davies Liu] SQLTypes -> schema
a5a02f2 [Davies Liu] Merge branch 'master' of github.com:apache/spark into R3
168b7fe [Davies Liu] sort generics
b1fe460 [Davies Liu] fix conflict in README.md
e74c04e [Davies Liu] fix schema.R
4f5ac09 [Davies Liu] Merge branch 'master' of github.com:apache/spark into R5
41f8184 [Davies Liu] rm man
ae78312 [Davies Liu] Merge pull request #237 from sun-rui/SPARKR-154_3
1bdcb63 [Zongheng Yang] Updates to README.md.
5a553e7 [cafreeman] Use object attribute instead of argument
71372d9 [cafreeman] Update docs and examples
8526d2e [cafreeman] Remove `tojson` functions
6ef5f2d [cafreeman] Fix spacing
7741d66 [cafreeman] Rename the SQL DataType function
141efd8 [Shivaram Venkataraman] Merge pull request #245 from hqzizania/upstream
9387402 [Davies Liu] fix style
40199eb [Shivaram Venkataraman] Move except into sorted position
07d0dbc [Sun Rui] [SPARKR-244] Fix test failure after integration of subtract() and subtractByKey() for RDD.
7e8caa3 [Shivaram Venkataraman] Merge pull request #246 from hlin09/fixCombineByKey
ed66c81 [cafreeman] Update `subtract` to work with `generics.R`
f3ba785 [cafreeman] Fixed duplicate export
275deb4 [cafreeman] Update `NAMESPACE` and tests
1a3b63d [cafreeman] new version of `CreateDF`
836c4bf [cafreeman] Update `createDataFrame` and `toDF`
be5d5c1 [cafreeman] refactor schema functions
40338a4 [Zongheng Yang] Merge pull request #244 from sun-rui/SPARKR-154_5
20b97a6 [Zongheng Yang] Merge pull request #234 from hqzizania/assist
ba54e34 [Shivaram Venkataraman] Merge pull request #238 from sun-rui/SPARKR-154_4
c9497a3 [Shivaram Venkataraman] Merge pull request #208 from lythesia/master
b317aa7 [Zongheng Yang] Merge pull request #243 from hqzizania/master
136a07e [Zongheng Yang] Merge pull request #242 from hqzizania/stats
cd66603 [cafreeman] new line at EOF
8b76e81 [Shivaram Venkataraman] Merge pull request #233 from redbaron/fail-early-on-missing-dep
7dd81b7 [cafreeman] Documentation
0e2a94f [cafreeman] Define functions for schema and fields
jamesrgrinter pushed a commit to jamesrgrinter/spark that referenced this pull request Apr 22, 2018
* MapR [SPARK-186] Update OJAI versions to the latest for Spark-2.2.1 OJAI Connector
Igosuki pushed a commit to Adikteev/spark that referenced this pull request Jul 31, 2018
* use environment based secrets for S3Job

* add documentation, remove old secrets test
bzhaoopenstack pushed a commit to bzhaoopenstack/spark that referenced this pull request Sep 11, 2019
Use global env to expose all jobs env vars
arjunshroff pushed a commit to arjunshroff/spark that referenced this pull request Nov 24, 2020
* MapR [SPARK-186] Update OJAI versions to the latest for Spark-2.2.1 OJAI Connector
imback82 added a commit to imback82/spark-4 that referenced this pull request Feb 8, 2021
…laUDAF and ScalaAggregator

### What changes were proposed in this pull request?

This PR proposes to propagate the name used for registering UDFs to `ScalaUDF`, `ScalaUDAF` and `ScaalAggregator`.

Note that `PythonUDF` gets the name correctly: https://github.com/apache/spark/blob/466c045bfac20b6ce19f5a3732e76a5de4eb4e4a/python/pyspark/sql/udf.py#L358-L359
, and same for Hive UDFs:
https://github.com/apache/spark/blob/466c045bfac20b6ce19f5a3732e76a5de4eb4e4a/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveSessionCatalog.scala#L67
### Why are the changes needed?

This PR can help in the following scenarios:
1) Better EXPLAIN output
2) By adding  `def name: String` to `UserDefinedExpression`, we can match an expression by `UserDefinedExpression` and look up the catalog, an use case needed for apache#31273.

### Does this PR introduce _any_ user-facing change?

The EXPLAIN output involving udfs will be changed to use the name used for UDF registration.

For example, for the following:
```
sql("CREATE TEMPORARY FUNCTION test_udf AS 'org.apache.spark.examples.sql.Spark33084'")
sql("SELECT test_udf(col1) FROM VALUES (1), (2), (3)").explain(true)
```
The output of the optimized plan will change from:
```
Aggregate [spark33084(cast(col1#223 as bigint), org.apache.spark.examples.sql.Spark330846906be0f, 1, 1) AS spark33084(col1)apache#237]
+- LocalRelation [col1#223]
```
to
```
Aggregate [test_udf(cast(col1#223 as bigint), org.apache.spark.examples.sql.Spark330847a62d697, 1, 1, Some(test_udf)) AS test_udf(col1)apache#237]
+- LocalRelation [col1#223]
```

### How was this patch tested?

Added new tests.

Closes apache#31500 from imback82/udaf_name.

Authored-by: Terry Kim <yuminkim@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants