Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-21422][BUILD] Depend on Apache ORC 1.4.0 #18640

Closed
wants to merge 1 commit into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 6 additions & 0 deletions assembly/pom.xml
Original file line number Diff line number Diff line change
Expand Up @@ -220,6 +220,12 @@
<hive.deps.scope>provided</hive.deps.scope>
</properties>
</profile>
<profile>
<id>orc-provided</id>
<properties>
<orc.deps.scope>provided</orc.deps.scope>
</properties>
</profile>
<profile>
<id>parquet-provided</id>
<properties>
Expand Down
3 changes: 3 additions & 0 deletions dev/deps/spark-deps-hadoop-2.6
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,7 @@ JavaEWAH-0.3.2.jar
RoaringBitmap-0.5.11.jar
ST4-4.0.4.jar
activation-1.1.1.jar
aircompressor-0.3.jar
antlr-2.7.7.jar
antlr-runtime-3.4.jar
antlr4-runtime-4.5.3.jar
Expand Down Expand Up @@ -148,6 +149,8 @@ netty-3.9.9.Final.jar
netty-all-4.0.43.Final.jar
objenesis-2.1.jar
opencsv-2.3.jar
orc-core-1.4.0-nohive.jar
orc-mapreduce-1.4.0-nohive.jar
oro-2.0.8.jar
osgi-resource-locator-1.0.1.jar
paranamer-2.6.jar
Expand Down
3 changes: 3 additions & 0 deletions dev/deps/spark-deps-hadoop-2.7
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,7 @@ JavaEWAH-0.3.2.jar
RoaringBitmap-0.5.11.jar
ST4-4.0.4.jar
activation-1.1.1.jar
aircompressor-0.3.jar
antlr-2.7.7.jar
antlr-runtime-3.4.jar
antlr4-runtime-4.5.3.jar
Expand Down Expand Up @@ -149,6 +150,8 @@ netty-3.9.9.Final.jar
netty-all-4.0.43.Final.jar
objenesis-2.1.jar
opencsv-2.3.jar
orc-core-1.4.0-nohive.jar
orc-mapreduce-1.4.0-nohive.jar
oro-2.0.8.jar
osgi-resource-locator-1.0.1.jar
paranamer-2.6.jar
Expand Down
44 changes: 44 additions & 0 deletions pom.xml
Original file line number Diff line number Diff line change
Expand Up @@ -132,6 +132,8 @@
<hive.version.short>1.2.1</hive.version.short>
<derby.version>10.12.1.1</derby.version>
<parquet.version>1.8.2</parquet.version>
<orc.version>1.4.0</orc.version>
<orc.classifier>nohive</orc.classifier>
<hive.parquet.version>1.6.0</hive.parquet.version>
<jetty.version>9.3.20.v20170531</jetty.version>
<javaxservlet.version>3.1.0</javaxservlet.version>
Expand Down Expand Up @@ -207,6 +209,7 @@
<flume.deps.scope>compile</flume.deps.scope>
<hadoop.deps.scope>compile</hadoop.deps.scope>
<hive.deps.scope>compile</hive.deps.scope>
<orc.deps.scope>compile</orc.deps.scope>
<parquet.deps.scope>compile</parquet.deps.scope>
<parquet.test.deps.scope>test</parquet.test.deps.scope>

Expand Down Expand Up @@ -1677,6 +1680,44 @@
</exclusion>
</exclusions>
</dependency>
<dependency>
<groupId>org.apache.orc</groupId>
<artifactId>orc-core</artifactId>
<version>${orc.version}</version>
<classifier>${orc.classifier}</classifier>
<scope>${orc.deps.scope}</scope>
<exclusions>
<exclusion>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-common</artifactId>
</exclusion>
<exclusion>
<groupId>org.apache.hive</groupId>
<artifactId>hive-storage-api</artifactId>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so the orc core module still contains hive related stuff?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and to confirm, this exclusion is safe only if we don't use hive storage api of orc in sql/core, right?

Copy link
Member Author

@dongjoon-hyun dongjoon-hyun Aug 15, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you so much for review, @cloud-fan .

  • The original orc-core-1.4.0.jar has hive-storage-api dependency. (Maven Repo)
  • orc-core-1.4.0-nohive.jar is a shaded jar file including hive-storage-api under org.apache.orc namespace.

orc-core-1.4.0-nohive.jar is designed for users and apps who don't want to depend on (or consider) hive. nohive is a classifier for this purpose.

This PR uses orc-core-1.4.0-nohive only. To avoid Maven confusion, this exclusion makes it sure by removing the hive-storage-api dependency explicitly from orc-core artifact.

</exclusion>
</exclusions>
</dependency>
<dependency>
<groupId>org.apache.orc</groupId>
<artifactId>orc-mapreduce</artifactId>
<version>${orc.version}</version>
<classifier>${orc.classifier}</classifier>
Copy link
Member

@viirya viirya Aug 15, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The classifier for orc-mapreduce is the same purpose as orc-core?

Copy link
Member Author

@dongjoon-hyun dongjoon-hyun Aug 15, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for review, @viirya .
Yes, it's the same for the same purpose. There is orc-mapreduce-1.4.0.jar and orc-mapreduce-1.4.0-nohive.jar.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. I think they are come from https://issues.apache.org/jira/browse/ORC-174.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right. The wording is a little bit different, but technically those jars come from that JIRA patch.

<scope>${orc.deps.scope}</scope>
<exclusions>
<exclusion>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-common</artifactId>
</exclusion>
<exclusion>
<groupId>org.apache.orc</groupId>
<artifactId>orc-core</artifactId>
</exclusion>
<exclusion>
<groupId>org.apache.hive</groupId>
<artifactId>hive-storage-api</artifactId>
</exclusion>
</exclusions>
</dependency>
<dependency>
<groupId>org.apache.parquet</groupId>
<artifactId>parquet-column</artifactId>
Expand Down Expand Up @@ -2710,6 +2751,9 @@
<profile>
<id>hive-provided</id>
</profile>
<profile>
<id>orc-provided</id>
</profile>
<profile>
<id>parquet-provided</id>
</profile>
Expand Down
10 changes: 10 additions & 0 deletions sql/core/pom.xml
Original file line number Diff line number Diff line change
Expand Up @@ -86,6 +86,16 @@
<scope>test</scope>
</dependency>

<dependency>
<groupId>org.apache.orc</groupId>
<artifactId>orc-core</artifactId>
<classifier>${orc.classifier}</classifier>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry a dumb question, what does classifier mean? I don't see it in the rest of this pom file

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what exactly is the storage api? confused about this too ...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In Maven, classifier allows to distinguish artifacts that were built from the same POM but differ in their content. Here, nohive is used and it refers orc-core-1.4.0-nohive.jar instead of orc-core-1.4.0.jar.

Copy link
Member Author

@dongjoon-hyun dongjoon-hyun Aug 15, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, @rxin !
Currently, the hive-storage-api jar has something like SearchArgument and PredicateLeaf. Apache ORC is trying to become an independent module like Parquet.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok good to learn the classifier stuff... Does it work in SBT too?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. sbt understands classifier in pom file. Also, we can use classifier in sbt build file, too.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rxin Storage-API is a separately released artifact from the Hive project. Basically, Storage-API are the in-memory format for Hive's vectorization. You could draw the analogy that Storage-Api is for Hive what Arrow is for Drill. It allows formats to read and write directly in the format that is needed by the execution engine.

With the nohive classifier, ORC shades the storage-api jar into the ORC namespace so that it is compatible with any version of Hive.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, @omalley !

</dependency>
<dependency>
<groupId>org.apache.orc</groupId>
<artifactId>orc-mapreduce</artifactId>
<classifier>${orc.classifier}</classifier>
</dependency>
<dependency>
<groupId>org.apache.parquet</groupId>
<artifactId>parquet-column</artifactId>
Expand Down