-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-21422][BUILD] Depend on Apache ORC 1.4.0 #18640
Conversation
Test build #79627 has finished for PR 18640 at commit
|
Hi, @rxin , @srowen , @sameeragarwal , @cloud-fan , @hvanhovell , @gatorsmile , @ueshin , @viirya , @kiszk . Could you review this small PR about depedency change? This is a start of upgrade to Apache ORC in order to reduce the old Hive dependency in Apache Spark 2.3 for the following issues.
I've heard that Apache Spark will not drop ORC data source from @sameeragarwal . If then, could we move forward a small step like this? |
Retest this please. |
Test build #79951 has finished for PR 18640 at commit
|
Retest this please. |
Test build #80055 has finished for PR 18640 at commit
|
Retest this please |
Test build #80221 has finished for PR 18640 at commit
|
Hi, @liancheng , @zhzhan , @rxin , @marmbrus . |
Why are we adding this to core? Why not just the hive module? |
Can we add smaller amount of new code to use this, too? It may help show which modules are relevant to add this. |
Thank you for review, @rxin . |
LGTM, great to see progress on ORC support. |
To the best of my knowledge almost everybody runs with Hive anyway and the vast majority of users that run ORC are Hive users. In hindsight we probably should have put most of the data source dependencies as separate packages similar to Presto. |
I agree with the following, but this does not block those users. This is only better than putting the dependency on
|
Why don't we then create a separate orc module? Just copy a few of the files over? |
Until now, I think ORC is the same with most of other data sources(CSV, JDBC, JSON, PARQUET, TEXT) which live inside Or, is there any other reason I didn't catch here? |
I just checked the dependency size. They look pretty reasonable, roughly 2 MBs in total (although I do worry in the future whether ORC would bring in a lot more jars). cc @omalley any guidance on this topic? |
Hi, @rxin Since ORC 1.4.0, ORC community provides small shaded jar files to improve usability in general purposes. This PR uses the followings.
The size is due to including the followings.
In terms of the number of files,
The bottom line is there are still some source codes come from |
@rxin The ORC core library's dependency tree is aggressively kept as small as possible. I've gone through and excluded unnecessary jars from our dependencies. I also kick back pull requests that add unnecessary new dependencies. |
@rxin . How can I proceed this PR now? Could you give me some advice again? |
Thank you again for coming and reviewing this PR, @rxin , @kiszk , @mridulm , @omalley .
I tried to contain and summarize all advices here, but please let me know if I missed some concerns here. |
I would also comment that in the long term, Spark should move to using the vectorized reader in ORC's core. That would remove the dependence on ORC's mapreduce module, which provides row by row shims on top of the vectorized reader. |
Sure. Thank you so much, @omalley ! |
@rxin . Could you make some decision for this PR? Do we need to put this into |
Retest this please. |
Test build #80466 has finished for PR 18640 at commit
|
LGTM; unless @rxin still has some strong objections? |
Thank you so much, @sameeragarwal . |
Hi, @mridulm, @sameeragarwal , and @rxin . |
Retest this please. |
Test build #80576 has finished for PR 18640 at commit
|
Hi, @sameeragarwal and @mridulm . |
</exclusion> | ||
<exclusion> | ||
<groupId>org.apache.hive</groupId> | ||
<artifactId>hive-storage-api</artifactId> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so the orc core module still contains hive related stuff?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and to confirm, this exclusion is safe only if we don't use hive storage api of orc in sql/core, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you so much for review, @cloud-fan .
- The original
orc-core-1.4.0.jar
hashive-storage-api
dependency. (Maven Repo) orc-core-1.4.0-nohive.jar
is a shaded jar file includinghive-storage-api
underorg.apache.orc
namespace.
orc-core-1.4.0-nohive.jar
is designed for users and apps who don't want to depend on (or consider) hive
. nohive
is a classifier for this purpose.
This PR uses orc-core-1.4.0-nohive
only. To avoid Maven confusion, this exclusion makes it sure by removing the hive-storage-api
dependency explicitly from orc-core
artifact.
@@ -87,6 +87,16 @@ | |||
</dependency> | |||
|
|||
<dependency> | |||
<groupId>org.apache.orc</groupId> | |||
<artifactId>orc-core</artifactId> | |||
<classifier>${orc.classifier}</classifier> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry a dumb question, what does classifier
mean? I don't see it in the rest of this pom file
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what exactly is the storage api? confused about this too ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In Maven, classifier
allows to distinguish artifacts that were built from the same POM but differ in their content. Here, nohive
is used and it refers orc-core-1.4.0-nohive.jar
instead of orc-core-1.4.0.jar
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you, @rxin !
Currently, the hive-storage-api
jar has something like SearchArgument
and PredicateLeaf
. Apache ORC is trying to become an independent module like Parquet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok good to learn the classifier
stuff... Does it work in SBT too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. sbt understands classifier
in pom file. Also, we can use classifier in sbt build file, too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rxin Storage-API is a separately released artifact from the Hive project. Basically, Storage-API are the in-memory format for Hive's vectorization. You could draw the analogy that Storage-Api is for Hive what Arrow is for Drill. It allows formats to read and write directly in the format that is needed by the execution engine.
With the nohive classifier, ORC shades the storage-api jar into the ORC namespace so that it is compatible with any version of Hive.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you, @omalley !
LGTM besides some minor questions, @rxin any more comments on this? |
<groupId>org.apache.orc</groupId> | ||
<artifactId>orc-mapreduce</artifactId> | ||
<version>${orc.version}</version> | ||
<classifier>${orc.classifier}</classifier> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The classifier
for orc-mapreduce
is the same purpose as orc-core
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for review, @viirya .
Yes, it's the same for the same purpose. There is orc-mapreduce-1.4.0.jar
and orc-mapreduce-1.4.0-nohive.jar
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. I think they are come from https://issues.apache.org/jira/browse/ORC-174.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right. The wording is a little bit different, but technically those jars come from that JIRA patch.
LGTM |
Thank you again, @viirya . |
Hi, @cloud-fan , @rxin , @sameeragarwal and @mridulm . |
lgtm |
Thank you so much, @rxin , @cloud-fan , @sameeragarwal , @omalley , @mridulm , @viirya ! |
Thanks! Merging to master. |
Thank you, @gatorsmile !!! |
What changes were proposed in this pull request?
Like Parquet, this PR aims to depend on the latest Apache ORC 1.4 for Apache Spark 2.3. There are key benefits for Apache ORC 1.4.
Later, we can get the following two key benefits by adding new ORCFileFormat in SPARK-20728 (#17980), too.
How was this patch tested?
Pass the jenkins.