[SPARK-21422][BUILD] Depend on Apache ORC 1.4.0 #18640

dongjoon-hyun · 2017-07-14T22:05:21Z

What changes were proposed in this pull request?

Like Parquet, this PR aims to depend on the latest Apache ORC 1.4 for Apache Spark 2.3. There are key benefits for Apache ORC 1.4.

Stability: Apache ORC 1.4.0 has many fixes and we can depend on ORC community more.
Maintainability: Reduce the Hive dependency and can remove old legacy code later.

Later, we can get the following two key benefits by adding new ORCFileFormat in SPARK-20728 (#17980), too.

Usability: User can use ORC data sources without hive module, i.e, -Phive.
Speed: Use both Spark ColumnarBatch and ORC RowBatch together. This will be faster than the current implementation in Spark.

How was this patch tested?

Pass the jenkins.

dongjoon-hyun · 2017-07-14T22:07:37Z

This aims to reduce the review scope for #17980 .
cc @kiszk .

SparkQA · 2017-07-15T01:02:17Z

Test build #79627 has finished for PR 18640 at commit 0f29656.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2017-07-15T17:41:22Z

Hi, @rxin , @srowen , @sameeragarwal , @cloud-fan , @hvanhovell , @gatorsmile , @ueshin , @viirya , @kiszk .

Could you review this small PR about depedency change?

This is a start of upgrade to Apache ORC in order to reduce the old Hive dependency in Apache Spark 2.3 for the following issues.

SPARK-20901 Feature parity for ORC with Parquet
SPARK-20682 Support a new faster ORC data source based on Apache ORC
SPARK-20728 Make ORCFileFormat configurable between sql/hive and sql/core
SPARK-16060 Vectorized Orc Reader

I've heard that Apache Spark will not drop ORC data source from @sameeragarwal . If then, could we move forward a small step like this?

dongjoon-hyun · 2017-07-26T01:09:08Z

Retest this please.

SparkQA · 2017-07-26T04:07:05Z

Test build #79951 has finished for PR 18640 at commit 0f29656.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2017-07-30T19:51:42Z

Retest this please.

SparkQA · 2017-07-30T22:49:52Z

Test build #80055 has finished for PR 18640 at commit 0f29656.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2017-08-03T23:09:11Z

Retest this please

SparkQA · 2017-08-04T02:12:15Z

Test build #80221 has finished for PR 18640 at commit 0f29656.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2017-08-04T17:17:11Z

Hi, @liancheng , @zhzhan , @rxin , @marmbrus .
I'm pining you since you worked on #6194 before.

rxin · 2017-08-04T17:49:14Z

Why are we adding this to core? Why not just the hive module?

kiszk · 2017-08-04T17:58:10Z

Can we add smaller amount of new code to use this, too? It may help show which modules are relevant to add this.

dongjoon-hyun · 2017-08-04T18:21:47Z

Thank you for review, @rxin .
We can use ORC like Parquet now. Parquet is inside sql/core, not sql/hive.

dongjoon-hyun · 2017-08-04T18:28:02Z

Thank you for review, @kiszk .
The example may be #17980 , #17924, and #17943 .
If possible, in this PR, I want to focus on only Dependency on ORC issue.

mridulm · 2017-08-04T19:38:18Z

LGTM, great to see progress on ORC support.

rxin · 2017-08-04T19:50:55Z

To the best of my knowledge almost everybody runs with Hive anyway and the vast majority of users that run ORC are Hive users. In hindsight we probably should have put most of the data source dependencies as separate packages similar to Presto.

dongjoon-hyun · 2017-08-04T20:07:42Z

I agree with the following, but this does not block those users. This is only better than putting the dependency on sql/hive because it also supports more the other users who are using only ML and storage, too. In addition, when we refactor the data source dependecies, this will help the refactoring as clean as Parquet.

To the best of my knowledge almost everybody runs with Hive anyway and the vast majority of users that run ORC are Hive users.

rxin · 2017-08-04T20:43:34Z

Why don't we then create a separate orc module? Just copy a few of the files over?

dongjoon-hyun · 2017-08-04T22:11:50Z

Until now, I think ORC is the same with most of other data sources(CSV, JDBC, JSON, PARQUET, TEXT) which live inside sql/core now. If that is an architectural plan of Apache Spark 2.3, I will. Are we going to move out all data sources into separate modules, e.g., datasources/parquet, in timeframe of Spark 2.3?

Or, is there any other reason I didn't catch here?

rxin · 2017-08-04T23:22:06Z

I just checked the dependency size. They look pretty reasonable, roughly 2 MBs in total (although I do worry in the future whether ORC would bring in a lot more jars).

cc @omalley any guidance on this topic?

dongjoon-hyun · 2017-08-05T15:50:57Z

Hi, @rxin

Since ORC 1.4.0, ORC community provides small shaded jar files to improve usability in general purposes. This PR uses the followings.

orc-core-1.4.0-nohive.jar (1.4MB)
orc-mapreduce-1.4.0-nohive.jar (739KB)

The size is due to including the followings.

com.google.protobuf:protobuf-java
org.apache.hive:hive-storage-api

In terms of the number of files,

ORC (354 files)
ProtoBuf (247 files)
Hive Storage API (92 files)

The bottom line is there are still some source codes come from org.apache.hive namespace originally. So, I'm wondering if this is the reason why you want to put this into sql/hive module still and want to copy source codes instead of using this shaded jar?

omalley · 2017-08-07T15:43:18Z

@rxin The ORC core library's dependency tree is aggressively kept as small as possible. I've gone through and excluded unnecessary jars from our dependencies. I also kick back pull requests that add unnecessary new dependencies.

dongjoon-hyun · 2017-08-07T16:15:12Z

Thank you, @omalley .

@rxin . I think we had better depend on Apache ORC libraries as is in this PR.

dongjoon-hyun · 2017-08-07T23:47:19Z

@rxin . How can I proceed this PR now? Could you give me some advice again?

dongjoon-hyun · 2017-08-08T16:55:36Z

Thank you again for coming and reviewing this PR, @rxin , @kiszk , @mridulm , @omalley .
So far, we discussed the followings.

Why are we adding this to core? Why not just the hive module? (@rxin)
- sql/core module gives more benefit than sql/hive.
- Apache ORC library (no-hive version) is a general and resonably small library designed for non-hive apps.
Can we add smaller amount of new code to use this, too? (@kiszk)
- The previous [SPARK-20728][SQL] Make ORCFileFormat configurable between sql/hive and sql/core #17980 , [SPARK-20682][SQL] Support a new faster ORC data source based on Apache ORC #17924, and [SPARK-20682][SQL] Implement new ORC data source based on Apache ORC #17943 are the complete examples containing this PR.
- This PR is focusing on dependency only.
Why don't we then create a separate orc module? Just copy a few of the files over? (@rxin)
- Apache ORC library is the same with most of other data sources(CSV, JDBC, JSON, PARQUET, TEXT) which live inside sql/core
- It's better to use as a library instead of copying ORC files because Apache ORC shaded jar has many files. We had better depend on Apache ORC community's effort until an unavoidable reason for copying occurs.
I do worry in the future whether ORC would bring in a lot more jars (@rxin)
- The ORC core library's dependency tree is aggressively kept as small as possible. I've gone through and excluded unnecessary jars from our dependencies. I also kick back pull requests that add unnecessary new dependencies. (@omalley)

I tried to contain and summarize all advices here, but please let me know if I missed some concerns here.

omalley · 2017-08-08T18:06:58Z

I would also comment that in the long term, Spark should move to using the vectorized reader in ORC's core. That would remove the dependence on ORC's mapreduce module, which provides row by row shims on top of the vectorized reader.

dongjoon-hyun · 2017-08-08T18:11:26Z

Sure. Thank you so much, @omalley !

dongjoon-hyun · 2017-08-09T18:38:45Z

@rxin . Could you make some decision for this PR? Do we need to put this into sql/hive still for some reasons?

dongjoon-hyun · 2017-08-09T18:53:37Z

Retest this please.

SparkQA · 2017-08-09T22:00:35Z

Test build #80466 has finished for PR 18640 at commit 0f29656.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

sameeragarwal · 2017-08-10T17:52:00Z

LGTM; unless @rxin still has some strong objections?

dongjoon-hyun · 2017-08-10T17:54:52Z

Thank you so much, @sameeragarwal .

dongjoon-hyun · 2017-08-13T00:57:48Z

Hi, @mridulm, @sameeragarwal , and @rxin .
Please let me know if there is something for me to do here. Thanks!

dongjoon-hyun · 2017-08-13T00:58:19Z

Retest this please.

SparkQA · 2017-08-13T04:08:38Z

Test build #80576 has finished for PR 18640 at commit 0f29656.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2017-08-14T17:59:22Z

Hi, @sameeragarwal and @mridulm .
I cannot see any clear reason for the objection here. Also, there is another positive feedback from @ash211 in the dev@spark, too. This PR will bring an improvement definitely. Could you merge this PR for Apache Spark to move forward?

cloud-fan · 2017-08-15T07:26:43Z

pom.xml

+          </exclusion>
+          <exclusion>
+            <groupId>org.apache.hive</groupId>
+            <artifactId>hive-storage-api</artifactId>


so the orc core module still contains hive related stuff?

and to confirm, this exclusion is safe only if we don't use hive storage api of orc in sql/core, right?

Thank you so much for review, @cloud-fan .

The original orc-core-1.4.0.jar has hive-storage-api dependency. (Maven Repo)

orc-core-1.4.0-nohive.jar is a shaded jar file including hive-storage-api under org.apache.orc namespace.

orc-core-1.4.0-nohive.jar is designed for users and apps who don't want to depend on (or consider) hive. nohive is a classifier for this purpose.

This PR uses orc-core-1.4.0-nohive only. To avoid Maven confusion, this exclusion makes it sure by removing the hive-storage-api dependency explicitly from orc-core artifact.

cloud-fan · 2017-08-15T07:28:45Z

sql/core/pom.xml

@@ -87,6 +87,16 @@
    </dependency>

    <dependency>
+      <groupId>org.apache.orc</groupId>
+      <artifactId>orc-core</artifactId>
+      <classifier>${orc.classifier}</classifier>


sorry a dumb question, what does classifier mean? I don't see it in the rest of this pom file

what exactly is the storage api? confused about this too ...

In Maven, classifier allows to distinguish artifacts that were built from the same POM but differ in their content. Here, nohive is used and it refers orc-core-1.4.0-nohive.jar instead of orc-core-1.4.0.jar.

Thank you, @rxin !
Currently, the hive-storage-api jar has something like SearchArgument and PredicateLeaf. Apache ORC is trying to become an independent module like Parquet.

ok good to learn the classifier stuff... Does it work in SBT too?

Yes. sbt understands classifier in pom file. Also, we can use classifier in sbt build file, too.

@rxin Storage-API is a separately released artifact from the Hive project. Basically, Storage-API are the in-memory format for Hive's vectorization. You could draw the analogy that Storage-Api is for Hive what Arrow is for Drill. It allows formats to read and write directly in the format that is needed by the execution engine.

With the nohive classifier, ORC shades the storage-api jar into the ORC namespace so that it is compatible with any version of Hive.

Thank you, @omalley !

cloud-fan · 2017-08-15T07:30:30Z

LGTM besides some minor questions, @rxin any more comments on this?

viirya · 2017-08-15T08:37:49Z

pom.xml

+        <groupId>org.apache.orc</groupId>
+        <artifactId>orc-mapreduce</artifactId>
+        <version>${orc.version}</version>
+        <classifier>${orc.classifier}</classifier>


The classifier for orc-mapreduce is the same purpose as orc-core?

Thank you for review, @viirya .
Yes, it's the same for the same purpose. There is orc-mapreduce-1.4.0.jar and orc-mapreduce-1.4.0-nohive.jar.

Thanks. I think they are come from https://issues.apache.org/jira/browse/ORC-174.

Right. The wording is a little bit different, but technically those jars come from that JIRA patch.

viirya · 2017-08-15T08:51:52Z

LGTM

dongjoon-hyun · 2017-08-15T08:52:14Z

Thank you again, @viirya .

dongjoon-hyun · 2017-08-16T03:30:41Z

Hi, @cloud-fan , @rxin , @sameeragarwal and @mridulm .
Could you merge this PR?

rxin · 2017-08-16T05:17:33Z

lgtm

dongjoon-hyun · 2017-08-16T05:49:32Z

Thank you so much, @rxin , @cloud-fan , @sameeragarwal , @omalley , @mridulm , @viirya !

gatorsmile · 2017-08-16T06:00:29Z

Thanks! Merging to master.

dongjoon-hyun · 2017-08-16T06:16:27Z

Thank you, @gatorsmile !!!

[SPARK-21422][BUILD] Depend on Apache ORC 1.4.0

0f29656

cloud-fan reviewed Aug 15, 2017

View reviewed changes

viirya reviewed Aug 15, 2017

View reviewed changes

dongjoon-hyun mentioned this pull request Aug 16, 2017

[SPARK-20682][SQL] Update ORC data source based on Apache ORC library #18953

Closed

asfgit closed this in 8c54f1e Aug 16, 2017

dongjoon-hyun deleted the SPARK-21422 branch August 16, 2017 06:16

[SPARK-21422][BUILD] Depend on Apache ORC 1.4.0 #18640

[SPARK-21422][BUILD] Depend on Apache ORC 1.4.0 #18640

Conversation

dongjoon-hyun commented Jul 14, 2017 • edited Loading

What changes were proposed in this pull request?

How was this patch tested?

dongjoon-hyun commented Jul 14, 2017

SparkQA commented Jul 15, 2017

dongjoon-hyun commented Jul 15, 2017

dongjoon-hyun commented Jul 26, 2017

SparkQA commented Jul 26, 2017

dongjoon-hyun commented Jul 30, 2017

SparkQA commented Jul 30, 2017

dongjoon-hyun commented Aug 3, 2017

SparkQA commented Aug 4, 2017

dongjoon-hyun commented Aug 4, 2017

rxin commented Aug 4, 2017

kiszk commented Aug 4, 2017 • edited Loading

dongjoon-hyun commented Aug 4, 2017

dongjoon-hyun commented Aug 4, 2017

mridulm commented Aug 4, 2017

rxin commented Aug 4, 2017

dongjoon-hyun commented Aug 4, 2017 • edited Loading

rxin commented Aug 4, 2017

dongjoon-hyun commented Aug 4, 2017 • edited Loading

rxin commented Aug 4, 2017

dongjoon-hyun commented Aug 5, 2017

omalley commented Aug 7, 2017

dongjoon-hyun commented Aug 7, 2017

dongjoon-hyun commented Aug 7, 2017

dongjoon-hyun commented Aug 8, 2017

omalley commented Aug 8, 2017

dongjoon-hyun commented Aug 8, 2017

dongjoon-hyun commented Aug 9, 2017

dongjoon-hyun commented Aug 9, 2017

SparkQA commented Aug 9, 2017

sameeragarwal commented Aug 10, 2017

dongjoon-hyun commented Aug 10, 2017

dongjoon-hyun commented Aug 13, 2017 • edited Loading

dongjoon-hyun commented Aug 13, 2017

SparkQA commented Aug 13, 2017

dongjoon-hyun commented Aug 14, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dongjoon-hyun Aug 15, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dongjoon-hyun Aug 15, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cloud-fan commented Aug 15, 2017

viirya Aug 15, 2017 • edited Loading

Choose a reason for hiding this comment

dongjoon-hyun Aug 15, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

viirya commented Aug 15, 2017

dongjoon-hyun commented Aug 15, 2017

dongjoon-hyun commented Aug 16, 2017

rxin commented Aug 16, 2017

dongjoon-hyun commented Aug 16, 2017 • edited Loading

gatorsmile commented Aug 16, 2017

dongjoon-hyun commented Aug 16, 2017

dongjoon-hyun commented Jul 14, 2017 •

edited

Loading

kiszk commented Aug 4, 2017 •

edited

Loading

dongjoon-hyun commented Aug 4, 2017 •

edited

Loading

dongjoon-hyun commented Aug 4, 2017 •

edited

Loading

dongjoon-hyun commented Aug 13, 2017 •

edited

Loading

dongjoon-hyun commented Aug 14, 2017 •

edited

Loading

dongjoon-hyun Aug 15, 2017 •

edited

Loading

dongjoon-hyun Aug 15, 2017 •

edited

Loading

viirya Aug 15, 2017 •

edited

Loading

dongjoon-hyun Aug 15, 2017 •

edited

Loading

dongjoon-hyun commented Aug 16, 2017 •

edited

Loading