Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-21422][BUILD] Depend on Apache ORC 1.4.0 #18640

Closed
wants to merge 1 commit into from
Closed

[SPARK-21422][BUILD] Depend on Apache ORC 1.4.0 #18640

wants to merge 1 commit into from

Conversation

dongjoon-hyun
Copy link
Member

@dongjoon-hyun dongjoon-hyun commented Jul 14, 2017

What changes were proposed in this pull request?

Like Parquet, this PR aims to depend on the latest Apache ORC 1.4 for Apache Spark 2.3. There are key benefits for Apache ORC 1.4.

  • Stability: Apache ORC 1.4.0 has many fixes and we can depend on ORC community more.
  • Maintainability: Reduce the Hive dependency and can remove old legacy code later.

Later, we can get the following two key benefits by adding new ORCFileFormat in SPARK-20728 (#17980), too.

  • Usability: User can use ORC data sources without hive module, i.e, -Phive.
  • Speed: Use both Spark ColumnarBatch and ORC RowBatch together. This will be faster than the current implementation in Spark.

How was this patch tested?

Pass the jenkins.

@dongjoon-hyun
Copy link
Member Author

This aims to reduce the review scope for #17980 .
cc @kiszk .

@SparkQA
Copy link

SparkQA commented Jul 15, 2017

Test build #79627 has finished for PR 18640 at commit 0f29656.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun
Copy link
Member Author

Hi, @rxin , @srowen , @sameeragarwal , @cloud-fan , @hvanhovell , @gatorsmile , @ueshin , @viirya , @kiszk .

Could you review this small PR about depedency change?

This is a start of upgrade to Apache ORC in order to reduce the old Hive dependency in Apache Spark 2.3 for the following issues.

  • SPARK-20901 Feature parity for ORC with Parquet
  • SPARK-20682 Support a new faster ORC data source based on Apache ORC
  • SPARK-20728 Make ORCFileFormat configurable between sql/hive and sql/core
  • SPARK-16060 Vectorized Orc Reader

I've heard that Apache Spark will not drop ORC data source from @sameeragarwal . If then, could we move forward a small step like this?

@dongjoon-hyun
Copy link
Member Author

Retest this please.

@SparkQA
Copy link

SparkQA commented Jul 26, 2017

Test build #79951 has finished for PR 18640 at commit 0f29656.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun
Copy link
Member Author

Retest this please.

@SparkQA
Copy link

SparkQA commented Jul 30, 2017

Test build #80055 has finished for PR 18640 at commit 0f29656.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun
Copy link
Member Author

Retest this please

@SparkQA
Copy link

SparkQA commented Aug 4, 2017

Test build #80221 has finished for PR 18640 at commit 0f29656.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun
Copy link
Member Author

Hi, @liancheng , @zhzhan , @rxin , @marmbrus .
I'm pining you since you worked on #6194 before.

@rxin
Copy link
Contributor

rxin commented Aug 4, 2017

Why are we adding this to core? Why not just the hive module?

@kiszk
Copy link
Member

kiszk commented Aug 4, 2017

Can we add smaller amount of new code to use this, too? It may help show which modules are relevant to add this.

@dongjoon-hyun
Copy link
Member Author

Thank you for review, @rxin .
We can use ORC like Parquet now. Parquet is inside sql/core, not sql/hive.

@dongjoon-hyun
Copy link
Member Author

Thank you for review, @kiszk .
The example may be #17980 , #17924, and #17943 .
If possible, in this PR, I want to focus on only Dependency on ORC issue.

@mridulm
Copy link
Contributor

mridulm commented Aug 4, 2017

LGTM, great to see progress on ORC support.

@rxin
Copy link
Contributor

rxin commented Aug 4, 2017

To the best of my knowledge almost everybody runs with Hive anyway and the vast majority of users that run ORC are Hive users. In hindsight we probably should have put most of the data source dependencies as separate packages similar to Presto.

@dongjoon-hyun
Copy link
Member Author

dongjoon-hyun commented Aug 4, 2017

I agree with the following, but this does not block those users. This is only better than putting the dependency on sql/hive because it also supports more the other users who are using only ML and storage, too. In addition, when we refactor the data source dependecies, this will help the refactoring as clean as Parquet.

To the best of my knowledge almost everybody runs with Hive anyway and the vast majority of users that run ORC are Hive users.

@rxin
Copy link
Contributor

rxin commented Aug 4, 2017

Why don't we then create a separate orc module? Just copy a few of the files over?

@dongjoon-hyun
Copy link
Member Author

dongjoon-hyun commented Aug 4, 2017

Until now, I think ORC is the same with most of other data sources(CSV, JDBC, JSON, PARQUET, TEXT) which live inside sql/core now. If that is an architectural plan of Apache Spark 2.3, I will. Are we going to move out all data sources into separate modules, e.g., datasources/parquet, in timeframe of Spark 2.3?

Or, is there any other reason I didn't catch here?

@rxin
Copy link
Contributor

rxin commented Aug 4, 2017

I just checked the dependency size. They look pretty reasonable, roughly 2 MBs in total (although I do worry in the future whether ORC would bring in a lot more jars).

cc @omalley any guidance on this topic?

@dongjoon-hyun
Copy link
Member Author

Hi, @rxin

Since ORC 1.4.0, ORC community provides small shaded jar files to improve usability in general purposes. This PR uses the followings.

  • orc-core-1.4.0-nohive.jar (1.4MB)
  • orc-mapreduce-1.4.0-nohive.jar (739KB)

The size is due to including the followings.

  • com.google.protobuf:protobuf-java
  • org.apache.hive:hive-storage-api

In terms of the number of files,

  • ORC (354 files)
  • ProtoBuf (247 files)
  • Hive Storage API (92 files)

The bottom line is there are still some source codes come from org.apache.hive namespace originally. So, I'm wondering if this is the reason why you want to put this into sql/hive module still and want to copy source codes instead of using this shaded jar?

@omalley
Copy link
Contributor

omalley commented Aug 7, 2017

@rxin The ORC core library's dependency tree is aggressively kept as small as possible. I've gone through and excluded unnecessary jars from our dependencies. I also kick back pull requests that add unnecessary new dependencies.

@dongjoon-hyun
Copy link
Member Author

Thank you, @omalley .

@rxin . I think we had better depend on Apache ORC libraries as is in this PR.

@dongjoon-hyun
Copy link
Member Author

@rxin . How can I proceed this PR now? Could you give me some advice again?

@dongjoon-hyun
Copy link
Member Author

Thank you again for coming and reviewing this PR, @rxin , @kiszk , @mridulm , @omalley .
So far, we discussed the followings.

  1. Why are we adding this to core? Why not just the hive module? (@rxin)

    • sql/core module gives more benefit than sql/hive.
    • Apache ORC library (no-hive version) is a general and resonably small library designed for non-hive apps.
  2. Can we add smaller amount of new code to use this, too? (@kiszk)

  3. Why don't we then create a separate orc module? Just copy a few of the files over? (@rxin)

    • Apache ORC library is the same with most of other data sources(CSV, JDBC, JSON, PARQUET, TEXT) which live inside sql/core
    • It's better to use as a library instead of copying ORC files because Apache ORC shaded jar has many files. We had better depend on Apache ORC community's effort until an unavoidable reason for copying occurs.
  4. I do worry in the future whether ORC would bring in a lot more jars (@rxin)

    • The ORC core library's dependency tree is aggressively kept as small as possible. I've gone through and excluded unnecessary jars from our dependencies. I also kick back pull requests that add unnecessary new dependencies. (@omalley)

I tried to contain and summarize all advices here, but please let me know if I missed some concerns here.

@omalley
Copy link
Contributor

omalley commented Aug 8, 2017

I would also comment that in the long term, Spark should move to using the vectorized reader in ORC's core. That would remove the dependence on ORC's mapreduce module, which provides row by row shims on top of the vectorized reader.

@dongjoon-hyun
Copy link
Member Author

Sure. Thank you so much, @omalley !

@dongjoon-hyun
Copy link
Member Author

@rxin . Could you make some decision for this PR? Do we need to put this into sql/hive still for some reasons?

@dongjoon-hyun
Copy link
Member Author

Retest this please.

@SparkQA
Copy link

SparkQA commented Aug 9, 2017

Test build #80466 has finished for PR 18640 at commit 0f29656.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@sameeragarwal
Copy link
Member

LGTM; unless @rxin still has some strong objections?

@dongjoon-hyun
Copy link
Member Author

Thank you so much, @sameeragarwal .

@dongjoon-hyun
Copy link
Member Author

dongjoon-hyun commented Aug 13, 2017

Hi, @mridulm, @sameeragarwal , and @rxin .
Please let me know if there is something for me to do here. Thanks!

@dongjoon-hyun
Copy link
Member Author

Retest this please.

@SparkQA
Copy link

SparkQA commented Aug 13, 2017

Test build #80576 has finished for PR 18640 at commit 0f29656.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun
Copy link
Member Author

dongjoon-hyun commented Aug 14, 2017

Hi, @sameeragarwal and @mridulm .
I cannot see any clear reason for the objection here. Also, there is another positive feedback from @ash211 in the dev@spark, too. This PR will bring an improvement definitely. Could you merge this PR for Apache Spark to move forward?

</exclusion>
<exclusion>
<groupId>org.apache.hive</groupId>
<artifactId>hive-storage-api</artifactId>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so the orc core module still contains hive related stuff?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and to confirm, this exclusion is safe only if we don't use hive storage api of orc in sql/core, right?

Copy link
Member Author

@dongjoon-hyun dongjoon-hyun Aug 15, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you so much for review, @cloud-fan .

  • The original orc-core-1.4.0.jar has hive-storage-api dependency. (Maven Repo)
  • orc-core-1.4.0-nohive.jar is a shaded jar file including hive-storage-api under org.apache.orc namespace.

orc-core-1.4.0-nohive.jar is designed for users and apps who don't want to depend on (or consider) hive. nohive is a classifier for this purpose.

This PR uses orc-core-1.4.0-nohive only. To avoid Maven confusion, this exclusion makes it sure by removing the hive-storage-api dependency explicitly from orc-core artifact.

@@ -87,6 +87,16 @@
</dependency>

<dependency>
<groupId>org.apache.orc</groupId>
<artifactId>orc-core</artifactId>
<classifier>${orc.classifier}</classifier>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry a dumb question, what does classifier mean? I don't see it in the rest of this pom file

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what exactly is the storage api? confused about this too ...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In Maven, classifier allows to distinguish artifacts that were built from the same POM but differ in their content. Here, nohive is used and it refers orc-core-1.4.0-nohive.jar instead of orc-core-1.4.0.jar.

Copy link
Member Author

@dongjoon-hyun dongjoon-hyun Aug 15, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, @rxin !
Currently, the hive-storage-api jar has something like SearchArgument and PredicateLeaf. Apache ORC is trying to become an independent module like Parquet.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok good to learn the classifier stuff... Does it work in SBT too?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. sbt understands classifier in pom file. Also, we can use classifier in sbt build file, too.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rxin Storage-API is a separately released artifact from the Hive project. Basically, Storage-API are the in-memory format for Hive's vectorization. You could draw the analogy that Storage-Api is for Hive what Arrow is for Drill. It allows formats to read and write directly in the format that is needed by the execution engine.

With the nohive classifier, ORC shades the storage-api jar into the ORC namespace so that it is compatible with any version of Hive.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, @omalley !

@cloud-fan
Copy link
Contributor

LGTM besides some minor questions, @rxin any more comments on this?

<groupId>org.apache.orc</groupId>
<artifactId>orc-mapreduce</artifactId>
<version>${orc.version}</version>
<classifier>${orc.classifier}</classifier>
Copy link
Member

@viirya viirya Aug 15, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The classifier for orc-mapreduce is the same purpose as orc-core?

Copy link
Member Author

@dongjoon-hyun dongjoon-hyun Aug 15, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for review, @viirya .
Yes, it's the same for the same purpose. There is orc-mapreduce-1.4.0.jar and orc-mapreduce-1.4.0-nohive.jar.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. I think they are come from https://issues.apache.org/jira/browse/ORC-174.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right. The wording is a little bit different, but technically those jars come from that JIRA patch.

@viirya
Copy link
Member

viirya commented Aug 15, 2017

LGTM

@dongjoon-hyun
Copy link
Member Author

Thank you again, @viirya .

@dongjoon-hyun
Copy link
Member Author

Hi, @cloud-fan , @rxin , @sameeragarwal and @mridulm .
Could you merge this PR?

@rxin
Copy link
Contributor

rxin commented Aug 16, 2017

lgtm

@dongjoon-hyun
Copy link
Member Author

dongjoon-hyun commented Aug 16, 2017

Thank you so much, @rxin , @cloud-fan , @sameeragarwal , @omalley , @mridulm , @viirya !

@gatorsmile
Copy link
Member

Thanks! Merging to master.

@asfgit asfgit closed this in 8c54f1e Aug 16, 2017
@dongjoon-hyun
Copy link
Member Author

Thank you, @gatorsmile !!!

@dongjoon-hyun dongjoon-hyun deleted the SPARK-21422 branch August 16, 2017 06:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants