Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sub-partitioning of Parquet file for ADAM #1003

Closed
jpdna opened this issue Apr 18, 2016 · 3 comments
Closed

Sub-partitioning of Parquet file for ADAM #1003

jpdna opened this issue Apr 18, 2016 · 3 comments

Comments

@jpdna
Copy link
Member

jpdna commented Apr 18, 2016

The Spark-SQL programming guide describe an optimization of parquet usage that involves splitting parquet file into directories corresponding to different column values. here

This issue is meant as a place for discussion of this topic and to determine if we should prototype such a parquet directory layout, for example dividing the parquet file into individual files per chromosome.

Look forward to any comments and/or links to earlier discussions of this topic

@fnothaft
Copy link
Member

fnothaft commented Jul 6, 2016

Closing as dupe of #651.

@heuermh
Copy link
Member

heuermh commented Jul 21, 2016

@tomwhite
Copy link
Member

@jpdna @heuermh that's pretty old now - I think using Spark to do the partitioning is the way forward, and Impala supports nested types so flattening is not necessary. See #651 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants