Sub-partitioning of Parquet file for ADAM #1003

jpdna · 2016-04-18T12:15:07Z

The Spark-SQL programming guide describe an optimization of parquet usage that involves splitting parquet file into directories corresponding to different column values. here

This issue is meant as a place for discussion of this topic and to determine if we should prototype such a parquet directory layout, for example dividing the parquet file into individual files per chromosome.

Look forward to any comments and/or links to earlier discussions of this topic

fnothaft · 2016-07-06T16:02:32Z

Closing as dupe of #651.

heuermh · 2016-07-21T03:40:22Z

@jpdna Take a look at https://github.com/tomwhite/genomics-analytics/blob/master/adam.md

tomwhite · 2016-07-22T09:42:00Z

@jpdna @heuermh that's pretty old now - I think using Spark to do the partitioning is the way forward, and Impala supports nested types so flattening is not necessary. See #651 (comment)

fnothaft closed this as completed Jul 6, 2016

fnothaft added the duplicate label Jul 6, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sub-partitioning of Parquet file for ADAM #1003

Sub-partitioning of Parquet file for ADAM #1003

jpdna commented Apr 18, 2016

fnothaft commented Jul 6, 2016

heuermh commented Jul 21, 2016

tomwhite commented Jul 22, 2016

Sub-partitioning of Parquet file for ADAM #1003

Sub-partitioning of Parquet file for ADAM #1003

Comments

jpdna commented Apr 18, 2016

fnothaft commented Jul 6, 2016

heuermh commented Jul 21, 2016

tomwhite commented Jul 22, 2016