added contigName Hive style partitioning to AlignmentRecordRDD #17

jpdna · 2017-07-10T23:25:18Z

Will move this PR to bdgenomics once 1018 is merged.
Given this PR, the Parquet directory is laid out with a directory per chromosome (contigName) like

_SUCCESS
_common_metadata
_metadata
_rgdict.avro
_seqdict.avro
contigName=1
    -> part-r-00000-f872ea82-3036-455a-a35d-d043ec386db4.gz.parquet
          ->(in future we will have another layer before the parquet files  )   

            posBin=10000,posBin=20000,...

Later, we will either add a posBin column to Avro or figure out how to allow that column to exist in parquet/dataset but drop from Avro, which will add another layer of directly hierarchy under the 'contigName=N' dirs that bins start pos into 10000 bp bins ( or some other optimal size )

As per discussion in bigdatagenomics#651
such binning should allow a more efficient predicate pushdown of range queries than we may currently get from Parquet.
I'm hoping this strategy is compatible and complementary with the sorted partition mapping system.

The code here can be tested in shell with

import org.bdgenomics.adam.rdd.ADAMContext._
val rdd = sc.loadAlignments("../adam/adam-core/src/test/resources/small.sam")

 val x = rdd.transformDataset(ds => {
     |  import ds.sqlContext.implicits._
     |  val df = ds.toDF()
     |  df.where(df("contigName") === "1")
     |    .as[AlignmentRecordProduct]
     | })

x.saveAsParquet("test_chr_partitioned_parquet")

added contigName Hive style partitioning to AlignmentRecord

8f3c904

jpdna mentioned this pull request Jul 10, 2017

Support Hive-style partitioning bigdatagenomics/adam#651

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

added contigName Hive style partitioning to AlignmentRecordRDD #17

added contigName Hive style partitioning to AlignmentRecordRDD #17

jpdna commented Jul 10, 2017

added contigName Hive style partitioning to AlignmentRecordRDD #17

Are you sure you want to change the base?

added contigName Hive style partitioning to AlignmentRecordRDD #17

Conversation

jpdna commented Jul 10, 2017