Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

added contigName Hive style partitioning to AlignmentRecordRDD #17

Open
wants to merge 1 commit into
base: issues/1018-dataset-api
Choose a base branch
from

Conversation

jpdna
Copy link

@jpdna jpdna commented Jul 10, 2017

Will move this PR to bdgenomics once 1018 is merged.
Given this PR, the Parquet directory is laid out with a directory per chromosome (contigName) like

_SUCCESS
_common_metadata
_metadata
_rgdict.avro
_seqdict.avro
contigName=1
    -> part-r-00000-f872ea82-3036-455a-a35d-d043ec386db4.gz.parquet
          ->(in future we will have another layer before the parquet files  )   

            posBin=10000,posBin=20000,...

Later, we will either add a posBin column to Avro or figure out how to allow that column to exist in parquet/dataset but drop from Avro, which will add another layer of directly hierarchy under the 'contigName=N' dirs that bins start pos into 10000 bp bins ( or some other optimal size )

As per discussion in bigdatagenomics#651
such binning should allow a more efficient predicate pushdown of range queries than we may currently get from Parquet.
I'm hoping this strategy is compatible and complementary with the sorted partition mapping system.

The code here can be tested in shell with

import org.bdgenomics.adam.rdd.ADAMContext._
val rdd = sc.loadAlignments("../adam/adam-core/src/test/resources/small.sam")

 val x = rdd.transformDataset(ds => {
     |  import ds.sqlContext.implicits._
     |  val df = ds.toDF()
     |  df.where(df("contigName") === "1")
     |    .as[AlignmentRecordProduct]
     | })

x.saveAsParquet("test_chr_partitioned_parquet")

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant