added contigName Hive style partitioning to AlignmentRecordRDD #17
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Will move this PR to bdgenomics once 1018 is merged.
Given this PR, the Parquet directory is laid out with a directory per chromosome (contigName) like
Later, we will either add a posBin column to Avro or figure out how to allow that column to exist in parquet/dataset but drop from Avro, which will add another layer of directly hierarchy under the 'contigName=N' dirs that bins
start
pos into 10000 bp bins ( or some other optimal size )As per discussion in bigdatagenomics#651
such binning should allow a more efficient predicate pushdown of range queries than we may currently get from Parquet.
I'm hoping this strategy is compatible and complementary with the sorted partition mapping system.
The code here can be tested in shell with