Parquet storage of VariantContext #1151

jpdna · 2016-09-06T08:50:38Z

Within ADAM we can represent Genotypes in a "variant major" mode in VariantContextRdd where each row of RDD is an array of Genotypes ( directly loaded from a multisample VCF for example )

We have currently no way to directly persist this to Parquet directly, rather we transpose this to a GenotypeRDD and save to Parquet as we can write Genotype to parquet. We can of course reload this to a GenotypeRDD and then reconvert to a VariantContextRDD, but this requires a a big groupBy and sort.

Would there be value to creating a VariantContext avro object with an array of Genotype so that we can more directly persist to Parquet the equivalent of a multi-sample VCF?

fnothaft · 2016-09-06T16:15:40Z

Personally, I'd rather not. In my ideal world, we'd actually get rid of VariantContext for everything except converting to/from VCF. I think the VariantContext data structure preserves many of the problems inherent in VCF when working with "wide" cohorts.

jpdna · 2016-09-06T16:57:56Z

Thanks @fnothaft - for your further consideration: to give some more context here, such a persisted "VariantContext" in Parquet represents to me basically the persisted result of a "group by Variant Position, sort by variant position" over a RDD[Genotype], and not a different data structure at least in terms of within Spark processing.

It's possible that just saving a sorted RDD[Genotype] could accomplish the same goal - though right now Spark seems to have no knowledge as to the fact that a parquet file is sorted.

I'd imagine it should be pretty cheap and a local non-shuffle operation to go from a persisted parquet representation of RDD[VariantContext] to RDD[Genotype] because you just need to unroll the arrays of [Genotype]. However going the other way - from RDD[Genotype]->RDD[VariantContext] is a shuffle and sort.

My use case here is also a bit conflated with thinking about: #651 (comment)
trying to enable more efficient range based queries, that could retrieve all the genotypes for a given variant, in a given genomic range, quickly while avoiding shuffle/sort. The Hbase work is one approach to do this - but I'd also like to see it work in Parquet both to compare and because I think it might be more viable for smaller clusters than HBase and for "Stand alone" usage of a ADAM file more equivalent for BCF.

( in my ideal world ADAM formats would be as good as BAM/CRAM/VCF/BCF) both for stand alone and cluster usage -- I wish we had tabix seek equivalent.... if we do and I don't realize it please let me know)

Does the expanded use-case / scenario make this any more interesting Frank?
What are the limitations / concerns that you'd see with such a parquet (variant major) genotype data object?

I guess we'd need some demonstrated use case (like better performance in the range queries) as to why it is worth the complexity of adding - but do you see it as otherwise deleterious?

fnothaft · 2016-09-06T17:11:32Z

I agree with your general gist, but I've always grumbled about what an RDD[VariantContext] presents. Essentially, the RDD[VariantContext] is the Genotype embodiment of the groupBy "anti-pattern" in Spark. I put scare quotes around anti-pattern, because it's not that groupBys are necessarily terrible, but they can be really really terrible.

I think we have a good sense for what the problems with the RDD[Genotype] programming model are. Specifically:

It's possible that just saving a sorted RDD[Genotype] could accomplish the same goal - though right now Spark seems to have no knowledge as to the fact that a parquet file is sorted.

and

I'd imagine it should be pretty cheap and a local non-shuffle operation to go from a persisted parquet representation of RDD[VariantContext] to RDD[Genotype] because you just need to unroll the arrays of [Genotype]. However going the other way - from RDD[Genotype]->RDD[VariantContext] is a shuffle and sort.

and

My use case here is also a bit conflated with thinking about: #651 (comment)
trying to enable more efficient range based queries, that could retrieve all the genotypes for a given variant, in a given genomic range, quickly while avoiding shuffle/sort. The Hbase work is one approach to do this - but I'd also like to see it work in Parquet both to compare and because I think it might be more viable for smaller clusters than HBase and for "Stand alone" usage of a ADAM file more equivalent for BCF.

I would prefer to fix these problems. I think the HBase work is a good solution to 3*, and a solution to 1 would be 90% of a solution to 2. I think 1 could be solved (? or at least "worked around") with clever abuse of metadata. I've been mentally working through an approach for 1 for a while.

* I agree we'd need to solve said problem for Parquet as well. I don't think that'd be impossible, but I don't know how much work it'd involve.

jpdna · 2016-09-06T17:24:36Z

Interesting Discussion!
I do see your point that we don't want to elevate the group by when we should be streaming through in one pass with an accumulator ( ideally on data that once is sorted doesn't have to be again) - this is also much better for growing datasets.

I'll think more on the hbase and the partitioning / bucketing stuff.
Closing this ticket.

jpdna closed this as completed Sep 6, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parquet storage of VariantContext #1151

Parquet storage of VariantContext #1151

jpdna commented Sep 6, 2016

fnothaft commented Sep 6, 2016

jpdna commented Sep 6, 2016

fnothaft commented Sep 6, 2016

jpdna commented Sep 6, 2016

Parquet storage of VariantContext #1151

Parquet storage of VariantContext #1151

Comments

jpdna commented Sep 6, 2016

fnothaft commented Sep 6, 2016

jpdna commented Sep 6, 2016

fnothaft commented Sep 6, 2016

jpdna commented Sep 6, 2016