-
Notifications
You must be signed in to change notification settings - Fork 311
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Parquet storage of VariantContext #1151
Comments
Personally, I'd rather not. In my ideal world, we'd actually get rid of |
Thanks @fnothaft - for your further consideration: to give some more context here, such a persisted "VariantContext" in Parquet represents to me basically the persisted result of a "group by Variant Position, sort by variant position" over a RDD[Genotype], and not a different data structure at least in terms of within Spark processing. It's possible that just saving a sorted RDD[Genotype] could accomplish the same goal - though right now Spark seems to have no knowledge as to the fact that a parquet file is sorted. I'd imagine it should be pretty cheap and a local non-shuffle operation to go from a persisted parquet representation of RDD[VariantContext] to RDD[Genotype] because you just need to unroll the arrays of [Genotype]. However going the other way - from RDD[Genotype]->RDD[VariantContext] is a shuffle and sort. My use case here is also a bit conflated with thinking about: #651 (comment) ( in my ideal world ADAM formats would be as good as BAM/CRAM/VCF/BCF) both for stand alone and cluster usage -- I wish we had tabix seek equivalent.... if we do and I don't realize it please let me know) Does the expanded use-case / scenario make this any more interesting Frank? I guess we'd need some demonstrated use case (like better performance in the range queries) as to why it is worth the complexity of adding - but do you see it as otherwise deleterious? |
I agree with your general gist, but I've always grumbled about what an I think we have a good sense for what the problems with the
and
and
I would prefer to fix these problems. I think the HBase work is a good solution to 3*, and a solution to 1 would be 90% of a solution to 2. I think 1 could be solved (? or at least "worked around") with clever abuse of metadata. I've been mentally working through an approach for 1 for a while. * I agree we'd need to solve said problem for Parquet as well. I don't think that'd be impossible, but I don't know how much work it'd involve. |
Interesting Discussion! I'll think more on the hbase and the partitioning / bucketing stuff. |
Within ADAM we can represent Genotypes in a "variant major" mode in VariantContextRdd where each row of RDD is an array of Genotypes ( directly loaded from a multisample VCF for example )
We have currently no way to directly persist this to Parquet directly, rather we transpose this to a GenotypeRDD and save to Parquet as we can write Genotype to parquet. We can of course reload this to a GenotypeRDD and then reconvert to a VariantContextRDD, but this requires a a big groupBy and sort.
Would there be value to creating a VariantContext avro object with an array of Genotype so that we can more directly persist to Parquet the equivalent of a multi-sample VCF?
The text was updated successfully, but these errors were encountered: