-
Notifications
You must be signed in to change notification settings - Fork 311
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How to create a rich(er) VariantContext RDD? Reconstruct VCF INFO fields. #878
Comments
A method to copy might the method used in BCFTools merge: https://samtools.github.io/bcftools/bcftools.html#merge It takes: And custom methods to handle the merging of the variant annotations can be specified.
The situation were merging of the variantAnnotations is needed is when the genotypes/variant from two separate variant calling experiments are merged in Adam. This by first converting 2 different VCF files to Adam/Parquet genotype files. And finally executing toVariantContext on the combined genotype collection. |
Should be resolved by #1250. |
Fixed by #1288. In scala> val genotypes = sc.loadGenotypes("adam-core/src/test/resources/sorted.vcf")
genotypes: org.bdgenomics.adam.rdd.variant.GenotypeRDD =
GenotypeRDD(MapPartitionsRDD[5] at flatMap at VariantContextRDD.scala:67,SequenceDictionary{
1->249250621, 0
2->249250621, 1
13->249250621, 2},ArrayBuffer({"sampleId": "NA12878", "name": null, "attributes": {}}, {"sampleId": "NA12891", "name": null, "attributes": {}}, {"sampleId": "NA12892", "name": null, "attributes": {}}),ArrayBuffer(FILTER=<ID=IndelFS,Description="FS > 200.0">, FILTER=<ID=IndelQD,Description="QD < 2.0">, FILTER=<ID=IndelReadPosRankSum,Description="ReadPosRankSum < -20.0">, FILTER=<ID=LowQual,Description="Low quality">, FILTER=<ID=VQSRTrancheSNP99.50to99.60,Description="Truth sensitivity tranche level for SNP model at VQS Lod: -0.5377 <= x < -0.1787">, FILTER=<ID=VQSRTrancheSNP99.60to99.70,Description="Truth se...
scala> val variantContexts = genotypes.toVariantContextRDD
variantContexts: org.bdgenomics.adam.rdd.variant.VariantContextRDD =
VariantContextRDD(MapPartitionsRDD[8] at map at GenotypeRDD.scala:68,SequenceDictionary{
1->249250621, 0
2->249250621, 1
13->249250621, 2},ArrayBuffer({"sampleId": "NA12878", "name": null, "attributes": {}}, {"sampleId": "NA12891", "name": null, "attributes": {}}, {"sampleId": "NA12892", "name": null, "attributes": {}}),ArrayBuffer(FILTER=<ID=IndelFS,Description="FS > 200.0">, FILTER=<ID=IndelQD,Description="QD < 2.0">, FILTER=<ID=IndelReadPosRankSum,Description="ReadPosRankSum < -20.0">, FILTER=<ID=LowQual,Description="Low quality">, FILTER=<ID=VQSRTrancheSNP99.50to99.60,Description="Truth sensitivity tranche level for SNP model at VQS Lod: -0.5377 <= x < -0.1787">, FILTER=<ID=VQSRTrancheSNP99.60to99.70,Description="... We should revisit merging genotype annotations after the remainder of Genotype refactoring is complete (version 0.22.0). |
Hi,
Is this code snippet the correct way to convert from a RDD of Genotypes to a RDD of VariantContexts?
Are you planning on enriching the VariantContext further with reconstructed vcf INFO fields?
By recomputing some INFO values from all the genotype attributes?
Or by merging possible different variantCallingAnnotations over all called genotypes?
This could be done in the buildFromGenotypes constructor....
Possbile VCF INFO fields that could be reconstructed are:
Best Regards,
Neill
--edit
What I read here is that a groupByKey is somethign to avoid .
https://databricks.gitbooks.io/databricks-spark-knowledge-base/content/best_practices/prefer_reducebykey_over_groupbykey.html
But it seems to me that converting a Genotype RDD to a VariantContext RDD is something that one will have to do frequently when analyzing genotypes. And can't replace with a reduceByKey
Because in the end you want to expose / query / export a genotype and variant matrix, not a genotype array.......
Has anyone tested the speed of this conversion for the 1000 genomes chr1 vcf file with 2500 samples?
The text was updated successfully, but these errors were encountered: