Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to create a rich(er) VariantContext RDD? Reconstruct VCF INFO fields. #878

Closed
NeillGibson opened this issue Nov 9, 2015 · 3 comments
Milestone

Comments

@NeillGibson
Copy link
Contributor

Hi,

Is this code snippet the correct way to convert from a RDD of Genotypes to a RDD of VariantContexts?

val genotypes  = ac.loadGenotypes("/chr1.adam")
val variantsByPos = genotypes.keyBy(_.getVariant.start.hashCode).groupByKey
val variantContext  = variantsByPos.values.map(gts => VariantContext.buildFromGenotypes(gts.toSeq))
variantContext.positions
res5: org.bdgenomics.adam.models.ReferencePosition = ReferenceRegion(1,10221400,10221401,Independent)
variantContext.variant.start
res7: Long = 10221400
variantContext_real.first.genotypes.size
res9: Int = 2504

Are you planning on enriching the VariantContext further with reconstructed vcf INFO fields?
By recomputing some INFO values from all the genotype attributes?
Or by merging possible different variantCallingAnnotations over all called genotypes?
This could be done in the buildFromGenotypes constructor....

Possbile VCF INFO fields that could be reconstructed are:

DP : combined depth across samples, e.g. DP=154
AC : allele count in genotypes, for each ALT allele, in the same order as listed
AF : allele frequency for each ALT allele in the same order as listed
VariantCallErrorProbability (Max) ?
VariantFilters (Merge)?
Mapq0Reads (Sum)?
VariantQualityByDepth (Max?)?

Best Regards,

Neill

--edit

What I read here is that a groupByKey is somethign to avoid .
https://databricks.gitbooks.io/databricks-spark-knowledge-base/content/best_practices/prefer_reducebykey_over_groupbykey.html

But it seems to me that converting a Genotype RDD to a VariantContext RDD is something that one will have to do frequently when analyzing genotypes. And can't replace with a reduceByKey

Because in the end you want to expose / query / export a genotype and variant matrix, not a genotype array.......

Has anyone tested the speed of this conversion for the 1000 genomes chr1 vcf file with 2500 samples?

@NeillGibson
Copy link
Contributor Author

A method to copy might the method used in BCFTools merge:

https://samtools.github.io/bcftools/bcftools.html#merge

It takes:
maximum variant.QAUL,
sum for variantAnnotations.DP (over distinct variantCallingAnnotations),
sum for variantAnnotations.DP4(over distinct variantCallingAnnotations)
and the values in the variantAnnotations of the first genotype for the other fields.

And custom methods to handle the merging of the variant annotations can be specified.

-i, --info-rules -|TAG:METHOD[,…]
    Rules for merging INFO fields (scalars or vectors) or - to disable the default rules. METHOD is one of sum, avg, min, max, join. Default is DP:sum,DP4:sum if these fields exist in the input files. Fields with no specified rule will take the value from the first input file. The merged QUAL value is currently set to the maximum. This behaviour is not user controllable at the moment. 

The situation were merging of the variantAnnotations is needed is when the genotypes/variant from two separate variant calling experiments are merged in Adam.

This by first converting 2 different VCF files to Adam/Parquet genotype files.
Then combining the genotype records from both Adam/Parquet files into one Genotype collection.

And finally executing toVariantContext on the combined genotype collection.

@fnothaft
Copy link
Member

fnothaft commented Nov 8, 2016

Should be resolved by #1250.

@heuermh
Copy link
Member

heuermh commented Jan 5, 2017

Fixed by #1288.

In adam-shell

scala> val genotypes = sc.loadGenotypes("adam-core/src/test/resources/sorted.vcf")
genotypes: org.bdgenomics.adam.rdd.variant.GenotypeRDD = 
GenotypeRDD(MapPartitionsRDD[5] at flatMap at VariantContextRDD.scala:67,SequenceDictionary{
1->249250621, 0
2->249250621, 1
13->249250621, 2},ArrayBuffer({"sampleId": "NA12878", "name": null, "attributes": {}}, {"sampleId": "NA12891", "name": null, "attributes": {}}, {"sampleId": "NA12892", "name": null, "attributes": {}}),ArrayBuffer(FILTER=<ID=IndelFS,Description="FS > 200.0">, FILTER=<ID=IndelQD,Description="QD < 2.0">, FILTER=<ID=IndelReadPosRankSum,Description="ReadPosRankSum < -20.0">, FILTER=<ID=LowQual,Description="Low quality">, FILTER=<ID=VQSRTrancheSNP99.50to99.60,Description="Truth sensitivity tranche level for SNP model at VQS Lod: -0.5377 <= x < -0.1787">, FILTER=<ID=VQSRTrancheSNP99.60to99.70,Description="Truth se...

scala> val variantContexts = genotypes.toVariantContextRDD
variantContexts: org.bdgenomics.adam.rdd.variant.VariantContextRDD = 
VariantContextRDD(MapPartitionsRDD[8] at map at GenotypeRDD.scala:68,SequenceDictionary{
1->249250621, 0
2->249250621, 1
13->249250621, 2},ArrayBuffer({"sampleId": "NA12878", "name": null, "attributes": {}}, {"sampleId": "NA12891", "name": null, "attributes": {}}, {"sampleId": "NA12892", "name": null, "attributes": {}}),ArrayBuffer(FILTER=<ID=IndelFS,Description="FS > 200.0">, FILTER=<ID=IndelQD,Description="QD < 2.0">, FILTER=<ID=IndelReadPosRankSum,Description="ReadPosRankSum < -20.0">, FILTER=<ID=LowQual,Description="Low quality">, FILTER=<ID=VQSRTrancheSNP99.50to99.60,Description="Truth sensitivity tranche level for SNP model at VQS Lod: -0.5377 <= x < -0.1787">, FILTER=<ID=VQSRTrancheSNP99.60to99.70,Description="...

We should revisit merging genotype annotations after the remainder of Genotype refactoring is complete (version 0.22.0).

@heuermh heuermh closed this as completed Jan 5, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants