How to create a rich(er) VariantContext RDD? Reconstruct VCF INFO fields. #878

NeillGibson · 2015-11-09T13:11:54Z

Hi,

Is this code snippet the correct way to convert from a RDD of Genotypes to a RDD of VariantContexts?

val genotypes  = ac.loadGenotypes("/chr1.adam")
val variantsByPos = genotypes.keyBy(_.getVariant.start.hashCode).groupByKey
val variantContext  = variantsByPos.values.map(gts => VariantContext.buildFromGenotypes(gts.toSeq))
variantContext.positions
res5: org.bdgenomics.adam.models.ReferencePosition = ReferenceRegion(1,10221400,10221401,Independent)
variantContext.variant.start
res7: Long = 10221400
variantContext_real.first.genotypes.size
res9: Int = 2504

Are you planning on enriching the VariantContext further with reconstructed vcf INFO fields?
By recomputing some INFO values from all the genotype attributes?
Or by merging possible different variantCallingAnnotations over all called genotypes?
This could be done in the buildFromGenotypes constructor....

Possbile VCF INFO fields that could be reconstructed are:

DP : combined depth across samples, e.g. DP=154
AC : allele count in genotypes, for each ALT allele, in the same order as listed
AF : allele frequency for each ALT allele in the same order as listed
VariantCallErrorProbability (Max) ?
VariantFilters (Merge)?
Mapq0Reads (Sum)?
VariantQualityByDepth (Max?)?

Best Regards,

Neill

--edit

What I read here is that a groupByKey is somethign to avoid .
https://databricks.gitbooks.io/databricks-spark-knowledge-base/content/best_practices/prefer_reducebykey_over_groupbykey.html

But it seems to me that converting a Genotype RDD to a VariantContext RDD is something that one will have to do frequently when analyzing genotypes. And can't replace with a reduceByKey

Because in the end you want to expose / query / export a genotype and variant matrix, not a genotype array.......

Has anyone tested the speed of this conversion for the 1000 genomes chr1 vcf file with 2500 samples?

The text was updated successfully, but these errors were encountered:

NeillGibson · 2015-12-03T10:46:57Z

A method to copy might the method used in BCFTools merge:

https://samtools.github.io/bcftools/bcftools.html#merge

It takes:
maximum variant.QAUL,
sum for variantAnnotations.DP (over distinct variantCallingAnnotations),
sum for variantAnnotations.DP4(over distinct variantCallingAnnotations)
and the values in the variantAnnotations of the first genotype for the other fields.

And custom methods to handle the merging of the variant annotations can be specified.

-i, --info-rules -|TAG:METHOD[,…]
    Rules for merging INFO fields (scalars or vectors) or - to disable the default rules. METHOD is one of sum, avg, min, max, join. Default is DP:sum,DP4:sum if these fields exist in the input files. Fields with no specified rule will take the value from the first input file. The merged QUAL value is currently set to the maximum. This behaviour is not user controllable at the moment.

The situation were merging of the variantAnnotations is needed is when the genotypes/variant from two separate variant calling experiments are merged in Adam.

This by first converting 2 different VCF files to Adam/Parquet genotype files.
Then combining the genotype records from both Adam/Parquet files into one Genotype collection.

And finally executing toVariantContext on the combined genotype collection.

fnothaft · 2016-11-08T23:49:44Z

Should be resolved by #1250.

heuermh · 2017-01-05T04:24:17Z

Fixed by #1288.

In adam-shell

scala> val genotypes = sc.loadGenotypes("adam-core/src/test/resources/sorted.vcf")
genotypes: org.bdgenomics.adam.rdd.variant.GenotypeRDD = 
GenotypeRDD(MapPartitionsRDD[5] at flatMap at VariantContextRDD.scala:67,SequenceDictionary{
1->249250621, 0
2->249250621, 1
13->249250621, 2},ArrayBuffer({"sampleId": "NA12878", "name": null, "attributes": {}}, {"sampleId": "NA12891", "name": null, "attributes": {}}, {"sampleId": "NA12892", "name": null, "attributes": {}}),ArrayBuffer(FILTER=<ID=IndelFS,Description="FS > 200.0">, FILTER=<ID=IndelQD,Description="QD < 2.0">, FILTER=<ID=IndelReadPosRankSum,Description="ReadPosRankSum < -20.0">, FILTER=<ID=LowQual,Description="Low quality">, FILTER=<ID=VQSRTrancheSNP99.50to99.60,Description="Truth sensitivity tranche level for SNP model at VQS Lod: -0.5377 <= x < -0.1787">, FILTER=<ID=VQSRTrancheSNP99.60to99.70,Description="Truth se...

scala> val variantContexts = genotypes.toVariantContextRDD
variantContexts: org.bdgenomics.adam.rdd.variant.VariantContextRDD = 
VariantContextRDD(MapPartitionsRDD[8] at map at GenotypeRDD.scala:68,SequenceDictionary{
1->249250621, 0
2->249250621, 1
13->249250621, 2},ArrayBuffer({"sampleId": "NA12878", "name": null, "attributes": {}}, {"sampleId": "NA12891", "name": null, "attributes": {}}, {"sampleId": "NA12892", "name": null, "attributes": {}}),ArrayBuffer(FILTER=<ID=IndelFS,Description="FS > 200.0">, FILTER=<ID=IndelQD,Description="QD < 2.0">, FILTER=<ID=IndelReadPosRankSum,Description="ReadPosRankSum < -20.0">, FILTER=<ID=LowQual,Description="Low quality">, FILTER=<ID=VQSRTrancheSNP99.50to99.60,Description="Truth sensitivity tranche level for SNP model at VQS Lod: -0.5377 <= x < -0.1787">, FILTER=<ID=VQSRTrancheSNP99.60to99.70,Description="...

We should revisit merging genotype annotations after the remainder of Genotype refactoring is complete (version 0.22.0).

heuermh mentioned this issue Jun 7, 2016

Release ADAM version 0.20.0 #1048

Closed

61 tasks

fnothaft added this to the 0.21.0 milestone Jul 20, 2016

This was referenced Jul 20, 2016

Documentation for common usecases #491

Closed

Update documentation #406

Closed

Release ADAM version 0.21.0 #1088

Closed

heuermh closed this as completed Jan 5, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to create a rich(er) VariantContext RDD? Reconstruct VCF INFO fields. #878

How to create a rich(er) VariantContext RDD? Reconstruct VCF INFO fields. #878

NeillGibson commented Nov 9, 2015

NeillGibson commented Dec 3, 2015

fnothaft commented Nov 8, 2016

heuermh commented Jan 5, 2017 •

edited

Loading

How to create a rich(er) VariantContext RDD? Reconstruct VCF INFO fields. #878

How to create a rich(er) VariantContext RDD? Reconstruct VCF INFO fields. #878

Comments

NeillGibson commented Nov 9, 2015

NeillGibson commented Dec 3, 2015

fnothaft commented Nov 8, 2016

heuermh commented Jan 5, 2017 • edited Loading

heuermh commented Jan 5, 2017 •

edited

Loading