Proposal: VariantCallingAnnotations to be moved to Variant #41

laserson · 2014-11-22T22:10:00Z

This stems from the -onlyvariants flag we added to vcf2adam, which only writes out the variant information. IMO, annotations associated with a variant should be packaged with a variant. Not a genotype. If you're denormalizing the variant information into the Genotype, you shouldn't denormalize these two pieces separately. This is annoying from the -onlyvariants perspective, because at the moment, this ends up storing minimal info on the variants, when what I really want to do is analyze the metadata on the variants. Thoughts?

The text was updated successfully, but these errors were encountered:

laserson · 2014-12-18T19:57:12Z

@massie @tdanford @fnothaft any thoughts on this?

fnothaft · 2014-12-18T21:24:55Z

@laserson Thanks for the ping! I'd totally missed this.

A few thoughts:

The VariantCallingAnnotations should probably be renamed to GenotypeAnnotations or GenotypeStatistics. This object contains sample specific statistics like sample allelic depth, VQSR tranche for the sample, genotype likelihoods, etc.
The variant annotations are in DatabaseVariantAnnotation/VariantEffect. These get imported through a different process from the VCF file (specifically, VcfAnnotation2ADAM).

Which sort of annotations are you most interested in? Population genomics annotations (DatabaseVariantAnnotation), VEP annotations (VariantEffect), or genotype statistics (VariantCallingAnnotations)?

CCing @mlinderm, who put much of the original brains into the Variant/Genotype schemas.

fnothaft · 2014-12-18T21:26:36Z

Also, to add to this, IIRC, the variant annotations were split out for performance reasons. If you want to use them along with genotype data, you can group the genotypes together (e.g., via RichVariant, IIRC) and then join them against a variant site. This is arguably preferable when you're processing very large cohorts, as it minimizes replication of the variant annotations.

laserson · 2014-12-18T21:43:29Z

I think this was the use case I was thinking of when I submitted this: I had a good chunk of genotype data from the 1000 Genome project. But I wanted to filter variants based on annotations in the ExAC VCF file. Ideally, I could ingest the ExAC file using -onlyvariants and still maintain all of the annotation data, suitable for performing a join against the genotype data. This is not possible as currently implemented, since -onlyvariants takes the minimal variant data.

fnothaft · 2014-12-18T21:45:05Z

What annotations did you want to import from the ExAC file though? I think if you import the ExAC via VcfAnnotations2ADAM, you should then be able to join against genotypes.

laserson · 2014-12-18T22:08:53Z

Hmm, didn't realize we had that command line tool. Perhaps that's just perfectly sufficient then. I'll close this issue for now.

fnothaft · 2014-12-18T22:11:36Z

Wonderful! There is indeed always a method to the madness.

laserson closed this as completed Dec 18, 2014

heuermh mentioned this issue Jul 1, 2016

Cleanup in org.bdgenomics.adam.converters package. bigdatagenomics/adam#1062

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Proposal: VariantCallingAnnotations to be moved to Variant #41

Proposal: VariantCallingAnnotations to be moved to Variant #41

laserson commented Nov 22, 2014

laserson commented Dec 18, 2014

fnothaft commented Dec 18, 2014

fnothaft commented Dec 18, 2014

laserson commented Dec 18, 2014

fnothaft commented Dec 18, 2014

laserson commented Dec 18, 2014

fnothaft commented Dec 18, 2014

Proposal: VariantCallingAnnotations to be moved to Variant #41

Proposal: VariantCallingAnnotations to be moved to Variant #41

Comments

laserson commented Nov 22, 2014

laserson commented Dec 18, 2014

fnothaft commented Dec 18, 2014

fnothaft commented Dec 18, 2014

laserson commented Dec 18, 2014

fnothaft commented Dec 18, 2014

laserson commented Dec 18, 2014

fnothaft commented Dec 18, 2014