Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal: VariantCallingAnnotations to be moved to Variant #41

Closed
laserson opened this issue Nov 22, 2014 · 7 comments
Closed

Proposal: VariantCallingAnnotations to be moved to Variant #41

laserson opened this issue Nov 22, 2014 · 7 comments

Comments

@laserson
Copy link
Contributor

This stems from the -onlyvariants flag we added to vcf2adam, which only writes out the variant information. IMO, annotations associated with a variant should be packaged with a variant. Not a genotype. If you're denormalizing the variant information into the Genotype, you shouldn't denormalize these two pieces separately. This is annoying from the -onlyvariants perspective, because at the moment, this ends up storing minimal info on the variants, when what I really want to do is analyze the metadata on the variants. Thoughts?

@laserson
Copy link
Contributor Author

@massie @tdanford @fnothaft any thoughts on this?

@fnothaft
Copy link
Member

@laserson Thanks for the ping! I'd totally missed this.

A few thoughts:

  • The VariantCallingAnnotations should probably be renamed to GenotypeAnnotations or GenotypeStatistics. This object contains sample specific statistics like sample allelic depth, VQSR tranche for the sample, genotype likelihoods, etc.
  • The variant annotations are in DatabaseVariantAnnotation/VariantEffect. These get imported through a different process from the VCF file (specifically, VcfAnnotation2ADAM).

Which sort of annotations are you most interested in? Population genomics annotations (DatabaseVariantAnnotation), VEP annotations (VariantEffect), or genotype statistics (VariantCallingAnnotations)?

CCing @mlinderm, who put much of the original brains into the Variant/Genotype schemas.

@fnothaft
Copy link
Member

Also, to add to this, IIRC, the variant annotations were split out for performance reasons. If you want to use them along with genotype data, you can group the genotypes together (e.g., via RichVariant, IIRC) and then join them against a variant site. This is arguably preferable when you're processing very large cohorts, as it minimizes replication of the variant annotations.

@laserson
Copy link
Contributor Author

I think this was the use case I was thinking of when I submitted this: I had a good chunk of genotype data from the 1000 Genome project. But I wanted to filter variants based on annotations in the ExAC VCF file. Ideally, I could ingest the ExAC file using -onlyvariants and still maintain all of the annotation data, suitable for performing a join against the genotype data. This is not possible as currently implemented, since -onlyvariants takes the minimal variant data.

@fnothaft
Copy link
Member

What annotations did you want to import from the ExAC file though? I think if you import the ExAC via VcfAnnotations2ADAM, you should then be able to join against genotypes.

@laserson
Copy link
Contributor Author

Hmm, didn't realize we had that command line tool. Perhaps that's just perfectly sufficient then. I'll close this issue for now.

@fnothaft
Copy link
Member

Wonderful! There is indeed always a method to the madness.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants