From 78df6b2f6573b3cd2807a71ec8950d7dfbc9a65d Mon Sep 17 00:00:00 2001 From: sooheelee Date: Wed, 30 Jan 2019 17:43:54 -0500 Subject: [PATCH] CalculateGenotypePostiors minor updates to javadoc and logger type (#5601) * CalculateGenotypePostiors minor updates to javadoc and logger type - Clarify tool documentation: - Remove statistical notes and provide link to GATK Article#11074 for background and math - Consolidate Notes and Caveats sections - Clarify at top the three different sources of priors and tool behavior regarding these - Clarify for family priors the tool only considers trio groups - Add tool order of ingestion of annotations MLEAC vs AC and detail on AN requirement - Add Laura's comment that recent updates allow the tool to appropriately apply priors to indels and version that does this (4.0.5.0) - Specifically define that trios are mother-father-offspring - Change logger.info to logger.warn for situation where trio pedigree file is incomplete - Note that in this situation, in the absence of other refinement, the results are identical to the input --- .../CalculateGenotypePosteriors.java | 53 +++++++++++++------ 1 file changed, 36 insertions(+), 17 deletions(-) diff --git a/src/main/java/org/broadinstitute/hellbender/tools/walkers/variantutils/CalculateGenotypePosteriors.java b/src/main/java/org/broadinstitute/hellbender/tools/walkers/variantutils/CalculateGenotypePosteriors.java index e1596ea57e8..59a6af5dee0 100644 --- a/src/main/java/org/broadinstitute/hellbender/tools/walkers/variantutils/CalculateGenotypePosteriors.java +++ b/src/main/java/org/broadinstitute/hellbender/tools/walkers/variantutils/CalculateGenotypePosteriors.java @@ -26,30 +26,38 @@ * Calculate genotype posterior probabilities given family and/or known population genotypes * *

- * This tool calculates the posterior genotype probability for each sample genotype in a VCF of input variant calls, - * based on the genotype likelihoods from the samples themselves and, optionally, from input VCFs describing allele - * frequencies in related populations. The input variants must possess genotype likelihoods generated by - * HaplotypeCaller, UnifiedGenotyper or another source that provides unbiased genotype likelihoods.

+ * The tool calculates the posterior genotype probability for each sample genotype in a given VCF format callset. + * The input variants must present genotype likelihoods generated by HaplotypeCaller, UnifiedGenotyper or other + * source that provides unbiased genotype likelihoods.

* - *

Statistical notes

- *

The AF field is not used in the calculation as it does not provide a way to estimate the confidence - * interval or uncertainty around the allele frequency, unlike AN which does provide this necessary information. This - * uncertainty is modeled by a Dirichlet distribution: that is, the frequency is known up to a Dirichlet distribution - * with parameters AC1+q,AC2+q,...,(AN-AC1-AC2-...)+q, where "q" is the global frequency prior (typically q << 1). The - * genotype priors applied then follow a Dirichlet-Multinomial distribution, where 2 alleles per sample are drawn - * independently. This assumption of independent draws follows from the assumption of Hardy-Weinberg equilibrium (HWE). - * Thus, HWE is imposed on the likelihoods as a result of CalculateGenotypePosteriors.

+ *

+ * The tool can use priors from three different data sources: (i) one or more supporting germline population callsets + * with specific annotation(s) if supplied , (ii) the pedigree for a trio if supplied and if the trio is represented + * in the callset under refinement, and/or (iii) the allele counts of the callset samples themselves given at least + * ten samples. It is possible to deactivate the contribution of the callset samples with the --ignore-input-samples + * flag. + *

+ * + *

+ * For more background information and for mathematical details, see GATK forum article at + * https://software.broadinstitute.org/gatk/documentation/article?id=11074. + * Additional GATK mathematical notes are presented as whitepapers in the gatk GitHub repository docs section + * at https://github.com/broadinstitute/gatk/tree/master/docs. + *

* *

Inputs

*

*

*

* *

- * Optionally, a collection of VCFs can be provided for the purpose of informing allele frequency priors. Each of + * Optionally, a collection of VCFs can be provided for the purpose of informing population allele frequency priors. Each of * these resource VCFs must satisfy at least one of the following requirement sets: *

* @@ -80,6 +88,17 @@ * For any non-SNP sites in the input callset, flat priors are applied. *

* + *

+ * For versions of the tool 4.0.5.0+, the tool appropriately applies priors to indels. + *

+ * + *

+ * If applying family priors, only diploid family genotypes are supported. In addition, family priors only apply to + * trios represented in both a supplied pedigree and in the callset under refinement. Note, if the pedigree is + * incomplete, the tools skips calculating family priors. In this case, and in the absence of other refinement, the + * results will be identical to the input. + *

+ * *

Usage examples

* *

Refine genotypes based on the discovered allele frequency in an input VCF containing many samples

@@ -258,7 +277,7 @@ public void onTraversalStart() { if (!skipFamilyPriors){ final Set trios = sampleDB.getTrios(); if(trios.isEmpty()) { - logger.info("No PED file passed or no *non-skipped* trios found in PED file. Skipping family priors."); + logger.warn("No PED file passed or no *non-skipped* trios found in PED file. Skipping family priors."); skipFamilyPriors = true; } }