From 78df6b2f6573b3cd2807a71ec8950d7dfbc9a65d Mon Sep 17 00:00:00 2001
From: sooheelee
Date: Wed, 30 Jan 2019 17:43:54 -0500
Subject: [PATCH] CalculateGenotypePostiors minor updates to javadoc and logger
type (#5601)
* CalculateGenotypePostiors minor updates to javadoc and logger type
- Clarify tool documentation:
- Remove statistical notes and provide link to GATK Article#11074 for background and math
- Consolidate Notes and Caveats sections
- Clarify at top the three different sources of priors and tool behavior regarding these
- Clarify for family priors the tool only considers trio groups
- Add tool order of ingestion of annotations MLEAC vs AC and detail on AN requirement
- Add Laura's comment that recent updates allow the tool to appropriately apply priors to indels and version that does this (4.0.5.0)
- Specifically define that trios are mother-father-offspring
- Change logger.info to logger.warn for situation where trio pedigree file is incomplete
- Note that in this situation, in the absence of other refinement, the results are identical to the input
---
.../CalculateGenotypePosteriors.java | 53 +++++++++++++------
1 file changed, 36 insertions(+), 17 deletions(-)
diff --git a/src/main/java/org/broadinstitute/hellbender/tools/walkers/variantutils/CalculateGenotypePosteriors.java b/src/main/java/org/broadinstitute/hellbender/tools/walkers/variantutils/CalculateGenotypePosteriors.java
index e1596ea57e8..59a6af5dee0 100644
--- a/src/main/java/org/broadinstitute/hellbender/tools/walkers/variantutils/CalculateGenotypePosteriors.java
+++ b/src/main/java/org/broadinstitute/hellbender/tools/walkers/variantutils/CalculateGenotypePosteriors.java
@@ -26,30 +26,38 @@
* Calculate genotype posterior probabilities given family and/or known population genotypes
*
*
- * This tool calculates the posterior genotype probability for each sample genotype in a VCF of input variant calls,
- * based on the genotype likelihoods from the samples themselves and, optionally, from input VCFs describing allele
- * frequencies in related populations. The input variants must possess genotype likelihoods generated by
- * HaplotypeCaller, UnifiedGenotyper or another source that provides unbiased genotype likelihoods.
+ * The tool calculates the posterior genotype probability for each sample genotype in a given VCF format callset.
+ * The input variants must present genotype likelihoods generated by HaplotypeCaller, UnifiedGenotyper or other
+ * source that provides unbiased genotype likelihoods.
*
- * Statistical notes
- * The AF field is not used in the calculation as it does not provide a way to estimate the confidence
- * interval or uncertainty around the allele frequency, unlike AN which does provide this necessary information. This
- * uncertainty is modeled by a Dirichlet distribution: that is, the frequency is known up to a Dirichlet distribution
- * with parameters AC1+q,AC2+q,...,(AN-AC1-AC2-...)+q, where "q" is the global frequency prior (typically q << 1). The
- * genotype priors applied then follow a Dirichlet-Multinomial distribution, where 2 alleles per sample are drawn
- * independently. This assumption of independent draws follows from the assumption of Hardy-Weinberg equilibrium (HWE).
- * Thus, HWE is imposed on the likelihoods as a result of CalculateGenotypePosteriors.
+ *
+ * The tool can use priors from three different data sources: (i) one or more supporting germline population callsets
+ * with specific annotation(s) if supplied , (ii) the pedigree for a trio if supplied and if the trio is represented
+ * in the callset under refinement, and/or (iii) the allele counts of the callset samples themselves given at least
+ * ten samples. It is possible to deactivate the contribution of the callset samples with the --ignore-input-samples
+ * flag.
+ *
+ *
+ *
+ * For more background information and for mathematical details, see GATK forum article at
+ * https://software.broadinstitute.org/gatk/documentation/article?id=11074.
+ * Additional GATK mathematical notes are presented as whitepapers in the gatk GitHub repository docs section
+ * at https://github.com/broadinstitute/gatk/tree/master/docs.
+ *
*
* Inputs
*
*
- * - A VCF with genotype likelihoods, and optionally genotypes, AC/AN fields, or MLEAC/AN fields.
- * - (Optional) A PED pedigree file containing the description of the relationships between individuals.
+ * - A VCF with genotype likelihoods, and optionally genotypes, AC/AN fields, or MLEAC/AN fields.
+ * The tool will use MLEAC if available or AC if MLEAC is not provided. AN is also required unless genotypes are
+ * provided for all samples.
+ * - (Optional) A PED pedigree file containing the description of the relationships between individuals. The
+ * tool considers only trio groups. A trio consists of mother-father-child.
*
*
*
*
- * Optionally, a collection of VCFs can be provided for the purpose of informing allele frequency priors. Each of
+ * Optionally, a collection of VCFs can be provided for the purpose of informing population allele frequency priors. Each of
* these resource VCFs must satisfy at least one of the following requirement sets:
*
*
@@ -65,7 +73,7 @@
* - Genotype posteriors added to the FORMAT fields ("PP")
* - Genotypes and GQ assigned according to these posteriors (note that the original genotype and GQ may change)
* - Per-site genotype priors added to the INFO field ("PG")
- * - (Optional) Per-site, per-trio joint likelihoods (JL) and joint posteriors (JL) given as Phred-scaled probability
+ *
- (Optional) Per-site, per-trio joint likelihoods (JL) and joint posteriors (JP) given as Phred-scaled probability
* of all genotypes in the trio being correct based on the PLs for JL and the PPs for JP. These annotations are added to
* the FORMAT fields.
*
@@ -80,6 +88,17 @@
* For any non-SNP sites in the input callset, flat priors are applied.
*
*
+ *
+ * For versions of the tool 4.0.5.0+, the tool appropriately applies priors to indels.
+ *
+ *
+ *
+ * If applying family priors, only diploid family genotypes are supported. In addition, family priors only apply to
+ * trios represented in both a supplied pedigree and in the callset under refinement. Note, if the pedigree is
+ * incomplete, the tools skips calculating family priors. In this case, and in the absence of other refinement, the
+ * results will be identical to the input.
+ *
+ *
* Usage examples
*
* Refine genotypes based on the discovered allele frequency in an input VCF containing many samples
@@ -258,7 +277,7 @@ public void onTraversalStart() {
if (!skipFamilyPriors){
final Set trios = sampleDB.getTrios();
if(trios.isEmpty()) {
- logger.info("No PED file passed or no *non-skipped* trios found in PED file. Skipping family priors.");
+ logger.warn("No PED file passed or no *non-skipped* trios found in PED file. Skipping family priors.");
skipFamilyPriors = true;
}
}