Skip to content

Commit

Permalink
CalculateGenotypePostiors minor updates to javadoc and logger type (#…
Browse files Browse the repository at this point in the history
…5601)

* CalculateGenotypePostiors minor updates to javadoc and logger type
- Clarify tool documentation:
    - Remove statistical notes and provide link to GATK Article#11074 for background and math
    - Consolidate Notes and Caveats sections
    - Clarify at top the three different sources of priors and tool behavior regarding these
    - Clarify for family priors the tool only considers trio groups
    - Add tool order of ingestion of annotations MLEAC vs AC and detail on AN requirement
    - Add Laura's comment that recent updates allow the tool to appropriately apply priors to indels and version that does this (4.0.5.0)
    - Specifically define that trios are mother-father-offspring
- Change logger.info to logger.warn for situation where trio pedigree file is incomplete
    - Note that in this situation, in the absence of other refinement, the results are identical to the input
  • Loading branch information
sooheelee authored Jan 30, 2019
1 parent 335fac0 commit 78df6b2
Showing 1 changed file with 36 additions and 17 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -26,30 +26,38 @@
* Calculate genotype posterior probabilities given family and/or known population genotypes
*
* <p>
* This tool calculates the posterior genotype probability for each sample genotype in a VCF of input variant calls,
* based on the genotype likelihoods from the samples themselves and, optionally, from input VCFs describing allele
* frequencies in related populations. The input variants must possess genotype likelihoods generated by
* HaplotypeCaller, UnifiedGenotyper or another source that provides <b>unbiased</b> genotype likelihoods.</p>
* The tool calculates the posterior genotype probability for each sample genotype in a given VCF format callset.
* The input variants must present genotype likelihoods generated by HaplotypeCaller, UnifiedGenotyper or other
* source that provides unbiased genotype likelihoods.</p>
*
* <h4>Statistical notes</h4>
* <p>The AF field is not used in the calculation as it does not provide a way to estimate the confidence
* interval or uncertainty around the allele frequency, unlike AN which does provide this necessary information. This
* uncertainty is modeled by a Dirichlet distribution: that is, the frequency is known up to a Dirichlet distribution
* with parameters AC1+q,AC2+q,...,(AN-AC1-AC2-...)+q, where "q" is the global frequency prior (typically q << 1). The
* genotype priors applied then follow a Dirichlet-Multinomial distribution, where 2 alleles per sample are drawn
* independently. This assumption of independent draws follows from the assumption of Hardy-Weinberg equilibrium (HWE).
* Thus, HWE is imposed on the likelihoods as a result of CalculateGenotypePosteriors.</p>
* <p>
* The tool can use priors from three different data sources: (i) one or more supporting germline population callsets
* with specific annotation(s) if supplied , (ii) the pedigree for a trio if supplied and if the trio is represented
* in the callset under refinement, and/or (iii) the allele counts of the callset samples themselves given at least
* ten samples. It is possible to deactivate the contribution of the callset samples with the --ignore-input-samples
* flag.
* </p>
*
* <p>
* For more background information and for mathematical details, see GATK forum article at
* https://software.broadinstitute.org/gatk/documentation/article?id=11074.
* Additional GATK mathematical notes are presented as whitepapers in the <i>gatk</i> GitHub repository docs section
* at https://github.com/broadinstitute/gatk/tree/master/docs.
* </p>
*
* <h3>Inputs</h3>
* <p>
* <ul>
* <li>A VCF with genotype likelihoods, and optionally genotypes, AC/AN fields, or MLEAC/AN fields.</li>
* <li>(Optional) A PED pedigree file containing the description of the relationships between individuals.</li>
* <li>A VCF with genotype likelihoods, and optionally genotypes, AC/AN fields, or MLEAC/AN fields.
* The tool will use MLEAC if available or AC if MLEAC is not provided. AN is also required unless genotypes are
* provided for all samples.</li>
* <li>(Optional) A PED pedigree file containing the description of the relationships between individuals. The
* tool considers only trio groups. A trio consists of mother-father-child.</li>
* </ul>
* </p>
*
* <p>
* Optionally, a collection of VCFs can be provided for the purpose of informing allele frequency priors. Each of
* Optionally, a collection of VCFs can be provided for the purpose of informing population allele frequency priors. Each of
* these resource VCFs must satisfy at least one of the following requirement sets:
* </p>
* <ul>
Expand All @@ -65,7 +73,7 @@
* <li>Genotype posteriors added to the FORMAT fields ("PP")</li>
* <li>Genotypes and GQ assigned according to these posteriors (note that the original genotype and GQ may change)</li>
* <li>Per-site genotype priors added to the INFO field ("PG")</li>
* <li>(Optional) Per-site, per-trio joint likelihoods (JL) and joint posteriors (JL) given as Phred-scaled probability
* <li>(Optional) Per-site, per-trio joint likelihoods (JL) and joint posteriors (JP) given as Phred-scaled probability
* of all genotypes in the trio being correct based on the PLs for JL and the PPs for JP. These annotations are added to
* the FORMAT fields.</li>
* </ul>
Expand All @@ -80,6 +88,17 @@
* For any non-SNP sites in the input callset, flat priors are applied.
* </p>
*
* <p>
* For versions of the tool 4.0.5.0+, the tool appropriately applies priors to indels.
* </p>
*
* <p>
* If applying family priors, only diploid family genotypes are supported. In addition, family priors only apply to
* trios represented in both a supplied pedigree and in the callset under refinement. Note, if the pedigree is
* incomplete, the tools skips calculating family priors. In this case, and in the absence of other refinement, the
* results will be identical to the input.
* </p>
*
* <h3>Usage examples</h3>
*
* <h4>Refine genotypes based on the discovered allele frequency in an input VCF containing many samples</h4>
Expand Down Expand Up @@ -258,7 +277,7 @@ public void onTraversalStart() {
if (!skipFamilyPriors){
final Set<Trio> trios = sampleDB.getTrios();
if(trios.isEmpty()) {
logger.info("No PED file passed or no *non-skipped* trios found in PED file. Skipping family priors.");
logger.warn("No PED file passed or no *non-skipped* trios found in PED file. Skipping family priors.");
skipFamilyPriors = true;
}
}
Expand Down

0 comments on commit 78df6b2

Please sign in to comment.