Improved implementation of allele-specific new qual #5460

davidbenjamin · 2018-11-29T16:34:47Z

Closes #4833.

This doesn't change the model, just the numerical implementation. It addresses some of the issues holding up #4614. From the discussion there, here is the essence of this PR:

The general rule for avoiding finite precision problems with the qual score is: always calculate probabilities of alleles being absent. The problem with working in terms of alleles being present is that for very good GQs this probability is so close to 1 that quals can become infinite.

In this PR we add up the probabilities of genotypes that don't have the allele, which is small but non-zero and everything works fine.

@ldgauthier care to review? Note that this does not close issue #4614, but it enables a subsequent PR to do so.

…tability and speed

ldgauthier

Looks good, but very cryptic to the uninitiated. Some more docs in GenotypeAlleleCounts would go a long way.

ldgauthier · 2018-12-10T15:24:02Z

src/main/java/org/broadinstitute/hellbender/tools/walkers/genotyper/GenotypeAlleleCounts.java

@@ -655,6 +656,21 @@ public void forEachAlleleIndexAndCount(final IntBiConsumer action) {
        new IndexRange(0, distinctAlleleCount).forEach(n -> action.accept(sortedAlleleCounts[2*n], sortedAlleleCounts[2*n+1]));
    }

+    public void forEachAbsentAlleleIndex(final IntConsumer action, final int alleleCount) {


Can you add some docs to the method?

If I understand this correctly, all the alleles in sortedAlleleCounts are required to have counts > 0 (as described in the class docs), so they are all present. Thus we're going to try to apply the action to every allele index between zero and alleleCount, but skip if that allele is present in sortedAlleleCounts. And the index on the sortedAlleleCounts is 2X the index because that array is of the form described above where every other entry is an allele, right?

distinctAlleleCount == sortedAlleleCounts.length/2, right? (Or sortedAlleleCounts.length >> 1 depending on who you ask.) I think it might be clearer to put that comparison in terms of the sortedAlleleCounts length.

After debugging the AlleleFrequencyCalculatorUnitTests I did verify that the sorted array is sorted by allele index. Can you note that somewhere in GenotypeAlleleCounts?

Done and done.

davidbenjamin · 2018-12-10T20:32:55Z

@ldgauthier back to you with docs.

ldgauthier

Docs look good -- one minor comment. Also, would it be easy to add the overflow cases from Sarah's comparison as a regression test or was that too many samples?

ldgauthier · 2018-12-10T21:27:23Z

src/main/java/org/broadinstitute/hellbender/tools/walkers/genotyper/GenotypeAlleleCounts.java

 * <p>Each allele present in a genotype (count != 0) has a <i>rank</i>, that is the 0-based ordinal of
 * that allele amongst the ones present in the genotype as sorted by their index.</p>
 *
 * <p>For example:</p>
 *
- * <p><b>0/0/2/2</b> has two alleles with indices <b>0</b> and <b>2</b>, both with count 2.
+ * <p><b>[0,1,2,1]</b> has two alleles with indices <b>0</b> and <b>2</b>, both with count 1.


It would be nice to show the corresponding diploid genotype as well.

Also, would it be easy to add the overflow cases from Sarah's comparison as a regression test or was that too many samples?

Not too many samples, but instead of copying that case I wrote a pedagogical unit test (with just as many samples) that fails spectacularly without the fix and for which you can calculate the correct answer analytically.

codecov-io · 2018-12-11T05:38:25Z

Codecov Report

Merging #5460 into master will increase coverage by 0.004%.
The diff coverage is 97.368%.

@@               Coverage Diff               @@
##              master     #5460       +/-   ##
===============================================
+ Coverage     87.063%   87.068%   +0.004%     
- Complexity     31262     31278       +16     
===============================================
  Files           1922      1922               
  Lines         144312    144344       +32     
  Branches       15918     15922        +4     
===============================================
+ Hits          125643    125677       +34     
+ Misses         12891     12886        -5     
- Partials        5778      5781        +3

Impacted Files	Coverage Δ	Complexity Δ
...rs/genotyper/afcalc/AlleleFrequencyCalculator.java	`85.556% <100%> (+0.328%)`	`31 <4> (+3)`	⬆️
.../tools/walkers/genotyper/GenotypeAlleleCounts.java	`95.187% <100%> (+0.215%)`	`83 <4> (+4)`	⬆️
...yper/afcalc/AlleleFrequencyCalculatorUnitTest.java	`97.546% <100%> (+0.195%)`	`27 <2> (+2)`	⬆️
...alkers/genotyper/GenotypeAlleleCountsUnitTest.java	`87.636% <83.333%> (-0.141%)`	`61 <1> (+1)`
...rs/realignmentfilter/FilterAlignmentArtifacts.java	`88.889% <0%> (-2.914%)`	`20% <0%> (+1%)`
...oadinstitute/hellbender/utils/gcs/BucketUtils.java	`79.878% <0%> (-0.61%)`	`42% <0%> (ø)`
...alignmentfilter/RealignmentArgumentCollection.java	`100% <0%> (ø)`	`1% <0%> (ø)`	⬇️
...ithwaterman/SmithWatermanIntelAlignerUnitTest.java	`60% <0%> (ø)`	`2% <0%> (ø)`	⬇️
...s/walkers/realignmentfilter/RealignmentEngine.java	`93.548% <0%> (+0.215%)`	`36% <0%> (+1%)`	⬆️
...org/broadinstitute/hellbender/utils/MathUtils.java	`79.022% <0%> (+0.611%)`	`213% <0%> (+2%)`	⬆️
... and 1 more

overhauled implementation of allele-specific new qual for numerical s…

d18b922

…tability and speed

davidbenjamin assigned ldgauthier Nov 29, 2018

davidbenjamin requested a review from ldgauthier November 29, 2018 16:34

davidbenjamin mentioned this pull request Dec 5, 2018

Make new qual the default in HaplotypeCaller and GenotypeGVCFs #5484

Merged

ldgauthier requested changes Dec 10, 2018

View reviewed changes

better docs

af2d73c

davidbenjamin force-pushed the db_new_qual branch from 98869fa to af2d73c Compare December 10, 2018 20:32

ldgauthier approved these changes Dec 10, 2018

View reviewed changes

review edits

f2fdf6f

davidbenjamin merged commit a012990 into master Dec 11, 2018

davidbenjamin deleted the db_new_qual branch December 11, 2018 05:43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improved implementation of allele-specific new qual #5460

Improved implementation of allele-specific new qual #5460

davidbenjamin commented Nov 29, 2018 •

edited

Loading

ldgauthier left a comment

ldgauthier Dec 10, 2018

ldgauthier Dec 10, 2018

davidbenjamin Dec 10, 2018

davidbenjamin commented Dec 10, 2018

ldgauthier left a comment

ldgauthier Dec 10, 2018

davidbenjamin Dec 11, 2018

davidbenjamin Dec 11, 2018

codecov-io commented Dec 11, 2018 •

edited

Loading

Improved implementation of allele-specific new qual #5460

Improved implementation of allele-specific new qual #5460

Conversation

davidbenjamin commented Nov 29, 2018 • edited Loading

ldgauthier left a comment

Choose a reason for hiding this comment

ldgauthier Dec 10, 2018

Choose a reason for hiding this comment

ldgauthier Dec 10, 2018

Choose a reason for hiding this comment

davidbenjamin Dec 10, 2018

Choose a reason for hiding this comment

davidbenjamin commented Dec 10, 2018

ldgauthier left a comment

Choose a reason for hiding this comment

ldgauthier Dec 10, 2018

Choose a reason for hiding this comment

davidbenjamin Dec 11, 2018

Choose a reason for hiding this comment

davidbenjamin Dec 11, 2018

Choose a reason for hiding this comment

codecov-io commented Dec 11, 2018 • edited Loading

Codecov Report

davidbenjamin commented Nov 29, 2018 •

edited

Loading

codecov-io commented Dec 11, 2018 •

edited

Loading