Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improved implementation of allele-specific new qual #5460

Merged
merged 3 commits into from
Dec 11, 2018
Merged

Conversation

davidbenjamin
Copy link
Contributor

@davidbenjamin davidbenjamin commented Nov 29, 2018

Closes #4833.

This doesn't change the model, just the numerical implementation. It addresses some of the issues holding up #4614. From the discussion there, here is the essence of this PR:

The general rule for avoiding finite precision problems with the qual score is: always calculate probabilities of alleles being absent. The problem with working in terms of alleles being present is that for very good GQs this probability is so close to 1 that quals can become infinite.

In this PR we add up the probabilities of genotypes that don't have the allele, which is small but non-zero and everything works fine.

@ldgauthier care to review? Note that this does not close issue #4614, but it enables a subsequent PR to do so.

Copy link
Contributor

@ldgauthier ldgauthier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, but very cryptic to the uninitiated. Some more docs in GenotypeAlleleCounts would go a long way.

@@ -655,6 +656,21 @@ public void forEachAlleleIndexAndCount(final IntBiConsumer action) {
new IndexRange(0, distinctAlleleCount).forEach(n -> action.accept(sortedAlleleCounts[2*n], sortedAlleleCounts[2*n+1]));
}

public void forEachAbsentAlleleIndex(final IntConsumer action, final int alleleCount) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add some docs to the method?

If I understand this correctly, all the alleles in sortedAlleleCounts are required to have counts > 0 (as described in the class docs), so they are all present. Thus we're going to try to apply the action to every allele index between zero and alleleCount, but skip if that allele is present in sortedAlleleCounts. And the index on the sortedAlleleCounts is 2X the index because that array is of the form described above where every other entry is an allele, right?

distinctAlleleCount == sortedAlleleCounts.length/2, right? (Or sortedAlleleCounts.length >> 1 depending on who you ask.) I think it might be clearer to put that comparison in terms of the sortedAlleleCounts length.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After debugging the AlleleFrequencyCalculatorUnitTests I did verify that the sorted array is sorted by allele index. Can you note that somewhere in GenotypeAlleleCounts?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done and done.

@davidbenjamin
Copy link
Contributor Author

@ldgauthier back to you with docs.

Copy link
Contributor

@ldgauthier ldgauthier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Docs look good -- one minor comment. Also, would it be easy to add the overflow cases from Sarah's comparison as a regression test or was that too many samples?

* <p>Each allele present in a genotype (count != 0) has a <i>rank</i>, that is the 0-based ordinal of
* that allele amongst the ones present in the genotype as sorted by their index.</p>
*
* <p>For example:</p>
*
* <p><b>0/0/2/2</b> has two alleles with indices <b>0</b> and <b>2</b>, both with count 2.
* <p><b>[0,1,2,1]</b> has two alleles with indices <b>0</b> and <b>2</b>, both with count 1.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice to show the corresponding diploid genotype as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, would it be easy to add the overflow cases from Sarah's comparison as a regression test or was that too many samples?

Not too many samples, but instead of copying that case I wrote a pedagogical unit test (with just as many samples) that fails spectacularly without the fix and for which you can calculate the correct answer analytically.

@codecov-io
Copy link

codecov-io commented Dec 11, 2018

Codecov Report

Merging #5460 into master will increase coverage by 0.004%.
The diff coverage is 97.368%.

@@               Coverage Diff               @@
##              master     #5460       +/-   ##
===============================================
+ Coverage     87.063%   87.068%   +0.004%     
- Complexity     31262     31278       +16     
===============================================
  Files           1922      1922               
  Lines         144312    144344       +32     
  Branches       15918     15922        +4     
===============================================
+ Hits          125643    125677       +34     
+ Misses         12891     12886        -5     
- Partials        5778      5781        +3
Impacted Files Coverage Δ Complexity Δ
...rs/genotyper/afcalc/AlleleFrequencyCalculator.java 85.556% <100%> (+0.328%) 31 <4> (+3) ⬆️
.../tools/walkers/genotyper/GenotypeAlleleCounts.java 95.187% <100%> (+0.215%) 83 <4> (+4) ⬆️
...yper/afcalc/AlleleFrequencyCalculatorUnitTest.java 97.546% <100%> (+0.195%) 27 <2> (+2) ⬆️
...alkers/genotyper/GenotypeAlleleCountsUnitTest.java 87.636% <83.333%> (-0.141%) 61 <1> (+1)
...rs/realignmentfilter/FilterAlignmentArtifacts.java 88.889% <0%> (-2.914%) 20% <0%> (+1%)
...oadinstitute/hellbender/utils/gcs/BucketUtils.java 79.878% <0%> (-0.61%) 42% <0%> (ø)
...alignmentfilter/RealignmentArgumentCollection.java 100% <0%> (ø) 1% <0%> (ø) ⬇️
...ithwaterman/SmithWatermanIntelAlignerUnitTest.java 60% <0%> (ø) 2% <0%> (ø) ⬇️
...s/walkers/realignmentfilter/RealignmentEngine.java 93.548% <0%> (+0.215%) 36% <0%> (+1%) ⬆️
...org/broadinstitute/hellbender/utils/MathUtils.java 79.022% <0%> (+0.611%) 213% <0%> (+2%) ⬆️
... and 1 more

@davidbenjamin davidbenjamin merged commit a012990 into master Dec 11, 2018
@davidbenjamin davidbenjamin deleted the db_new_qual branch December 11, 2018 05:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Investigate AS_QD with new QUAL
3 participants