-
Notifications
You must be signed in to change notification settings - Fork 597
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improved implementation of allele-specific new qual #5460
Conversation
…tability and speed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, but very cryptic to the uninitiated. Some more docs in GenotypeAlleleCounts would go a long way.
@@ -655,6 +656,21 @@ public void forEachAlleleIndexAndCount(final IntBiConsumer action) { | |||
new IndexRange(0, distinctAlleleCount).forEach(n -> action.accept(sortedAlleleCounts[2*n], sortedAlleleCounts[2*n+1])); | |||
} | |||
|
|||
public void forEachAbsentAlleleIndex(final IntConsumer action, final int alleleCount) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add some docs to the method?
If I understand this correctly, all the alleles in sortedAlleleCounts are required to have counts > 0 (as described in the class docs), so they are all present. Thus we're going to try to apply the action to every allele index between zero and alleleCount, but skip if that allele is present in sortedAlleleCounts. And the index on the sortedAlleleCounts is 2X the index because that array is of the form described above where every other entry is an allele, right?
distinctAlleleCount == sortedAlleleCounts.length/2, right? (Or sortedAlleleCounts.length >> 1 depending on who you ask.) I think it might be clearer to put that comparison in terms of the sortedAlleleCounts length.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After debugging the AlleleFrequencyCalculatorUnitTests I did verify that the sorted array is sorted by allele index. Can you note that somewhere in GenotypeAlleleCounts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done and done.
98869fa
to
af2d73c
Compare
@ldgauthier back to you with docs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Docs look good -- one minor comment. Also, would it be easy to add the overflow cases from Sarah's comparison as a regression test or was that too many samples?
* <p>Each allele present in a genotype (count != 0) has a <i>rank</i>, that is the 0-based ordinal of | ||
* that allele amongst the ones present in the genotype as sorted by their index.</p> | ||
* | ||
* <p>For example:</p> | ||
* | ||
* <p><b>0/0/2/2</b> has two alleles with indices <b>0</b> and <b>2</b>, both with count 2. | ||
* <p><b>[0,1,2,1]</b> has two alleles with indices <b>0</b> and <b>2</b>, both with count 1. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be nice to show the corresponding diploid genotype as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, would it be easy to add the overflow cases from Sarah's comparison as a regression test or was that too many samples?
Not too many samples, but instead of copying that case I wrote a pedagogical unit test (with just as many samples) that fails spectacularly without the fix and for which you can calculate the correct answer analytically.
Codecov Report
@@ Coverage Diff @@
## master #5460 +/- ##
===============================================
+ Coverage 87.063% 87.068% +0.004%
- Complexity 31262 31278 +16
===============================================
Files 1922 1922
Lines 144312 144344 +32
Branches 15918 15922 +4
===============================================
+ Hits 125643 125677 +34
+ Misses 12891 12886 -5
- Partials 5778 5781 +3
|
Closes #4833.
This doesn't change the model, just the numerical implementation. It addresses some of the issues holding up #4614. From the discussion there, here is the essence of this PR:
In this PR we add up the probabilities of genotypes that don't have the allele, which is small but non-zero and everything works fine.
@ldgauthier care to review? Note that this does not close issue #4614, but it enables a subsequent PR to do so.