-
Notifications
You must be signed in to change notification settings - Fork 597
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improved implementation of allele-specific new qual #5460
Changes from 2 commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -10,6 +10,7 @@ | |
import java.util.Arrays; | ||
import java.util.Collections; | ||
import java.util.List; | ||
import java.util.function.IntConsumer; | ||
import java.util.stream.Collectors; | ||
import java.util.stream.IntStream; | ||
|
||
|
@@ -18,19 +19,26 @@ | |
* | ||
* <p>Alleles are represented herein by their indices running from <b>0</b> to <b>N-1</b> where <i>N</i> is the number of alleles.</p> | ||
* | ||
* <p>Genotypes are represented as a single array of alternating alleles and counts, where only alleles with non-zero counts are included: | ||
* [allele 1, count1, allele 2, count2. . .]</p> | ||
* | ||
* <p>Each allele present in a genotype (count != 0) has a <i>rank</i>, that is the 0-based ordinal of | ||
* that allele amongst the ones present in the genotype as sorted by their index.</p> | ||
* | ||
* <p>For example:</p> | ||
* | ||
* <p><b>0/0/2/2</b> has two alleles with indices <b>0</b> and <b>2</b>, both with count 2. | ||
* <p><b>[0,1,2,1]</b> has two alleles with indices <b>0</b> and <b>2</b>, both with count 1. | ||
* The rank of <b>0</b> is <i>0</i> whereas the rank of <b>2</b> is <i>1</i>.</p> | ||
* | ||
* <p><b>2/4/4/7</b> has three alleles with indices <b>2</b>, <b>4</b> and <b>7</b>. <b>2</b> and <b>7</b> have count 1 whereas <b>4</b> has count 2. | ||
* <p><b>[2,1,4,2,7,1]</b> has three alleles with indices <b>2</b>, <b>4</b> and <b>7</b>. <b>2</b> and <b>7</b> have count 1 whereas <b>4</b> has count 2. | ||
* The rank of <b>2</b> is <i>0</i>, the rank of <b>4</b> is <i>1</i>. and the rank of <b>7</b> is <i>2</i>.</p> | ||
* | ||
* <p>In contrast, in both examples above both <b>3</b> and <b>10</b> (and many others) are absent thus they have no rank (represented by <i>-1</i> whenever applies).</p> | ||
* | ||
* <p><b>[0,0,1,2]</b> is not valid because allele 0 has a count of 0 and should be absent from the array.</p> | ||
* | ||
* <p><b>[1,1,0,1]</b> is not valid because allele 1 comes before allele 0.</p> | ||
* | ||
* <p>{@link GenotypeAlleleCounts} instances have themselves their own index (returned by {@link #index() index()}, that indicate their 0-based ordinal within the possible genotype combinations with the same ploidy.</p> | ||
* | ||
* <p>For example, for ploidy 3:</p> | ||
|
@@ -152,7 +160,7 @@ protected void increase(final int times) { | |
} | ||
|
||
/** | ||
* Updates the genotype counts to match the next genotype. | ||
* Updates the genotype counts to match the next genotype according to the canonical ordering of PLs. | ||
* | ||
* <p> | ||
* This method must not be invoked on cached genotype-allele-counts that are meant to remain constant, | ||
|
@@ -655,6 +663,31 @@ public void forEachAlleleIndexAndCount(final IntBiConsumer action) { | |
new IndexRange(0, distinctAlleleCount).forEach(n -> action.accept(sortedAlleleCounts[2*n], sortedAlleleCounts[2*n+1])); | ||
} | ||
|
||
/** | ||
* Perform an action for every allele index not represented in this genotype. For example if the total allele count | ||
* is 4 and {@code sortedAlleleCounts} is [0,1,2,1] then alleles 0 and 2 are present, each with a count of 1, while | ||
* alleles 1 and 3 are absent, so we perform {@code action} on 1 and 3. | ||
*/ | ||
public void forEachAbsentAlleleIndex(final IntConsumer action, final int alleleCount) { | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Can you add some docs to the method? If I understand this correctly, all the alleles in sortedAlleleCounts are required to have counts > 0 (as described in the class docs), so they are all present. Thus we're going to try to apply the action to every allele index between zero and alleleCount, but skip if that allele is present in sortedAlleleCounts. And the index on the sortedAlleleCounts is 2X the index because that array is of the form described above where every other entry is an allele, right? distinctAlleleCount == sortedAlleleCounts.length/2, right? (Or sortedAlleleCounts.length >> 1 depending on who you ask.) I think it might be clearer to put that comparison in terms of the sortedAlleleCounts length. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. After debugging the AlleleFrequencyCalculatorUnitTests I did verify that the sorted array is sorted by allele index. Can you note that somewhere in GenotypeAlleleCounts? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Done and done. |
||
int presentAlleleIndex = 0; | ||
int presentAllele = sortedAlleleCounts[0]; | ||
|
||
for (int n = 0; n < alleleCount; n++) { | ||
// if we find n in sortedAlleleCounts, it is present, so we move presentAllele to the next | ||
// index in sortedAlleleCounts and skip the allele; otherwise the allele is absent and we perform the action on it. | ||
if (n == presentAllele) { | ||
// if we haven't exhausted all the present alleles, move to the next one. | ||
// Note that distinctAlleleCount == sortedAlleleCounts.length/2 | ||
if (++presentAlleleIndex < distinctAlleleCount) { | ||
// every other entry in sortedAlleleCounts is an allele index; hence we multiply by 2 | ||
presentAllele = sortedAlleleCounts[2 * presentAlleleIndex]; | ||
} | ||
continue; | ||
} | ||
action.accept(n); | ||
} | ||
} | ||
|
||
public double sumOverAlleleIndicesAndCounts(final IntToDoubleBiFunction func) { | ||
return new IndexRange(0, distinctAlleleCount).sum(n -> func.apply(sortedAlleleCounts[2*n], sortedAlleleCounts[2*n+1])); | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be nice to show the corresponding diploid genotype as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not too many samples, but instead of copying that case I wrote a pedagogical unit test (with just as many samples) that fails spectacularly without the fix and for which you can calculate the correct answer analytically.