Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improved implementation of allele-specific new qual #5460

Merged
merged 3 commits into from
Dec 11, 2018
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,7 @@
import java.util.Arrays;
import java.util.Collections;
import java.util.List;
import java.util.function.IntConsumer;
import java.util.stream.Collectors;
import java.util.stream.IntStream;

Expand Down Expand Up @@ -655,6 +656,21 @@ public void forEachAlleleIndexAndCount(final IntBiConsumer action) {
new IndexRange(0, distinctAlleleCount).forEach(n -> action.accept(sortedAlleleCounts[2*n], sortedAlleleCounts[2*n+1]));
}

public void forEachAbsentAlleleIndex(final IntConsumer action, final int alleleCount) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add some docs to the method?

If I understand this correctly, all the alleles in sortedAlleleCounts are required to have counts > 0 (as described in the class docs), so they are all present. Thus we're going to try to apply the action to every allele index between zero and alleleCount, but skip if that allele is present in sortedAlleleCounts. And the index on the sortedAlleleCounts is 2X the index because that array is of the form described above where every other entry is an allele, right?

distinctAlleleCount == sortedAlleleCounts.length/2, right? (Or sortedAlleleCounts.length >> 1 depending on who you ask.) I think it might be clearer to put that comparison in terms of the sortedAlleleCounts length.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After debugging the AlleleFrequencyCalculatorUnitTests I did verify that the sorted array is sorted by allele index. Can you note that somewhere in GenotypeAlleleCounts?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done and done.

int presentAlleleIndex = 0;
int presentAllele = sortedAlleleCounts[0];

for (int n = 0; n < alleleCount; n++) {
if (n == presentAllele) {
if (++presentAlleleIndex < distinctAlleleCount) {
presentAllele = sortedAlleleCounts[2 * presentAlleleIndex];
}
continue;
}
action.accept(n);
}
}

public double sumOverAlleleIndicesAndCounts(final IntToDoubleBiFunction func) {
return new IndexRange(0, distinctAlleleCount).sum(n -> func.apply(sortedAlleleCounts[2*n], sortedAlleleCounts[2*n+1]));
}
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,7 @@
import htsjdk.variant.variantcontext.Allele;
import htsjdk.variant.variantcontext.Genotype;
import htsjdk.variant.variantcontext.VariantContext;
import it.unimi.dsi.fastutil.doubles.DoubleArrayList;
import it.unimi.dsi.fastutil.ints.Int2ObjectArrayMap;
import org.apache.commons.math3.util.MathArrays;
import org.broadinstitute.hellbender.tools.walkers.genotyper.GenotypeAlleleCounts;
Expand Down Expand Up @@ -85,6 +86,9 @@ public AFCalculationResult getLog10PNonRef(final VariantContext vc, final int de

final boolean spanningDeletionPresent = alleles.contains(Allele.SPAN_DEL);
final Map<Integer, int[]> nonVariantIndicesByPloidy = new Int2ObjectArrayMap<>();

// re-usable buffers of the log10 genotype posteriors of genotypes missing each allele
final List<DoubleArrayList> log10AbsentPosteriors = IntStream.range(0,numAlleles).mapToObj(n -> new DoubleArrayList()).collect(Collectors.toList());
for (final Genotype g : vc.getGenotypes()) {
if (!g.hasLikelihoods()) {
continue;
Expand All @@ -107,26 +111,30 @@ public AFCalculationResult getLog10PNonRef(final VariantContext vc, final int de
log10PNoVariant += Math.min(0,MathUtils.log10SumLog10(nonVariantLog10Posteriors));
}

// per allele non-log space probabilities of zero counts for this sample
// for each allele calculate the total probability of genotypes containing at least one copy of the allele
final double[] log10ProbabilityOfNonZeroAltAlleles = new double[numAlleles];
Arrays.fill(log10ProbabilityOfNonZeroAltAlleles, Double.NEGATIVE_INFINITY);
// if the VC is biallelic the allele-specific qual equals the variant qual
if (numAlleles == 2) {
continue;
}

// for each allele, we collect the log10 probabilities of genotypes in which the allele is absent, then add (in log space)
// to get the log10 probability that the allele is absent in this sample
log10AbsentPosteriors.forEach(DoubleArrayList::clear); // clear the buffers. Note that this is O(1) due to the primitive backing array
for (int genotype = 0; genotype < glCalc.genotypeCount(); genotype++) {
final double log10GenotypePosterior = log10GenotypePosteriors[genotype];
glCalc.genotypeAlleleCountsAt(genotype).forEachAlleleIndexAndCount((alleleIndex, count) ->
log10ProbabilityOfNonZeroAltAlleles[alleleIndex] =
MathUtils.log10SumLog10(log10ProbabilityOfNonZeroAltAlleles[alleleIndex], log10GenotypePosterior));
glCalc.genotypeAlleleCountsAt(genotype).forEachAbsentAlleleIndex(a -> log10AbsentPosteriors.get(a).add(log10GenotypePosterior), numAlleles);
}

for (int allele = 0; allele < numAlleles; allele++) {
// if prob of non hom ref == 1 up to numerical precision, short-circuit to avoid NaN
if (log10ProbabilityOfNonZeroAltAlleles[allele] >= 0) {
log10POfZeroCountsByAllele[allele] = Double.NEGATIVE_INFINITY;
} else {
log10POfZeroCountsByAllele[allele] += MathUtils.log10OneMinusPow10(log10ProbabilityOfNonZeroAltAlleles[allele]);
}
}
final double[] log10PNoAllele = log10AbsentPosteriors.stream()
.mapToDouble(buffer -> MathUtils.log10SumLog10(buffer.toDoubleArray()))
.map(x -> Math.min(0, x)).toArray(); // if prob of non hom ref > 1 due to finite precision, short-circuit to avoid NaN

// multiply the cumulative probabilities of alleles being absent, which is addition of logs
MathUtils.addToArrayInPlace(log10POfZeroCountsByAllele, log10PNoAllele);
}

// for biallelic the allele-specific qual equals the variant qual, and we short-circuited the calculation above
if (numAlleles == 2) {
log10POfZeroCountsByAllele[1] = log10PNoVariant;
}

// unfortunately AFCalculationResult expects integers for the MLE. We really should emit the EM no-integer values
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -7,10 +7,7 @@
import org.testng.annotations.DataProvider;
import org.testng.annotations.Test;

import java.util.ArrayList;
import java.util.Arrays;
import java.util.Collections;
import java.util.List;
import java.util.*;

/**
* Test {@link org.broadinstitute.gatk.tools.walkers.genotyper.GenotypeAlleleCounts}
Expand Down Expand Up @@ -184,8 +181,11 @@ private void testPloidyTwoOrMore(final int ploidy) {

GenotypeAlleleCounts current = GenotypeAlleleCounts.first(ploidy);

while (!current.containsAllele(MAXIMUM_ALLELE_INDEX + 1)) {
while (true) {
final GenotypeAlleleCounts next = current.next();
if (next.containsAllele(MAXIMUM_ALLELE_INDEX + 1)) {
break;
}

// test log10CombinationCount
if (ploidy == 2) {
Expand All @@ -203,14 +203,18 @@ private void testPloidyTwoOrMore(final int ploidy) {

//test forEach
final List<Integer> alleleCountsAsList = new ArrayList<>(next.distinctAlleleCount()*2);
final Set<Integer> absentAlleles = new HashSet<>();
next.forEachAlleleIndexAndCount((alleleIndex, alleleCount) -> {
alleleCountsAsList.add(alleleIndex);
alleleCountsAsList.add(alleleCount);});
next.forEachAbsentAlleleIndex(absentAlleles::add, MAXIMUM_ALLELE_INDEX + 1);
final int[] actualAlleleCounts = new int[next.distinctAlleleCount()*2];
next.copyAlleleCounts(actualAlleleCounts, 0);

Assert.assertEquals(alleleCountsAsList.stream().mapToInt(n->n).toArray(), actualAlleleCounts);

Assert.assertEquals(absentAlleles.size(), MAXIMUM_ALLELE_INDEX + 1 - next.distinctAlleleCount());
next.forEachAlleleIndexAndCount((index, count) -> Assert.assertTrue(!absentAlleles.contains(index)));


if (current.distinctAlleleCount() == 1) {
Expand Down