Added a check to ensure previous caching information doesn't affect referenceConfidenceCalls #5911

jamesemery · 2019-05-01T16:48:12Z

I don't doubt that there could be issues caused by reads with previously filled caches. Ultimately this shouldn't have too significant an impact except in very pathological circumstances with highly repetitive regions or reads that hang beyond a certain length into the next region and happen to have had good looking indel sites without the cirgar actually containing any indels for that read. This should eliminate any of these circumstances entirely so we can be sure the cache is clear before every call.

Fixes #5908

…e Indel cache has been cleared from underlying reads

codecov · 2019-05-01T17:21:59Z

Codecov Report

Merging #5911 into master will decrease coverage by 0.017%.
The diff coverage is 100%.

@@              Coverage Diff               @@
##             master     #5911       +/-   ##
==============================================
- Coverage     86.84%   86.823%   -0.018%     
- Complexity    32326     32348       +22     
==============================================
  Files          1991      1993        +2     
  Lines        149342    149470      +128     
  Branches      16482     16505       +23     
==============================================
+ Hits         129689    129774       +85     
- Misses        13646     13679       +33     
- Partials       6007      6017       +10

Impacted Files	Coverage Δ	Complexity Δ
...kers/haplotypecaller/ReferenceConfidenceModel.java	`92.982% <100%> (+0.031%)`	`87 <1> (+1)`	⬆️
...lotypecaller/ReferenceConfidenceModelUnitTest.java	`95.787% <100%> (+0.086%)`	`53 <0> (+3)`	⬆️
...nder/utils/runtime/StreamingProcessController.java	`67.299% <0%> (-0.474%)`	`33% <0%> (ø)`
...transforms/markduplicates/MarkDuplicatesSpark.java	`94.595% <0%> (ø)`	`36% <0%> (ø)`	⬇️
.../tsv/SimpleCSVWriterWrapperWithHeaderUnitTest.java	`48.077% <0%> (ø)`	`7% <0%> (?)`
...nstitute/hellbender/utils/tsv/SimpleXSVWriter.java	`77.273% <0%> (ø)`	`11% <0%> (?)`

droazen

Back to @jamesemery with a question

droazen · 2019-05-03T21:11:06Z

...va/org/broadinstitute/hellbender/tools/walkers/haplotypecaller/ReferenceConfidenceModel.java

@@ -189,6 +189,10 @@ public ReferenceConfidenceModel(final SampleList samples,
        final int ploidy = ploidyModel.samplePloidy(0); // the first sample = the only sample in reference-confidence mode.

        final SimpleInterval refSpan = activeRegion.getSpan();
+        // Ensuring that if the underlying reads have any cached indel informativeness data it gets purged before calling the next region.
+        if (USE_CACHED_READ_INDEL_INFORMATIVENESS_VALUES) {
+            readLikelihoods.sampleReads(0).forEach(r -> r.clearTransientAttribute(INDEL_INFORMATIVE_BASES_CACHE_ATTRIBUTE_NAME));


Is sampleReads(0) correct here? Do we have to worry about the multi-sample case in this context?

This behavior is acceptable by virtue of the fact that above there is the following line: Utils.validateArg(readLikelihoods.numberOfSamples() == 1, () -> "readLikelihoods must contain exactly one sample but it contained " + readLikelihoods.numberOfSamples()); Furthermore when we create pileups on the next in AssemblyBasedCallerUtils.getPileupsOverReference() this is the exact mechanism by which we access the reads.

jamesemery · 2019-05-03T21:15:56Z

Note to self: move the purging of the cache to the end of execution and test that leftover cache values are never present after invoking referenceConfidenceModel

jamesemery · 2019-05-13T17:31:14Z

@droazen I have pushed the cache removal step down to a more testable point in the code and added the assertion to the existing testing infrastructure. Can you take a quick look at this branch so it can go in at some point?

droazen

Back to @jamesemery with another round of comments.

droazen · 2019-05-14T20:02:51Z

...va/org/broadinstitute/hellbender/tools/walkers/haplotypecaller/ReferenceConfidenceModel.java

@@ -217,6 +217,11 @@ public ReferenceConfidenceModel(final SampleList samples,
            }
        }

+        // Ensuring that we remove any indel informativeness data we may have attached to the underlying reads for caching purposes


Add an additional comment explaining briefly why it is necessary to clear the cached data.

droazen · 2019-05-14T20:18:24Z

...roadinstitute/hellbender/tools/walkers/haplotypecaller/ReferenceConfidenceModelUnitTest.java

@@ -508,6 +508,10 @@ public void testRefConfidenceBasic(final int nReads, final int extension) {
        final IndependentSampleGenotypesModel genotypingModel = new IndependentSampleGenotypesModel();
        final List<Integer> expectedDPs = Collections.nCopies(data.getActiveRegion().getSpan().size(), nReads);
        final List<VariantContext> contexts = model.calculateRefConfidence(data.getRefHap(), haplotypes, data.getPaddedRefLoc(), data.getActiveRegion(), likelihoods, ploidyModel, calls, false, Collections.emptyList());
+        // Asserting that none of the reads after calculateRefConfidence have indel informativeness caching values attached.
+        for (GATKRead read : data.getActiveRegion().getReads()) {
+            Assert.assertTrue(read.getTransientAttribute(ReferenceConfidenceModel.INDEL_INFORMATIVE_BASES_CACHE_ATTRIBUTE_NAME) == null);


Use Assert.assertNull() here.

Also, should checking for the clearing of this attribute be its own test case, instead of piggybacking on other test cases?

I'm worried about clearing the cache in every case so I would prefer to assert that it isn't leaving the reads cached in any case, thus I attached this to all of our tests.

droazen · 2019-05-14T20:18:50Z

...roadinstitute/hellbender/tools/walkers/haplotypecaller/ReferenceConfidenceModelUnitTest.java

@@ -529,6 +533,10 @@ public void testRefConfidencePartialReads() {
                final List<Integer> expectedDPs = new ArrayList<>(Collections.nCopies(data.getActiveRegion().getSpan().size(), 0));
                for ( int i = start; i < readLen + start; i++ ) expectedDPs.set(i, 1);
                final List<VariantContext> contexts = model.calculateRefConfidence(data.getRefHap(), haplotypes, data.getPaddedRefLoc(), data.getActiveRegion(), likelihoods, ploidyModel, calls);
+                // Asserting that none of the reads after calculateRefConfidence have indel informativeness caching values attached.
+                for (GATKRead read : data.getActiveRegion().getReads()) {
+                    Assert.assertTrue(read.getTransientAttribute(ReferenceConfidenceModel.INDEL_INFORMATIVE_BASES_CACHE_ATTRIBUTE_NAME) == null);


Assert.assertNull()

droazen · 2019-05-14T20:19:40Z

...roadinstitute/hellbender/tools/walkers/haplotypecaller/ReferenceConfidenceModelUnitTest.java

@@ -565,6 +573,10 @@ public void testRefConfidenceWithCalls() {

                    final List<Integer> expectedDPs = Collections.nCopies(data.getActiveRegion().getSpan().size(), nReads);
                    final List<VariantContext> contexts = model.calculateRefConfidence(data.getRefHap(), haplotypes, data.getPaddedRefLoc(), data.getActiveRegion(), likelihoods, ploidyModel, calls);
+                    // Asserting that none of the reads after calculateRefConfidence have indel informativeness caching values attached.
+                    for (GATKRead read : data.getActiveRegion().getReads()) {
+                        Assert.assertTrue(read.getTransientAttribute(ReferenceConfidenceModel.INDEL_INFORMATIVE_BASES_CACHE_ATTRIBUTE_NAME) == null);


Assert.assertNull()

jamesemery · 2019-05-14T20:29:21Z

@droazen Responded to your comments. Are there lingering objections to getting this branch in?

droazen

👍 Looks good now -- merging!

jamesemery added 2 commits May 1, 2019 12:40

adding a defensive check to calculateReferenceConfidence to ensure th…

a9d8b15

…e Indel cache has been cleared from underlying reads

adding a defensive check to calculateReferenceConfidence to ensure th…

383cca5

…e Indel cache has been cleared from underlying reads

droazen suggested changes May 3, 2019

View reviewed changes

droazen assigned jamesemery May 3, 2019

Added a proper test of cache clearing behavior

c556031

droazen suggested changes May 14, 2019

View reviewed changes

responded to comments

e0a4406

droazen approved these changes May 21, 2019

View reviewed changes

droazen merged commit 497fef2 into master May 21, 2019

droazen deleted the je_purgeTransientAttributeFieldsBetweenMulticallActiveRegions branch May 21, 2019 14:25

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added a check to ensure previous caching information doesn't affect referenceConfidenceCalls #5911

Added a check to ensure previous caching information doesn't affect referenceConfidenceCalls #5911

jamesemery commented May 1, 2019

codecov bot commented May 1, 2019 •

edited

Loading

droazen left a comment

droazen May 3, 2019

jamesemery May 6, 2019

jamesemery commented May 3, 2019

jamesemery commented May 13, 2019

droazen left a comment

droazen May 14, 2019

droazen May 14, 2019

droazen May 14, 2019

jamesemery May 14, 2019

droazen May 14, 2019

droazen May 14, 2019

jamesemery commented May 14, 2019

droazen left a comment

Added a check to ensure previous caching information doesn't affect referenceConfidenceCalls #5911

Added a check to ensure previous caching information doesn't affect referenceConfidenceCalls #5911

Conversation

jamesemery commented May 1, 2019

codecov bot commented May 1, 2019 • edited Loading

Codecov Report

droazen left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jamesemery commented May 3, 2019

jamesemery commented May 13, 2019

droazen left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jamesemery commented May 14, 2019

droazen left a comment

Choose a reason for hiding this comment

codecov bot commented May 1, 2019 •

edited

Loading