Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added a check to ensure previous caching information doesn't affect referenceConfidenceCalls #5911

Merged

Conversation

jamesemery
Copy link
Collaborator

I don't doubt that there could be issues caused by reads with previously filled caches. Ultimately this shouldn't have too significant an impact except in very pathological circumstances with highly repetitive regions or reads that hang beyond a certain length into the next region and happen to have had good looking indel sites without the cirgar actually containing any indels for that read. This should eliminate any of these circumstances entirely so we can be sure the cache is clear before every call.

Fixes #5908

jamesemery added 2 commits May 1, 2019 12:40
…e Indel cache has been cleared from underlying reads
…e Indel cache has been cleared from underlying reads
@codecov
Copy link

codecov bot commented May 1, 2019

Codecov Report

Merging #5911 into master will decrease coverage by 0.017%.
The diff coverage is 100%.

@@              Coverage Diff               @@
##             master     #5911       +/-   ##
==============================================
- Coverage     86.84%   86.823%   -0.018%     
- Complexity    32326     32348       +22     
==============================================
  Files          1991      1993        +2     
  Lines        149342    149470      +128     
  Branches      16482     16505       +23     
==============================================
+ Hits         129689    129774       +85     
- Misses        13646     13679       +33     
- Partials       6007      6017       +10
Impacted Files Coverage Δ Complexity Δ
...kers/haplotypecaller/ReferenceConfidenceModel.java 92.982% <100%> (+0.031%) 87 <1> (+1) ⬆️
...lotypecaller/ReferenceConfidenceModelUnitTest.java 95.787% <100%> (+0.086%) 53 <0> (+3) ⬆️
...nder/utils/runtime/StreamingProcessController.java 67.299% <0%> (-0.474%) 33% <0%> (ø)
...transforms/markduplicates/MarkDuplicatesSpark.java 94.595% <0%> (ø) 36% <0%> (ø) ⬇️
.../tsv/SimpleCSVWriterWrapperWithHeaderUnitTest.java 48.077% <0%> (ø) 7% <0%> (?)
...nstitute/hellbender/utils/tsv/SimpleXSVWriter.java 77.273% <0%> (ø) 11% <0%> (?)

Copy link
Contributor

@droazen droazen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Back to @jamesemery with a question

@@ -189,6 +189,10 @@ public ReferenceConfidenceModel(final SampleList samples,
final int ploidy = ploidyModel.samplePloidy(0); // the first sample = the only sample in reference-confidence mode.

final SimpleInterval refSpan = activeRegion.getSpan();
// Ensuring that if the underlying reads have any cached indel informativeness data it gets purged before calling the next region.
if (USE_CACHED_READ_INDEL_INFORMATIVENESS_VALUES) {
readLikelihoods.sampleReads(0).forEach(r -> r.clearTransientAttribute(INDEL_INFORMATIVE_BASES_CACHE_ATTRIBUTE_NAME));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is sampleReads(0) correct here? Do we have to worry about the multi-sample case in this context?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This behavior is acceptable by virtue of the fact that above there is the following line: Utils.validateArg(readLikelihoods.numberOfSamples() == 1, () -> "readLikelihoods must contain exactly one sample but it contained " + readLikelihoods.numberOfSamples()); Furthermore when we create pileups on the next in AssemblyBasedCallerUtils.getPileupsOverReference() this is the exact mechanism by which we access the reads.

@jamesemery
Copy link
Collaborator Author

Note to self: move the purging of the cache to the end of execution and test that leftover cache values are never present after invoking referenceConfidenceModel

@jamesemery
Copy link
Collaborator Author

@droazen I have pushed the cache removal step down to a more testable point in the code and added the assertion to the existing testing infrastructure. Can you take a quick look at this branch so it can go in at some point?

Copy link
Contributor

@droazen droazen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Back to @jamesemery with another round of comments.

@@ -217,6 +217,11 @@ public ReferenceConfidenceModel(final SampleList samples,
}
}

// Ensuring that we remove any indel informativeness data we may have attached to the underlying reads for caching purposes
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add an additional comment explaining briefly why it is necessary to clear the cached data.

@@ -508,6 +508,10 @@ public void testRefConfidenceBasic(final int nReads, final int extension) {
final IndependentSampleGenotypesModel genotypingModel = new IndependentSampleGenotypesModel();
final List<Integer> expectedDPs = Collections.nCopies(data.getActiveRegion().getSpan().size(), nReads);
final List<VariantContext> contexts = model.calculateRefConfidence(data.getRefHap(), haplotypes, data.getPaddedRefLoc(), data.getActiveRegion(), likelihoods, ploidyModel, calls, false, Collections.emptyList());
// Asserting that none of the reads after calculateRefConfidence have indel informativeness caching values attached.
for (GATKRead read : data.getActiveRegion().getReads()) {
Assert.assertTrue(read.getTransientAttribute(ReferenceConfidenceModel.INDEL_INFORMATIVE_BASES_CACHE_ATTRIBUTE_NAME) == null);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use Assert.assertNull() here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, should checking for the clearing of this attribute be its own test case, instead of piggybacking on other test cases?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm worried about clearing the cache in every case so I would prefer to assert that it isn't leaving the reads cached in any case, thus I attached this to all of our tests.

@@ -529,6 +533,10 @@ public void testRefConfidencePartialReads() {
final List<Integer> expectedDPs = new ArrayList<>(Collections.nCopies(data.getActiveRegion().getSpan().size(), 0));
for ( int i = start; i < readLen + start; i++ ) expectedDPs.set(i, 1);
final List<VariantContext> contexts = model.calculateRefConfidence(data.getRefHap(), haplotypes, data.getPaddedRefLoc(), data.getActiveRegion(), likelihoods, ploidyModel, calls);
// Asserting that none of the reads after calculateRefConfidence have indel informativeness caching values attached.
for (GATKRead read : data.getActiveRegion().getReads()) {
Assert.assertTrue(read.getTransientAttribute(ReferenceConfidenceModel.INDEL_INFORMATIVE_BASES_CACHE_ATTRIBUTE_NAME) == null);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assert.assertNull()

@@ -565,6 +573,10 @@ public void testRefConfidenceWithCalls() {

final List<Integer> expectedDPs = Collections.nCopies(data.getActiveRegion().getSpan().size(), nReads);
final List<VariantContext> contexts = model.calculateRefConfidence(data.getRefHap(), haplotypes, data.getPaddedRefLoc(), data.getActiveRegion(), likelihoods, ploidyModel, calls);
// Asserting that none of the reads after calculateRefConfidence have indel informativeness caching values attached.
for (GATKRead read : data.getActiveRegion().getReads()) {
Assert.assertTrue(read.getTransientAttribute(ReferenceConfidenceModel.INDEL_INFORMATIVE_BASES_CACHE_ATTRIBUTE_NAME) == null);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assert.assertNull()

@jamesemery
Copy link
Collaborator Author

@droazen Responded to your comments. Are there lingering objections to getting this branch in?

Copy link
Contributor

@droazen droazen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 Looks good now -- merging!

@droazen droazen merged commit 497fef2 into master May 21, 2019
@droazen droazen deleted the je_purgeTransientAttributeFieldsBetweenMulticallActiveRegions branch May 21, 2019 14:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

value of USE_CACHED_READ_INDEL_INFORMATIVENESS_VALUES changes the result
2 participants