Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory leak with a simple Cache using only weak keys and weak values #167

Closed
lauredogit opened this issue Jun 26, 2017 · 23 comments
Closed

Comments

@lauredogit
Copy link

Hi,

When using a simple Cache built only specifying .weakKeys() and .weakValues() where keys and values are maintained by other data structures using strong refereneces and maintaining their own lifecycle, we observed a memory leak in production with Caffeine whilst not having observed it with Guava.

With this configuration, we are using the BoundedLocalCache with a LocalCacheFactory.WI and are creating NodeFactory.WW Nodes.

In the WW Node, there is a WeakValueReference which stores a strong reference to the key.

This strong reference caused a memory leak occupying 2/3 of the heap.

We did not setup a Cache#cleanUp task as we expected our cache to be sufficiently accessed to perform its maintenance tasks without it.

The Cache gets about 11 writes per second in production of mainly unique keys, i.e we are using it as a Store, not as a Cache.

I guess our use case does not really fit in and that we need to schedule the cleanUp of the Cache every now and then.

If you need additional details, feel free to ask.

Best regards,
Dominique

@ben-manes
Copy link
Owner

Thanks for the report and sorry for the inconvenience.

As you noted, NodeFactory.WW holds only weak references:

static class WW<K, V> implements Node<K, V> {
  volatile WeakKeyReference<K> key;
  volatile WeakValueReference<V> value;
  ...

The map should be referenced using node.getKeyReference() to store based on the weak reference and not the strong reference. The lookups should use Reference.LookupKeyReference to match.

ReferenceTest should be asserting this behavior. It seems to mostly have value-based tests and not enough weak key tests, unfortunately. When I quickly hack one of the tests to be based on keys it passes,

@Test(dataProvider = "caches")
@CacheSpec(keys = ReferenceType.WEAK, values = {ReferenceType.STRONG, ReferenceType.WEAK},
    expireAfterAccess = Expire.DISABLED, expireAfterWrite = Expire.DISABLED,
    maximumSize = Maximum.DISABLED, weigher = CacheWeigher.DEFAULT,
    population = Population.FULL, stats = Stats.ENABLED, removalListener = Listener.CONSUMING)
public void put_weak(Cache<Integer, Integer> cache, CacheContext context) {
  cache.put(10000000, context.absentValue());
  context.clear();
  GcFinalization.awaitFullGc();

  long count = context.initialSize() - cache.estimatedSize() + 1;
  if (context.population() != Population.SINGLETON) {
    assertThat(count, is(0L));
  }
  assertThat(cache, hasRemovalNotifications(context, count, RemovalCause.COLLECTED));
  verifyWriter(context, (verifier, writer) -> verifier.deletions(count, RemovalCause.COLLECTED));
}

Can you provide a failing test or direct me to where the strong reference is being held? I agree that at your write rate an explicit cleanUp shouldn't be necessary. Every write should trigger a cleanUp and empty WeakReference objects should be cheap even if held a little too long.

@lauredogit
Copy link
Author

lauredogit commented Jun 26, 2017

Hi Ben,

The strong reference is held by com.github.benmanes.caffeine.cache.References$WeakValueReference

/**
   * The value in a cache that holds values weakly. This class retains a reference to the key in
   * the advent that the value is reclaimed so that the entry can be removed from the cache in
   * constant time.
   */
  static final class WeakValueReference<V> extends WeakReference<V>
      implements InternalReference<V> {
    private final Object keyReference;

    public WeakValueReference(@Nonnull Object keyReference,
        @Nullable V value, @Nullable ReferenceQueue<V> queue) {
      super(value, queue);
      this.keyReference = keyReference;
    }

    @Override
    public Object getKeyReference() {
      return keyReference;
    }

    @Override
    public boolean equals(Object object) {
      return referenceEquals(object);
    }

    @Override
    @SuppressWarnings("PMD.UselessOverridingMethod")
    public int hashCode() {
      return super.hashCode();
    }
  }

The keyReference field is a strong reference and prevents the GC from reclaiming the keys, which created our leak.

If I understood correctly, you need the strong reference for the com.github.benmanes.caffeine.cache.BoundedLocalCache#drainValueReferences method, unfortunately if the cache does not call this method sufficiently, the memory leak happens.

@ben-manes
Copy link
Owner

Oh, thanks!

I think this was a mistake in the generated code. The keyReference is an object so that it could be a WeakKeyReference. The generated code is mistakenly,

public final void setValue(V value, ReferenceQueue<V> referenceQueue) {
  ((Reference<V>) getValueReference()).clear();
  UnsafeAccess.UNSAFE.putObject(this, VALUE_OFFSET, new WeakValueReference<V>(key, value, referenceQueue));
}

I think if we changed this to call getKeyReference() instead of key then it would work.

@ben-manes
Copy link
Owner

I should clarify that keyReference is either key if a strongly held or (should be) WeakKeyReference if weakly held. I think changing the code generator to account for this would resolve the problem. I'll also add unit tests to better cover this case.

Its morning in PST and I won't be able to get to this until the evening. I'll try to get it fixed and released tonight, if that's okay.

@lauredogit
Copy link
Author

lauredogit commented Jun 26, 2017

Well, it seems so.

In the com.github.benmanes.caffeine.cache.BoundedLocalCache#put(K, V, boolean, boolean),

You are creating new Nodes with

          node = nodeFactory.newNode(key, keyReferenceQueue(),
              value, valueReferenceQueue(), newWeight, now);

and with the WW NodeFactory it does:

  WW {
    <K, V> Node<K, V> newNode(K key, ReferenceQueue<K> keyReferenceQueue, V value,
        ReferenceQueue<V> valueReferenceQueue, int weight, long now) {
      return new WW<>(key, keyReferenceQueue, value, valueReferenceQueue, weight, now);
    }

    <K, V> Node<K, V> newNode(Object keyReference, V value, ReferenceQueue<V> valueReferenceQueue,
        int weight, long now) {
      return new WW<>(keyReference, value, valueReferenceQueue, weight, now);
    }

    <K> Object newLookupKey(K key) {
      return new LookupKeyReference<K>(key);
    }

    <K> Object newReferenceKey(K key, ReferenceQueue<K> referenceQueue) {
      return new WeakKeyReference<K>(key, referenceQueue);
    }
  },

which creates a WW Node:

  static class WW<K, V> implements Node<K, V> {
    protected static final long KEY_OFFSET = UnsafeAccess.objectFieldOffset(WW.class, "key");

    protected static final long VALUE_OFFSET = UnsafeAccess.objectFieldOffset(WW.class, "value");

    volatile WeakKeyReference<K> key;

    volatile WeakValueReference<V> value;

    WW(K key, ReferenceQueue<K> keyReferenceQueue, V value, ReferenceQueue<V> valueReferenceQueue,
        int weight, long now) {
      UnsafeAccess.UNSAFE.putObject(this, KEY_OFFSET, new WeakKeyReference<K>(key, keyReferenceQueue));
      UnsafeAccess.UNSAFE.putObject(this, VALUE_OFFSET, new WeakValueReference<V>(key, value, valueReferenceQueue));
    }

And in the WW Node, the key is currently not being wrapped into a WeakReference in

new WeakValueReference(key, value, valueReferenceQueue)) .

Regarding the issue, thank you very much for your prompt reaction.

But no hurry, you can take your time, we are currently applying a workaround in production.

And congratulations for Caffeine, by the way, it's really nice!

@ben-manes
Copy link
Owner

Thanks. I haven't had a chance to work on this tonight. I'll have it out by Monday if this week remains hectic, but I'll try to get it resolved sooner. I think most of the work here is to the missing write test cases to assert the failure and the fix.

@ben-manes
Copy link
Owner

Reminder to self - This is also a bug for weakKeys+softValues, unsurprisingly since the generated code is parameterized.

@ben-manes
Copy link
Owner

I have an improved unit test that fails with master and passes with the fix (branch). I need to review all the test cases in ReferenceTest to ensure they are generalized to capture this scenario.

The constructors are now generated as,

WW(K key, ReferenceQueue<K> keyReferenceQueue, V value, ReferenceQueue<V> valueReferenceQueue,
    int weight, long now) {
  this(new WeakKeyReference<K>(key, keyReferenceQueue), value, valueReferenceQueue, weight, now);
}

WW(Object keyReference, V value, ReferenceQueue<V> valueReferenceQueue, int weight, long now) {
  UnsafeAccess.UNSAFE.putObject(this, KEY_OFFSET, keyReference);
  UnsafeAccess.UNSAFE.putObject(this, VALUE_OFFSET, new WeakValueReference<V>(keyReference, value, valueReferenceQueue));
}

@lauredogit
Copy link
Author

Hi Ben,

Great news, this is indeed the needed fix for the WW Node!

As you said , you should also apply the same fix to the WSo Node using weak keys and soft values.

  static class WSo<K, V> implements Node<K, V> {
    protected static final long KEY_OFFSET = UnsafeAccess.objectFieldOffset(WSo.class, "key");

    protected static final long VALUE_OFFSET = UnsafeAccess.objectFieldOffset(WSo.class, "value");

    volatile WeakKeyReference<K> key;

    volatile SoftValueReference<V> value;

    WSo(K key, ReferenceQueue<K> keyReferenceQueue, V value, ReferenceQueue<V> valueReferenceQueue,
        int weight, long now) {
      UnsafeAccess.UNSAFE.putObject(this, KEY_OFFSET, new WeakKeyReference<K>(key, keyReferenceQueue));
      UnsafeAccess.UNSAFE.putObject(this, VALUE_OFFSET, new SoftValueReference<V>(key, value, valueReferenceQueue));
    }

    WSo(Object keyReference, V value, ReferenceQueue<V> valueReferenceQueue, int weight, long now) {
      UnsafeAccess.UNSAFE.putObject(this, KEY_OFFSET, keyReference);
      UnsafeAccess.UNSAFE.putObject(this, VALUE_OFFSET, new SoftValueReference<V>(keyReference, value, valueReferenceQueue));
    }

=>

    WSo(K key, ReferenceQueue<K> keyReferenceQueue, V value, ReferenceQueue<V> valueReferenceQueue,
        int weight, long now) {
this(new WeakKeyReference<K>(key, keyReferenceQueue), value, valueReferenceQueue, weight, now);
    }

@ben-manes
Copy link
Owner

The code generator change also fixed WSo to match your suggestion. The new unit tests should cover that case due to being parameterized to run against all configurations that match the specification. Of course that only helps when the tests are written or not flawed, which caused this case to go unnoticed. (It also results in millions of test executions, which now exceeds the maximum build time in Travis)

Sorry for the delay - its been an exhausting week. I'll have this released by Sunday night (PST) so that it is available Monday. Thanks for being patient.

@ben-manes
Copy link
Owner

I finished the tests (see branch) but there are two tooling issues that I need to address. Sorry that this will take another few days to work out. Specifically I need to split the Travis build to pass the time limit and the latest ErrorProne causes a compilation to fail.

/Users/ben/projects/caffeine/caffeine/src/test/java/com/github/benmanes/caffeine/cache/MultiThreadedTest.java:100: error: An unhandled exception was thrown by the Error Prone static analysis plugin.
      (cache, key) -> cache.get(key),
                               ^
     Please report this at https://github.com/google/error-prone/issues/new and include the following:

     error-prone version: 2.0.21
     Stack Trace:
     java.lang.NullPointerException
        at com.google.errorprone.matchers.IsLastStatementInBlock.matches(IsLastStatementInBlock.java:31)

@ben-manes
Copy link
Owner

Seems errorprone was fixed in google/error-prone@b9b492f

@ben-manes
Copy link
Owner

Sorry, I spent the holiday fighting with TravisCI and haven't gotten it to be stable yet. Despite splitting the build into multiple jobs, it still exceeds the timeout. That's probably because they are run on the same virtual machine, so the parallelism doesn't get any additional resources. I might need to switch to another CI that offers a longer build time.

@ben-manes
Copy link
Owner

I think I was able to get Travis under control by reducing the number of concurrent jobs to 1. Then the build is split into 3 jobs, each of which can run for up to 50 minutes. The only flaw is that code coverage can only be obtained for one of the jobs and the largest is near the limit.

If I am able to get a pass on Travis then I'll release sometime tomorrow.

@ben-manes
Copy link
Owner

TravisCI support agreed to increase the timeout if their idea didn't work, which sadly it didn't. So I'm waiting to hear back when the change is applied, can fully validate the build, and release.

Sorry for the long delay and noisy responses. Just want to keep you informed that I am actively trying to get this released and not ignoring the problem.

@ben-manes
Copy link
Owner

@davsclaus - LruCache
@benson-git - ChannelPool
@DavyLandman - RascalExecutionContext

According to a github search, your projects may be affected by this bug. I have not evaluated when it was introduced, so it may have always gone unnoticed due to flawed tests. This bug causes caches with weakKeys + weak/softValues to not be collected, due to a strong reference and failed lookup. As you have not reported this issue hopefully it is a minor problem in your use cases, but please do upgrade promptly. Sorry for the inconvenience.

A release is pending TravisCI support to enable additional compute time so that I can fully validate the build (suite passes locally). I will update this bug when the release is live.

@phraktle
Copy link
Contributor

phraktle commented Jul 6, 2017

Just an off-topic remark: wow @ben-manes :) Beyond building the best caching library, your level of attentiveness is exemplary. Thanks for your efforts!

@ben-manes
Copy link
Owner

Thanks, but my view is the opposite. It's an attempt to show respect and gratitude for the patience / understanding for any frustrations that my mistakes induced. I can't earn that generosity, but I can try to show that it is not taken for granted.

@davsclaus
Copy link

davsclaus commented Jul 6, 2017 via email

@DavyLandman
Copy link

DavyLandman commented Jul 6, 2017

I have to say, if all library authors were as proactive as @ben-manes, we would be a lot less weary of new dependencies. So thanks for the report, updating today or tomorrow (after update release).

@ben-manes
Copy link
Owner

Released 2.5.3. Please remember to upgrade.

Thanks for the kind words.

DavyLandman added a commit to usethesource/vallang that referenced this issue Jul 9, 2017
@yasarhike
Copy link

yasarhike commented May 30, 2024

Hey ben, when i set the maximumSize property of the cache the behaviour is unpredictable. Can you please the behaviour ?

@ben-manes
Copy link
Owner

Please open a new issue @yasarhike. The size eviction does include a bit of randomization to protect against denial of service attacks as predictable behavior is exploitable.

jurgenvinju pushed a commit to usethesource/java-air that referenced this issue Mar 4, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants