LoadingCache no longer respects maximumSize in 2.6.1 #225

panghy · 2018-03-01T20:29:13Z

This seems to be pretty serious but we've noticed that since upgrading to 2.6.1 from 2.6.0 that the maximum size of a cache is no longer respected (for a synchronous loading cache). Attached is a graph of us reporting stats from a cache that has a maximum size of 100k elements.

As the graph shows, before 2.6.1 is deployed the cached stayed at a max size of 100k, after 2.6.1, it got into the millions (almost 200M). The size is simply cache.estimatedSize().

The code is as vanilla as it gets:

 Caffeine.newBuilder().
          maximumSize(100000).
          build(key -> {...});

The text was updated successfully, but these errors were encountered:

panghy · 2018-03-01T20:30:46Z

Trying to inspect the commits since 2.6.0 to see what could have been the cause.

ben-manes · 2018-03-01T20:33:57Z

I don't recall changing anything in the eviction logic and the EvictionTest#evict passes. Can we get a test case showing it failing? Note that by default eviction is asynchronous so a test either needs to call cleanUp(), use executor(Runnable::run), or wait (e.g. using Awaitility).

panghy · 2018-03-01T20:36:58Z

This is actually in production so it seems odd that it stopped evicting entries after days.

panghy · 2018-03-01T20:37:11Z

But anyhow, looking into a test case.

ben-manes · 2018-03-01T20:40:30Z

yeah, even if ForkJoinPool is too busy to do our work, the write buffer should force a clean-up due to back pressure. The maintenance work replays the writes onto the policy and evicts, but the write buffer has a maximum size. When full, it will block on the eviction lock as a fallback to a thrashing or otherwise bad system state. Since there were many writes and your threads are not all blocked, its not obvious.

Can you get a thread dump just to check?

panghy · 2018-03-01T20:43:18Z

I can see commonPool workers awaiting work unfortunately.

ben-manes · 2018-03-01T20:46:00Z

I do know many people were burned by the Linux futex bug, which broke locking everywhere. They of course decided not to inform anyone, it went out into many distributions, and has been a quiet nightmare. If you are on an older kernel then you could be effected.

panghy · 2018-03-01T20:47:58Z

We are on 4.4.0-1050-aws (ubuntu), not sure if we upgraded together with the 2.6.1 rollout (also java 1.8.0_161)

ben-manes · 2018-03-01T20:52:06Z

Linux lost wake-up bug
Gil Tene's warning

Ubuntu 16.04 should be fine, but earlier were broken. What does the AWS version map to?

panghy · 2018-03-01T20:53:43Z

We might have also switched instance types as part of the roll-out so there are some additional (m4 -> m5) which is a new AWS base image, trying to generate a small test case to just test on the base metal.

ben-manes · 2018-03-01T20:56:39Z

According to Gil, the issue started being observed on Haswell-based processors. So if you upgraded to that class, you should also get onto on a newish kernel. I am only making wild guesses based on the limited information we have so far, though.

panghy · 2018-03-01T21:06:55Z

Just rolled a very simple test that triggers loading from a cache of max size 100k in a loop for a million times and printing out estimateSize(). The works both on my mac and on the production machine that's seeing the issue. :(

panghy · 2018-03-01T21:08:48Z

Just for some reference, these are the top jmap entries:

   1:      16979066      950827696  [The object that's cached, it's both present as a key and in values]
   2:       7559480      665234240  java.util.concurrent.ConcurrentHashMap$TreeNode
   3:       3824275      438191872  [Ljava.lang.Object;
   4:       7770486      435147216  com.github.benmanes.caffeine.cache.PSMS

panghy · 2018-03-01T22:14:25Z

We found the issue, this happens if the key is not stable (it's hashCode changes and its equality). Sorry for the false alarm but it's interesting to note that I guess the behavior for a size limited cache would be as such if keys are changed underneath the hood.

ben-manes · 2018-03-01T22:26:51Z

Oh, wonderful!

Yes, unfortunately there is no way for us to detect that without forking the underlying hash table to add a check. The eviction's removal can be unsuccessful due to a concurrent cache.invalidate(key), so we cannot tell if it is due to a user bug or concurrent operations.

robertotru · 2018-06-24T07:58:01Z

Hello folks.

I have the same issue in version 2.6.2 and I'm not sure if it is due to issues with hash codes or equality as mentioned above.

I've create a very naive example here https://github.com/robertotru/caffeine-max-size

The key in this example has final fields and equals/hashCode have been autogenerated by IntelliJ using JDK7 standards.

Maybe I'm missing something there. E.g., am I doing something wrong (see another answer here: #214 )

ben-manes · 2018-06-24T08:51:25Z

By default the cache defers eviction to ForkJoinPool, which appears to be taking longer than expected to evacuate the entries. Alternatively you can use executor(Runnable::run) to amortize the work on the calling threads, like Guava does. Then your test passes

ben-manes · 2018-06-24T08:56:09Z

It is odd though. When I switch to Integer keys the test passes without a hitch. So not sure why your key class is more problematic,

robertotru · 2018-06-24T09:13:45Z

@ben-manes Thanks for the answer.
I see your point.

Generally speaking, the cache should be instantiated as class field and to unit test the class I should not change the way I instantiate the cache. So, probably to test it using executor(Runnable::run) I should either always use that option (which is bad in production, right?) or have some hackish solution that I don't see as clean (e.g. have a configuration option to set the executor only for testing purposes or replacing the cache via reflection). I try to avoid changing the code only for testing purposes.

What's your opinion/suggestion on that?

ben-manes · 2018-06-24T09:14:05Z

Ah, I see. The reason is because maintenance work occurs due to a cache access. When the work is deferred to the FJP, the task can run while new work is being added. This leaves it in a dirty state, but after it has processed its work, so the next access will trigger a new task to be scheduled and the remaining items evicted. When running your test with a removal listener, we see it quickly evict and then do nothing while your test sleeps. If we add a simple read then it will evict the remaining items.

Potentially the maintenance task could reschedule itself on the executor immediately. That would probably mask this issue.

ben-manes · 2018-06-24T09:19:37Z

For testing, it is easier to not deal with asynchronous logic and use a same-thread executor. If you use dependency injection, it is simple to setup.

For production, the cost isn't that bad. But if you use an expensive removal listener, the caller is penalized. To provide a small and consistent response time it can be beneficial to defer the work to an executor. Since its available, we take advantage of it.

I think it wouldn't hurt to more aggressively retrigger the maintenance work if it detects a new run is needed after it completes, rather than defer to the next cache read or write. That won't change much on production, but avoid confusion cases like you saw.

ben-manes · 2018-06-24T09:22:01Z

For clarity, changing your test to the following passes due to the maintenance work being re-triggered as not all items were evicted on its previous run.

// TimeUnit.SECONDS.sleep(5);
TimeUnit.SECONDS.sleep(1);
cache.getIfPresent(cache.asMap().keySet().iterator().next());       
TimeUnit.SECONDS.sleep(1);

robertotru · 2018-06-24T09:26:06Z

Ah! Interesting.
I totally agree with

I think it wouldn't hurt to more aggressively retrigger the maintenance work if it detects a new run is needed after it completes, rather than defer to the next cache read or write.

I did not check your implementation (my fault), but in principle can't one check the size at the end of a cache.get() and if it exceeds the max you re-trigger the maintenance or would this procedure be too expensive? Maybe wrapping the return in a try-finally and executing the check in the finally.

ben-manes · 2018-06-24T09:31:36Z

Yes, something like that. There is a state machine, set to REQUIRED when idle and another clean-up is necessary. The task could check and reschedule itself, assuming it is not a custom executor. If a custom executor is used, then I'd prefer not doing this because it might over-penalize a caller, e.g. write storm causing a single user-facing thread to do all of the work. Something like this shouldn't be hard.

robertotru · 2018-06-24T09:36:42Z

@ben-manes so, let's say that I will hack the code to the cache getters inside my class so that I manually retrigger the cleanup when the size reaches the fixed threshold.
I'll stay tuned in case you will change something to make this feature more "reliable" in the near feature.

Thanks a lot for the quick answer.

ben-manes · 2018-06-24T09:38:07Z

You can call cache.cleanUp() if you need to manually retrigger it. Or in your example the same-thread executor shouldn't be a performance problem and you can just use that.

ben-manes · 2018-06-24T20:25:33Z

Added the prompt rescheduling. Still working on 2.7, but haven't had the time to wrap up those tests and changes.

ben-manes · 2019-02-24T19:04:38Z

Released 2.7

panghy closed this as completed Mar 1, 2018

ben-manes added a commit that referenced this issue Jun 24, 2018

Reschedule async maintenance immediately if pending work remains (#225)

ac119eb

ben-manes added a commit that referenced this issue Jun 24, 2018

Reschedule async maintenance immediately if pending work remains (#225)

7b743ee

ben-manes added a commit that referenced this issue Jun 24, 2018

Reschedule async maintenance immediately if pending work remains (#225)

e8ff6d3

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LoadingCache no longer respects maximumSize in 2.6.1 #225

LoadingCache no longer respects maximumSize in 2.6.1 #225

panghy commented Mar 1, 2018

panghy commented Mar 1, 2018

ben-manes commented Mar 1, 2018

panghy commented Mar 1, 2018

panghy commented Mar 1, 2018

ben-manes commented Mar 1, 2018

panghy commented Mar 1, 2018

ben-manes commented Mar 1, 2018

panghy commented Mar 1, 2018

ben-manes commented Mar 1, 2018

panghy commented Mar 1, 2018

ben-manes commented Mar 1, 2018

panghy commented Mar 1, 2018

panghy commented Mar 1, 2018

panghy commented Mar 1, 2018

ben-manes commented Mar 1, 2018

robertotru commented Jun 24, 2018 •

edited

Loading

ben-manes commented Jun 24, 2018

ben-manes commented Jun 24, 2018

robertotru commented Jun 24, 2018

ben-manes commented Jun 24, 2018

ben-manes commented Jun 24, 2018

ben-manes commented Jun 24, 2018 •

edited

Loading

robertotru commented Jun 24, 2018 •

edited

Loading

ben-manes commented Jun 24, 2018

robertotru commented Jun 24, 2018

ben-manes commented Jun 24, 2018

ben-manes commented Jun 24, 2018

ben-manes commented Feb 24, 2019

LoadingCache no longer respects maximumSize in 2.6.1 #225

LoadingCache no longer respects maximumSize in 2.6.1 #225

Comments

panghy commented Mar 1, 2018

panghy commented Mar 1, 2018

ben-manes commented Mar 1, 2018

panghy commented Mar 1, 2018

panghy commented Mar 1, 2018

ben-manes commented Mar 1, 2018

panghy commented Mar 1, 2018

ben-manes commented Mar 1, 2018

panghy commented Mar 1, 2018

ben-manes commented Mar 1, 2018

panghy commented Mar 1, 2018

ben-manes commented Mar 1, 2018

panghy commented Mar 1, 2018

panghy commented Mar 1, 2018

panghy commented Mar 1, 2018

ben-manes commented Mar 1, 2018

robertotru commented Jun 24, 2018 • edited Loading

ben-manes commented Jun 24, 2018

ben-manes commented Jun 24, 2018

robertotru commented Jun 24, 2018

ben-manes commented Jun 24, 2018

ben-manes commented Jun 24, 2018

ben-manes commented Jun 24, 2018 • edited Loading

robertotru commented Jun 24, 2018 • edited Loading

ben-manes commented Jun 24, 2018

robertotru commented Jun 24, 2018

ben-manes commented Jun 24, 2018

ben-manes commented Jun 24, 2018

ben-manes commented Feb 24, 2019

robertotru commented Jun 24, 2018 •

edited

Loading

ben-manes commented Jun 24, 2018 •

edited

Loading

robertotru commented Jun 24, 2018 •

edited

Loading