Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LoadingCache no longer respects maximumSize in 2.6.1 #225

Closed
panghy opened this issue Mar 1, 2018 · 28 comments
Closed

LoadingCache no longer respects maximumSize in 2.6.1 #225

panghy opened this issue Mar 1, 2018 · 28 comments

Comments

@panghy
Copy link

panghy commented Mar 1, 2018

This seems to be pretty serious but we've noticed that since upgrading to 2.6.1 from 2.6.0 that the maximum size of a cache is no longer respected (for a synchronous loading cache). Attached is a graph of us reporting stats from a cache that has a maximum size of 100k elements.

screen shot 2018-03-01 at 12 24 28 pm

As the graph shows, before 2.6.1 is deployed the cached stayed at a max size of 100k, after 2.6.1, it got into the millions (almost 200M). The size is simply cache.estimatedSize().

The code is as vanilla as it gets:

 Caffeine.newBuilder().
          maximumSize(100000).
          build(key -> {...});
@panghy
Copy link
Author

panghy commented Mar 1, 2018

Trying to inspect the commits since 2.6.0 to see what could have been the cause.

@ben-manes
Copy link
Owner

I don't recall changing anything in the eviction logic and the EvictionTest#evict passes. Can we get a test case showing it failing? Note that by default eviction is asynchronous so a test either needs to call cleanUp(), use executor(Runnable::run), or wait (e.g. using Awaitility).

@panghy
Copy link
Author

panghy commented Mar 1, 2018

This is actually in production so it seems odd that it stopped evicting entries after days.

@panghy
Copy link
Author

panghy commented Mar 1, 2018

But anyhow, looking into a test case.

@ben-manes
Copy link
Owner

yeah, even if ForkJoinPool is too busy to do our work, the write buffer should force a clean-up due to back pressure. The maintenance work replays the writes onto the policy and evicts, but the write buffer has a maximum size. When full, it will block on the eviction lock as a fallback to a thrashing or otherwise bad system state. Since there were many writes and your threads are not all blocked, its not obvious.

Can you get a thread dump just to check?

@panghy
Copy link
Author

panghy commented Mar 1, 2018

I can see commonPool workers awaiting work unfortunately.

@ben-manes
Copy link
Owner

I do know many people were burned by the Linux futex bug, which broke locking everywhere. They of course decided not to inform anyone, it went out into many distributions, and has been a quiet nightmare. If you are on an older kernel then you could be effected.

@panghy
Copy link
Author

panghy commented Mar 1, 2018

We are on 4.4.0-1050-aws (ubuntu), not sure if we upgraded together with the 2.6.1 rollout (also java 1.8.0_161)

@ben-manes
Copy link
Owner

Ubuntu 16.04 should be fine, but earlier were broken. What does the AWS version map to?

@panghy
Copy link
Author

panghy commented Mar 1, 2018

We might have also switched instance types as part of the roll-out so there are some additional (m4 -> m5) which is a new AWS base image, trying to generate a small test case to just test on the base metal.

@ben-manes
Copy link
Owner

According to Gil, the issue started being observed on Haswell-based processors. So if you upgraded to that class, you should also get onto on a newish kernel. I am only making wild guesses based on the limited information we have so far, though.

@panghy
Copy link
Author

panghy commented Mar 1, 2018

Just rolled a very simple test that triggers loading from a cache of max size 100k in a loop for a million times and printing out estimateSize(). The works both on my mac and on the production machine that's seeing the issue. :(

@panghy
Copy link
Author

panghy commented Mar 1, 2018

Just for some reference, these are the top jmap entries:

   1:      16979066      950827696  [The object that's cached, it's both present as a key and in values]
   2:       7559480      665234240  java.util.concurrent.ConcurrentHashMap$TreeNode
   3:       3824275      438191872  [Ljava.lang.Object;
   4:       7770486      435147216  com.github.benmanes.caffeine.cache.PSMS

@panghy
Copy link
Author

panghy commented Mar 1, 2018

We found the issue, this happens if the key is not stable (it's hashCode changes and its equality). Sorry for the false alarm but it's interesting to note that I guess the behavior for a size limited cache would be as such if keys are changed underneath the hood.

@panghy panghy closed this as completed Mar 1, 2018
@ben-manes
Copy link
Owner

Oh, wonderful!

Yes, unfortunately there is no way for us to detect that without forking the underlying hash table to add a check. The eviction's removal can be unsuccessful due to a concurrent cache.invalidate(key), so we cannot tell if it is due to a user bug or concurrent operations.

@robertotru
Copy link

robertotru commented Jun 24, 2018

Hello folks.

I have the same issue in version 2.6.2 and I'm not sure if it is due to issues with hash codes or equality as mentioned above.

I've create a very naive example here https://github.com/robertotru/caffeine-max-size

The key in this example has final fields and equals/hashCode have been autogenerated by IntelliJ using JDK7 standards.

Maybe I'm missing something there. E.g., am I doing something wrong (see another answer here: #214 )

@ben-manes
Copy link
Owner

By default the cache defers eviction to ForkJoinPool, which appears to be taking longer than expected to evacuate the entries. Alternatively you can use executor(Runnable::run) to amortize the work on the calling threads, like Guava does. Then your test passes

@ben-manes
Copy link
Owner

It is odd though. When I switch to Integer keys the test passes without a hitch. So not sure why your key class is more problematic,

@robertotru
Copy link

@ben-manes Thanks for the answer.
I see your point.

Generally speaking, the cache should be instantiated as class field and to unit test the class I should not change the way I instantiate the cache. So, probably to test it using executor(Runnable::run) I should either always use that option (which is bad in production, right?) or have some hackish solution that I don't see as clean (e.g. have a configuration option to set the executor only for testing purposes or replacing the cache via reflection). I try to avoid changing the code only for testing purposes.

What's your opinion/suggestion on that?

@ben-manes
Copy link
Owner

Ah, I see. The reason is because maintenance work occurs due to a cache access. When the work is deferred to the FJP, the task can run while new work is being added. This leaves it in a dirty state, but after it has processed its work, so the next access will trigger a new task to be scheduled and the remaining items evicted. When running your test with a removal listener, we see it quickly evict and then do nothing while your test sleeps. If we add a simple read then it will evict the remaining items.

Potentially the maintenance task could reschedule itself on the executor immediately. That would probably mask this issue.

@ben-manes
Copy link
Owner

For testing, it is easier to not deal with asynchronous logic and use a same-thread executor. If you use dependency injection, it is simple to setup.

For production, the cost isn't that bad. But if you use an expensive removal listener, the caller is penalized. To provide a small and consistent response time it can be beneficial to defer the work to an executor. Since its available, we take advantage of it.

I think it wouldn't hurt to more aggressively retrigger the maintenance work if it detects a new run is needed after it completes, rather than defer to the next cache read or write. That won't change much on production, but avoid confusion cases like you saw.

@ben-manes
Copy link
Owner

ben-manes commented Jun 24, 2018

For clarity, changing your test to the following passes due to the maintenance work being re-triggered as not all items were evicted on its previous run.

// TimeUnit.SECONDS.sleep(5);
TimeUnit.SECONDS.sleep(1);
cache.getIfPresent(cache.asMap().keySet().iterator().next());       
TimeUnit.SECONDS.sleep(1);

@robertotru
Copy link

robertotru commented Jun 24, 2018

Ah! Interesting.
I totally agree with

I think it wouldn't hurt to more aggressively retrigger the maintenance work if it detects a new run is needed after it completes, rather than defer to the next cache read or write.

I did not check your implementation (my fault), but in principle can't one check the size at the end of a cache.get() and if it exceeds the max you re-trigger the maintenance or would this procedure be too expensive? Maybe wrapping the return in a try-finally and executing the check in the finally.

@ben-manes
Copy link
Owner

Yes, something like that. There is a state machine, set to REQUIRED when idle and another clean-up is necessary. The task could check and reschedule itself, assuming it is not a custom executor. If a custom executor is used, then I'd prefer not doing this because it might over-penalize a caller, e.g. write storm causing a single user-facing thread to do all of the work. Something like this shouldn't be hard.

@robertotru
Copy link

@ben-manes so, let's say that I will hack the code to the cache getters inside my class so that I manually retrigger the cleanup when the size reaches the fixed threshold.
I'll stay tuned in case you will change something to make this feature more "reliable" in the near feature.

Thanks a lot for the quick answer.

@ben-manes
Copy link
Owner

You can call cache.cleanUp() if you need to manually retrigger it. Or in your example the same-thread executor shouldn't be a performance problem and you can just use that.

@ben-manes
Copy link
Owner

Added the prompt rescheduling. Still working on 2.7, but haven't had the time to wrap up those tests and changes.

@ben-manes
Copy link
Owner

Released 2.7

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants