-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
LoadingCache no longer respects maximumSize in 2.6.1 #225
Comments
Trying to inspect the commits since 2.6.0 to see what could have been the cause. |
I don't recall changing anything in the eviction logic and the |
This is actually in production so it seems odd that it stopped evicting entries after days. |
But anyhow, looking into a test case. |
yeah, even if Can you get a thread dump just to check? |
I can see commonPool workers awaiting work unfortunately. |
I do know many people were burned by the Linux futex bug, which broke locking everywhere. They of course decided not to inform anyone, it went out into many distributions, and has been a quiet nightmare. If you are on an older kernel then you could be effected. |
We are on 4.4.0-1050-aws (ubuntu), not sure if we upgraded together with the 2.6.1 rollout (also java 1.8.0_161) |
Ubuntu 16.04 should be fine, but earlier were broken. What does the AWS version map to? |
We might have also switched instance types as part of the roll-out so there are some additional (m4 -> m5) which is a new AWS base image, trying to generate a small test case to just test on the base metal. |
According to Gil, the issue started being observed on Haswell-based processors. So if you upgraded to that class, you should also get onto on a newish kernel. I am only making wild guesses based on the limited information we have so far, though. |
Just rolled a very simple test that triggers loading from a cache of max size 100k in a loop for a million times and printing out estimateSize(). The works both on my mac and on the production machine that's seeing the issue. :( |
Just for some reference, these are the top jmap entries:
|
We found the issue, this happens if the key is not stable (it's hashCode changes and its equality). Sorry for the false alarm but it's interesting to note that I guess the behavior for a size limited cache would be as such if keys are changed underneath the hood. |
Oh, wonderful! Yes, unfortunately there is no way for us to detect that without forking the underlying hash table to add a check. The eviction's removal can be unsuccessful due to a concurrent |
Hello folks. I have the same issue in version 2.6.2 and I'm not sure if it is due to issues with hash codes or equality as mentioned above. I've create a very naive example here https://github.com/robertotru/caffeine-max-size The key in this example has final fields and equals/hashCode have been autogenerated by IntelliJ using JDK7 standards. Maybe I'm missing something there. E.g., am I doing something wrong (see another answer here: #214 ) |
By default the cache defers eviction to ForkJoinPool, which appears to be taking longer than expected to evacuate the entries. Alternatively you can use |
It is odd though. When I switch to |
@ben-manes Thanks for the answer. Generally speaking, the cache should be instantiated as class field and to unit test the class I should not change the way I instantiate the cache. So, probably to test it using What's your opinion/suggestion on that? |
Ah, I see. The reason is because maintenance work occurs due to a cache access. When the work is deferred to the FJP, the task can run while new work is being added. This leaves it in a dirty state, but after it has processed its work, so the next access will trigger a new task to be scheduled and the remaining items evicted. When running your test with a removal listener, we see it quickly evict and then do nothing while your test sleeps. If we add a simple read then it will evict the remaining items. Potentially the maintenance task could reschedule itself on the executor immediately. That would probably mask this issue. |
For testing, it is easier to not deal with asynchronous logic and use a same-thread executor. If you use dependency injection, it is simple to setup. For production, the cost isn't that bad. But if you use an expensive removal listener, the caller is penalized. To provide a small and consistent response time it can be beneficial to defer the work to an executor. Since its available, we take advantage of it. I think it wouldn't hurt to more aggressively retrigger the maintenance work if it detects a new run is needed after it completes, rather than defer to the next cache read or write. That won't change much on production, but avoid confusion cases like you saw. |
For clarity, changing your test to the following passes due to the maintenance work being re-triggered as not all items were evicted on its previous run. // TimeUnit.SECONDS.sleep(5);
TimeUnit.SECONDS.sleep(1);
cache.getIfPresent(cache.asMap().keySet().iterator().next());
TimeUnit.SECONDS.sleep(1); |
Ah! Interesting.
I did not check your implementation (my fault), but in principle can't one check the size at the end of a |
Yes, something like that. There is a state machine, set to |
@ben-manes so, let's say that I will hack the code to the cache getters inside my class so that I manually retrigger the cleanup when the size reaches the fixed threshold. Thanks a lot for the quick answer. |
You can call |
Added the prompt rescheduling. Still working on 2.7, but haven't had the time to wrap up those tests and changes. |
Released 2.7 |
This seems to be pretty serious but we've noticed that since upgrading to 2.6.1 from 2.6.0 that the maximum size of a cache is no longer respected (for a synchronous loading cache). Attached is a graph of us reporting stats from a cache that has a maximum size of 100k elements.
As the graph shows, before 2.6.1 is deployed the cached stayed at a max size of 100k, after 2.6.1, it got into the millions (almost 200M). The size is simply
cache.estimatedSize()
.The code is as vanilla as it gets:
The text was updated successfully, but these errors were encountered: