Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Occasionally Value is not evicted after expireAfterWrite setting. #212

Closed
kojilin opened this issue Dec 17, 2017 · 23 comments
Closed

Occasionally Value is not evicted after expireAfterWrite setting. #212

kojilin opened this issue Dec 17, 2017 · 23 comments

Comments

@kojilin
Copy link

kojilin commented Dec 17, 2017

Hi, I'm facing some weird problem recently. My caffeine version is 2.5.6.

My cache setting is like this

Caffeine.newBuilder()
    .refreshAfterWrite(300 / 2, TimeUnit.SECONDS)
    .expireAfterWrite(300, TimeUnit.SECONDS)
    .maximumSize(cacheSize)
    .recordStats(...)
    .executor(...)
    .buildAsync((key, ignored) -> getFromRemote(key))

Sometimes 1 of my servers has stale data in caffeine cache. I have taken the heap dump, but I'm not sure if there is a way to let me find the clue(Maybe the problem is my remote calling code, but not found the problem yet).

stale's one (writeTime -9189666743569861076), data was updated 2 days before, and as above, I set 5 mins as expire time.
20171223115107

another data from cache in same instance(writeTime 33948135258891076)
7c1913d049047c8068725aac4432af84

And curious about the differences of writeTime too.
I trace the code and it looks like -9189666743569861076 - Long.MAX_VALUE = 33705293284914733. so it means it doesn't reset on a completion callback?
(33948247-33705293)/60/60/24 = 2.x days.
so it's unfinished processing? But Node.value(CompletableFuture).result has data.

Is there any possibility that can make caffeine not to update/evict?
(I'm not sure, but if the write time keep -9189666743569861076(far future) make it never expired?)

@ben-manes
Copy link
Owner

The timetamps should be updated using whenComplete, e.g. LocalAsyncLoadingCache#get. Can you list the methods you use on the cache so that we can verify each has this completion logic? Also what is your executor set to?

Internally when a future is in-flight it does get the duration of Long.MAX_VALUE to disable expiration. Any updates to that timestamp are done under the entry's lock, so it should never be stampeded on. When the future succeeds it should then callback into the cache using replace(key, future, future) to trigger the weight & expiration time calculations.

@kojilin
Copy link
Author

kojilin commented Dec 17, 2017

Thanks your explain about writeTime.

AsyncLoadingCache<Object, Object> localCache = 
Caffeine.newBuilder()
    ...
    .buildAsync((key, ignored) -> getFromRemote(key));

and we only use

CompletableFuture<Object> future = localCache.get(key);

another is we have localCache.synchronous().stats().hitRate() to get hitRate with metrics gauge. But I think it just get stats.

About thread, we try to use server's event-loop if possible. And usually we use even-loop thread to call #get.

class ServerOrCommonExecutorService ForwardingExecutorService {
        @Override
        protected ExecutorService delegate() {
            if(RequestContext.current() != null){
              return RequestContext.current().eventLoop();
            }
            return ForkJoinPool.commonPool();
        }
}

builder.executor(new ServerOrCommonExecutorService());

Not sure if possible, but I didn't found clue that eventLoop thread has blocking or in-complete from metrics(e.g. active or running thread has 0 value during 2 days. No blocking alert..).

@ben-manes
Copy link
Owner

Do you think that you could write a failing test case? I don’t see the cause yet when trying to reproduce it.

@kojilin
Copy link
Author

kojilin commented Dec 18, 2017

I can't reproduce in my local too. So still try to find the place where I can test.

And curious about if
https://github.com/ben-manes/caffeine/blob/master/caffeine/src/main/java/com/github/benmanes/caffeine/cache/BoundedLocalCache.java#L1297
may run concurrently with
https://github.com/ben-manes/caffeine/blob/master/caffeine/src/main/java/com/github/benmanes/caffeine/cache/LocalAsyncLoadingCache.java#L121 ?
I'm not totally figure out all the flow of caffeine, so just random thoughts.

@ben-manes
Copy link
Owner

Yes, it can except for the synchronized (node) section, where a node is the entry metadata that wraps a value. The first snippet will ensure the writeTime is at the maximum when in-flight, while the second makes a callback using cache.replace(...) which goes here. That too has a synchronized (node) when updating the write time, so we should expect exclusivity.

@kojilin
Copy link
Author

kojilin commented Dec 18, 2017

I see, I will try to reproduce...

@ben-manes
Copy link
Owner

I wonder if this is due to refreshAfterWrite, which is pretty complicated. That seems the most likely culprit from walking through the code, but I don’t have a scenario yet.

@kojilin
Copy link
Author

kojilin commented Dec 18, 2017

Let me try to remove refreshAfterWrite and test first.

@kojilin
Copy link
Author

kojilin commented Dec 19, 2017

Just feedback, remove refreshAfterWrite but still same result.
at 10AM
2017-12-19 16 56 20

at 3PM
2017-12-19 16 56 32

There is a record with same timestamp.

But still can't reproduce at local.

@ben-manes
Copy link
Owner

ben-manes commented Dec 19, 2017

That’s too bad. I had a theory with refresh, but now I’m unsure. What JVM and OS are you running? I have had issues reported that we found were due to bugs in the platform itself.

@kojilin
Copy link
Author

kojilin commented Dec 19, 2017

CentOS 6.8 and 1.8.0_92. Maybe I can test with new java version.

@ben-manes
Copy link
Owner

I believe #203 is caused by a kernel bug. This causes lost wake ups, in that case a deadlock. Perhaps here it’s resulting in the future not to run its completion handlers. It’s a wild guess but your version is old enough.

@kojilin
Copy link
Author

kojilin commented Dec 19, 2017

ah let me repost again.
If we run this line first, but not get now yet.
https://github.com/ben-manes/caffeine/blob/master/caffeine/src/main/java/com/github/benmanes/caffeine/cache/BoundedLocalCache.java#L1338

then when complete runs to here and get now first.
https://github.com/ben-manes/caffeine/blob/master/caffeine/src/main/java/com/github/benmanes/caffeine/cache/BoundedLocalCache.java#L1929

then the expire maybe wrong(because now - far future won't be negative in this case), how do you think?
https://github.com/ben-manes/caffeine/blob/master/caffeine/src/main/java/com/github/benmanes/caffeine/cache/BoundedLocalCache.java#L1936

I tried again with debug mode thread stop and step by step.

@ben-manes
Copy link
Owner

My belief is that in that case, the worse that could happen is the entry is set with too old of a write time, causing it to expire immediately. Since we use the delta and the timestamp is read on the same thread, there shouldn't be any races or word tearing. A slow computation could result in the premature eviction, but not cause what you are seeing (a wraparound would take 288 years). We should probably move the read, but I don't think its causing this issue.

@kojilin
Copy link
Author

kojilin commented Dec 19, 2017

Sorry I didn't get your point about get too old of a write time.
I think replace's callback will run concurrently with AddTask's run, right?
So if run's get now run after callback's get now
e.g.
AddTask got lock first at https://github.com/ben-manes/caffeine/blob/master/caffeine/src/main/java/com/github/benmanes/caffeine/cache/BoundedLocalCache.java#L1336

and enter the if block?

0 at
https://github.com/ben-manes/caffeine/blob/master/caffeine/src/main/java/com/github/benmanes/caffeine/cache/BoundedLocalCache.java#L1929

and 10 at
https://github.com/ben-manes/caffeine/blob/master/caffeine/src/main/java/com/github/benmanes/caffeine/cache/BoundedLocalCache.java#L1338

then hasExpired will be
(0L - (10L + Long.MAX_VALUE))) which will be really large. I think the condition at https://github.com/ben-manes/caffeine/blob/master/caffeine/src/main/java/com/github/benmanes/caffeine/cache/BoundedLocalCache.java#L733 will be true?

@ben-manes
Copy link
Owner

ben-manes commented Dec 19, 2017

It will run concurrently, except for the portions under the synchronized (node) blocks.

Wouldn't hasExpired be very large (since we think the entry is in-flight) until the replace finishes? Then it will set the timestamp to a real value, e.g. 0 in your example? Shouldn't the lock ensure we don't stampede the writeTime due to a context switch?

@kojilin
Copy link
Author

kojilin commented Dec 19, 2017

AddTask got the lock and enter the block of checking in-flight future. Then replace's whenComplete got the System.nanoTime before AddTask. (I'm not sure the thread of AddTask & whenComplete must be same or not.)

And about reset, I think it will return at https://github.com/ben-manes/caffeine/blob/master/caffeine/src/main/java/com/github/benmanes/caffeine/cache/BoundedLocalCache.java#L1937
So it has no chance to set back to 0.

@ben-manes
Copy link
Owner

oh! I see what you mean. Yes, that's a real possibility. Lets definitely fix that.

@kojilin
Copy link
Author

kojilin commented Dec 19, 2017

Wow yes! hope this solve the problem. Thanks!!

ben-manes added a commit that referenced this issue Dec 21, 2017
Now passes the TCK tests that were backwards incompatible with 1.0. The
OSGI test is enabled now that jsr107 includes the bindings.

Added a Jigsaw module names (fixes #211) for Java 9 modules. The work
towards a clean Java 9 library (no javax.annotations, no Unsafe) is
pending. This will released as version 3.0 when ready.

Stricter expiration checks for #212. Previously a stale read of the
current time was used where it was perceived as benign, to avoid
excessive (expensive) calls to the system clock. This might have
unintentionally resulted in an AsyncLoadingCache not expiring a
future properly.

Replaces the unnecessary NonReentrantLock with the JDK's rentrant one.
In #203 a deadlock occurs, though it is not reproducable and most likely
is due to a known (and scary) Linux kernel bug in older versions. The
improvements to state management also means the lock is extremely cheap
so any gains in the slimmer version are not detectable.
@ben-manes
Copy link
Owner

I think that I made the proper changes here. The latest build should be in the snapshot repository shortly. Can you use that and see if we fixed the problem?

@kojilin
Copy link
Author

kojilin commented Dec 21, 2017

OK let me try and report here soon, maybe need 1 or 2 days

ben-manes added a commit that referenced this issue Dec 22, 2017
Now passes the TCK tests that were backwards incompatible with 1.0. The
OSGI test is enabled now that jsr107 includes the bindings.

Added a Jigsaw module names (fixes #211) for Java 9 modules. The work
towards a clean Java 9 library (no javax.annotations, no Unsafe) is
pending. This will released as version 3.0 when ready.

Stricter expiration checks for #212. Previously a stale read of the
current time was used where it was perceived as benign, to avoid
excessive (expensive) calls to the system clock. This might have
unintentionally resulted in an AsyncLoadingCache not expiring a
future properly.

Replaces the unnecessary NonReentrantLock with the JDK's rentrant one.
In #203 a deadlock occurs, though it is not reproducable and most likely
is due to a known (and scary) Linux kernel bug in older versions. The
improvements to state management also means the lock is extremely cheap
so any gains in the slimmer version are not detectable.
ben-manes added a commit that referenced this issue Dec 22, 2017
Now passes the TCK tests that were backwards incompatible with 1.0. The
OSGI test is enabled now that jsr107 includes the bindings.

Added a Jigsaw module names (fixes #211) for Java 9 modules. The work
towards a clean Java 9 library (no javax.annotations, no Unsafe) is
pending. This will released as version 3.0 when ready.

Stricter expiration checks for #212. Previously a stale read of the
current time was used where it was perceived as benign, to avoid
excessive (expensive) calls to the system clock. This might have
unintentionally resulted in an AsyncLoadingCache not expiring a
future properly.

Replaces the unnecessary NonReentrantLock with the JDK's rentrant one.
In #203 a deadlock occurs, though it is not reproducable and most likely
is due to a known (and scary) Linux kernel bug in older versions. The
improvements to state management also means the lock is extremely cheap
so any gains in the slimmer version are not detectable.
@kojilin
Copy link
Author

kojilin commented Dec 23, 2017

I tried and let it run 1 day.
Compare to previous test(3 instance, 1 day)
2.5.6: All servers have 1~100 entries' writeTime in far future.
2.6.1-Snapsot: No server has far future writeTime entry.
Current nanoTime is around ~34438964688472572L, and there is no < 0 data.
2017-12-23 10 55 02

I think the fix solve the issue.

@ben-manes
Copy link
Owner

Released

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants