-
Notifications
You must be signed in to change notification settings - Fork 67
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: Fix failure expiry race condition #67
Conversation
88b5494
to
9000ebf
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @njhill -- I did run the test, as well as just the new/changed tests with and without the corresponding main source changes so functionally your changes look good as far as I can surmise. I just have two small nit-picky questions/comments
} else { | ||
lastUsed = ce != null ? runtimeCache.getLastUsedTime(modelId) : -1L; | ||
// Use shorter expiry age if model was used in last 3 minutes | ||
final long expiryAge = (lastUsed > 0 && (now - lastUsed) < 180_000L) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should the last 3 minutes (180_000L
) here be a defined constant/env variable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes it could be ... let me push a change with that
Motivation A bug was encountered whereby model load failure records would not be cleaned up after expiration while there were continuous (failing) requests for that model, meaning no reload attempt occurred and those requests failed indefinitely. Modifications - Add unit test to reproduce the problem, which required - Making expiry time configurable - Modifying the behaviour of the dummy model so that triggered loading failures only occur on the first attempt - Add some log statements - Fix the main bug: When a failed cache entry without any record in the registry is encountered during a load attempt, remove it rather than treating as a load failure - Fix some other logic related to failure expiry - Use a shorter expiry time (2/3 of configured time) for recently-used models Result Failures expire as intended in all situations Signed-off-by: Nick Hill <nickhill@us.ibm.com>
Signed-off-by: Nick Hill <nickhill@us.ibm.com>
9000ebf
to
ac34cfe
Compare
Thanks @ckadner, I pushed an update with the constant, PTAL |
Signed-off-by: Nick Hill <nickhill@us.ibm.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: ckadner, njhill The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Motivation
A bug was encountered whereby model load failure records would not be cleaned up after expiration while there were continuous (failing) requests for that model, meaning no reload attempt occurred and those requests failed indefinitely.
Modifications
Result
Failures expire as intended in all situations
Signed-off-by: Nick Hill nickhill@us.ibm.com