-
Notifications
You must be signed in to change notification settings - Fork 101
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
bugfix: check that timer servicing worked #4846
Conversation
that sets the global timer for ASAP retry
src/ir_passes/await.ml
Outdated
(* if self-call queue full: expire global timer soon and retry *) | ||
and t_timer_throw context exp = | ||
t_on_throw context exp | ||
(blockE [expD (primE |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we sure there aren't other reasons for the error than queue full? If so, it might make make sense to test the error code of the error and only set the global timer when appropriate...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Checking the Wasm, the only possibility is send failure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM so far. What will you do about failure to enqueue tasks?
A PR on top of this one and with the logic to insert the failed expirations into the priority queue's head (in order). |
review feedback
Download the artifacts for this pull request: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems it fixes the issue, I can't reproduce it anymore on my branch https://github.com/dfinity/ic/commits/andriy/motoko-timers-repro/
This deals with the (unlikely) possibility that the send queue is not full when the timer servicing action is submitted, but becomes full while submitting the user jobs. Now we catch the failure and re-add (single-expiration) jobs to the start of the priority queue. This is the missing piece to #4846. This is an incremental change, so that we don't have to touch the happy path. A rewrite would be justified to collapse gathering and self-sends. There is an optimisation realised in `@prune`.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGMT
This fixes a disappearing timer situation described by Timo in https://dfinity.slack.com/archives/CPL67E7MX/p1736347600078339.
It turns out that under high message load the
async
timer servicing routine cannot be run. The fix is simple, check if the self-call succeeded (causes athrow
already), and if not, set a very near global timer to retry ASAP (in the top-levelcatch
).TODO:
catch
send errors for user workers (and mitigate) — see Mitigate timer job submission failures #4852