Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bugfix: check that timer servicing worked #4846

Merged
merged 11 commits into from
Jan 17, 2025
Merged

bugfix: check that timer servicing worked #4846

merged 11 commits into from
Jan 17, 2025

Conversation

ggreif
Copy link
Contributor

@ggreif ggreif commented Jan 14, 2025

This fixes a disappearing timer situation described by Timo in https://dfinity.slack.com/archives/CPL67E7MX/p1736347600078339.

It turns out that under high message load the async timer servicing routine cannot be run. The fix is simple, check if the self-call succeeded (causes a throw already), and if not, set a very near global timer to retry ASAP (in the top-level catch).

TODO:

@ggreif ggreif changed the title bugfix: check that servicing worked bugfix: check that timer servicing worked Jan 14, 2025
src/prelude/internals.mo Outdated Show resolved Hide resolved
Copy link

github-actions bot commented Jan 14, 2025

Comparing from 607ecfd to 77fa565:
In terms of gas, no changes are observed in 5 tests.
In terms of size, no changes are observed in 5 tests.

@ggreif ggreif self-assigned this Jan 14, 2025
@ggreif ggreif added the build_artifacts Upload moc binary as workflow artifacts label Jan 14, 2025
@ggreif ggreif marked this pull request as ready for review January 14, 2025 19:10
@ggreif ggreif requested a review from a team as a code owner January 14, 2025 19:10
@ggreif ggreif requested review from crusso and berestovskyy January 14, 2025 21:15
@ggreif ggreif marked this pull request as draft January 15, 2025 10:26
@ggreif ggreif marked this pull request as ready for review January 15, 2025 11:41
src/mo_frontend/typing.mli Outdated Show resolved Hide resolved
(* if self-call queue full: expire global timer soon and retry *)
and t_timer_throw context exp =
t_on_throw context exp
(blockE [expD (primE
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we sure there aren't other reasons for the error than queue full? If so, it might make make sense to test the error code of the error and only set the global timer when appropriate...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Checking the Wasm, the only possibility is send failure.

Copy link
Contributor

@crusso crusso left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM so far. What will you do about failure to enqueue tasks?

@ggreif
Copy link
Contributor Author

ggreif commented Jan 15, 2025

LGTM so far. What will you do about failure to enqueue tasks?

A PR on top of this one and with the logic to insert the failed expirations into the priority queue's head (in order).

@dfinity dfinity deleted a comment from github-actions bot Jan 15, 2025
Copy link

github-actions bot commented Jan 15, 2025

Download the artifacts for this pull request:

berestovskyy
berestovskyy previously approved these changes Jan 16, 2025
Copy link

@berestovskyy berestovskyy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems it fixes the issue, I can't reproduce it anymore on my branch https://github.com/dfinity/ic/commits/andriy/motoko-timers-repro/

@ggreif ggreif removed the build_artifacts Upload moc binary as workflow artifacts label Jan 17, 2025
This deals with the (unlikely) possibility that the send queue is not full when the timer servicing action is submitted, but becomes full while submitting the user jobs. Now we catch the failure and re-add (single-expiration) jobs to the start of the priority queue.

This is the missing piece to #4846.

This is an incremental change, so that we don't have to touch the happy path. A rewrite would be justified to collapse gathering and self-sends.

There is an optimisation realised in `@prune`.
@ggreif ggreif requested a review from crusso January 17, 2025 15:00
@ggreif ggreif added Bug Something isn't working automerge-squash When ready, merge (using squash) labels Jan 17, 2025
Copy link
Contributor

@crusso crusso left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGMT

@mergify mergify bot merged commit a9bc214 into master Jan 17, 2025
11 checks passed
@mergify mergify bot removed the automerge-squash When ready, merge (using squash) label Jan 17, 2025
@mergify mergify bot deleted the gabor/timer-check branch January 17, 2025 19:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants