Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix sleep await duration hang #1121

Conversation

adam-singer
Copy link
Contributor

@adam-singer adam-singer commented Jul 9, 2024

While testing it was observed that state_manager::make_client_keepalive_spawn would enter into the spawn! but hang even after sleep duration passed. Some digging into rust-lang/rust#57391 && https://doc.rust-lang.org/src/core/time.rs.html#110 shows that using const Duration might not be stable in all cases.

After applying this change awaking from sleep does work. Still suspect other issues around creating high number of sleeping threads for client connections.


This change is Reviewable

allada and others added 13 commits June 28, 2024 19:35
In prep to support a distributed/redis scheduler, prepare the state
interface to no longer take mutable references.

This is a partial PR and should be landed immediately with followup PRs
that will remove many of the locking in the SimpleScheduler.

towards: TraceMachina#359
Worker logic should not be visible to StateManager just yet. In the
future this will likely change, but for this phase of the refactor
SimpleScheduler should own all information about workers.

towards: TraceMachina#359
Moves the logic on when the matching enginge trigger gets run to
under the workers struct where easy. This splits the logic of
when a task is changed and matching engine needs to run and when
a task gets run and the matching engine needs to be run.

towards: TraceMachina#359
This is a complete rewrite of way the scheduler works internally.
No changes should be externally visible.
Introduces various code quality improvements to scheduler-v2
which allow bazel test to pass.

Co-authored-by: Zach Birenbaum <zacharyobirenbaum@gmail.com>
…raceMachina#1091)

This will enable the underlying scheduler to intercept the Drop
call allowing easier cleanups of actively listened actions.
ActionStateResult is now wired up to ActionListener allowing
it to be notified of Drop calls. This will be used to do client
operation id cleanups.
Worker stream now properly terminated on action complete.
…ina#1107)

StateManager will now properly remove items from the maps if the
client disconnects after a set amount of time. Currently these
values are hard codded, but will be easy to transition them to
use config variables once we design it out.
In order to make the refactor easier we are removing all metrics.

We will add them back in later.
Removes unused proto field in worker api.
@adam-singer adam-singer requested a review from allada July 9, 2024 06:33
Copy link
Member

@allada allada left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 1 LGTMs obtained, and 1 discussions need to be resolved


nativelink-scheduler/src/scheduler_state/state_manager.rs line 905 at r1 (raw file):

    spawn!("client_action_state_result_keepalive", async move {
        loop {
            tokio::time::sleep(Duration::from_secs(KEEPALIVE_DURATION_SECS)).await;

nit: Can we instead just move this to a variable on inner_weak? We are going to need to do this anyway when we move to use configs.

@allada allada force-pushed the dev branch 4 times, most recently from f103f63 to 2954bfa Compare July 16, 2024 20:50
@allada allada deleted the branch TraceMachina:dev July 17, 2024 16:13
@allada allada closed this Jul 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants