-
Notifications
You must be signed in to change notification settings - Fork 127
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix sleep await duration hang #1121
Fix sleep await duration hang #1121
Conversation
In prep to support a distributed/redis scheduler, prepare the state interface to no longer take mutable references. This is a partial PR and should be landed immediately with followup PRs that will remove many of the locking in the SimpleScheduler. towards: TraceMachina#359
Worker logic should not be visible to StateManager just yet. In the future this will likely change, but for this phase of the refactor SimpleScheduler should own all information about workers. towards: TraceMachina#359
Moves the logic on when the matching enginge trigger gets run to under the workers struct where easy. This splits the logic of when a task is changed and matching engine needs to run and when a task gets run and the matching engine needs to be run. towards: TraceMachina#359
This is a complete rewrite of way the scheduler works internally. No changes should be externally visible.
Introduces various code quality improvements to scheduler-v2 which allow bazel test to pass. Co-authored-by: Zach Birenbaum <zacharyobirenbaum@gmail.com>
…raceMachina#1091) This will enable the underlying scheduler to intercept the Drop call allowing easier cleanups of actively listened actions.
ActionStateResult is now wired up to ActionListener allowing it to be notified of Drop calls. This will be used to do client operation id cleanups.
Worker stream now properly terminated on action complete.
…ina#1107) StateManager will now properly remove items from the maps if the client disconnects after a set amount of time. Currently these values are hard codded, but will be easy to transition them to use config variables once we design it out.
In order to make the refactor easier we are removing all metrics. We will add them back in later.
Removes unused proto field in worker api.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 0 of 1 LGTMs obtained, and 1 discussions need to be resolved
nativelink-scheduler/src/scheduler_state/state_manager.rs
line 905 at r1 (raw file):
spawn!("client_action_state_result_keepalive", async move { loop { tokio::time::sleep(Duration::from_secs(KEEPALIVE_DURATION_SECS)).await;
nit: Can we instead just move this to a variable on inner_weak
? We are going to need to do this anyway when we move to use configs.
f103f63
to
2954bfa
Compare
While testing it was observed that
state_manager::make_client_keepalive_spawn
would enter into thespawn!
but hang even after sleep duration passed. Some digging into rust-lang/rust#57391 && https://doc.rust-lang.org/src/core/time.rs.html#110 shows that usingconst Duration
might not be stable in all cases.After applying this change awaking from sleep does work. Still suspect other issues around creating high number of sleeping threads for client connections.
This change is