-
Notifications
You must be signed in to change notification settings - Fork 6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[core][state] Task events backend - worker task event buffer implementation [1/n] #30867
Conversation
Local benchmarking:Some profiling results:This is a result of running
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great--- main comment is around unifying some of the code for SetStatus.
For the performance change, I wonder if we can make the following unrelated optimization and gain back most of the difference: #30872 Basically, serialize the scheduling strategy protos to binary and compare those instead of using the message differencer. |
…n] (#30979) Previous PRs: [core][state] Task events backend: config and interface definitions [0/n] #30829: Interface and protobuf definitions. [core][state] Task events backend - split drop count on worker #30953: Splitting of drop count for various events type on worker. [core][state] Task events backend - worker task event buffer implementation [1/n] #30867: TaskEventBuffer implementation In this PR: Added GcsTaskManager that stores the task events on the GCS side. The GcsTsakManager has its own io service and io thread that's separated from the main rpc thread/io_context. Handling of rpcs will be posted to its own internal io_service. Implementation for the update path. Interface for the read path. Next PRs: Implementation for the update path of GcsTaskManager Porting of profiling events Porting of state api task.
…tation [1/n] (ray-project#30867) In this PR: Added a TaskEventBuffer class which serves as an abstraction to store task events, and push those events to GCS in batches periodically. Each CoreWorker will own one single TaskEventBuffer, and events (both task status change events and profiling events) will be added to a local in-memory buffer. The TaskEventBuffer also owns its own GCS client and io thread, which is independent from the main io_contexts. Signed-off-by: Weichen Xu <weichen.xu@databricks.com>
…n] (ray-project#30979) Previous PRs: [core][state] Task events backend: config and interface definitions [0/n] ray-project#30829: Interface and protobuf definitions. [core][state] Task events backend - split drop count on worker ray-project#30953: Splitting of drop count for various events type on worker. [core][state] Task events backend - worker task event buffer implementation [1/n] ray-project#30867: TaskEventBuffer implementation In this PR: Added GcsTaskManager that stores the task events on the GCS side. The GcsTsakManager has its own io service and io thread that's separated from the main rpc thread/io_context. Handling of rpcs will be posted to its own internal io_service. Implementation for the update path. Interface for the read path. Next PRs: Implementation for the update path of GcsTaskManager Porting of profiling events Porting of state api task. Signed-off-by: Weichen Xu <weichen.xu@databricks.com>
**Previous PRs:** - #30829: - #30953: - #30867: - #30979: - #30934 - #31207 **This PR:** With the change, - `list_tasks` now will return tasks with attempt number as an additional column. - `get_task` might return multiple task attempt entries if there are retries. There is also some plumbing in the test and in core (esp in the test logic) given the changes. Major changes in the PR are: - Add limit support to `GcsTaskManager` - Change the state aggregator to get tasks from GCS.
…n] (#30979) Previous PRs: [core][state] Task events backend: config and interface definitions [0/n] #30829: Interface and protobuf definitions. [core][state] Task events backend - split drop count on worker #30953: Splitting of drop count for various events type on worker. [core][state] Task events backend - worker task event buffer implementation [1/n] #30867: TaskEventBuffer implementation In this PR: Added GcsTaskManager that stores the task events on the GCS side. The GcsTsakManager has its own io service and io thread that's separated from the main rpc thread/io_context. Handling of rpcs will be posted to its own internal io_service. Implementation for the update path. Interface for the read path. Next PRs: Implementation for the update path of GcsTaskManager Porting of profiling events Porting of state api task.
**Previous PRs:** - #30829: - #30953: - #30867: - #30979: - #30934 - #31207 **This PR:** With the change, - `list_tasks` now will return tasks with attempt number as an additional column. - `get_task` might return multiple task attempt entries if there are retries. There is also some plumbing in the test and in core (esp in the test logic) given the changes. Major changes in the PR are: - Add limit support to `GcsTaskManager` - Change the state aggregator to get tasks from GCS.
…tation [1/n] (ray-project#30867) In this PR: Added a TaskEventBuffer class which serves as an abstraction to store task events, and push those events to GCS in batches periodically. Each CoreWorker will own one single TaskEventBuffer, and events (both task status change events and profiling events) will be added to a local in-memory buffer. The TaskEventBuffer also owns its own GCS client and io thread, which is independent from the main io_contexts. Signed-off-by: tmynn <hovhannes.tamoyan@gmail.com>
…n] (ray-project#30979) Previous PRs: [core][state] Task events backend: config and interface definitions [0/n] ray-project#30829: Interface and protobuf definitions. [core][state] Task events backend - split drop count on worker ray-project#30953: Splitting of drop count for various events type on worker. [core][state] Task events backend - worker task event buffer implementation [1/n] ray-project#30867: TaskEventBuffer implementation In this PR: Added GcsTaskManager that stores the task events on the GCS side. The GcsTsakManager has its own io service and io thread that's separated from the main rpc thread/io_context. Handling of rpcs will be posted to its own internal io_service. Implementation for the update path. Interface for the read path. Next PRs: Implementation for the update path of GcsTaskManager Porting of profiling events Porting of state api task. Signed-off-by: tmynn <hovhannes.tamoyan@gmail.com>
…ect#31247) **Previous PRs:** - ray-project#30829: - ray-project#30953: - ray-project#30867: - ray-project#30979: - ray-project#30934 - ray-project#31207 **In This PR:** - Remove old code for timeline/profiling backend. Signed-off-by: tmynn <hovhannes.tamoyan@gmail.com>
**Previous PRs:** - ray-project#30829: - ray-project#30953: - ray-project#30867: - ray-project#30979: - ray-project#30934 - ray-project#31207 **This PR:** With the change, - `list_tasks` now will return tasks with attempt number as an additional column. - `get_task` might return multiple task attempt entries if there are retries. There is also some plumbing in the test and in core (esp in the test logic) given the changes. Major changes in the PR are: - Add limit support to `GcsTaskManager` - Change the state aggregator to get tasks from GCS. Signed-off-by: tmynn <hovhannes.tamoyan@gmail.com>
Why are these changes needed?
For details of the design and background see this doc
Previous PRs:
In this PR:
Related issue number
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.