Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[core][state] Adjust worker side reporting with batches && add debugstring #31840

Merged
merged 14 commits into from
Jan 28, 2023

Conversation

rickyyx
Copy link
Contributor

@rickyyx rickyyx commented Jan 22, 2023

Signed-off-by: rickyyx rickyx@anyscale.com

Why are these changes needed?

  1. This PR introduces a flag RAY_task_events_send_batch_size that controls the number of task events sent to GCS in a batch. With default setting, each core worker will send 10K task events per second to GCS, where GCS could handle 10K task events in ~50 milliseconds.

  2. This PR also adjust the worker side buffer limit to 1M with the new batching setting.

  3. The PR adds some debug informations as well.

Related issue number

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Signed-off-by: rickyyx <rickyx@anyscale.com>
Signed-off-by: rickyyx <rickyx@anyscale.com>
@rickyyx rickyyx marked this pull request as ready for review January 23, 2023 05:55
@rickyyx rickyyx added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Jan 23, 2023
@rickyyx rickyyx removed the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Jan 23, 2023
Copy link
Contributor

@rkooo567 rkooo567 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about just sorting and popping? The current impl seems to have a couple tricky issues (read comments below) E.g.,

i = 0

while (i < next_index) {
    // Append  the first entry to the end until the next index
    buffer_.append(buffer_.pop(0))
    i += 1
}
next_index = 0

reported = 0
to_send = []
while (reported < batch_size) {
    to_send.append(buffer_.pop(0))
}

src/ray/core_worker/task_event_buffer.cc Outdated Show resolved Hide resolved
src/ray/core_worker/task_event_buffer.cc Outdated Show resolved Hide resolved
src/ray/core_worker/task_event_buffer.cc Show resolved Hide resolved
@rkooo567 rkooo567 added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Jan 24, 2023
Signed-off-by: rickyyx <rickyx@anyscale.com>
@rickyyx
Copy link
Contributor Author

rickyyx commented Jan 27, 2023

  • Added test to assert sent order
  • Used circular buffer as the buffer for sending and overwriting task events

Signed-off-by: rickyyx <rickyx@anyscale.com>
@rickyyx rickyyx removed the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Jan 27, 2023
Signed-off-by: rickyyx <rickyx@anyscale.com>
Signed-off-by: rickyyx <rickyx@anyscale.com>
@rickyyx rickyyx added the tests-ok The tagger certifies test failures are unrelated and assumes personal liability. label Jan 27, 2023
Copy link
Contributor

@rkooo567 rkooo567 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One last question!

src/ray/core_worker/task_event_buffer.cc Outdated Show resolved Hide resolved
@rkooo567 rkooo567 added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Jan 27, 2023
@rickyyx rickyyx removed @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. tests-ok The tagger certifies test failures are unrelated and assumes personal liability. labels Jan 27, 2023
@rkooo567 rkooo567 merged commit 5d1f2e4 into ray-project:master Jan 28, 2023
@rkooo567
Copy link
Contributor

Awesome! Really happy with the final outcome!

krfricke added a commit that referenced this pull request Jan 28, 2023
rkooo567 pushed a commit that referenced this pull request Jan 28, 2023
rkooo567 pushed a commit that referenced this pull request Feb 1, 2023
edoakes pushed a commit to edoakes/ray that referenced this pull request Mar 22, 2023
…tring (ray-project#31840)

Signed-off-by: rickyyx <rickyx@anyscale.com>

This PR introduces a flag RAY_task_events_send_batch_size that controls the number of task events sent to GCS in a batch. With default setting, each core worker will send 10K task events per second to GCS, where GCS could handle 10K task events in ~50 milliseconds.

This PR also adjust the worker side buffer limit to 1M with the new batching setting.

The PR adds some debug informations as well.

Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>
edoakes pushed a commit to edoakes/ray that referenced this pull request Mar 22, 2023
…d debugstring (ray-project#31840)" (ray-project#32024)

This reverts commit 5d1f2e4.

Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>
edoakes pushed a commit to edoakes/ray that referenced this pull request Mar 22, 2023
…tring (remerging ray-project#31840) (ray-project#32057)

Remerging ray-project#31840

Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants