Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CoreWorker] lazy bind core_work's job_config through task spec. #31375

Merged
merged 28 commits into from
Jan 12, 2023

Conversation

scv119
Copy link
Contributor

@scv119 scv119 commented Dec 30, 2022

Why are these changes needed?

Previously the worker get job_config information from raylet on construction. This prevents us from lazily binding job_config to workers. This PR enables lazily bind job_config, by piggybacking job_confg in TaskSpec, and initialize the job_config when the worker receives task execution request (push_task) call.

We also refactor the WorkerContext and RayletClient as part of the chagne.

Related issue number

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

@scv119 scv119 changed the title [CoreWorker] populate taskspec with job_config [CoreWorker] populate job_config through task spec. Jan 1, 2023
@scv119 scv119 changed the title [CoreWorker] populate job_config through task spec. [CoreWorker] lazy bind core_work's job_config through task spec. Jan 1, 2023
@scv119 scv119 marked this pull request as ready for review January 1, 2023 01:04
@scv119
Copy link
Contributor Author

scv119 commented Jan 2, 2023

@liuyang-my the Java test failed but i'm not quite sure what exactly happened reading the logs. Do you know what might have gone wrong? (presumably we are hitting some deadlock issues?) https://buildkite.com/ray-project/oss-ci-build-pr/builds/8396#01857127-086b-406b-92da-6f935dcc8447 is the failed test

src/ray/core_worker/context.cc Outdated Show resolved Hide resolved
src/ray/core_worker/context.cc Outdated Show resolved Hide resolved
@@ -298,6 +298,30 @@ message ActorDiedErrorContext {
}
// ---Actor death contexts end----

message JobConfig {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

move JobConfig to common.proto to break the circular dependency

cpp/src/ray/runtime/local_mode_ray_runtime.cc Outdated Show resolved Hide resolved
src/ray/core_worker/context.cc Outdated Show resolved Hide resolved
src/ray/core_worker/context.cc Outdated Show resolved Hide resolved
src/ray/core_worker/core_worker.cc Show resolved Hide resolved
@@ -48,12 +48,15 @@ ObjectID LocalModeTaskSubmitter::Submit(InvocationSpec &invocation,
std::string task_name =
invocation.name.empty() ? functionDescriptor->DefaultTaskName() : invocation.name;

static rpc::JobConfig kDefaultJobConfig;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is static a premature optimization? Is it possible for this to be called from multiple threads and corrupt the object?

@fishbone
Copy link
Contributor

fishbone commented Jan 6, 2023

Question: will this slow down the perfs? I think this adds runtime env to all task specs (previously, only one). Do you mind benchmarking the perf regression?

Besides this, do you think it's good to pass job config through stdin for the workers? If doing this way, we probably could limit the all changes in worker pool.

I'm also thinking in the future this maybe need extension. We probably don't want to pass everything to task spec I believe.

Btw, ok with this approach if the benchmark with job config is ok. But let's add comment to job config proto to let people know it's passed to all tasks repeatedly.

Still reviewing...

@@ -88,7 +88,7 @@ class AbstractRayRuntime : public RayRuntime {

const TaskID &GetCurrentTaskId();

const JobID &GetCurrentJobID();
JobID GetCurrentJobID();
Copy link
Contributor

@fishbone fishbone Jan 6, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why update this one but not the rest?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this calls to the context which returns by value instead of reference; so we change it accordingly.
change them to return by value is a great idea, but will yield double the size of the PR and touching a lot of cpp runtime code, thus we prefer not changing it in this PR.

@@ -48,12 +48,15 @@ ObjectID LocalModeTaskSubmitter::Submit(InvocationSpec &invocation,
std::string task_name =
invocation.name.empty() ? functionDescriptor->DefaultTaskName() : invocation.name;

rpc::JobConfig kDefaultJobConfig;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit:
make it a global const static variable?

not sure, but I always feel kXYZ is global const variables.

@@ -149,6 +149,8 @@ JobID TaskSpecification::JobId() const {
return JobID::FromBinary(message_->job_id());
}

rpc::JobConfig TaskSpecification::JobConfig() const { return message_->job_config(); }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not const reference?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

rpc::Address address;
spec_builder.SetCommonTaskSpec(id,
"dummy_task",
Language::PYTHON,
FunctionDescriptorBuilder::BuildPython("", "", "", ""),
job_id,
config,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit:
rpc::JobConfig()

seems easier to read. Otherwise we need check what's config in the code.

Comment on lines 113 to 114
JobID current_job_id_ GUARDED_BY(mutex_);
rpc::JobConfig job_config_ GUARDED_BY(mutex_);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we use optional here if it's lazily initialized?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1. Why don't we use optional instead of default config?

@scv119
Copy link
Contributor Author

scv119 commented Jan 6, 2023

thanks for reviewing!

kicking off benchmark here: https://buildkite.com/ray-project/release-tests-pr/builds/24753

Copy link
Contributor

@rkooo567 rkooo567 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR doesn't touch the worker pool code. In that case, those workers started by each job still considered to belong to the job?

Also, I am curious about the behavior changes. Previously,

  1. When the worker starts, it belongs to the job
  2. When the job terminates the workers are killed.

With this change, how are these semantics changed?

@@ -149,6 +149,8 @@ JobID TaskSpecification::JobId() const {
return JobID::FromBinary(message_->job_id());
}

rpc::JobConfig TaskSpecification::JobConfig() const { return message_->job_config(); }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

job_config_ = job_config;
}
RAY_CHECK(current_job_id_ == job_id);
RAY_CHECK(google::protobuf::util::MessageDifferencer::Equals(job_config_, job_config_));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this necessary? Is overhead of MessageDifferencer::Equals small?

return current_job_id_;
}

rpc::JobConfig WorkerContext::GetCurrentJobConfig() const {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

const reference?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

accessing a reference to a state in critical section yields undefined behavior.

src/ray/core_worker/context.h Show resolved Hide resolved
Comment on lines 113 to 114
JobID current_job_id_ GUARDED_BY(mutex_);
rpc::JobConfig job_config_ GUARDED_BY(mutex_);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1. Why don't we use optional instead of default config?

src/ray/core_worker/core_worker.cc Show resolved Hide resolved
src/ray/core_worker/core_worker.cc Show resolved Hide resolved

if (options_.worker_type == WorkerType::DRIVER &&
!options_.serialized_job_config.empty()) {
// Driver populates job_config through worker startup options.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC driver is not started with worker startup options?

Maybe it should be "driver populates the job config via initialization. Workers populates it when the first task is received"?

@rkooo567 rkooo567 added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Jan 6, 2023
Copy link
Contributor

@rkooo567 rkooo567 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

request change since Cade approves it already..

@scv119 scv119 merged commit 302a7e5 into ray-project:master Jan 12, 2023
@MisterLin1995 MisterLin1995 mentioned this pull request Jan 12, 2023
7 tasks
scv119 pushed a commit that referenced this pull request Jan 12, 2023
Fix and reopen java tests closed in #31375

Co-authored-by: Marcus Zhang <zxl265370@antgroup.com>
AmeerHajAli pushed a commit that referenced this pull request Jan 12, 2023
)

Previously the worker get job_config information from raylet on construction. This prevents us from lazily binding job_config to workers. This PR enables lazily bind job_config, by piggybacking job_confg in TaskSpec, and initialize the job_config when the worker receives task execution request (push_task) call.

We also refactor the WorkerContext and RayletClient as part of the chagne.
AmeerHajAli pushed a commit that referenced this pull request Jan 12, 2023
Fix and reopen java tests closed in #31375

Co-authored-by: Marcus Zhang <zxl265370@antgroup.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
@author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants