Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Core][Node Labels 3/n]Add node labels to node resources and publish to all node #36009

Merged
merged 1 commit into from
Jun 14, 2023

Conversation

larrylian
Copy link
Contributor

@larrylian larrylian commented Jun 2, 2023

Why are these changes needed?

Add node labels to node resources and publish to all node

Related issue number

Enhancing node affinity scheduling feature through node labels #34894
(P1)Parse the configuration parameters for node labels and save them in the NodeInfo data structure.

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

@larrylian larrylian requested a review from a team as a code owner June 2, 2023 03:36
src/ray/raylet/scheduling/cluster_resource_scheduler.cc Outdated Show resolved Hide resolved
Comment on lines +303 to +302
if (it == nodes_.end()) {
NodeResources node_resources;
it = nodes_.emplace(node_id, node_resources).first;
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's a bit weird that we can have node with only labels but empty resources. We should be able to set labels and resources together when we add a node?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have reconsidered and still think it's better to keep this interface.

  1. it's not weird because the original code logic already had cases where the total resources of a node were not set.
    image

  2. To update resources, use UpdateResourceCapacity, and to update labels, use UpdateNodeLabels. This way, the interface is more focused, and the code is simpler.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm actually wondering when we may not have total resources. Should it be a check instead of warning? Let me check with the team.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we actually change if (it == nodes_.end()) { to CHECK which makes code easier to reason about.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My current implementation requires me to retain this logic. Since the labels information is already available in 'GcsNodeInfo', I will first call 'ResetNodeLabels()', and then update the resources.
image

NodeResources node_resources;
node_resources.total = ResourceMapToResourceRequest(resource_map_total, false);
node_resources.available = ResourceMapToResourceRequest(resource_map_available, false);
node_resources.labels = node_labels;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After we have labels, the current naming (NodeResources) is no longer that accurate. We should probably have something like

class Node {
   NodeResources resources;
  map<string, string> labels;
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Labels can also be regarded as a resource of node. So it is very suitable to put it in NodeResources, and it reduces a lot of code changes.
  2. NodeResources do not need to be computable. The ResourceRequest in NodeResources is Computable. Adding labels to NodeResources does not affect the original available/tatal, etc.
    image

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool! Got the design.

src/ray/gcs/gcs_server/gcs_resource_manager.cc Outdated Show resolved Hide resolved
NodeResources node_resources;
node_resources.total = ResourceMapToResourceRequest(resource_map_total, false);
node_resources.available = ResourceMapToResourceRequest(resource_map_available, false);
node_resources.labels = node_labels;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool! Got the design.

@larrylian larrylian requested review from jjyao and jiwq June 12, 2023 09:21
Comment on lines +303 to +302
if (it == nodes_.end()) {
NodeResources node_resources;
it = nodes_.emplace(node_id, node_resources).first;
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we actually change if (it == nodes_.end()) { to CHECK which makes code easier to reason about.

src/ray/raylet/node_manager.h Outdated Show resolved Hide resolved
src/ray/gcs/gcs_server/gcs_resource_manager.cc Outdated Show resolved Hide resolved
const scheduling::NodeID &node_id,
const absl::flat_hash_map<std::string, std::string> &labels) {
if (labels.empty()) {
return;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should early return here. If labels are empty then we should just set node labels to empty.

Copy link
Contributor Author

@larrylian larrylian Jun 13, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above.
I've thought about it, and for the scenario where nodes are added, the semantics should be 'reset' instead of 'update', since the labels are static. Therefore, I have changed the interface to 'ResetNodeLabels()'. If dynamic labels are added in the future, I will create a new interface called 'UpdateNodeLabels()'.

src/ray/raylet/scheduling/cluster_resource_manager.cc Outdated Show resolved Hide resolved
@larrylian larrylian force-pushed the node_labels_3 branch 2 times, most recently from 3781ee3 to 3942b14 Compare June 13, 2023 04:00

absl::flat_hash_map<std::string, std::string> labels(node.labels().begin(),
node.labels().end());
cluster_resource_manager_.ResetNodeLabels(scheduling_node_id, labels);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: SetNodeLabels()

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

@@ -237,6 +237,7 @@ class NodeManager : public rpc::NodeManagerServiceHandler,
/// Handler for the addition or updation of a resource in the GCS
/// \param node_id ID of the node that created or updated resources.
/// \param createUpdatedResources Created or updated resources.
/// \param labels Created or updated labels of this node.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove this line

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

@@ -44,10 +44,11 @@ ClusterResourceScheduler::ClusterResourceScheduler(
const absl::flat_hash_map<std::string, double> &local_node_resources,
std::function<bool(scheduling::NodeID)> is_node_available_fn,
std::function<int64_t(void)> get_used_object_store_memory,
std::function<bool(void)> get_pull_manager_at_capacity)
std::function<bool(void)> get_pull_manager_at_capacity,
const absl::flat_hash_map<std::string, std::string> &local_node_labels)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where is the change that creates the ClusterResourceScheduler and pass in the labels?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

image

all node

Signed-off-by: LarryLian <554538252@qq.com>
@jjyao jjyao merged commit 887c03f into ray-project:master Jun 14, 2023
arvind-chandra pushed a commit to lmco/ray that referenced this pull request Aug 31, 2023
…to all node(ray-project#36009)

Signed-off-by: LarryLian <554538252@qq.com>
Signed-off-by: e428265 <arvind.chandramouli@lmco.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants