Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[core] Only get single node info rather then all when needed #49727

Merged
merged 4 commits into from
Jan 16, 2025

Conversation

dayshah
Copy link
Contributor

@dayshah dayshah commented Jan 8, 2025

Why are these changes needed?

GetAllNodeInfo is a heavily hit request on large clusters that starts to take a long time as clusters get large since it has to copy and send back info for all nodes. However, on some of these requests we're only looking for a single node with its id or ip address, and GetAllNodeInfo already has the option for a filter, so we should use that to only get info for the node we need it for. Here I'm utilizing the node_id filter and adding a node_ip_address filter, and using it in the global_state_accessor.

GetNodeToConnectForDriver Old Logic:

  • Get all the alive nodes
  • Get the gcs_server_address
  • If we find the node which matches the ip_address we wanted, return that node_info.

Else

  • If we find a node with an ip_address that matches the gcs_server_address, return that node_info.
  • If we wanted an ip_address that matched the gcs_server_address and we find an ip_address that matches the 127.0.0.1

GcsNodeToConnectForDriver New Logic

  • Try to get the info for the address we requested
  • If we get it return that info

Else

  • Get the gcs_server_address
  • Try to get the node info for the gcs_server_address.
  • If the ip_address we wanted matches the gcs_server_addresss, try to get the node info for 127.0.0.1

The 127.0.0.1 is the last prioritized requst because the assumption is that the cluster is running locally and rpc performance shouldn't really matter.

Related issue number

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

@dayshah dayshah changed the title [core] Get node info with filter by id [core] Only get single node info rather then all when needed Jan 13, 2025
@dayshah dayshah marked this pull request as ready for review January 13, 2025 07:28
@dayshah dayshah requested review from a team, pcmoritz and raulchen as code owners January 13, 2025 07:28
Signed-off-by: dayshah <dhyey2019@gmail.com>
// TODO(kfstorm): Do we need to replace `node_ip_address` with
// `get_node_ip_address()`?
if ((ip_address == "127.0.0.1" && gcs_address.first == node_ip_address) ||
ip_address == gcs_address.first) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i feel like the error messages don't totally align with the error messages, here we'll just return a node_info as long it matches the gcs_address we fetched

python/ray/includes/global_state_accessor.pxd Outdated Show resolved Hide resolved
src/ray/gcs/gcs_client/accessor.cc Outdated Show resolved Hide resolved
/// Get information of all nodes from an RPC to GCS synchronously with filter.
///
/// \return All nodes that match the given filter from the gcs without the cache.
virtual StatusOr<std::vector<rpc::GcsNodeInfo>> GetAllNoCacheWithFilter(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice

src/ray/gcs/gcs_client/accessor.h Outdated Show resolved Hide resolved
src/ray/gcs/gcs_client/global_state_accessor.cc Outdated Show resolved Hide resolved
src/ray/gcs/gcs_client/global_state_accessor.cc Outdated Show resolved Hide resolved
src/ray/gcs/gcs_client/global_state_accessor.cc Outdated Show resolved Hide resolved
src/ray/gcs/gcs_client/global_state_accessor.cc Outdated Show resolved Hide resolved
filter.set_state(rpc::GcsNodeInfo_GcsNodeState::GcsNodeInfo_GcsNodeState_ALIVE);
filter.set_node_id(node_id.Binary());
{
absl::ReaderMutexLock lock(&mutex_);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why lock?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

gcs_client_ is protected by mutex, this is how it's used throughout

Comment on lines +485 to +488
RAY_LOG(INFO) << "This node has an IP address of " << node_ip_address
<< ", but we cannot find a local Raylet with the same address. "
<< "This can happen when you connect to the Ray cluster "
<< "with a different IP address or when connecting to a container.";
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

??

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

previously we were still logging this if we don't get any node info that matches with the node_ip_address id and have to resort to using the gcs_address we fetched instead of the node_ip_address passed into the function

Signed-off-by: dayshah <dhyey2019@gmail.com>
@rynewang
Copy link
Contributor

[jokes] Nowadays we are reinventing sql server, step 2: query by index. lol

absl::ReaderMutexLock lock(&mutex_);
auto timeout_ms =
std::max(end_time_point - current_time_ms(), static_cast<int64_t>(0));
node_infos_status =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would RAY_ASSIGN_OR_RETURN work here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ya this was super useful ty!

Signed-off-by: dayshah <dhyey2019@gmail.com>
@jjyao jjyao added the go add ONLY when ready to merge, run all tests label Jan 13, 2025
Copy link
Collaborator

@jjyao jjyao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LG

src/ray/gcs/gcs_client/accessor.h Outdated Show resolved Hide resolved
src/ray/gcs/gcs_client/accessor.h Outdated Show resolved Hide resolved
src/ray/gcs/gcs_client/accessor.h Outdated Show resolved Hide resolved
Comment on lines +470 to +471
if (node_infos.empty() && node_ip_address == gcs_address) {
filters.set_node_ip_address("127.0.0.1");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you know, can you comment on what this case is?

Copy link
Contributor Author

@dayshah dayshah Jan 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Related code (listen_to_localhost_only sets GrpcServer to listen to 127.0.0.1) with and pr that introduced this #16810. Don't see a concrete reason to have this last case anywhere, ideally the above case of looking for gcs_address should cover it, but maybe some case of container or something?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, lets keep it for now.

src/ray/gcs/gcs_server/gcs_node_manager.cc Outdated Show resolved Hide resolved
Signed-off-by: dayshah <dhyey2019@gmail.com>
@jjyao jjyao merged commit 6c0fa3d into ray-project:master Jan 16, 2025
5 checks passed
srinathk10 pushed a commit that referenced this pull request Feb 2, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
go add ONLY when ready to merge, run all tests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants