-
Notifications
You must be signed in to change notification settings - Fork 6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[core] Only get single node info rather then all when needed #49727
Conversation
5048ac9
to
84a2666
Compare
Signed-off-by: dayshah <dhyey2019@gmail.com>
84a2666
to
1bce3af
Compare
// TODO(kfstorm): Do we need to replace `node_ip_address` with | ||
// `get_node_ip_address()`? | ||
if ((ip_address == "127.0.0.1" && gcs_address.first == node_ip_address) || | ||
ip_address == gcs_address.first) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i feel like the error messages don't totally align with the error messages, here we'll just return a node_info as long it matches the gcs_address we fetched
src/ray/gcs/gcs_client/accessor.h
Outdated
/// Get information of all nodes from an RPC to GCS synchronously with filter. | ||
/// | ||
/// \return All nodes that match the given filter from the gcs without the cache. | ||
virtual StatusOr<std::vector<rpc::GcsNodeInfo>> GetAllNoCacheWithFilter( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice
filter.set_state(rpc::GcsNodeInfo_GcsNodeState::GcsNodeInfo_GcsNodeState_ALIVE); | ||
filter.set_node_id(node_id.Binary()); | ||
{ | ||
absl::ReaderMutexLock lock(&mutex_); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why lock?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
gcs_client_ is protected by mutex, this is how it's used throughout
RAY_LOG(INFO) << "This node has an IP address of " << node_ip_address | ||
<< ", but we cannot find a local Raylet with the same address. " | ||
<< "This can happen when you connect to the Ray cluster " | ||
<< "with a different IP address or when connecting to a container."; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
??
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
previously we were still logging this if we don't get any node info that matches with the node_ip_address id and have to resort to using the gcs_address we fetched instead of the node_ip_address passed into the function
Signed-off-by: dayshah <dhyey2019@gmail.com>
[jokes] Nowadays we are reinventing sql server, step 2: query by index. lol |
absl::ReaderMutexLock lock(&mutex_); | ||
auto timeout_ms = | ||
std::max(end_time_point - current_time_ms(), static_cast<int64_t>(0)); | ||
node_infos_status = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would RAY_ASSIGN_OR_RETURN
work here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ya this was super useful ty!
Signed-off-by: dayshah <dhyey2019@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LG
if (node_infos.empty() && node_ip_address == gcs_address) { | ||
filters.set_node_ip_address("127.0.0.1"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you know, can you comment on what this case is?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Related code (listen_to_localhost_only sets GrpcServer to listen to 127.0.0.1) with and pr that introduced this #16810. Don't see a concrete reason to have this last case anywhere, ideally the above case of looking for gcs_address should cover it, but maybe some case of container or something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, lets keep it for now.
Signed-off-by: dayshah <dhyey2019@gmail.com>
Signed-off-by: dayshah <dhyey2019@gmail.com>
Why are these changes needed?
GetAllNodeInfo is a heavily hit request on large clusters that starts to take a long time as clusters get large since it has to copy and send back info for all nodes. However, on some of these requests we're only looking for a single node with its id or ip address, and GetAllNodeInfo already has the option for a filter, so we should use that to only get info for the node we need it for. Here I'm utilizing the node_id filter and adding a node_ip_address filter, and using it in the global_state_accessor.
GetNodeToConnectForDriver Old Logic:
Else
GcsNodeToConnectForDriver New Logic
Else
The 127.0.0.1 is the last prioritized requst because the assumption is that the cluster is running locally and rpc performance shouldn't really matter.
Related issue number
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/
under thecorresponding
.rst
file.