-
Notifications
You must be signed in to change notification settings - Fork 3.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update DistNeighborSampler
for hetero
#8503
Conversation
Codecov ReportAttention:
Additional details and impacted files@@ Coverage Diff @@
## master #8503 +/- ##
==========================================
+ Coverage 88.90% 89.19% +0.29%
==========================================
Files 480 480
Lines 30563 30619 +56
==========================================
+ Hits 27171 27310 +139
+ Misses 3392 3309 -83 ☔ View full report in Codecov by Sentry. |
Due to the fact that the implementation of distributed training for hetero has changed, it is also necessary to change the dist hetero relabel neighborhood function. Related pytorch_geometric PR: [#8503](pyg-team/pytorch_geometric#8503) Changes made: - `num_sampled_neighbors_per_node` dictionary currently store information about the number of sampled neighbors for each layer separately: `const c10::Dict<rel_type, std::vector<int64_t>>&num_sampled_neighbors_per_node_dict` -> `const c10::Dict<rel_type, std::vector<std::vector<int64_t>>>&num_sampled_neighbors_per_node_dict` - The method of mapping nodes has also been changed. This is now done layer by layer. - After each layer, the range of src nodes for each edge type for the next layer is calculated and the offsets for edge types having the same src node types must be the same. - The src node range for each edge type in a given layer is defined by a dictionary `srcs_slice_dict`. Local src nodes (`sampled_rows`) will be created on its basis and the starting value of the next layer will be the end value from the previous layer.
88b148d
to
4bce42a
Compare
3dbf720
to
39c2cb8
Compare
Comment about the change: when no edges where sample for a given edge type, we do not add edge attributes to the batch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Kinga, looks amazing.
Depending if #8605 comes in before that, we might do some adjustments in test_dist_neighbor_loader.py
to account for a different RPC shutdown method or I'll make corrections in the other PR.
DistNeighborSampler
for hetero
This PR fixes RPC-related errors caused by premature worker shutdown. Closed #8605 and opened this PR to align with changes in #8503. The cause were multiple `atexit` statements defined both in `worker_loop` and in `rpc.py` that lead to unpredicatable behaviors resulting in errors, when the `ConcurrentEventLoop` shutdown was lagging behind. ``` RuntimeError: EPIPE: broken pipe (this error originated at tensorpipe/transport/uv/connection_impl.cc:157) [W tensorpipe_agent.cpp:725] RPC agent for mp_sampling_worker-13 encountered error when reading incoming request from mp_sampling_worker-6: pipe closed (this error originated at tensorpipe/core/pipe_impl.cc:356) ``` Also I've removed `rpc_workers_names` from loader args as we're not using that in current implementation. Updated tests for the sampler. --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Matthias Fey <matthias.fey@tu-dortmund.de>
The purpose of this PR is to improve distributed hetero sampling algorithm.
IMPORTANT INFO: This PR is complementary with #284 from pyg-lib. The pyg-lib one needs to be merged for this one to work properly.
Description: (sorry if too long)
Distributed hetero neighbor sampling is a procedure analogous to homo sampling, but more complicated due to the presence of different types of nodes and edges.
Sampling in distributed training imitates the
hetero_neighbor_sample()
function in pyg-lib. Therefore, the mechanism of action and the nomenclature of variables are similar.Due to the fact that in distributed training, after sampling each layer, it is necessary to synchronize the results between machines, the loop iterating through the layers was implemented in Python.
The main two loops iterate sequentially: over layers and edge types. Inside the loop, the
sample_one_hop()
function is called, which performs sampling for one layer.The input to the
sample_one_hop()
function is data of a specific type, so its execution is almost identical to homo.The sample_one_hop() function, depending on whether the input nodes are located on a given partition or a remote one, performs sampling or sends an RPC request to the remote machine to do so. The
dist_neighbor_sample()
->neighbor_sample()
function is used for sampling. Nodes are sampled with duplicates so that they can later be used to construct local to global node mappings.When all machines have finished sampling, their outputs are merged and synchronized in the same way as for homo.
Then the results return to the
node_sample()
function where they are written to the output dictionaries and the src nodes for the next layer are calculated.After going through all the layers, the global node indices are finally mapped to the local ones in the
hetero_dist_relabel()
function.Information about some of the variables used in a node_sample() function:
node_dict
- class storing information about nodes. It has three fields:src
,with_dupl
,out
, which are described in more detail in the distributed/utils.py file.batch_dict
- class used when sampling with the disjoint option. It stores information about the affiliation of nodes to subgraphs. Just likenode_dict
, it has three fields:src
,with_dupl
,out
.sampled_nbrs_per_node_dict
- a dictionary that stores information about the number of sampled neighbors by each src node. To facilitate subsequent operations, for each edge type is additionally divided into layers.num_sampled_nodes_dict
,num_sampled_edges_dict
- needed for HGAM to work.