Update `DistNeighborSampler` for hetero #8503

kgajdamo · 2023-12-01T15:39:06Z

The purpose of this PR is to improve distributed hetero sampling algorithm.
IMPORTANT INFO: This PR is complementary with #284 from pyg-lib. The pyg-lib one needs to be merged for this one to work properly.

Description: (sorry if too long)
Distributed hetero neighbor sampling is a procedure analogous to homo sampling, but more complicated due to the presence of different types of nodes and edges.
Sampling in distributed training imitates the hetero_neighbor_sample() function in pyg-lib. Therefore, the mechanism of action and the nomenclature of variables are similar.
Due to the fact that in distributed training, after sampling each layer, it is necessary to synchronize the results between machines, the loop iterating through the layers was implemented in Python.

The main two loops iterate sequentially: over layers and edge types. Inside the loop, the sample_one_hop() function is called, which performs sampling for one layer.
The input to the sample_one_hop() function is data of a specific type, so its execution is almost identical to homo.
The sample_one_hop() function, depending on whether the input nodes are located on a given partition or a remote one, performs sampling or sends an RPC request to the remote machine to do so. The dist_neighbor_sample()->neighbor_sample() function is used for sampling. Nodes are sampled with duplicates so that they can later be used to construct local to global node mappings.
When all machines have finished sampling, their outputs are merged and synchronized in the same way as for homo.
Then the results return to the node_sample() function where they are written to the output dictionaries and the src nodes for the next layer are calculated.
After going through all the layers, the global node indices are finally mapped to the local ones in the hetero_dist_relabel() function.

Information about some of the variables used in a node_sample() function:
node_dict - class storing information about nodes. It has three fields: src, with_dupl, out, which are described in more detail in the distributed/utils.py file.
batch_dict - class used when sampling with the disjoint option. It stores information about the affiliation of nodes to subgraphs. Just like node_dict, it has three fields: src, with_dupl, out.
sampled_nbrs_per_node_dict - a dictionary that stores information about the number of sampled neighbors by each src node. To facilitate subsequent operations, for each edge type is additionally divided into layers.
num_sampled_nodes_dict, num_sampled_edges_dict - needed for HGAM to work.

codecov · 2023-12-01T15:46:44Z

Codecov Report

Attention: 5 lines in your changes are missing coverage. Please review.

Comparison is base (331e5d1) 88.90% compared to head (441df26) 89.19%.

❗ Current head 441df26 differs from pull request most recent head ac9e9f8. Consider uploading reports for the commit ac9e9f8 to get more accurate results

Files	Patch %	Lines
...rch_geometric/distributed/dist_neighbor_sampler.py	90.19%	5 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #8503      +/-   ##
==========================================
+ Coverage   88.90%   89.19%   +0.29%     
==========================================
  Files         480      480              
  Lines       30563    30619      +56     
==========================================
+ Hits        27171    27310     +139     
+ Misses       3392     3309      -83

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Due to the fact that the implementation of distributed training for hetero has changed, it is also necessary to change the dist hetero relabel neighborhood function. Related pytorch_geometric PR: [#8503](pyg-team/pytorch_geometric#8503) Changes made: - `num_sampled_neighbors_per_node` dictionary currently store information about the number of sampled neighbors for each layer separately:  `const c10::Dict<rel_type, std::vector<int64_t>>&num_sampled_neighbors_per_node_dict` -> `const c10::Dict<rel_type, std::vector<std::vector<int64_t>>>&num_sampled_neighbors_per_node_dict` - The method of mapping nodes has also been changed. This is now done layer by layer. - After each layer, the range of src nodes for each edge type for the next layer is calculated and the offsets for edge types having the same src node types must be the same. - The src node range for each edge type in a given layer is defined by a dictionary `srcs_slice_dict`. Local src nodes (`sampled_rows`) will be created on its basis and the starting value of the next layer will be the end value from the previous layer.

Comment about the change: when no edges where sample for a given edge type, we do not add edge attributes to the batch.

JakubPietrakIntel

Thanks Kinga, looks amazing.
Depending if #8605 comes in before that, we might do some adjustments in test_dist_neighbor_loader.py to account for a different RPC shutdown method or I'll make corrections in the other PR.

This PR fixes RPC-related errors caused by premature worker shutdown. Closed #8605 and opened this PR to align with changes in #8503. The cause were multiple `atexit` statements defined both in `worker_loop` and in `rpc.py` that lead to unpredicatable behaviors resulting in errors, when the `ConcurrentEventLoop` shutdown was lagging behind. ``` RuntimeError: EPIPE: broken pipe (this error originated at tensorpipe/transport/uv/connection_impl.cc:157) [W tensorpipe_agent.cpp:725] RPC agent for mp_sampling_worker-13 encountered error when reading incoming request from mp_sampling_worker-6: pipe closed (this error originated at tensorpipe/core/pipe_impl.cc:356) ``` Also I've removed `rpc_workers_names` from loader args as we're not using that in current implementation. Updated tests for the sampler. --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Matthias Fey <matthias.fey@tu-dortmund.de>

kgajdamo added sampler distributed labels Dec 1, 2023

kgajdamo requested review from ZhengHongming888 and JakubPietrakIntel December 1, 2023 15:39

kgajdamo requested review from wsad1 and rusty1s as code owners December 1, 2023 15:39

kgajdamo mentioned this pull request Dec 1, 2023

Update hetero dist relabel pyg-team/pyg-lib#284

Merged

kgajdamo force-pushed the dist-hetero branch from 678c61b to 4b14fd7 Compare December 1, 2023 15:54

kgajdamo changed the title ~~Update DisNeighborSampler for hetero~~ Update DistNeighborSampler for hetero Dec 4, 2023

kgajdamo force-pushed the dist-hetero branch 2 times, most recently from 88b148d to 4bce42a Compare December 4, 2023 12:52

kgajdamo added feature loader labels Dec 5, 2023

rusty1s assigned kgajdamo Dec 6, 2023

rusty1s added the 0 - Priority P0 label Dec 6, 2023

kgajdamo force-pushed the dist-hetero branch 3 times, most recently from 3dbf720 to 39c2cb8 Compare December 11, 2023 09:17

kgajdamo added 5 commits December 11, 2023 10:56

update dist hetero + tests

cba3d40

update CHANGELOG.md

46762fe

remove init process group from dist sampler hetero tests

91e2d31

use context manager functionality to close the socket in tests

77e4336

Enable dist neigbor loader hetero test

39c2cb8

Comment about the change: when no edges where sample for a given edge type, we do not add edge attributes to the batch.

JakubPietrakIntel approved these changes Dec 14, 2023

View reviewed changes

update

7873a36

rusty1s added the skip-changelog label Dec 14, 2023

rusty1s changed the title ~~Update DistNeighborSampler for hetero~~ Update DistNeighborSampler for hetero Dec 14, 2023

update

a48d04f

update

441df26

rusty1s approved these changes Dec 14, 2023

View reviewed changes

Merge branch 'master' into dist-hetero

ac9e9f8

rusty1s enabled auto-merge (squash) December 14, 2023 13:58

rusty1s merged commit c5bf8ef into pyg-team:master Dec 14, 2023
12 checks passed

JakubPietrakIntel mentioned this pull request Dec 19, 2023

Fix RPC timeout issues caused by premature worker closing #8637

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update `DistNeighborSampler` for hetero #8503

Update `DistNeighborSampler` for hetero #8503

kgajdamo commented Dec 1, 2023 •

edited

Loading

codecov bot commented Dec 1, 2023 •

edited

Loading

JakubPietrakIntel left a comment

Update DistNeighborSampler for hetero #8503

Update DistNeighborSampler for hetero #8503

Conversation

kgajdamo commented Dec 1, 2023 • edited Loading

codecov bot commented Dec 1, 2023 • edited Loading

Codecov Report

JakubPietrakIntel left a comment

Choose a reason for hiding this comment

Update `DistNeighborSampler` for hetero #8503

Update `DistNeighborSampler` for hetero #8503

kgajdamo commented Dec 1, 2023 •

edited

Loading

codecov bot commented Dec 1, 2023 •

edited

Loading