Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update DistNeighborSampler for hetero #8503

Merged
merged 9 commits into from
Dec 14, 2023
Merged

Conversation

kgajdamo
Copy link
Contributor

@kgajdamo kgajdamo commented Dec 1, 2023

The purpose of this PR is to improve distributed hetero sampling algorithm.
IMPORTANT INFO: This PR is complementary with #284 from pyg-lib. The pyg-lib one needs to be merged for this one to work properly.

Description: (sorry if too long)
Distributed hetero neighbor sampling is a procedure analogous to homo sampling, but more complicated due to the presence of different types of nodes and edges.
Sampling in distributed training imitates the hetero_neighbor_sample() function in pyg-lib. Therefore, the mechanism of action and the nomenclature of variables are similar.
Due to the fact that in distributed training, after sampling each layer, it is necessary to synchronize the results between machines, the loop iterating through the layers was implemented in Python.

The main two loops iterate sequentially: over layers and edge types. Inside the loop, the sample_one_hop() function is called, which performs sampling for one layer.
The input to the sample_one_hop() function is data of a specific type, so its execution is almost identical to homo.
The sample_one_hop() function, depending on whether the input nodes are located on a given partition or a remote one, performs sampling or sends an RPC request to the remote machine to do so. The dist_neighbor_sample()->neighbor_sample() function is used for sampling. Nodes are sampled with duplicates so that they can later be used to construct local to global node mappings.
When all machines have finished sampling, their outputs are merged and synchronized in the same way as for homo.
Then the results return to the node_sample() function where they are written to the output dictionaries and the src nodes for the next layer are calculated.
After going through all the layers, the global node indices are finally mapped to the local ones in the hetero_dist_relabel() function.

Information about some of the variables used in a node_sample() function:
node_dict - class storing information about nodes. It has three fields: src, with_dupl, out, which are described in more detail in the distributed/utils.py file.
batch_dict - class used when sampling with the disjoint option. It stores information about the affiliation of nodes to subgraphs. Just like node_dict, it has three fields: src, with_dupl, out.
sampled_nbrs_per_node_dict - a dictionary that stores information about the number of sampled neighbors by each src node. To facilitate subsequent operations, for each edge type is additionally divided into layers.
num_sampled_nodes_dict, num_sampled_edges_dict - needed for HGAM to work.

Copy link

codecov bot commented Dec 1, 2023

Codecov Report

Attention: 5 lines in your changes are missing coverage. Please review.

Comparison is base (331e5d1) 88.90% compared to head (441df26) 89.19%.

❗ Current head 441df26 differs from pull request most recent head ac9e9f8. Consider uploading reports for the commit ac9e9f8 to get more accurate results

Files Patch % Lines
...rch_geometric/distributed/dist_neighbor_sampler.py 90.19% 5 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #8503      +/-   ##
==========================================
+ Coverage   88.90%   89.19%   +0.29%     
==========================================
  Files         480      480              
  Lines       30563    30619      +56     
==========================================
+ Hits        27171    27310     +139     
+ Misses       3392     3309      -83     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

rusty1s pushed a commit to pyg-team/pyg-lib that referenced this pull request Dec 4, 2023
Due to the fact that the implementation of distributed training for
hetero has changed, it is also necessary to change the dist hetero
relabel neighborhood function.

Related pytorch_geometric PR:
[#8503](pyg-team/pytorch_geometric#8503)

Changes made:
- `num_sampled_neighbors_per_node` dictionary currently store
information about the number of sampled neighbors for each layer
separately:

`const c10::Dict<rel_type,
std::vector<int64_t>>&num_sampled_neighbors_per_node_dict` -> `const
c10::Dict<rel_type,
std::vector<std::vector<int64_t>>>&num_sampled_neighbors_per_node_dict`
- The method of mapping nodes has also been changed. This is now done
layer by layer.
- After each layer, the range of src nodes for each edge type for the
next layer is calculated and the offsets for edge types having the same
src node types must be the same.
- The src node range for each edge type in a given layer is defined by a
dictionary `srcs_slice_dict`. Local src nodes (`sampled_rows`) will be
created on its basis and the starting value of the next layer will be
the end value from the previous layer.
@kgajdamo kgajdamo changed the title Update DisNeighborSampler for hetero Update DistNeighborSampler for hetero Dec 4, 2023
@kgajdamo kgajdamo force-pushed the dist-hetero branch 2 times, most recently from 88b148d to 4bce42a Compare December 4, 2023 12:52
@kgajdamo kgajdamo force-pushed the dist-hetero branch 3 times, most recently from 3dbf720 to 39c2cb8 Compare December 11, 2023 09:17
Copy link
Contributor

@JakubPietrakIntel JakubPietrakIntel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Kinga, looks amazing.
Depending if #8605 comes in before that, we might do some adjustments in test_dist_neighbor_loader.py to account for a different RPC shutdown method or I'll make corrections in the other PR.

@rusty1s rusty1s changed the title Update DistNeighborSampler for hetero Update DistNeighborSampler for hetero Dec 14, 2023
@rusty1s rusty1s enabled auto-merge (squash) December 14, 2023 13:58
@rusty1s rusty1s merged commit c5bf8ef into pyg-team:master Dec 14, 2023
12 checks passed
JakubPietrakIntel added a commit that referenced this pull request Dec 20, 2023
This PR fixes RPC-related errors caused by premature worker shutdown. 
Closed #8605 and opened this PR to align with changes in #8503.

The cause were multiple `atexit` statements defined both in
`worker_loop` and in `rpc.py` that lead to unpredicatable behaviors
resulting in errors, when the `ConcurrentEventLoop` shutdown was lagging
behind.
```
RuntimeError: EPIPE: broken pipe (this error originated at tensorpipe/transport/uv/connection_impl.cc:157)
[W tensorpipe_agent.cpp:725] RPC agent for mp_sampling_worker-13 encountered error when reading incoming request from mp_sampling_worker-6: pipe closed (this error originated at tensorpipe/core/pipe_impl.cc:356)
```
Also I've removed `rpc_workers_names` from loader args as we're not
using that in current implementation.
Updated tests for the sampler.

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Matthias Fey <matthias.fey@tu-dortmund.de>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants