Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update hetero dist relabel #284

Merged
merged 1 commit into from
Dec 4, 2023
Merged

Conversation

kgajdamo
Copy link
Contributor

@kgajdamo kgajdamo commented Dec 1, 2023

Due to the fact that the implementation of distributed training for hetero has changed, it is also necessary to change the dist hetero relabel neighborhood function.

Related pytorch_geometric PR: #8503

Changes made:
- num_sampled_neighbors_per_node dictionary currently store information about the number of sampled neighbors for each layer separately:

const c10::Dict<rel_type, std::vector<int64_t>>&num_sampled_neighbors_per_node_dict -> const c10::Dict<rel_type, std::vector<std::vector<int64_t>>>&num_sampled_neighbors_per_node_dict
- The method of mapping nodes has also been changed. This is now done layer by layer.
- After each layer, the range of src nodes for each edge type for the next layer is calculated and the offsets for edge types having the same src node types must be the same.
- The src node range for each edge type in a given layer is defined by a dictionary srcs_slice_dict. Local src nodes (sampled_rows) will be created on its basis and the starting value of the next layer will be the end value from the previous layer.

Copy link

codecov bot commented Dec 1, 2023

Codecov Report

All modified and coverable lines are covered by tests ✅

Comparison is base (a5fcc87) 86.47% compared to head (c65a353) 86.47%.

Additional details and impacted files
@@           Coverage Diff           @@
##           master     #284   +/-   ##
=======================================
  Coverage   86.47%   86.47%           
=======================================
  Files          35       35           
  Lines        1213     1213           
=======================================
  Hits         1049     1049           
  Misses        164      164           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@rusty1s rusty1s merged commit d2370c2 into pyg-team:master Dec 4, 2023
13 of 14 checks passed
rusty1s added a commit to pyg-team/pytorch_geometric that referenced this pull request Dec 14, 2023
The purpose of this PR is to improve distributed hetero sampling
algorithm.
**IMPORTANT INFO**: This PR is complementary with
[#284](pyg-team/pyg-lib#284) from pyg-lib. The
pyg-lib one needs to be merged for this one to work properly.


**Description:** (sorry if too long)
Distributed hetero neighbor sampling is a procedure analogous to homo
sampling, but more complicated due to the presence of different types of
nodes and edges.
Sampling in distributed training imitates the `hetero_neighbor_sample()`
function in pyg-lib. Therefore, the mechanism of action and the
nomenclature of variables are similar.
Due to the fact that in distributed training, after sampling each layer,
it is necessary to synchronize the results between machines, the loop
iterating through the layers was implemented in Python.

The main two loops iterate sequentially: over layers and edge types.
Inside the loop, the `sample_one_hop()` function is called, which
performs sampling for one layer.
The input to the `sample_one_hop()` function is data of a specific type,
so its execution is almost identical to homo.
The sample_one_hop() function, depending on whether the input nodes are
located on a given partition or a remote one, performs sampling or sends
an RPC request to the remote machine to do so. The
`dist_neighbor_sample()`->`neighbor_sample()` function is used for
sampling. Nodes are sampled with duplicates so that they can later be
used to construct local to global node mappings.
When all machines have finished sampling, their outputs are merged and
synchronized in the same way as for homo.
Then the results return to the `node_sample()` function where they are
written to the output dictionaries and the src nodes for the next layer
are calculated.
After going through all the layers, the global node indices are finally
mapped to the local ones in the `hetero_dist_relabel()` function.

Information about some of the variables used in a node_sample()
function:
`node_dict` - class storing information about nodes. It has three
fields: `src`, `with_dupl`, `out`, which are described in more detail in
the distributed/utils.py file.
`batch_dict` - class used when sampling with the disjoint option. It
stores information about the affiliation of nodes to subgraphs. Just
like `node_dict`, it has three fields: `src`, `with_dupl`, `out`.
`sampled_nbrs_per_node_dict` - a dictionary that stores information
about the number of sampled neighbors by each src node. To facilitate
subsequent operations, for each edge type is additionally divided into
layers.
`num_sampled_nodes_dict`, `num_sampled_edges_dict` - needed for HGAM to
work.

---------

Co-authored-by: rusty1s <matthias.fey@tu-dortmund.de>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants