Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

K-truss benchmarking #4593

Open
ChuckHastings opened this issue Aug 6, 2024 · 6 comments
Open

K-truss benchmarking #4593

ChuckHastings opened this issue Aug 6, 2024 · 6 comments
Assignees

Comments

@ChuckHastings
Copy link
Collaborator

ChuckHastings commented Aug 6, 2024

Benchmark k-truss and compare to other existing implementations.

@jnke2016
Copy link
Contributor

jnke2016 commented Feb 4, 2025

In 24.06, we updated our K-Truss implementation by moving from cuHornet to an implementation matching our primitive and the performance improvement was several order of magnitude faster.

As a summary, our earlier implementation run several iteration of our primitive per_v_pair_dst_nbr_intersection to compute the number of triangles incident to every edge in the graph. This can be inefficient if there is only a few subset of the edges to invalidate which will still require to run per_v_pair_dst_nbr_intersection on the entire edgelist and that primitive is not well optimized for K-Truss.
In our new implementation, we do not compute the number of triangles incident to every edge but instead identify weak edges to be deleted and remove their contribution on the affected triangles. This optimization improves our performance for different graph size and k values as it can be seeing in the figure below. These benchmark numbers were obtained by running cuGraph 25.02 on an H100 machine.

Image Image Image Image Image Image Image

On the memory side we are currently limited by the number of edge we can process in nbr_intersection on large graph though we have an internal flag that enables us to perform this operation in chunk. By reducing the intersection size, we can process 2 to 4x larger graph. A study of the algorithm unveiled that there is no need to store the result of nbr_intersection and instead directly update the edge property value (in this case triangle count). This will both significantly reduce our peak memory usage and also improve our runtime (since we will no longer need to read the intersection result stored in CSR format to retrieve and update the edge property)

@ChuckHastings
Copy link
Collaborator Author

This is a great start. But I think we should also show comparison to other implementations (I saw nx, hornet, arkuda in a quick internet search, but any we can find).

@jeaton32
Copy link
Collaborator

jeaton32 commented Feb 4, 2025

These are all single-gpu so far? Which GPU is being used?

@jnke2016
Copy link
Contributor

jnke2016 commented Feb 4, 2025

This is a great start. But I think we should also show comparison to other implementations (I saw nx, hornet, arkuda in a quick internet search, but any we can find).

Correct. This issue will be updated incrementally. In the first part of this k-truss benchmarking issue, I am portraying how the new optimization improved our runtime from our previous one which was already faster than the cuHornet based implementation.

The two fastest implementation in the literature on single GPU are KTrussExplorer and a linear algebraic based implementation which we are targeting to outperform.

@jnke2016
Copy link
Contributor

jnke2016 commented Feb 4, 2025

These are all single-gpu so far

Yes I do also have single node multi GPU numbers which will be added shortly.

Which GPU is being used

NVIDIA H100 80GB HBM3

@jnke2016
Copy link
Contributor

jnke2016 commented Feb 10, 2025

Runtime breakdown on a single GPU for a graph generated at scale 22 and edge factor 16 for different values of k.

Image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants