-
Notifications
You must be signed in to change notification settings - Fork 311
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG]: Leiden clustering numbering is off #4791
Comments
I believe this is a label numbering problem and not a clustering problem. Leiden is a hierarchical clustering algorithm. At each level, we combine clusters, the cluster is numbered by one of the vertices in the cluster. That assignment is arbitrary (based upon when the algorithm decides to move things). For example, if vertex 10 is being evaluated and it is determined it should merge into cluster 5, then it will be assigned to cluster 5. Cluster 10 would then be empty. The vertices/clusters are renumbered when we move to a new level of the hierarchy, but I don't believe that we renumber the vertices/clusters unless we move to another level of the hierarchy. |
Thanks @ChuckHastings for clarifying this up. I wonder though I wasn't able to see that in the previous versions of rapids (24.10) and has something to do with the recent update. |
The PR you referenced that fixed Leiden corrected a bug where the convergence criteria was wrong and was causing the algorithm to abort early. It is likely that this bug was masking this effect. Do you have a small example that you can share where this is occurring? I can try and recreate to get a better understanding of what you're seeing. |
Hey Chuck, Here's the file for the adjacency matrix: https://cedars.box.com/s/4mg82y2u0m77pi8c4i9izt3yzx52xq1c. I just use this function (from rapids-singlecell) to get a weighted graph def _create_graph(adjacency, use_weights=True):
from cugraph import Graph
sources, targets = adjacency.nonzero()
weights = adjacency[sources, targets]
if isinstance(weights, np.matrix):
weights = weights.A1
df = cudf.DataFrame({"source": sources, "destination": targets, "weights": weights})
g = Graph()
with warnings.catch_warnings():
warnings.simplefilter("ignore")
if use_weights:
g.from_cudf_edgelist(
df, source="source", destination="destination", weight="weights"
)
else:
g.from_cudf_edgelist(df, source="source", destination="destination")
return g and then run leiden using the following: from cugraph import leiden as culeiden
leiden_parts, _ = culeiden(
g,
resolution=1,
random_state=0,
max_iter=100,
) which generates the following output: Thanks for all your help. |
update: @jnke2016 is also taking a look |
I am looking at the issue which I was able to reproduce on a smaller datasets with 2676 vertices and 20480 edges
|
@jnke2016 Do you still think it's just a label numbering issue? We are using this algorithm actively and would like to know if this is a bug that can affect downstream analysis. Thanks and Happy New Year! |
@abs51295 thanks and Happy New Year to you too.
Yes it is indeed a numbering issue. In fact we already have an internal utility function that relabels the cluster IDs but it is unused (perhaps for performance reason, @ChuckHastings or @naimnv can provide more details here). I tested it locally and it resolved the issue on a single GPU. However, I am still debugging the Multi GPU case |
Thanks @jnke2016 for confirmation. I know this is a different request but can you give an example (possibly referring to my code example) as to how you run Leiden on multiple GPUs (more than 2 ideally)? |
Yes, it looks like a renumbering issue. |
@abs51295 Sure, If you are looking to run
|
Version
24.12
Which installation method(s) does this occur on?
No response
Describe the bug.
Hey,
We ran leiden clustering on our dataset after the recent fix #4730 and found that it skips some cluster numbers randomly. I wonder if it's just a label issue and not a problem with the algorithm itself.
Minimum reproducible example
Relevant log output
Environment details
Other/Misc.
No response
Code of Conduct
The text was updated successfully, but these errors were encountered: