Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ligra: Add comments and unit test regarding bug with vertex intersection #21

Open
wants to merge 9 commits into
base: master
Choose a base branch
from

Conversation

tomtseng
Copy link
Member

@tomtseng tomtseng commented Mar 5, 2020

[a]symmetric_vertex::intersect assumes that the neighbor lists of each vertex are in sorted order. However, this is not necessarily the case. In particular,

  • The AdjacencyGraph format that we read graphs in does not specify that neighbor lists are in sorted order. Therefore, a user could reasonably write a file in which they are not. Then when we read in the graph, the neighbor lists will remain in unsorted order.
  • sym_graph_from_edges in ligra/graph.h does not sort neighbor lists.
    • this is how this issue came to my attention -- I was failing a unit test using a graph created by graph_test::MakeUnweightedSymmetricGraph, which calls sym_graph_from_edges.

This PR is a temporary patch to graph_test::MakeUnweightedSymmetricGraph to make unit test pass. It also adds to comments to symmetric_vertex::intersect documenting this assumption.

However, this issue should really be addressed in a more robust way in some future PR. This is a sneaky assumption in symmetric_vertex::intersect that doesn't cause any seg faults but instead will lead to slightly wrong answers, and a user has no convenient way to validate the assumption that a graph's neighbor lists are all in sorted order. This affects:

  • KTruss
  • TriangleCounting
    • and by extension, CliqueCounting
  • SCAN clustering (my work-in-progress PR)

Some possible solutions:

  • Maintain that all graphs should have sorted neighbor lists
    • Sort neighbor lists in the graph constructor (might be expensive, O(m) time with integer sort)
    • Or sort neighbor lists in graph_io.h when we read in graphs or assert that they're sorted. Add a function comment to [a]symmetric_graph's constructor saying that neighbor lists must be sorted. (this feels error-prone)
  • Add a function for sorting a graph's neighbor lists and let the programmer beware of incorrect intersect calls (also error-prone)
  • Rewrite symmetric_vertex::intersect to not require sorted neighbor lists (expensive, I'd guess this will slow down TriangleCounting by a lot)

Feedback on what solution we should move forward with would be appreciated.

@tomtseng tomtseng requested review from ldhulipala and jeshi96 March 5, 2020 23:38
@ldhulipala
Copy link
Member

This is a good point.

We can add an extra command-line flag that sorts the neighbor lists if the input CSR graph is not sorted. I think the default should be to assume that the neighbor lists are sorted, and to assume that an input graph to the graph constructor has all of the incident adjacency lists sorted. We can have an alternate constructor that sorts.

I think this is the right approach because others who are benchmarking/using our library will probably only use the defaults, and so it's best to make sure the defaults are the fastest/best optimized codepaths. Ideally we would have some assertions that would fail if non-sorted lists are passed into the intersection primitives, which would make this approach both fast and safe.

On a similar note, I think we should make the default graph ingestion method to use binary CSR, read through mmap, and make other graph ingestion methods use explicit flags. Right now the default is to assume a text-based format (which is slow) and to read the file into a buffer instead of mmap'ing, which is slow and wastes memory.

Maybe there is a middle-ground, which is to check whether the neighbor lists are sorted which is dirt-cheap, and sort them if necessary.

WDYT?

@tomtseng
Copy link
Member Author

tomtseng commented Mar 8, 2020

I think the flag would be helpful, though I'm worried that user errors could still happen from someone not paying attention to the flags.

That brings up another question for me though: what is the common case? as in, how are these AdjacencyGraph files usually generated? What I've been doing is downloading SNAP graphs and then using utils/SNAPtoAdj in jshun/ligra. Running ./SNAPtoAdj -s, where the -s flag symmetrizes the graph, generates graphs with sorted neighbor lists, but it seems like this is by coincidence -- when symmetrizing the graph, SNAPtoAdj sorts the edges to remove duplicates. It doesn't look like running SNAPtoAdj without -s gives sorted neighbor lists.

One way we could assert for sortedness in intersect would be to add a boolean field are_neighbors_sorted to the graph, which someone calling the graph constructor can set to be true if they promise that the neighbor lists are sorted. Then the intersect functions can assert on this boolean.

I also do agree that it would probably be nice to have a more compact input format that we can read more quickly. That could also help reduce the chance of hitting this bug if we always save that input format with neighbor lists in sorted order.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants