Several years ago Michael Isard, Derek Murray, and I wrote a paper: "Scalability, but at what COST". The premise of the paper was that the current excitement about distributed computation (e.g. Hadoop, Spark) produced implementations that improve when you give them more resources (they "scale") but whose performance never quite gets to where you would be with a simple single-threaded implementation on a laptop.
If that sounds surprising, ... well hello there! Maybe go check out the paper, the matching blog post, a follow-up post about even larger datasets, and another followup post about another paper from the same community a year later.
One conclusion was that the Systems community (which I think of as SOSP/OSDI/+) was overexcited about big data processing, but not especially good at it yet. Specifically, graph processing, which is one place where interesting algorithms start to make a difference (no more word count). One could imagine the folks who study large-scale data processing professionally, over in the databases community, snickering just a bit.
This post is for you, databases community. <3
We are going to look a few papers from the most recent iterations of VLDB and SIGMOD, which I think of as the premiere venues in the database community. I have a few other papers I'd like to use, but those papers don't have quantitative data (they have figures, but no numbers) and the authors haven't responded to inquiries for a week now (clever).
As with the COST paper, we are going to try and understand to what degree the empirical results presented in these papers demonstrate that the implemented systems improve over a single-threaded implementation. That is a pretty low bar, and should be something all reviewers require as a minimal test of non-triviality. Some of the systems will be better than a single-threaded implementation, on some problems, at some scales, but all of the papers have picked comically poor baselines that are nowhere near single-threaded laptop performance.
As a caveat: My code might have bugs, I might be wrong about various things, et cetera. However, I'm doing all of these measurements because the papers declined to do so. None of them present any evidence that they are any better than a single-threaded implementation on any problem at any scale. So, if nothing else the published papers are not yet right. But I suspect they are mostly wrong.
To let you hop around, here are links to the three sections, each covering a different paper:
-
Scalable Distributed Subgraph Enumeration is a VLDB 2017 paper about finding undirected pattern graphs in larger undirected data graphs.
-
All-in-One: Graph Processing in RDBMSs Revisited is a SIGMOD 2017 paper about using SQL systems to do graph processing.
-
Parallelizing Sequential Graph Computation is a SIGMOD 2017 best paper about iterating incremental graph algorithms.
The conclusions are varying levels of violent depending on the paper. The laptop makes a pretty solid showing; databases reviewers come out rather badly.
Our first paper is Scalable Distributed Subgraph Enumeration or "SEED". We will have a future post about rules for picking acronyms.
2018-08-27 Update: I have added measurements for naive distribution using timely dataflow on a cluster, to test the hypothesis that these computations could distributed well naively. These are the "cluster" lines in each of the tables. The code can be found in this repository. Perhaps importantly, the graph data are pre-distributed to each of the machines.
Subgraph enumeration is a problem in which you get presented with a large undirected "data" graph, and a smaller undirected "pattern" graph. Your goal is to find all instances of the pattern graph in the data graph, where an "instance" is a mapping from pattern vertices to data vertices, for which all pattern edges exist between their corresponding data vertices.
This particular paper is interested in the following seven pattern graphs:
Notice that there are some constraints written by the graphs. The q1
graph, a length four cycle, has several constraints saying that v1
needs to be the smallest identifier, and that furthermore v2
should be less than v4
. These constraints exist to rule out symmetries: each undirected four-cycle (v1, v2, v3, v4)
actually corresponds to eight four-cycles, because we could start at any of the four vertices, and go either forwards or backwards through the nodes. For each pattern graph, the constraints ensure that all symmetries of a pattern graph are discovered only once, through a canonical ordering on vertex identifiers.
With these queries in mind, we are going to (i) review several measurements from the linked paper, for systems PSgL
, TT
, and SEED
, the latter being the contribution of the paper. The first two techniques are a Pregel-based approach (PSgL
) and the authors' prior work TwinTwigJoin (TT
), the latter of which is introduced as "the state-of-the-art algorithm". The SEED
algorithm improves on TT
, and the authors observe:
The results demonstrate that our algorithm outperforms all other state-of-the-art algorithms by more than one order of magnitude.
Now, we are going to introduce some new candidates for "state-of-the-art" algorithms, by writing some for-loops and running them on my laptop. For a few queries (q1
, q2
, q3
, and q7
) we will have an apples-to-apples comparison; for these I enumerate all solutions and do at least one CPU operation for each output (testing whether equal to u32::max_value()
before incrementing the count). The queries q2
, q5
, and q6
spend a non-trivial amount of time enumerating, rather than computing, solutions; for these queries we have a *
variant (which I'll explain in turn) that measures the amount of time to produce a constant time enumerator over the solutions. The query q4
looked hard and dull, so I ignored it.
All of our candidate computations parallelize trivially: you partition responsibility for the first vertex v1
using v1 % workers
, and each worker independently enumerates solutions in their purview. My laptop has two cores and two hyper threads, so we will measure 1, 2, and 4 workers. The measurements of prior work are from a cluster with 160 cores, and so we also report on a hypothetical 160-core laptop, where we partition the work 160 ways, and report the maximum time taken across them when executed sequentially; this is a totally fictitious measurement, but it represents what one should be able to achieve, rather than improvements over Hadoop/Pregel baselines.
There are two datasets for which the paper reports performance on all queries, a YouTube graph and the standard LiveJournal graph. Remember, the *
query variants are relaxed versions that just prepare enumerators, and Laptop*
is the maximum of 160 elapsed times rather than the actual elapsed time of a full run itself.
YouTube | cores | q1 | q2 | q2* | q3 | q4 | q5 | q5* | q6 | q6* | q7 |
---|---|---|---|---|---|---|---|---|---|---|---|
PSgL | 160 | DNF | DNF | - | 279s | DNF | DNF | - | 850s | - | 493s |
TT | 160 | 134s | 612s | - | 63s | 3282s | DNF | - | 229s | - | 129s |
SEED | 160 | 134s | 29s | - | 28s | 780s | 306s | - | 66s | - | 29s |
Laptop | 1 | 320s | 11s | 8.91s | 2.19s | - | - | 4.52s | 106s | 17s | 3.75s |
Laptop | 2 | 177s | 6.01s | 4.49s | 1.11s | - | - | 2.25s | 63s | 8.94s | 1.93s |
Laptop | 2+2 | 123s | 4.02s | 2.91s | 0.76s | - | - | 1.50s | 43s | 6.00s | 1.37s |
Laptop* | 160 | 2.31s | 249ms | 70ms | 31ms | - | - | 67ms | 5.80s | 187ms | 42ms |
Cluster | 160 | 3.16s | 717ms | 281ms | 252ms | - | - | 331ms | 6.36s | 353ms | 241ms |
Scanning through these numbers, independent of what you think about SEED I hope we can agree that TT
was never in the running as "state of the art". The stand-out numbers are for queries q1
and q6
, where SEED
is faster than a single thread (though, slower than the laptop with four threads), and queries q4
and q5
, where I gave up. The *
measurements are often surprisingly small, and while I'll say more about these later, this implies that it may not be hard to compute the results, but actually enumerating all of them might be a good reason to have many computers.
Let's look at the LiveJournal numbers. I added a row for EmptyHeaded from SIGMOD 2016 (numbers courtesy Chris Aberger), to give a bit more context about what performance you might hope for. We even got the same answers (eventually; I had a bug).
LiveJournal | cores | q1 | q2 | q2* | q3 | q4 | q5 | q5* | q6 | q6* | q7 |
---|---|---|---|---|---|---|---|---|---|---|---|
PSgL | 160 | DNF | DNF | - | 5071s | DNF | DNF | - | DNF | - | DNF |
TT | 160 | 220s | 5206s | - | 1281s | DNF | DNF | - | 6968s | - | DNF |
SEED | 160 | 220s | 107s | - | 60s | 1686s | 5814s | - | 1013s | - | 1206s |
EmptyHeaded | 14 | - | - | - | 10s | - | - | - | - | - | 396s |
Laptop | 1 | 821s | 68s | 38s | 55s | - | - | 35s | - | 369s | 1763s |
Laptop | 2 | 455s | 37s | 20s | 30s | - | - | 17s | - | 193s | 985s |
Laptop | 2+2 | 321s | 30s | 13s | 22s | - | - | 11s | - | 138s | 761s |
Laptop* | 160 | 6.98s | 740ms | 314ms | 472ms | - | - | 258ms | - | 3.13s | 17s |
Cluster | 160 | 8.69s | 2.15s | 933ms | 1.13s | 1.03s | 4.35s | 21s |
These numbers are more interesting. First, for q1
the laptop is now never faster than the cluster. However, SEED
is barely faster, and still much slower than a naive 160-way partitioning could be, in principle. Second, I didn't complete any q6
measurements here, partly because the number of matches is roughly 18 trillion (it would take ~6000 seconds for one 3Ghz core just to perform 18 trillion increment ops); I did count the matches exactly though, in q6*
. Finally, for q7
we now have SEED
in the running (briefly) against the laptop; there are many more 5-cliques in LiveJournal (roughly half a trillion) than in YouTube (roughly 70 million), and it makes sense that more machines might help. Other numbers are similar to the above, with the laptop often faster than SEED
.
Now, the authors are apparently aware of centralized (i.e. non-Hadoop) solutions, like the ones we implement, but worry that
[...] existing sequential algorithms for subgraph enumeration [8, 16] are not scalable to handle big graphs.
The paper also reports timing measurements on a billions-edge Friendster graph, as evidence that they handle larger graphs. As it turns out, the laptop handles large data too (though a near thing, in this case). Unfortunately, the paper only presents measurements for two queries: q2
and q7
.
cores | q2 | q7 | |
---|---|---|---|
TT | 160 | DNF | 9425s |
SEED | 160 | 1144s | 555s |
Laptop | 1 | 4603s | 2532s |
Laptop | 2 | 2385s | 1268s |
Laptop | 2+2 | 1587s | 820s |
Laptop* | 160 | - | - |
Cluster | 160 | 82.5s | 39.5s |
These numbers are for sure more interesting. SEED
does better than any number I can actually achieve on my laptop. That demonstrates non-triviality, in that their system isn't just a laptop computing the answer and 156 cores mining bitcoin. I would have loved to show you the 160 core measurement, but .. the graph is large enough and the 160-way work partition scattered enough that OS X decides it should page out the 15GB of data. Triangle computation saw a 5x slowdown in "160 core" mode vs one core (reporting about 1/33rd the single-threaded time), and I didn't want to wait the hypothetical ten hours it might have taken to complete q2
and q7
in that mode, only to get a number that is mostly measuring random reads to my SSD.
For the actual measurements, we didn't exactly see the claimed scaling failure, and it is a pity that the VLDB reviewers just take text like this at face value. In fact, it looks like these implementations are scaling just fine with naive parallelization. If I had one of those expensive MacBooks with four real cores and four hyperthreads, I suspect we would see faster times on the laptop than on SEED
. Although, at that point the 160 core cluster might be cheaper.
To start things out, let's talk about how we represent our data. We are going to use an adjacency-list representation. For each vertex, we will have a sorted list of the other vertices to which it is connected. For reasons, we will also track for each vertex the number of neighbors whose identifier is less than that of the vertex, so that we can quickly get both (i) the list of all neighbors, and (ii) the list of neighbors with greater identifier (this helps us restrict our attention to neighbors satisfying the <
constraints).
As is apparently standard, we re-label the vertices so that the undirected degree of the vertices is non-decreasing as identifiers increase. The high-degree folks will have the highest identifiers. There is a good reason for this, but it is mainly that it makes things go faster and everyone else does it so we will too. What this really means is that when we break symmetries, we break them using degree before identifier.
Now, let's start with a warm up. Computing triangles. I've written a q0
query that returns the number of undirected triangles using some methods you don't know about yet. It takes as arguments a reference to our GraphMap
memory-mapped adjacency list, as well as index
and peers
numbers that we will find useful for parallelizing the computation (think of them as 0
and 1
for the single-threaded case).
The computation walks through each graph node (v1
), starting from index
in steps of size peers
, and grabs the set of forward
neighbors of the node (v1f
), those neighbors of v1
with greater identifier. For each such neighbor (v2
), the computation grabs the set of its forward
neighbors and intersects them with the neighbors of v1
greater than v2
(which happens to be &vf1[(index_v2+1)..]
). We count up the number of elements v3
in the intersection.
fn q0(graph: &GraphMap, index: u32, peers: u32) -> usize {
let mut count = 0;
let mut v1 = index;
while v1 < graph.nodes() {
let v1f = graph.forward(v1);
for (index_v2, &v2) in v1f.iter().enumerate() {
intersect_and(&v1f[(index_v2+1)..], graph.forward(v2), |v3| count += 1);
}
v1 += peers;
}
count
}
This is triangle counting!
The only thing I haven't shown you is intersect_and
, which takes two sorted slices and calls the supplied closure for each element present in both slices. The only cleverness it has is that it iterates through the shorter of the two slices, using exponential search in the other slice, taking time linear in the smaller length and at most logarithmic in the larger length.
This is a fairly standard approach to finding patterns in larger graphs: traverse, in a depth-first manner, possible bindings to each of the variables, discovering possible next bindings by intersecting the adjacency lists that constrain the next variable.
Does this go fast? Yes.
What about memory overhead, as warned about by the SEED
authors? Other than the GraphMap
itself, which is mmap
applied to the on-disk representation, .. there isn't any. There is .. the stack? The program binary itself takes some memory? There is no dynamically allocated memory in this program, and none of the depth-first search approaches that I know of require any; they can just use cursors on their stack. We will use some memory later on, because it makes writing the programs simpler, but we will only need enough to stash the adjacency list of a vertex or two.
Let's now talk through each of the queries, so that I can explain what got done. In each case, we are enumerating the values for some variable, then traversing them and enumerating the values for the next variable, repeating for all variables. In many cases we can be smart about how we do this. Generally, we try to enumerate the most constrained variables first (most incident edges, inequality constraints) to cut down on the candidates explored.
-
Query 1 is a four-cycle. There is relatively little structure here, which means we just explore lots of candidates. We start with
v1
, and from its forward edges consider all possible pairs ofv2
andv4
, wherev2 < v4
. For each pair, we intersect their neighborhoods to enumerate the possible values forv3
. -
Query 2 is a two-truss. That might be a new term for you, but it means "edge (
(v1, v3)
) with two incident triangles". We can enumerate these by considering allv1
andv3
edges, withv1 < v3
, and then enumerating allv2
in the intersection of their neighborhoods, much like undirected triangles (thoughv2
need not be greater thanv1
andv3
). Having done this, we can enumerate allv2
, but also allv4
, which have the same constraints and only need to be greater thanv2
.This is our first example where the computation (intersecting neighborhoods) can be more efficient than enumeration (enumerating all ordered pairs from the intersection). The intersected neighborhoods completely describe the set of
v2
andv4
, and it is a trivial nested loop to enumerate them. Each ordered pair is a valid output, and any correct algorithm must pay this cost. In the case of LiveJournal, almost half the time is spent enumerating, rather than computing. -
Query 3 is a four-clique. We enumerate
v1
,v2
, andv3
just as we did for undirected triangles above. To enumeratev4
we would intersect the neighborhoods ofv1
,v2
, andv3
, but we already have the intersection of the neighborhoods ofv1
andv2
, which we used to enumeratev3
. Rather, we just intersect the forward neighbors of eachv3
with the list ofv3
themselves. -
Query 4 is a house (a four-cycle with a triangle on top). This just seemed annoying and expensive. It is like a less constrained version of
q6
, both with respect to edges required and symmetries. As we will see,q6
is already annoying (and not for good reasons, in my opinion). -
Query 5 is a fan. This is a problematic query. As it turns out, even
SEED
doesn't enumerate all solutions here (I believe). One of their graphs contains a943
-clique, which induces somewhere north of 300 quadrillion (3x10^17) instances ofq5
. Instead, they produce what they describe as a compressed output, roughly the clique plus assembly instructions. This is why I don't feel too bad about the*
queries.Our approach to
q5
is to observe that it describes exactly a length-four path along edges that complete triangles withv1
. Ourq5*
measurement produces for eachv1
the set of edges that complete triangles, with "instructions" that you should enumerate all length-four paths (subject tov3 < v5
). It is relatively fast to compute these sets of edges (as measured), and conceptually simple to enumerate all length-four paths. As withq2
, every length four path is a valid output, and any correct algorithm needs to do work proportional to this number.Technically, this approach requires storage bounded by the square of the maximum degree, though in practice these sets seemed to max out at half a million edges (in LiveJournal). If nothing else, the edges are clearly a subset of the actual edges, so one bit for each edge would suffice.
-
Query 6 is also a house with a roof, but this house is a clique. Another way to view
q6
is as a three-truss (remember them?), an edge ((v2, v5)
) with three incident triangles (described byv1
,v3
, andv4
), with the additional constraint that there is a forward edge fromv3
tov4
. As with the two-truss queryq2
we compute and stash all triangle-completing vertices for the edge(v2, v5)
. We then enumeratev3
from the stashed list, enumeratev4
by intersecting forward edges ofv3
with the stashed list, and then for each match we play out allv1
from the stashed list.This query is a good example where it is much more expensive to actually enumerate the solutions than to describe them (by pointing at the stashed list, rather than walking through all
v1
values). Ourq6*
measurement produces all(v2, v3, v4, v5)
matches and then points at the unchanging stashed list for thev1
. As withq2
andq5
, each of the enumerated results are valid, and any correct algorithm must pay the difference betweenq6
andq6*
. We could eventually enumerate all matches, and using more machines would make it go faster for sure, but the meaningful computation happened long ago.We can count the number of solutions, by replacing the enumeration of
v1
by the addition ofstashed.len() - 2
to the count of all matches (we need to subtract two becausev1
must be distinct fromv3
andv4
, which are by construction in the stashed list). In the case of LiveJournal there are over 18 trillion matches (specifically: 18,841,835,780,333), which is much easier to count up to in steps ofstashed.len() - 2
than in steps of one. -
Query 7 is a five-clique. We do the same thing we did for triangles and four-cliques, which is to enumerate all
(v1, v2, v3, v4)
and then enumeratev5
by intersecting the forward neighbors ofv4
with the list we formed to enumerate thev4
s themselves. You could see how clique enumeration might generalize.
That's really all. I tried to go with fairly direct implementations that took advantage of obvious structure, but none of which were at all specialized to the datasets themselves. The Friendster dataset is at the limit of what my laptop can process (it's about 15GB on disk, in the representation I used); getting any larger would require more memory.
Scaling out to multiple workers is pretty trivial; you just specify peers
to be as many workers as you have, and start up tasks with index
ranging from 0
to peers
. That's how I got the numbers I used. Each worker introduces some additional overhead in the form of a few more vectors of adjacency list stash, some thread overhead, etc., but they all read the same read-only GraphMap
adjacency lists. If you wanted to use multiple machines, you just give each machine a copy of the graph; they are mainly small. The resulting times, at least the maximum time over the 160 parts, suggests that naive parallelization could be very efficient, and much faster than the numbers SEED
produces.
I hope we now have a better handle on baselines for subgraph enumeration. The SEED
paper does have some ideas that I think are fine, in that you should index things other than just edges, if you have the resources to swing it (I don't, for Friendster, but do for the other graphs). At the same time, their absolute numbers are hard to use; when I mocked up their approach of pre-computing and indexing all triangles, the times for q2
and q3
dropped by a relatively modest 12 seconds each. When the evaluation of a paper's contribution rests solely on experimental measurement, we need better baselines (and better implementations) to tell if the new ideas produced broad gains; the measurements in this paper tell us mostly that the authors have improved their code base from two years ago.
Many of the COST numbers for this work are unbounded, unfortunately. The authors do have some scaling numbers for q2
and q7
on LiveJournal, their Figure 9, in which we can see that the COST for q7
on LiveJournal is 8x16 cores, and the COST for q2
is greater than 14x16 cores (their largest configuration). We don't know anything about the COSTs of q4
, q5
, and q6
, because we didn't properly evaluate those here, but I think it is fair to be skeptical.
I found the distinction between the queries and their *
variants interesting. My intent was to have the starred variants capture the essence of the computation, leaving only rote, necessary, and parallelizable work. In essence, the scaling of the remaining work is intended to be fundamentally uninteresting; if you really need to do it, it is both conceptually easy and relatively efficient (relative to the amount of work you must do). I'm not sure what conclusions to draw from them, if any, but some part of me wants to trivialize the gap. This seems similar to theoretical work in the databases space where algorithms are allowed additive |INPUT| + |OUTPUT|
terms for free, and only study additional costs.
Our second paper (!) is the SIGMOD 2017 paper "All-in-One: Graph Processing in RDBMSs Revisited". It is behind a $15 pay-wall, for which I apologize, but by the end of the section you might think that this is a fine place for it.
The premise of the paper is to "revisit the issue how RDBMS can support graph processing at the SQL level." They aim to study how SQL can be used directly to attack graph processing problems like weakly connected components (WCC) and PageRank (PR), among others. What do they conclude? From their abstract:
We conduct extensive performance studies to test 10 graph algorithms using 9 large real graphs in 3 major RDBMSs. We show that RDBMSs are capable of dealing with graph processing in reasonable time.
It can be difficult to quantify "reasonable" time. It will not be faster than a laptop.
One quantification of "unreasonable"-ness could be that the reported times are larger than those in "The case against specialized graph analytics engines", a CIDR 2015 paper proposing that people should do graph processing using SQL, which happens not to be behind a paywall (read it instead).
The CIDR paper, amazingly, reads like a follow-on to the 2017 SIGMOD paper, studying computations on larger graphs and getting better results. Only, it was published more than two years earlier, in the same community. I don't really understand.
The "All-in-One" paper is, however, aware of this work, addressing it in its related work section thusly:
Fan et al. in [18] propose GRAIL, a syntactic layer converting graph queries into SQL script.
So they do know about the work. They just don't have anything more to say about how their work furthers our understanding, and the SIGMOD 2017 PC had the courtesy not to ask.
So we have two papers, both proposing to use SQL for graph computation, which I can assure you is a very exciting and popular approach among people who are more expert in SQL than they are in graph algorithms. Let's look at how their numbers stack up to (i) each other, (ii) some graph processing systems, and (iii) single-threaded laptop-oriented computation.
Fortunately, all of the datasets considered by the two papers are quite small. The CIDR 2015 paper considers at the largest the uk-2007-05
graph, which has 3.7B edges (though they also have some numbers on larger synthetic data). The SIGMOD 2017 paper considers at the largest the orkut
dataset, which has 117M edges. Fortunately, the CIDR 2015 paper also considers the orkut dataset, and so we can actually compare the results of the two papers.
Aside: It really shouldn't have to be a !#@$ing lucky coincidence that two papers on the same subject use the same dataset. I shouldn't have to cross-correlate their results and rescale their figures and squint hard to understand how pieces of research relate to each other.
Let's look first at the CIDR measurements. They consider single-source shortest paths (SSSP), weakly connected components (WCC), and ten iterations of PageRank (PR). The interesting measurements are the "SS" measurements, where "SS" is short for "SQL Server" running their approach. You can also see measurements for Giraph and GraphLab.
On average, SS seems a bit better than Giraph, maybe worse than GraphLab, but on the balance not that much different. Their point in the paper is, if it isn't all that much different, maybe just stick with the SQL installation that works and doesn't cost Apple $200M USD. Think, with that kind of money, Apple could have read the SIGMOD paper 13 million times!
How do their numbers, the SIGMOD paper's that is, look by the way?
They are a bit hard to read out of the figure (it is a screenshot, the weird visual scaling is in the paper), but it looks like their three database approaches (no comparison to other systems) take about 1,000 seconds for SSSP and WCC, and just under 2,000 seconds for PageRank (number of iterations unspecified).
It's hard to draw conclusions about PageRank because who knows, they might have run 1,000 iterations (it would still be slow), but for SSSP and WCC, it seems the CIDR paper is respectively about 15x and 5x faster than what the SIGMOD paper proposes. And that is if you manage to stay away from PostgreSQL, which tacks on another 3x-4x slowdown (but comes with a Turing award).
I judge the CIDR 2015 paper interesting, in that the HotOS COST paper hadn't appeared yet, and it was reasonable to be surprised by the questionable performance of the graph processing systems, and want to tell people (like we did). At the same time, it seemed pretty easy to diagnose the problem as resource starvation for the graph processors (slightly larger graphs run out of memory on their single machine set-up).
I judge the SIGMOD 2017 program committee something of a mess. Perhaps there was something thought provoking in the "All-in-one" paper that I missed; they consider more algorithms? A larger collection of even smaller graphs? They report on different database vendors, to hone in on the worst possible configuration for graph processing on SQL?
I said we would see a graph processing system numbers and some laptop numbers. I have to admit, I don't have numbers for the orkut
graph with timely dataflow. I .. could get them. I guess. First, let's just look at how the orkut numbers for the two papers up above line up with a single-threaded measurement. (NB: I have no idea about the number of iterations for SIGMOD 2017; as we'll see it doesn't really matter).
WCC | PR (iters) | |
---|---|---|
CIDR15 | ~180s | ~220s (10x) |
SIGMOD17 | ~1000s | ~2000s (??x) |
LAPTOP | 2.644s | 13.090s (20x) |
The WCC numbers use the same algorithm (label propagation) the other papers use, on an adjacency list representation. If you use the union find algorithm, the WCC number drops down to 1.146 seconds. If you arrange the edges according to a Hilbert curve, the WCC time drops to 0.727 seconds and the PageRank time drops to 3.955 seconds for twenty iterations.
Let's leave the SIGMOD 2017 paper behind, and advance to measurements on graphs that require at least a gigabyte to represent. I'm talking about twitter_rv
and uk_2007_05
, both of which the CIDR 2015 paper has measurements for, and on both of which Malte and I have previously evaluated timely PageRank.
First, let's check out the CIDR measurements, which include fewer of the graph processing systems (Giraph and GraphLab) because apparently they run out of memory or crash or whatever. It's ok to make fun; they've all made lots of money.
We are now up in the thousands of seconds! Wow. Big numbers.
Let's break out the timely dataflow numbers for PageRank on a single machine, which also report how long a single thread takes to do the work:
With 16 cores the CIDR system is doing ten iterations in 3,000 and 6,500 seconds, for twitter_rv
and uk_2007_05
respectively. With 12 cores timely is doing twenty iterations in about 100 seconds. And even the simple laptop configuration with the vertex order comes in at 300 and 650 seconds, almost exactly one tenth of what it takes a 16-core database to do the same thing.
A factor of 10x isn't a horrible tax to pay for re-using re-usable infrastructure, like your database. A factor of 60x-130x is more debatable, but I think the CIDR paper is totally worth talking about. I'm a bit disappointed that there aren't reasonable baselines, but not as much as that the SIGMOD paper doesn't have any baselines at all.
In the past two years the database community has gotten substantially worse at doing graph processing with SQL. I wouldn't recommend it, really. They've gone from a 16-core machine being only 10x slower than a single-threaded for
-loop, to being up to 1000x slower but only on graphs up to 500MB in size.
At this rate, it is going to be hard work to move the needle for SIGMOD 2018.
This was SIGMOD 2017's best paper winner.
I like several things about the paper, mostly that it is going in new directions, but I would really recommend the authors' other paper, Incremental Graph Computations: Doable and Undoable (also paywalled; if you surf from SIGMOD 2017's program, you can get in though) where they describe the algorithms they implement. The system itself is not clearly an improvement on existing iterative processors (for example, Timely Dataflow, but really any other iterative system).
First, let's clarify the title. When the authors say "sequential" they do not mean "single-threaded", nor "as opposed to concurrent", nor anything like that. For the purposes of this paper, a "sequential" algorithm is one that can respond to changes to its input, updating its output correctly. No, that is not what I was expecting either.
The gist of the paper is that if you have such algorithms, they that can absorb changes to the graph and produce changed output, you could partition up your graph somehow and iteratively absorb, produce, and exchange changes, until things eventually stop changing. Unlike in Pregel, for people who are only familiar with Pregel, where each vertex is logically isolated from others, this paper says "hey if you have a better algorithm for your whole hunk of vertices, use it!"
This totally makes sense (and indeed is how lots of existing systems work). It is such a good idea, we did it for graph connectivity using Naiad back in the COST paper (union find is such a "sequential" algorithm). Our attempt at establishing a COST for graph connectivity was to lean on Naiad until distributed union-find was faster than our single-threaded union find. Naiad is more general than what is proposed in this paper (streaming allows overlapped work in union-find), but the topic was less thoroughly investigated than in this paper (union-find and pagerank aren't very in-depth).
But this paper does graph connectivity too! We can compare things, and see how they line up. I'm going to be a bit lazy and grab the one measurement that I happen to know I have: graph connectivity for LiveJournal. They do many other computations, but I'm not here to do a thorough assessment. Their paper probably would have been the best place for that.
The x-axis N
is number of machines, each of which have four cores.
The system presented is Grape
, which is the second-best line here. Look at those Blogel
numbers though, amirite? It turns out this is because .. wait for it .. Blogel reportedly computes connected components as part of their "pre-processing". But we aren't here to talk about Blogel. Grape also does some non-trivial pre-processing, and while not as thorough as Blogel's, it gets them down to 1.6 seconds at 24x4 cores. Way better than the state of the art.
Single-threaded time to do graph connectivity on LiveJournal?
Either one second, or half a second, depending on whether you do use adjacency lists or that Hilbert curve preprocessing thing. And it goes faster with two threads, because we evaluated that in the COST paper (although admittedly on a much larger graph).
Generally, I like the direction the paper takes; it brings new algorithms into play, and does something less silly than people have been doing before. It would be nice if their system was demonstrated to be better than what people have done before (both single-threaded, and timely), but maybe we save that for next year (or maybe for two years ago, because SIGMOD). I do have more questions about their paper, but they are "what does your theorem actually mean" questions rather than "what neurotoxin did you coat your submission with" questions. I think that is progress.
Working hard to do things well is for other people.
Carlo Curino tells me that the database community prioritizes "novelty", and indeed none of the presented works did things the way I would have. In that sense, novelty accomplished. Or perhaps I misunderstood Carlo; as Merriam-Webster tells us, "novelty" has multiple meanings.
While I wasn't expecting it, and I don't know who H. M. Jones is, that quote thar sums up my feelings pretty well.