Today we are going to look at something that I think is pretty amazing. I've wanted to do this for a while, and while at ETH Zürich I found the time and collaborators (John Liagouris and Zaheer Chothia) to make it happen.

We are probably all familiar with "big data": for some reason you have been gifted with a pile of data so large that you couldn't hope to do anything with it other than download some scalable system, describe some thoughtful computation, and press "GO". Of course, there are many flavors of these computations, and the one we'll be working with is "high-throughput", "low-latency", and allows you to write nice iterative computations: differential dataflow. But really, what we are about to do isn't system dependent: you should be able to do this with any sufficiently powerful system; we'll just use a specific sufficiently powerful one.

So. You went and wrote a big data computation, you pressed "GO", and depending on your system you are now either waiting, or perhaps watching a stream of results. Look at all those answers coming out! How did they get there? Imagine picking up any one output and asking "why this result, and not another?". Your social network analysis tells you who can reach whom, but how can you explain the result as anything other than "my computation applied to a terabyte of data"?

We're going to look today at the problem of explaining outputs. For all of the exciting and meaningful results you computed, how can individual result elements be explained in terms of inputs to the computation? What does that even mean?

We have a tech report online, in case you are the sort of person who prefers that type of reading. It has more details and fewer flip comments.

Explaining reachability

Let's take a simple example from above: determining the connected components of a social graph like the Twitter @mention relationship. Maybe once we have computed the results we need to explain some things. Maybe we want to know how to get two connected people in touch, or explain why you are so well connected to Justin Bieber, or debug a logical inconsistency in the data (e.g. "there should be no path from A to B; what's up?").

We've seen before how to perform a reachability computation: individuals propose labels to their neighbors, and each remembers (and re-proposes) the smallest label it has ever seen. In the fullness of time, any two people who can reach each other will eventually have the same label. The results of this computation are pairs (person, label), where two people with the same label can reach one another.

Imagine we run the computation and see among our outputs (frank, label). You might want to know why; what part of the input leads the computation to produce this particular result? How did I get that label?

If you are familiar with the algorithm, there is a fairly simple answer: label traverses a sequence of graph edges to get to frank. This sequence of edges is just a path in the graph, and is a great explanation for why we see this result: the output is concise, and it has the property that if we re-run the computation on just this input (the path) we get (frank, label) in the output.

Consider the following two ways of looking at your data: with all the data, and with only input elements that explain the output:

Explanations look like they might be pretty handy for reachability, right?

I mean, we would still have to write a weird system for finding paths in the graph, and do more bespoke work if you ever want to change your query (e.g. maybe you decide wait for a second copy of the label, making the explanations be "two paths", or something weird like that). But at least we understand what an explanation kinda looks like, right?

By the end of the post we'll see how to do the above against constantly updating billion edge graphs with millisecond latencies, without writing any new system (or even much/any new code, in principle).

Explaining all the things

What we will work through today is a framework for explaining the outputs of any differential dataflow computation, using differential dataflow itself. This means that we'll end up with computations that can be interactively interrogated to get output explanations, and these explanations update in realtime as the explanatory inputs change. If the paths from label to frank change, we will observe the stream of these changes (and if the label for frank changes, we will see the paths swing to connect to the new label).

So we automatically get interactive explanations for arbitrary outputs of any computation, in realtime. What isn't to like? Well, there are some overheads, and we will see these, but I'm glad to have your attention.

Explanations

Before going too far, let's specify the problem we would like to solve. Imagine we have an input collection, a computation, and a corresponding output collection. For any subset of the output domain, we want to identify a subset of the input collection whose output under the computation agrees with the actual output, on the specified subset. This was a bit wordy, because we want to be able to explain not only output records, but the absence of output records. For some set of records that may or may not exist, we want a subset of the input that faithfully reproduces what actually happened for those records. Even if this means preventing some records from being produced.

We know there is at least one solution to this problem: the entire input collection itself reproduces the entire output collection. However, we are interested in more concise explanations, and will work towards minimizing the amount of input we actually need. This is one place where differential dataflow helps out a great deal: naïve adaptations of existing work can result in enormous explanations. Even differential dataflow can provide large explanations, and we will need to introduce some "optimizations" to streamline the explanations. For example, we may just want one path between two nodes, rather than all paths, or even "all shortest paths".

We promised interactive updates to explanations for interactively update computations, so let's clear that up now.

Recall that differential dataflow updates computations. We will produce explanations in the spirit above, using differential dataflow. As such, that computation (the explanation computation) will interactively update as well. At each moment in time, we will have computed an explanation for the current snapshot of the data, where the stream of updates to the explanations move us from one correct answer (explanation) to the next, tracking the changes to the input.

Differential dataflow

Let's do a quick recap of what differential dataflow is, because this will play an important role in understanding how it plays a role in explaining things. We could just start with the idea that it is good at doing big-data like computations, and that it quickly updates computations. This is a great start, but we'll want to dig a bit deeper to really grok what is going on.

Differential dataflow computations can be viewed as directed dataflow computations: each vertex in the dataflow graph names a collection, and directed edges from one vertex to another indicate that the source is used in the definition of the target.

For example, here is a picture (it is for breadth-first search, but the graph structure is basically the same for label propagation):

This structure has several important consequences:

Functional operators

Each differential dataflow collection is defined only by its inputs, which is an important part of explaining the collection.

A collection is explained by the collections that are its dataflow inputs.

This isn't very helpful yet, because we have identified entire collections as useful for explaining outputs. If we just report that you pretty much need all the data, you wouldn't be too impressed.

Fortunately, most big data computations (and differential dataflow specifically) use data-parallel operators: each collection is not an arbitrary function of its inputs, but rather the result of partitioning the elements of its input collections, and applying a function to each part. For example, when we determine the minimum label a graph node receives, we partition the proposed labels by "proposee"; at this point, each part yields only the smallest label, independent of what the other parts do.

This independence is part of what makes differential dataflow go fast: we know that changes only affect one part, and can minimize the amount of recomputation we do when a change arrives. For exactly the same reason, independence helps us provide concise explanations: we know that only so many records played a role in determining some output.

A collection element is explained by the input collection elements in the part that would produce it.

This is now a bit more helpful. If we need to explain some records in the output of an operator, we just need to look up the inputs to the corresponding parts. They do a good job explaining the output.

Dataflow structure

While each operator can name input records required to explain output records, the dataflow graph shows us how to find explanations for these input records: they were output records of some other vertex in the dataflow graph, or perhaps they are inputs to the computation itself. In either case, we can chase backwards through the dataflow graph, expanding the set of tuples of which we require explanations, until we have some subset of the input records and no additional internal records to explain.

Looking back at the picture up above; if we had to explain the outputs of group, we would push them back to concat, which would probably require explanations from map, etc.

There is the sneaky issue that differential dataflow allows loops, which means we might have to chase these dependencies for a while. This is true, but we'll use differential dataflow's timestamps (in the next section) to make sure that we only chase causal dependencies: records that could actually play a role in the production of outputs.

Logical timestamps

Differential dataflow transcribes evolving collections as an append-only log of changes to the collection. The collection might initially have some records r1, r2, r3 in it, but as the computation proceeds perhaps r1 departs (for reasons unknown). This would be recorded by differential dataflow as something like:

(r1, +1, 0)
(r2, +1, 0)
(r3, +1, 0)
(r1, -1, 7)

where the shape of these tuples is (record, change, time): the record has some change to the number of times it occurs in the collection, where the changes occur at some logical time. Here we've had the first three changes take effect immediately (at time 0), with the removal of r1 at time 7, which I just made up arbitrarily.

Each differential dataflow operator maps whatever collection logic it has across all of the times at which its inputs change. For the example above, it could produce output changes at times 0 and 7. Importantly, the computation for time 0 ignores the input changes for time 7 (or other times greater than 0); consequently, they will not play a role in explaining such outputs.

The important feature here is the timestamp: when we explain records in a differential dataflow computation, we will ask for explanations at a specific timestamp. For example, if we want to know about r1, we need to clearly indicate whether we want it explained at time 0, when the record still exists, or at time 7, when it no longer exists. In the first case, we can ignore the (r1, -1, 7) tuple; it doesn't explain the output, and we don't need to have it explained. In the second case, we totally need that subtraction, because without it we wouldn't produce the right answer (and you wouldn't be impressed with the explanation).

Timestamps also allow us to break "explanation loops" that might otherwise exist in chasing records around a cyclic dataflow graph: here we guarantee that the explanatory timestamps only decrease as we chase them through the dataflow graph.

Mocking up explanations

We are going to start with a very simple approach: for each differential dataflow operator, we'll make a second, new operator with the same behavior, but additional inputs and outputs. For each output of the original operator, we'll add a new "explain me" input: this input sees a collection of operator outputs that should be explained. For each input of the original operator, we'll add a new "explanation" output: this output produces the subset of the corresponding input needed to explain the requested outputs. Because the figures work this way, we'll add the suffix _q to these new "query" collections.

We'll then wire up our new operators in a reflection of the original dataflow graph: wherever there is an edge from op1 to op2, we will add an edge in the reverse direction, from op2_q back to op1_q. This will turn explanation requests made of op2 into explanation requests for op1, which in turn can pass requests back to its input operators.

Here is a schematic picture, which isn't particularly accurate. :)

For timely dataflow nerds, these new operators will live in a second scope, parallel to the original. This indicates to timely that information only flows in one direction, from our computation to the explanation infrastructure, and not back; this prevents explanation work from blocking actual computation. What isn't shown in the picture is the flow of information from the first scope to the second; we'll want the actual computation to inform the explanations, and that isn't happening yet.

We will start with simple, general versions of operator explanations, which should make our approach clear (I hope), but we'll want to move on to some smarter versions to streamline the computation. We will do that just afterwards, so if it all seems hopelessly naïve don't worry (at least, not yet). We'll also need to add a bit more complexity to deal with the big secret I haven't mentioned yet.

A simple approach (`group`)

The simplest sort of operator in differential dataflow is the group operator, which allows you to specify a key for each input record, and some logic to apply to each group of records (grouped by that key). The output collection is the collation of all of the outputs produced by the groups.

We are going to make our lives a little easier here, and require that the outputs from group have the form (key, val); that is, they explicitly record the key used to produce the record. We could work around this by keeping a bit more information on the side, but we'll see that most of use cases do this, and it is basically free to "discard" the key if we actually want to.

Let's try to implement group_q: we have records input to group that look like (key, val) and output records that look like (key, val). For any requested output records (inputs to group_q) we want to find the group input records with the same key (because those are exactly the inputs that produced the output). It kind of looks like we want a join between the actual input and the requested output. Let's do that.

The group operator on top is what we see on the left side of the picture above. The join below it just gets called group_q, because it's all we need to fish out the relevant inputs for any given outputs.

This is a simple way to determine the inputs; what are the overheads involved? A join in differential dataflow is roughly like a hash-join: each input is kept indexed by its key, and when records (changes to records) arrive on either input they are matched against existing records on the other input. In our case, the input collection will be maintained indexed by key, and as new requests arrive they result in one look-up, and the emission of input records with the same key.

This approach maintains a second copy of the input data, indexed by key. That can be a fair amount of space. At run-time, lookups for explanations are relatively cheap; the work done is proportional to the amount of explanations required, which is probably as good as it gets. As it turns out, the data are indexed by the join the same way group indexes them, and we could try and share this data. There are some logistic issues at the moment (the data have different timestamp types), but in theory is it possible.

A simple approach (`map`)

The other operator we need to get a primitive MapReduce computation up is map, which allows you to take each input record to a collection of output records, with no requirements on how you do it. In principle it is easy to explain this operator, because each output record comes from some specific input record. In practice, you have to just write down all these pairs, which is easy but annoying. We will improve it, but for now the picture looks like this:

The join here is like the one for group_q, but there are probably a lot more keys here. Plus, we expected to do a fair bit of work when we wrote group, but we probably thought map would be cheap. We'll fix this up in some special cases in the smarter variants.

Smarter approaches

These are going to be some smarter approaches, but also some admissions that what was done above wasn't sufficient. For example, we don't really treat binary operators above (you could generalize the approach without much work), and we haven't dealt with "keyless" operators like map which can be expensive without the key structure. Maybe think of this section as making you smarter, rather than us being smart.

The `map` operator

Explanations for map can be simplified dramatically when the map logic is invertible. That is, if we can go automatically from the output to the input, then we don't need to write anything down. You might think this is a rare case, but in our examples at least we use map mostly to re-arrange a record to exchange which parts are key and which parts are value.

In case it isn't clear, we have a picture indicating the smarter dataflow:

The `join` operator

The join operator takes two collections (key, val1) and (key, val2) and produces pairs (key, (val1, val2)) when it finds a matching pair. Requesting an explanation for (key, (val1, val2)) ends up being very easy, as we have all the information about what we need in front of us. We just request an explanation for (key, val1) from the first input and (key, val2) for the second input.

No additional state needs to be retained for join operators. At runtime, each explanation request just turns into one explanation request for each of the inputs to the join. This is a neat example where a painful operator (join) turns into very simple explanation infrastructure, sort of the dual to map.

The `topk` operator

The group operator is a general-purpose aggregator, but often we know there is more specific logic to be applied. One example we saw several times in our programs is the topk operator, which partitions its input by some key and for each key retains the top few values in the part.

For example, label propagation repeatedly circulates label proposals, with each participant retaining only the smallest label. We would usually write this with a group operator, because we want to bring together all the proposals destined for one participant. We know that we only plan to keep the smallest label, but as long as this remains our secret any general-purpose explanation infrastructure will only know to ask for explanations of all of a parts input.

Instead, if we make clear that we are using a special form of group, we should be able to ask for the smallest record (or records) as the explanation. These records explains what actually happens, as all other records are simply discarded. To make this work correctly, we need to join any requests (key, val) with the actual top few records, produced as output. This join is important because the actual answer may differ from our request; recall that we want to explain absences as well as actual records (why was that again? keep reading!).

The overhead of topk is like that of group, except that we index the output of the operator rather than its input. This can be substantially smaller, especially if the topk reduces the data a lot. At runtime, each explanation request turns in to one lookup, like with group, but probably produces fewer explanation requests of the operator's inputs.

Recap

When we write a differential dataflow program, we also assemble a shadow copy of the dataflow graph, one that draws in various collections produced by the real computation, as well as requests for explanations of its output records. This shadow dataflow graph circulates explanation requests until all have resolved requests made of the input collections themselves. These inputs are then the explanations for the outputs.

Both of these dataflow graphs are running at the same time, both continually updating in response to changes in the inputs (data, and requested explanations).

This sounds great, and we are going to look at how well it works on the graph connectivity problem. Then we will talk a bit about how it doesn't actually work as promised for general computation, we'll fix it, and then we'll move to a more awesome problem (stable matching).

Explaining reachability

We are going to take some "real" graphs, a Livejournal linkage graph and a Twitter follower graph, spin up a graph reachability computation, and start asking for explanations about their outputs. These datasets are both static, so we will fake out some changes when it comes time to see how responsive the computations are.

Baseline approaches

There aren't a lot of techniques for automatically explaining iterative computations. Believe me when I say it was hard enough just finding systems that produce the correct answers in the first place. There are two strawman implementations we could consider, but don't take these as critiques of their associated systems; neither were designed for this.

Spark-style provenance

If you think of our iterative computations as repeated batch computation, you could apply some recent work on data provenance to determine, for each iteration, which input tuples play a role in producing each output tuple. The reasoning we did for group isn't new, and you can just apply it on an operator by operator basis.

If you try this, you pretty quickly run aground on a few issues. First, Spark isn't great at iterative computations in the first place, even without explaining its outputs. Second, and more importantly, in each iteration the labels for each node get updated based on new proposals and old values. Nothing in the computation reveals that "staying put" doesn't require an explanation. If a label hasn't changed, don't bother tracking down the explanations for all those pointless proposals, right? Unfortunately, if you require explanations for these pointless proposals, you are going to bring in the edges the traversed, transitively, which ends up being all the edges in the connected component. Accurate, but not yet helpful.

The Spark-style computational model doesn't reveal the relation between successive iterates, which is what lets differential dataflow determine it only needs explanations for times less or equal to those of requests, and to stop asking questions about records when nothing changes.

Datalog-style provenance

Another approach is to invoke provenance work done in the datalog community. If you aren't familiar with datalog, think of it as a version of SQL with recursion and a less shouty syntax. The language inherently supports recursion and joins, which is great for label propagation, but it doesn't natively support aggregations like min or topk. Also, for good reasons, the explanations produced in these systems include all explanations, corresponding to all paths from label to frank in our example from way above.

This isn't as bad as the Spark-style "literally every path anywhere" explanations, but in a connected graph you are still going to get all the edges in your connected component. You'll produce fewer intermediate requests though, so yay.

Both of these approach end up asking for pretty much all the edges in the graph. This isn't helpful. We aren't going to evaluate these.

Simple differential approach

We'll start with a simple differential dataflow approach, based on a direct port of the connected component code in the differential dataflow repository. This code uses a group to collect the labels each iteration. The differential computation does reveal that it is the same collection changing each iteration, which allows differential to isolate the first time a label (e.g. (frank, label)) appears, explain the appearance, and then stop talking about it afterwards. On the other hand, this simple version isn't bright enough to use the topk instead of group, and that has some performance issues we will check out.

Our plan is to load up the Twitter data, as a batch computation but with the potential to update the edges if we like, and then start asking for explanations of random observed outputs. We'll get our answers, but the amount of time taken for the responses may (will) vary. Let's look at that.

Ok, let's explain this figure. We did this experiment 1,000 times, recording the amount of time taken for to explain random outputs, and we are plotting the distribution of these 1,000 measurements in this figure. Basically, we sort the values and plot them, and that gives the empirical complementary cumulative density function, or something like that.

What the figure tells us is that we most take at least one second, and we get as large as one thousand seconds (though not so often). That doesn't sound brilliant, does it? You were promised more!

The explanation lies in the next figure, in which we plot the time taken against the size of the explanation: the number of input edges produced.

This is just a scatterplot, but it shows us that there is a pretty serious correlation between the amount of time taken and the size of the explanation. And, wtf why are the explanations so big so often? Some are nearly one million edges. That sounds wrong, right?

The problem here is that group is just pulling in all contributions to each computation. This isn't as bad as it sounds, in that if you get label 0 in round 5, group is smart enough to only ask for inputs in round 4, and then round 3 etc. Because the differential dataflow label propagation is smart (it propagates small labels first), this actually means that we are producing edges corresponding to shortest paths from, e.g. label to frank.

The problem is that there can be lots of these shortest paths. Many nodes have large numbers (hundreds of thousands) of paths of the same short length back to the node that produced their label. The group operator keeps all of them, because ... well, you didn't realize this would be a problem either, did you?

So at least part of the problem is that we are producing too much output. Not just that there are 1M records to work with (that doesn't take 1,000 seconds), but all the weird chasing around of data and looking things up, it's just a mess with such casual explanations.

Smarter differential approach

Let's swing in the topk operator, which is smarter than group and only requires one element (label proposal, in this case) as an explanation. As everyone is actually proposing the same label in the example above, what this actually produces as an explanation is the shortest path along the smallest node identifiers. Kind of weird, but it makes for a unique explanation where otherwise there would be many.

Let's look at the latency plot. We took this one with multiple workers, because... science.

These numbers look a lot better. We are well below one second, and typically down around 10-100ms, depending on the number of workers we use. The fewer workers the better, right? It makes sense because there is very little parallel work to do here, but the throughput of the actual system would (does, figure incoming) suffer if we don't use multiple workers. The 32 worker numbers look a bit dodgy, in part because this is the number of cores the machine has, and if anything else needs to run the measurements get screwed up.

We also took some measurements where we update the graph to see how those latencies are affected. In this case we added and removed 10 random edges and measured the time to resolve both the graph connectivity answer and any changes to explanations.

They basically tell us that updates are still millisecond or less, except for some idiot measurement out in the tail that spiked to 1,000 seconds. No clue there; we think it is start-up or a glitchy measurement, but didn't want to just delete the data point (again, science).

Overheads

What about overheads? Like, how much are we paying extra when we turn on this explanation infrastructure? Yeah, some. Actually a fair bit. We ran the computation with and without explanations, with various numbers of workers, and compared it to Myria, a recent big data system that does graph connectivity. We tried to use SociaLite but it didn't produce the right answers, and GraphX doesn't run on the resources we used (one machine).

There is overhead, and it is non-trivial, but we still end up going faster than other systems. So, "boo overhead" and "yay differential dataflow". Possibly also "boo not giving Myria a fair shot because we don't really understand it". Myria folks, call me if these numbers are surprising. As another data point, GraphX on the same dataset takes about 400 seconds on a 16 node (128 core) cluster, somewhere between our 2- and 4- core numbers, and it neither incrementally updates nor explains anything (except your AWS bill).

The overheads are mostly due to the fact that the reference differential computation is pretty lean. All of the internal indices it uses are keyed by dense unsigned integers, and just use direct indexing into arrays. Some of the explanation operators require indices keyed by pairs, and fall back to a much slower HashMap implementation.

Recap.

These numbers are great. Zoom. Pow!

There was something about how it doesn't work, wasn't there?

Grown-up time

The label propagation example works just fine. But it is a special case of computation that everyone seems to fixate on, called a monotonic computation. If you add more inputs to such a computation, you just get more (or better) outputs. Additional inputs don't suppress output records.

It turns out if you are dealing with non-monotonic computations, the explanations we produce don't necessarily explain the results. Although we did a great job of finding inputs that played a role in producing your outputs, they may not have the property that if you re-run your computation on them you actually get the output. That would suck, right? But yeah it happens.

In SQL land you don't see this problem because SQL land things are monotonic. Same with Datalog. We think what we are about to do is perhaps the first time folks have derived explanatory input for large-scale non-monotonic computations (but yell at me if this is wrong).

Stable matching

To get a sense for what goes wrong, let's look at another cool algorithm that differential dataflow supports: stable matching.

Stable matching is a problem where you have a bipartite graph, so some nodes are blue and some are red, and each node has some preferences for some of the nodes of the other color, which we indicate by an edge with some preference on it (small values are better). The idea is to find a matching, a subset of these edges, so that no blue and red node would prefer to ditch their current assignment and hook-up with each other instead.

There are a bunch of algorithms for this problem, but there is a really simple one which proceeds in rounds, where each blue node proposes to its most preferred red node that hasn't yet rejected it. Red nodes then look at their proposals and go with whichever is best, walking away from proposals in the last round if they've gotten something better.

Now, grumpy people might say this requires a lot of nerve on the part of the blue nodes, and is bad behavior on the part of the red nodes, but you can prove that the process ends up with blue-optimal and red-pessimal assignments, so stfu and put yourself out there.

Anyhow, let's write the algorithm in differential dataflow, starting from a collection of quadruples

(name_1, pref_1, name_2, pref_2)

The algorithm repeatedly thins out these possible assignments by proposing the best ones, and discarding any that are rejected. It does this by re-keying the tuples by name_i, putting the corresponding preference first, then calling topk with the k value of 1:

initial.iterate(|active| {
  // key records by a, order by pa.
  let props =
    active.map(|x| (x.a,(x.pa,x.b,x.pb)))
          .topk(1, |x| x);

  // key records by b, order by pb.
  let accpt =
    props.map(|x| (x.b,(x.pb,x.a,x.pa)))
         .topk(1, |x| x);

  // discard unaccepted proposals.
  active.except(props.except(accpt))
})

The process terminates when there are no remaining options for unmatched nodes.

Explaining matchings

Let's say we run the computation and we see in the output

(frank, mr_scruffles)

Ok, how did that happen? I had lots of people on my list, and put mr_scruffles at the end as a joke. If we use our explanation infrastructure, we discover as an explanation (trust me on this) my full list of preferences that I went through, and the preferences for mr_scruffles, which he must have keyed in with his cute, furry paws. What we don't see is a good explanation for why everyone else on my list told me to take a hike.

The problem here is that no one actually told me to take a hike. There is no positive data for rejection here. The computation knows that if it doesn't receive a response, it should cross off the proposal. This is the right thing to do, but it baffles the explanation infrastructure, which thought that if it got all the input data it would produce the right output. Nope. There are some absences that are just as important (specifically, from the accpt collection).

Although part of the problem here is that we need to explain the absence of records, we know how to do that (at least, the infrastructure already does that). The bigger problem is that we don't know that we need to explain the absence of a record until we run the computation forward and fail to get the right answer. We can't just request an explanation for every absent record though, because, well most of them are absent. Like, infinitely many.

If we just took my preference list and that of mr_scruffles, the provided explanation, and we run the computation forward, we get the tuples

(frank, **REDACTED**)
(frank's_leg, mr_scruffles)

because these were the top preferences for frank and mr_scruffles, respectively, and the other involved parties have no better options that we identified. This isn't the right answer, though.

Of course, REDACTED had better proposals (perhaps so did frank's_leg), and those other proposals are an important part of explaining this unfortunate incident. We just need to automatically discover them, as appropriate.

Fixing things

There is a relatively simple way to fix things, and amazingly it actually works (in differential). We are going to do a few things, which I will enumerate:

We instantiate a second copy of the computation (here: "stable matching") whose input will be the explanation we identify. We use this second copy to try out our explanation and look for issues. As we fix the issue, we'll update the computation, maybe finding more issues to fix.
We merge the outputs of these second copies of operators with those of the reference computation, feeding into the explanation infrastructure. If this second copy outputs the same records as in the reference computation, great, we have matching output. If it outputs BS records that don't exist in the reference computation, those records must be explained. They don't exist, so their "explanation" will serve to correct their existence.
This may happen multiple times, so we run this process to fixed point, adding (never removing) records to the explanation until the BS records are cleaned up.

Yeah, it is a bit crazy having all these additional loops around. It works because math. Math says "you're welcome".

All those numbers you saw up above for explaining reachability were using this weird infrastructure. Differential means it has approximately zero overhead when you don't need it (i.e. no iterations required).

In the stable matching examples above, we would get

(frank, **REDACTED**)
(frank's_leg, mr_scruffles)

as tuples that need to be explained. In this case that would mean asking REDACTED to explain accepting (frank, **REDACTED**) which obviously never happened. This "not happening" is explained by the tuple containing REDACTED's actual choice. Instead, the explanation will come back with the corrected input, which should explain the actual pairing involving REDACTED.

This won't immediately fix the problem, as the output might now be the second person on my list, SERIOUSLY?. Perhaps SERIOUSLY? wasn't interested either, and we'll have to bring in the matched tuple there, too. The process repeats, potentially for many rounds (maybe SERIOUSLY? didn't get his or her first choice either; we may need to explain why not).

Stable matching experiments

With things fixed, let's start looking at some stable matching numbers. You don't see lots of stable matching in papers, partly because as a non-monotonic computation there are a bunch of cheats you can't do. Asynchronous computation doesn't work as well with non-monotonic logic, and it isn't a super clean "think like a vertex" program. Point being, I don't actually know what good numbers should be, but we will see what they are in our case.

The Twitter graph ends up being a bit of a pain here. We put random preferences on each edge, but because some nodes have degrees in the millions, we see some stupidly large explanations. To make our lives easier, we threw away vertices with degree greater than 1,000. That seems like a lot of people to have asked out.

We didn't do this with the Livejournal dataset, so you can check out those numbers in the paper if you are keen. They are qualitatively similar.

We did a query latency experiment where we picked random output matchings from the computation and asked to have them explained.

You can see that some hunk of the examples require just a few milliseconds. I'm guessing these are folks that didn't have lots of choices, and require no iteration to sort out why they got what they got: limited potential. We do see a tail where some explanations require up to 10s to figure out; they probably have their own reality tv show.

Like with the naïve connected components, much of this comes down to complicated explanations. For reachability, we know that there are short paths and that this should mean small explanations, but we really have no good intuition about what explanations for matchings should look like. Here is a scatterplot of the times versus sizes of explanation

Some of the explanations are big, but not so big; thousands at most. What is going on? Let's do the same scatterplot, but with number of rounds of derivation rather than size of explanation.

The explanations aren't very big, but they do require many rounds to sort out. With random preferences, it is really easy to have a cascade of new work whenever a previously accepted proposal is declined. The next proposal on the list may be the best thing in the world for the recipient, and everything gets shaken up. I anticipate you'd see a lot less of this if there were some background ordering that most of the lists reflected.

In any case, most of the explanations come back with sub-second latency, so you might still call them interactive. They don't screw up the update latency either, so I can maintain a continually updated view of the many reasons I am at home with mr_scruffles tonight.

Overhead

The overhead looks a lot like for connected components.

There are some multiples of elapsed time to compute the matchings. Weirdly, almost all of this is in except, the part where we remember that things didn't happen. If you thin them down a bit, you recover about 2x at the cost of slightly bigger explanations (you may ask for things that didn't happen, but you probably need them explained anyhow in stable matching, if not other algorithms).

Wrap up

This is totally still on-going research. We have a pile of code we'd like to release. It actually isn't very large, but it is a bit tangled at the moment. There are some macros. Don't look at me like that; you've done this sort of thing before too.

There are in my mind lots of great questions to ask about explaining computations. I think relatively few big data computations don't want explanations in some part of their workflow. They are just too cool and useful. But while we do now have a way to compute and maintain explanations, we have so much to learn about what we really need for explanations.

Are our stable matching explanations actually "over-explanations"; could we thin them down further, improving both performance, but also the value of the explanations?

What are other good examples of computations where explanations would help? Don't say machine learning / deep nets (we tried to warn you!).

Can explanations be used as an information disclosure limitation mechanism? How little do I need to know about a dataset to know why it produced output that affects me?

As always, feel free to ask questions, propose new directions, demand the code be cleaned up faster. I love those things. ;)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

2016-03-27.md

2016-03-27.md

Explaining reachability

Explaining all the things

Explanations

Differential dataflow

Functional operators

Dataflow structure

Logical timestamps

Mocking up explanations

A simple approach (`group`)

A simple approach (`map`)

Smarter approaches

The `map` operator

The `join` operator

The `topk` operator

Recap

Explaining reachability

Baseline approaches

Spark-style provenance

Datalog-style provenance

Simple differential approach

Smarter differential approach

Overheads

Recap.

Grown-up time

Stable matching

Explaining matchings

Fixing things

Stable matching experiments

Overhead

Wrap up

Files

2016-03-27.md

Latest commit

History

2016-03-27.md

File metadata and controls

Explaining reachability

Explaining all the things

Explanations

Differential dataflow

Functional operators

Dataflow structure

Logical timestamps

Mocking up explanations

A simple approach (group)

A simple approach (map)

Smarter approaches

The map operator

The join operator

The topk operator

Recap

Explaining reachability

Baseline approaches

Spark-style provenance

Datalog-style provenance

Simple differential approach

Smarter differential approach

Overheads

Recap.

Grown-up time

Stable matching

Explaining matchings

Fixing things

Stable matching experiments

Overhead

Wrap up

A simple approach (`group`)

A simple approach (`map`)

The `map` operator

The `join` operator

The `topk` operator