The Advent of Code (AoC) is an annual on-line programming event, where each day you get a new bite-sized programming problem. You need to produce the correct answer to a personalized input problem, and if you do you are rewarded with .. a second related problem. But if you do that you get the programming equivalent of an advent calendar window: some ASCII art.
I like the AoC because it stretches my legs a bit with coding, and works the kinks out. It also reminds me of all the different types of programs people like (or hate) to write, and that not all of them are graph processing problems.
I thought this year I would try and write solutions to each of the day's problems in differential dataflow, to exercise a bit and see the pain points that have not yet been explored. At the moment I'm doing them each day, and they have varying levels of goodness of fit with a incremental, distributed, data-parallel programming framework. Nonetheless, I thought it might be helpful to talk through some of the solutions (and perhaps eventually all of the solutions).
Some of the problems really aren't a great fit, so some of these solutions exist just to make a point, rather than to represent great victories in the field of scalable interactive computation. I'll try to make clear which are the most egregious violations.
Also, the descriptions of the problems are linked for each day, and they are rather cute. The explanations there are probably more engaging than the summaries I'll write, which will have a bit less flavor and be more to (what I felt was) the point.
Some quick random access to various days. Also, check out the solution repository.
Day 01: Matching up elements in a sequence.
Day 02: Compute checksums of a matrix of numbers.
Day 03: Iteratively explore the integer lattice.
Day 04: Check for duplicates and anagrams.
Day 09: Parsing sequences of characters.
Day 10: Applying rotations in sequence.
Before starting all of this, I want to go through a bit of boiler-plate that all programs will have in common. This structure will include the appropriate libraries (timely and differential dataflow), and define a main
method that just starts up a timely dataflow computation; on a single machine this will spin up some number of worker threads, who then collaboratively compute the result.
extern crate timely;
extern crate differential_dataflow;
fn main() {
// start timely dataflow execution, described by closure.
timely::execute_from_args(std::env::args(), |worker| {
// sort out who we are.
let index = worker.index();
let peers = worker.peers();
// determine the worker's subset of the input.
// write a dataflow computation:
worker.dataflow(|scope| {
// your code here!
});
}).unwrap();
}
Of course, we will write more code, and may tweak this structure as the days' requirements call for it. But this is roughly how most programs are going to look.
The first day tasks us with processing a sequence of numbers, and finding the sum of all elements that are identical to the next element in the list, including the last element if it is identical to the first element.
Data-parallel computations don't usually deal with sequences of elements, in that they usually treat their data as unordered. However, you can totally hack this in by working with the collection of elements of type (usize, Element)
, where the usize
is the index of the element in the sequence. We will work in this model, where the data are introduced into the computation with their index.
If someone gave you a sequence of digits as text, perhaps as a String
or a &str
or a Vec<u8>
or however you like, it isn't too hard to swing through the elements and extract a sequence of integers.
// example input from day 01.
let input = b"91212129";
let characters =
input
.iter()
.map(|digit| digit - b'0')
.enumerate();
The value characters
is an iterator that produces elements of type (usize, u8)
indicating a position and value of the element at that position.
Before doing any computation, we will want to partition responsibility for the input among the workers; we don't want each of them introducing the input, as then we will have as many copies as we have workers, which is no good (all of our sums will be increased by that factor). As we have index
and peers
at hand from our boilerplate, indicating each worker's identity and number of colleagues, we can restrict the input to each worker by having them round-robin through the sequence:
let worker_input =
characters
.filter(|&(pos, digit)| pos % peers == index);
This works even if the sequence isn't dense, for example if there are gaps for some unfortunate reason. It is a pretty handy pattern to use.
To get this data into our dataflow, we'll use the new_collection_from()
operator, defined on dataflow scopes, inside the dataflow
closure:
worker.dataflow::<(),_,_>(|scope| {
// define a new collection from `worker_input`.
let digits = scope.new_collection_from(worker_input).1;
You may notice we have a .1
at the end, which is how you get at the second element of a pair. The first element of the pair is a handy helper that would let us interactively drive the computation around, but we aren't going to use that yet.
Now, let's say we had a collection of (pos, digit)
pairs and we want to find those that match (pos+1, digit)
; how might we do this? One relatively easy way is to transform each (pos, digit)
into (pos+1, digit)
and then intersect with the original list: we'll find matches exactly where digit
occurs at both pos
and pos+1
.
digits
.map(|(pos, digit)| (((pos + 1) % length, digit), 1))
.semijoin(&digits);
This fragment takes digits
and adds one to its position (taken modulo length
to wrap around), and then takes a semijoin
with digits, which retains only those ((pos+1)%length, digit)
that match something in digits
. You might also notice that we've tacked a 1
on as data, and that is there because this will be the answer to part one.
Part two asks for the same type of computation, but instead of the next element in the sequence we want to match the element half-way around the sequence. That sounds pretty far out, but it's basically the same thing where instead of +1
we use +length/2
.
digits
.map(|(pos, digit)| (((pos + length/2) % length, digit), 2))
.semijoin(&digits);
See how I put a 2
in there? Because this is the answer to part two. In fact, we can roll these two together into one computation, doing both parts one and two at the same time, which looks like
let part1 = digits.map(move |(pos, digit)| (((pos + 1) % length, digit), 1));
let part2 = digits.map(move |(pos, digit)| (((pos + length/2) % length, digit), 2));
part1
.concat(&part2) // merge collections.
.semijoin(&digits) // restrict to matches.
.explode(|((_, digit), part)| Some((part, digit as isize))) // `part` with weight `digit`.
.consolidate() // consolidate weights by `part`.
.inspect(|elt| println!("part {} sum: {:?}", elt.0, elt.2)); // check out answers.
That's the whole solution to day 01!
There are a few extra operators you may not have seen before: explode
, consolidate
, and inspect
. The explode
operator (taking applications for new names) replaces each element with a list of (element, count)
entries, which we use to produce digit
many copies of part
for each match we find. The consolidate
operator adds up all occurrences of the same element, and at this point in the computation we have just 1
and 2
, the only data we produced in explode
; this operator is what sums up the contributions. Finally, inspect
shows us each of the changes flowing by, which should just be the cumulative increases for 1
and 2
, which should be their sums!
There are of course several other ways you could have done this; for example, you could use the group
operator to group each of the digits by part, and take their sum in the reducer logic. This has some down-sides, and perhaps we will get around to discussing them later on!
Day two has us take a grid of numbers and compute a "checksum" in two different ways.
I've represented the input as a collection of triples (val, row, col)
, much like you might represent a sparse matrix. As it turns out, each of the checksums is a per-row aggregation (done differently for the two different problems of the day), followed by an aggregation over rows. So, we just discard the column identifier and group by row, producing the outputs 1
and 2
with some weights we will need to produce the answer:
entries
.map(|(val, row, _col)| (row, val))
.group(|_row, vals, output| {
// Part one solution (difference of max and min within rows)
let mut min = vals[0].0;
let mut max = vals[0].0;
for &(val, _) in vals.iter() {
if min > val { min = val; }
if max < val { max = val; }
}
output.push((1, (max - min)));
// Part two solution (ratio of two even divisors in the row)
for &(val1, _) in vals.iter() {
for &(val2, _) in vals.iter() {
if (val1 % val2) == 0 && val1 != val2 {
output.push((2, val1 / val2));
}
}
}
})
.map(|(_, part)| part)
.consolidate()
.inspect(|elt| println!("part {} checksum: {:?}", elt.0, elt.2));
Among the values in each row, the first part of the problem wants the difference between the maximum and minimum values, and the second part wants the ratio of the (guaranteed) two elements one of which cleanly divides the other. The two lines
output.push((1, (max - min)));
output.push((2, val1 / val2));
each produce their corresponding output value with the weight we want to accumulate. We then ditch the row identifier keeping only the part identifier, and accumulate and print the results.
Again we use consolidate
, rather than an operator like group
. The consolidate
operator has the special property that it only accumulates the counts for records, and so supports a specialized implementation: whenever it gets a collection of input differences, it just accumulates those differences and reports them. This contrasts with group
which could perform arbitrary logic, but with greater power come greater complexity of implementation: group
must re-assemble its full input collection whenever something changes.
The short version of this is that consolidate
keeps very little state and responds quickly to changes, whereas group
maintains its accumulated inputs and must reform its inputs on input changes to determine output changes.
Day 03 is a bit of a chaotic mess. The intent is that you start on the integer grid at (0, 0), and start spiraling outwards (right first, then up and around counter-clockwise). As you go, you determine a value for each new location as a function of the values of the surrounding cell (those whose values have been set). For the first part, your function is max + 1
, essentially counting up by one. For the second part, your function is the sum of the set values.
I've implemented this as an iterative process, where each cell advertises its value to each of its eight neighbors, but these advertisements result in a set value only when they contain an advertisement for the predecessor cell in the sequence. This is a bit like how you might write Conway's Game of Life in differential dataflow, another fun exercise. As it turns out, the sequencing means there isn't much parallelism here, but it works out (if slowly).
Rather than force the code on you, I'm just going to link it at you. It seems to run correctly, but I'm not sure I would recommend it for this specific task.
This was the first (and only) day that I finished quickly enough that the submission site told me what number I was for each part.
The problems of the day are: given a list of lines of text, how many lines have no duplicate words (part 1) and how many lines have no words that are anagrams of other words (part 2). These are classic interview questions, where you can solve part 1 by sorting the list of words and deduplicating, then checking the length. You can then solve part 2 by first sorting the characters of each of the words (so that anagrams are now the same word), then sorting and deduplicating the text.
phrases
.filter(|phrase| {
let mut words = phrase.split_whitespace().map(|x| x.as_bytes().to_owned()).collect::<Vec<_>>();
let len = words.len();
// comment for part 1, uncomment for part 2.
// for word in words.iter_mut() { word.sort(); }
words.sort();
words.dedup();
words.len() == len
})
.map(|_| 1)
.consolidate()
.inspect(|elt| println!("part {} checksum: {:?}", elt.0, elt.2));
It looks like I was computing part 1 in this bit of code, and then uncommented things to compute the answer to part two. Also, the Rust people would be very cross at me for turning each &str
into an array slice of u8
bytes; this answer is not UTF-8 respecting, and Rust gives you lots of good tools to work with that. Perhaps x.as_bytes().to_owned()
should be x.chars().collect::<Vec<_>>()
.
The ninth day of problems relate to parsing sequences of characters. You have a string formed out of {
, }
, <
, >
, and !
characters, plus a bunch of other stuff, and you need to follow some rules to properly parse out the nested {, }
structure (roughly: !
says "ignore the next character" and <
and >
begin and end non-nesting commented text).
This is a problem that initially looks hopelessly sequential, and I was half-tempted to just ignore this day and go back to bed. While in bed, I one of the first differential dataflow programs not written by me: Derek Murray (@mrry) wrote a parallel prefix sum implementation, which even if you ignore the differential dataflow stuff is a pretty neat thing to learn about.
Let's start with an explanation of parallel prefix sum. Your computer has inside it lots of little things called adders, each of which takes two numbers in binary and produces their sum, also in binary. These are typically made out of one-bit adders, a little teensy circuit with two input data bits and one output data bit. For them to be useful, though, they need an additional "carry" input and output bit, which indicates whether the bits below them overflowed and whether this addition will also overflow. You could go and write the eight-element truth table for the two output bits to get a sense for what it looks like.
A one-bit adder with carries is pretty smart, but if you try and make a 64-bit adder out of one of these you'll probably end up with a sequence of 64 of them, right? The carry-out for the least bit adder is wired to the carry-in for the second-least bit adder, which leads to the third-least bit adder, and so on. This is bad from a hardware point of view, because it means that a signal may need to travel along a relatively long path in order to produce the correct answer, and that means that the hardware clock will need to mellow while this happens, which means slowness.
Rather than put all the one-bit adders in sequence, we can do something smarter. We can make a 64-bit adder out of two 32-bit adders where doesn't take a carry-in bit and produce a carry-out bit, but rather it produces both sets of answers (a 32-bit sum and a carry-out) for each possible carry-in bit. It can do all of that work without waiting for the carry-in bit, and then when the bit arrives we just select which of the two 32-bit answers (and carry out) we want to let through. If we do this hierarchically, with 16-bit adders, and 8-bit adders, and so on, we end up with a logarithmic depth circuit and lots of independent parallel work.
This can be generalized to arbitrary sequences of associative operations. If we have an associative operator +
, and for a sequence xi
of inputs we want to compute the sum yi
of all the element up to xi
, for each xi
, we could do this in sequence:`
y0 = x0
y1 = y0 + x1
y2 = y1 + x2
..
yi = y(i-1) + xi
We could also do it hierarchically, by computing the sums for ranges of xi
that start at single elements (the xi
themselves), and grow by factors of two. For example, we could index y
by two numbers: the depth of the hierarchy (initially 0
) and the first element in its range.
y(0,0) = x0
y(0,1) = x1
y(0,2) = x2
..
y(0,i) = xi
The next level would add up y
terms from
y(1,0) = y(0,0) + y(0,1)
y(1,2) = y(0,2) + y(0,3)
..
And so on, until we have computed y(64,0)
.
Having done all of this, we can now reconstruct the sum at any position using at most a logarithmic number of terms. You can also reconstruct the sum at many positions using at most a linear number of terms, sharing much of the work at the higher levels of the hierarchy.
Importantly, the only thing you need to know about +
is that it is associative, a nice mathematical property that many operators have, and that by knowing about you can start to bust out more efficient algorithms like parallel prefix sum.
Let's get back to the parsing problem at hand: sequences of {
, }
, <
, >
, and !
characters. The rules are
- A
!
character causes the next character to be ignored (even if it is a!
too). - A
<
character commences a "garbage" region (like a comment) that stops at the first>
. - A
{
character increases the nesting depth by one, and}
decreases it by one.
There are several potential gotchas here, like that multiple <
symbols still only require one >
to close them, and that !
prevents lot of otherwise simple local parsing. We can't just look at a >
or a }
and see whether there is a !
before it, because there could be another !
before it, canceling it, or another !
before all of them, and it is just a mess.
What we could do is try and cast this as a parallel prefix sum problem. The way I've done this (there are others) is to start with the problem of determining whether characters should be ignored, and whether they are garbage (rules 1 and 2 above). We will deal with the nesting depth later.
I assert that there are essentially four states that the parser has with respect to ignores and garbage, and they are a cross product of "ignore the next character or not" and "in a garbage scope or not". I'm going to encode these as
0: don't ignore the next character, not garbage either.
1: don't ignore the next character, but in garbage scope.
2: ignore the next character, but not in garbage scope.
3: ignore the next character, and in garbage scope.
Each of the characters represent a transition from each of the four possible input states to the four possible output states. I'm going to represent them as a [u8; 4]
, an array of four bytes where byte i contains the next state (out of four) if we started in state i.
Here are the rules as I think they should be:
let transitions =
input
.map(|(pos, character)|
(pos, match character {
'<' => [1, 1, 0, 1], // start garbage if not ignored.
'>' => [0, 0, 0, 1], // end garbage if not ignored.
'!' => [2, 3, 0, 1], // toggle ignore bit; don't change garbage.
_ => [0, 1, 0, 1], // consume ignore bit, don't change garbage.
})
);
The rules are roughly that !
toggles the ignore "bit", either asking the next character to be ignored or consuming the ignore bit if it should be ignored, and the <
and >
characters start and end garbage respectively, if they are not ignored. All other characters leave the garbage bit unchanged but consume the ignore bit if it is set. As a sanity check, the third and fourth columns are identical, which seems sane if they represent what should happen when the character should be ignored (perhaps we just need a [u8;2]
?).
Why generate all of these transitions? Because what we would like most in the world is the cumulative application of these transition functions for each prefix of the sequence of characters. Right? If we had this for each pos
we could read out the first entry and it would reveal the parser state as we hit (pos, character)
. In principle parallel prefix sum should apply, as long as we can write down a +
operator on the state transitions.
|t1, t2| [t2[t1[0]], t2[t1[1]], t2[t1[2]], t2[t1[3]]]
Well, ok. This is not at all clear. This is an anonymous function that takes two arguments, t1
and t2
, and .. produces a new state transition function that first steps according to t1
and then according to t2
. It's a bit gross, but that is what it does. Here is also where we learn that a [u8;2]
would not be sufficient to summarize the action of multiple characters; the last two elements do not need to be 0
and 1
as they are for single characters.
At this point, if we had a handy parallel_prefix
function we could call it and away we go. Kinda like this:
// determine the transitions for intervals of positions, then the state starting from zero.
let ranges = pp_aggregate(transitions, |t1, t2| [t2[t1[0]], t2[t1[1]], t2[t1[2]], t2[t1[3]]]);
let values = pp_broadcast(ranges, 0, [0, 1, 2, 3], |state, trans| trans[*state]);
That's the actual code from the project, but it doesn't tell us what pp_aggregate
and pp_broadcast
do. If you believe that they work, great! You are a kind and trusting soul. Independently, let's look at what they are.
The first method, pp_aggregate
builds up the summaries of the geometrically increasing ranges. We are going to describe each range by a pair (pos, log)
which corresponds to the interval [pos << log, (pos + 1) << log]
. Roughly, log
contains the level of hierarchy (or log of its size) and pos
contains the starting index shifted right by log
.
If we start with an arbitrary collection
of pairs (usize, D)
for some data type D
, and a combine
function which maps pairs of D
to a new D
, the following hunk of code produces the collection of ((usize, usize), D)
entries whhich correspond to the aggregated regions, keyed by that (pos, log)
representation up above.
/// Accumulate data in `collection` into all powers-of-two intervals containing them.
fn pp_aggregate<G, D, F>(collection: Collection<G, (usize, D)>, combine: F) -> Collection<G, ((usize, usize), D)>
where
G: Scope,
G::Timestamp: Lattice,
D: Data,
F: Fn(D, &D) -> D + 'static,
{
// initial ranges are at each index, and with width 2^0.
let unit_ranges = collection.map(|(index, data)| ((index, 0), data));
unit_ranges
.iterate(|ranges|
// Each available range, of size less than usize::max_value(), advertises itself as the range
// twice as large, aligned to integer multiples of its size. Each range, which may contain at
// most two elements, then summarizes itself using the `combine` function. Finally, we re-add
// the initial `unit_ranges` intervals, so that the set of ranges grows monotonically.
ranges
.filter(|&((_, log), _)| log < 64)
.map(|((pos, log), data)| ((pos >> 1, log + 1), (pos, data)))
.group(move |_, input, output| {
let mut result = (input[0].0).1.clone();
if input.len() > 1 { result = combine(result, &(input[1].0).1); }
output.push((result, 1));
})
.concat(&unit_ranges.enter(&ranges.scope()))
)
}
The comments in the code are meant to explain, but informally: iteratively, each range coarsens itself by a factor of two, leaving a bit of data (pos
) to indicate its order among others in the same coarsened range. We group coarsened regions together and summarize the action by taking the first transition function and combining it with a second if it exists. We add in the original unit ranges because we want this to converge to a fixed point, and they would otherwise be lost as they do not result from any coarsened regions.
This all basically works. At least, it produces the correct answers for me, though there might be other bugs along the way.
The second half of the computation is taking all of these aggregates and re-flowing them down the tree we implicitly formed by aggregation. This produces the prefix sums at each location, rather than just at the end. This is conceptually similar, but with a few sticky details having to do with ranges that might have been empty. For whatever reason, I wrote this more like a fold
operator that starts from some seed: B
of a type possibly distinct from D
.
/// Produces the accumulated values at each of the `usize` locations in `aggregates` (and others).
fn pp_broadcast<G, D, B, F>(
ranges: Collection<G, ((usize, usize), D)>,
seed: B,
zero: D,
combine: F) -> Collection<G, (usize, B)>
where
G: Scope,
G::Timestamp: Lattice+Ord+::std::fmt::Debug,
D: Data,
B: Data+::std::hash::Hash,
F: Fn(&B, &D) -> B + 'static,
{
// Each parent range proposes an empty first child, to provide for its second child if it has no sibling.
let zero_ranges =
ranges
.filter(|&((_, log),_)| log > 0)
.map(move |((pos, log),_)| ((pos << 1, log - 1), zero.clone()))
.antijoin(&ranges.map(|((pos, log),_)| (pos, log)));
let aggregates = ranges.concat(&zero_ranges);
let init_state =
Some(((0, seed), Default::default(), 1))
.to_stream(&mut aggregates.scope())
.as_collection();
init_state
.iterate(|state| {
aggregates
.filter(|&((_, log),_)| log < 64) // the log = 64 interval doesn't help us here (overflows).
.enter(&state.scope())
.map(|((pos, log), data)| (pos << log, (log, data)))
.join_map(state, move |&pos, &(log, ref data), state| (pos + (1 << log), combine(state, data)))
.concat(&init_state.enter(&state.scope()))
.distinct()
})
.consolidate()
}
After defining some empty zero_ranges
that link up the state at the beginning of a parent range to the beginning of its second child, if there was no first child to do so, we whip up some initial_state
containing the supplied seed
. I think is technically sketchy, as each worker will introduce this seed making its multiplicity high. We eventually hit things with distinct
, but it could be cleaned up.
The main hunk of logic is the iterate
, which repeatedly develops a collection of (usize, B)
values describing the state at each position. We repeatedly take the state at known positions and apply existing summaries (including the zero_ranges
summaries of empty ranges) to produce new values at the other end of the ranges. This is going on mostly in the line
.join_map(state, move |&pos, &(log, ref data), state| (pos + (1 << log), combine(state, data)))
which matches the start point of ranges (created by pos << log
in the preceding line) with the known state, and applies the combine
function to produce the output a bit further along.
That diversion aside (it's totally awesome, by the way; go read the code again), let's get back to the problem we need to solve. It isn't that many more lines of code.
We have been asked to sum up the total depths of all {, }
pairs, which we can do by identifying the non-ignored, non-garbage }
symbols and extracting their depths. How do we get their depths? It's just another parallel prefix computation, it turns out (or, it could have been the same one, if we folded it in). Recall that we produced values
containing the state (ignore, garbage) heading in to each position:
// line up (position, symbol, state).
let symbols_state = input.join(&values);
// restrict the positions down to those that are neither '!' nor themselves cancelled.
let active = symbols_state.filter(|&(_, symbol, state)| symbol != '!' && (state == 0 || state == 1));
// part 1: accumulate for each '}' its depth.
let parens = active.filter(|&(_, symbol, state)| state == 0 && (symbol == '{' || symbol == '}'));
let depths = parens.map(|(pos, symbol, _)| (pos, if symbol == '{' { 1 } else { -1 }));
let ranges = pp_aggregate(depths, |sum1, sum2| sum1 + sum2);
let values = pp_broadcast(ranges, 0, 0, |sum1, sum2| sum1 + sum2);
parens
.filter(|&(_pos, symbol, _)| symbol == '}')
.map(|(pos, symbol, _)| (pos, symbol))
.join(&values)
.explode(|(_pos, _sym, sum)| Some(((), sum)))
.consolidate()
.inspect(|x| println!("part1: {:?}", x.2));
Let me call out the clever lines:
let depths = parens.map(|(pos, symbol, _)| (pos, if symbol == '{' { 1 } else { -1 }));
let ranges = pp_aggregate(depths, |sum1, sum2| sum1 + sum2);
let values = pp_broadcast(ranges, 0, 0, |sum1, sum2| sum1 + sum2);
Here we turn each {
into a +1
and each }
into a -1
, and then ask for the prefix sum. We go on to add up the sums at each }
, which gives us the depths.
One thing I think is really cool is that the collection of (pos, '{')
and (pos, '}')
is a potentially sparse set. We don't have entries at each pos
, and because of the magic of mathematics or programs or something, we don't have to. We get the prefix sum for an arbitrary set of indexed elements, whether they form a dense sequence or not. It would have been a huge pain to have to take the {
and }
symbols and determine their new dense indices.
Actually, it would just be another parallel prefix sum, right? We are just counting up the number of elements in the total sequence that match their description, and want to know the prefix sum at each position. The (pos, data)
representation for sparse sequences is super useful, and parallel prefix sum plays a big role in that. Languages like NESL rocked this back in the 80s, but it doesn't seem so popular with the modern word-counting crowd.
The AoC input for day 09 is about 20k characters, and even with this many you do not need parallelism to parse things efficiently. The code I've written takes about 350ms to compute the correct answer (300ms with two threads), whereas lean native Rust should probably take tens of microseconds instead. Perhaps with a larger volume of text, imagine your least favorite codebase, you could imagine wanting some parallelism, but most of these sorts of projects are naturally hierarchical using separate files and folders.
Here is a different use case: imagine you are editing a few megabytes of text formatted according to these rules, where ignored characters are red, and garbage is blue, normal characters are just black. What happens as you edit text? You typeity-type and cause some characters to go away and other characters to appear, and what; do we re-run the parser with each keystroke?
Fortunately, differential dataflow is magically incremental. All of it. The parallel-prefix computations we did, totally incrementalized.
Differential dataflow will propagate all of the changes in each of the collections we've defined, and that could mean lots and lots of changes, which wouldn't go very fast at all. Imagine editing a bunch of text and adding your first <
, causing all of the rest of the text to turn blue; maybe you don't like that and delete the <
turning it back from blue. Lots of changes happen, and we can't fix that for you (yet). But, because of the formatting rules, >
closes a garbage region no matter where it started; changes you make locally may not cascade throughout the file.
Let's try. I'm going to start up the computation, and then start making some single character edits, replacing 20 locations with !
and then putting the original character back. I used locations 1000, 2000, .. up to 20,000, because I have run out of good ideas by this point. I'm also just computing the (ignore, garbage) state, rather than the parens depth nesting.
Duration { secs: 0, nanos: 144267517 } parsed seen: 20696
Duration { secs: 0, nanos: 145662500 } changed ('{') seen: 14
Duration { secs: 0, nanos: 147368288 } changed (',') seen: 18
Duration { secs: 0, nanos: 148992387 } changed ('!') seen: 4
Duration { secs: 0, nanos: 149751358 } changed ('{') seen: 4
Duration { secs: 0, nanos: 150866971 } changed ('>') seen: 6
Duration { secs: 0, nanos: 151639222 } changed ('!') seen: 2
Duration { secs: 0, nanos: 152455813 } changed ('e') seen: 4
Duration { secs: 0, nanos: 153391716 } changed ('}') seen: 8
Duration { secs: 0, nanos: 154497041 } changed ('{') seen: 8
Duration { secs: 0, nanos: 155412734 } changed ('!') seen: 4
Duration { secs: 0, nanos: 155885640 } changed ('!')
Duration { secs: 0, nanos: 158067036 } changed ('i') seen: 38
Duration { secs: 0, nanos: 160065928 } changed ('>') seen: 40
Duration { secs: 0, nanos: 161284636 } changed (',') seen: 6
Duration { secs: 0, nanos: 162042381 } changed (',') seen: 8
Duration { secs: 0, nanos: 162709188 } changed ('!') seen: 4
Duration { secs: 0, nanos: 164150206 } changed ('<') seen: 16
Duration { secs: 0, nanos: 165755020 } changed ('\"') seen: 20
Duration { secs: 0, nanos: 166512474 } changed ('!') seen: 4
Each line reports the elapsed time (post dataflow construction, but pre-computation), what happened, and how many changes were observed. You can see a few things here:
-
It takes about 144ms to produce the (ignore, garbage) state for each location, starting from a string of 20k characters. That goes down to 100ms with two threads (yay), but you can't see that here.
-
Once we start doing single character changes, the per-update time drops. It takes about 20ms to do 20 changes, or roughly one millisecond per change on average.
-
The smallest change happens when we do two
!
characters in a row, representing zero new changes made and zero old changes reverted. In that case, where the input doesn't change and we just need to verify as much, it takes about 400 microseconds (oof). If we are eating 400us to do nothing, there is probably room to shave off about 400us from each of these numbers (which would be significant!).
Generally, though, the incremental updates here are happening with lower latency than your keyboard can register keys pressed, and higher throughput than you can type.
This is a silly hash computation thing, so I just wrote the vanilla Rust to do it.
fn knot_step<I: Iterator<Item=u8>+Clone>(iter: I, rounds: usize) -> Vec<u8> {
let mut state = (0 .. 256u16).map(|x| x as u8).collect::<Vec<_>>();
let mut cursor = 0;
let mut step = 0;
for _ in 0 .. rounds {
for width in iter.clone() {
let width = width as usize;
for swap in 0 .. width / 2 {
state.swap((cursor + swap) % 256, (cursor + width - swap - 1) % 256);
}
cursor = (cursor + width + step) % 256;
step += 1;
}
}
state
}
Not everything needs to be dataflow.
In principle you could do this with parallel prefix sum, as each range of width
values from the iterator can be described by the permutation of 0 .. 256
they induce, using 256 bytes, but the test sequence is only about 40 characters long.
Maybe this can be homework!