diff --git a/assets/tries/bfs.png b/assets/tries/bfs.png new file mode 100644 index 0000000..611eb5a Binary files /dev/null and b/assets/tries/bfs.png differ diff --git a/assets/tries/scc.png b/assets/tries/scc.png new file mode 100644 index 0000000..bcb6c90 Binary files /dev/null and b/assets/tries/scc.png differ diff --git a/posts/2017-02-21.md b/posts/2017-02-21.md new file mode 100644 index 0000000..b0298ac --- /dev/null +++ b/posts/2017-02-21.md @@ -0,0 +1,728 @@ + +## Modular data organization + +This post is about organizing data in differential dataflow. It is also about organizing organizing data, in that I got tired of writing so many different ways of organizing data and put together a modular framework for building different types of data layout. + +These could be useful to you, or (for the more cynical) you could point out the obvious blunders I've made. Something good must come of it. + +I'm optimistic! + +### 0. Some motivation + +Hey is differential dataflow fast or slow? Let's profile it. + +Strongly connected components is one computation I run a lot. Because it is cool and complicated, but not because anyone ever asked for continually updated strongly connected components. Not once. + +But you fire up a 10M node, 20M edge computation, you get a profile that looks a bit like this + +![scc](https://github.com/frankmcsherry/blog/blob/master/assets/tries/scc.pnf) + +What it is telling, because serious what the heck, is that we are spending a lot of time (20%) futzing around with memory management (`memmove`) and lots of time in methods called things like `done` and `extend` and `consolidate`. These are all methods related to building up collections, which makes sense because the `scc` computation produces a bunch of intermediate state (it is a really cool dataflow) but doesn't do much with it (one iteration is often enough). + +But, building up collections is part of what is holding scc back. + +What about something else? Let's look at a continually updated breadth first search compuation. This is another reference computation; I'm going to do 1M nodes and 10M edges, and continually apply 1,000 edge updates to the computation. + +![bfs](https://github.com/frankmcsherry/blog/blob/master/assets/tries/bfs.pnf) + +This time around we see we spend lots of time in `advance`, which we might not have met yet but will meet soon. There is also some `seek_key` and `tidy_vals`, both related to navigating around traces. This makes sense for bfs because there are relatively few updates as the edges change, but we do need to run around and check that things have stayed the same. + +So, seeking around in collections looking for keyed data is part of what is holding bfs back. + +Not shown: a screenshot of `Instruments.App` beachballing, because apparently it locked up OSX's screen shot taking mechanism too. + +### 1. The problem + +Recall that differential dataflow tracks lots of four-tuples, + + (key, val, time, diff) + +where `key` and `val` are probably ordered, `time` is kind of a mess, and `diff` is some sort of integer (I've put on my grown-up pants and switched it from an `i32` to an `isize`, because big data). + +All of this stuff also applies more broadly if your scenario has more interesting relations with more dimensions (or fewer, as we will get to). + +Without worrying the details, I'm going to claim that we want to navigate the data in this order too: we want to sniff around by `key`, within each key we want to move around `val`, and finally we want to crank through a bunch of `(time, diff)` pairs for each value. This may change as the underlying differential dataflow algorithms change, but let's take it as a given for now. + +Our tuples arrive in batches of varying sizes. At the beginning of the computation we might get a billion edges (hi `twitter_rv`!), and as the computation proceeds these batches might wobble around and be as small as single records. In any case, we want to append these tuples to our collection, so that when we sniff around in the future, we see them. + +There is some [magic about compacting differences for equivalent times](https://github.com/frankmcsherry/blog/blob/master/posts/2017-02-11.md#compacting-logical-times) which you can read about, but for now let's ignore this. + +So, we repeatedly accept batches of tuples, and need to support interactive exploration by keys and vals and .. not really by time but we need access to it. + +### 2. A bespoke implementation + +We'll start with the custom implementation, before we get fancy and generalize it. This should call out the intended structure of the implementation, and then we'll talk about swapping in clever bits more modularly. It is [the same implementation](https://github.com/frankmcsherry/blog/blob/master/posts/2017-02-11.md#new-data-structures) as mentioned in [the recent status report post](https://github.com/frankmcsherry/blog/blob/master/posts/2017-02-11.md), but let's review it anyhow. + +#### Step 1: An immutable collection + +Our first building block is going to be an immutable collection. If we are presented with a batch of data and are told "just this", how might we represent it? + +There are a few choices, but I happened to choose the following: + +```rust +struct Batch { + keys: Vec<(K, usize)>, + vals: Vec<(V, usize)>, + times: Vec<(T, isize)>, +} +``` + +The idea here is that `keys` is ordered by the `K` field, and has as its second field an offset into `vals` where the values associated with that key *end*. The values start, of course, at the offset where the previous key's values end (or zero, if this is the first key). Similarly, each interval of `vals` is sorted by `V` and has an offset into `times` bounding the associated range range of `(T, isize)` entries. + +This works. It has some performance issues that we will get to, but it is not so bad to navigate through. + +You could walk around in these fields on your own, but I thought it would be good to put together a trait that allowed us to hide the details behind a simple `Cursor` interface: + +```rust +pub trait Cursor { + + // move the cursor around.. + fn step_key(&mut self); + fn seek_key(&mut self, key: K); + fn step_val(&mut self); + fn seek_val(&mut self, val: V); + + // see what the cursor points at.. + fn valid_key(&self) -> bool; + fn valid_val(&self) -> bool; + fn key(&self) -> &K; + fn val(&self) -> &V; + + // do stuff with the associated (time, diff) data. + fn map_times(&self, func: F); +} +``` + +Actually the [real trait](https://github.com/frankmcsherry/differential-dataflow/blob/fed8680daf7c9b98df64fbd771976389afbe3eb7/src/trace/cursor/mod.rs#L12-L56) is a bit more complicated (rewinding, peeking, etc). + +Nonetheless, the trait above gives us some direction about what we'll need to be able to do with our data organization. Walk around, jump around, look at things. + +Our ordered lists allow us to use binary search for keys and values, but there is something a bit more clever we can do too. Exponential or "galloping" search starts from an existing cursor and moves forward in exponentially increasing steps until it would pass the value, then takes exponentially decreasing steps until it gets there. This is the version I use: + +```rust +/// Reports the number of elements to pass until `function` becomes false. +pub fn advancebool>(slice: &[T], function: F) -> usize { + + // start with no advance + let mut index = 0; + if index < slice.len() && function(&slice[index]) { + + // advance in exponentially growing steps. + let mut step = 1; + while index + step < slice.len() && function(&slice[index + step]) { + index += step; + step = step << 1; + } + + // advance in exponentially shrinking steps. + step = step >> 1; + while step > 0 { + if index + step < slice.len() && function(&slice[index + step]) { + index += step; + } + step = step >> 1; + } + + index += 1; + } + + index +} +``` + +All this jumping around is fun, but it isn't cheap. Stomping around in large amounts of memory can be pretty painful. One thing that improves life is breaking the `keys` vector into two vectors: + +```rust + keys: Vec, + offs: Vec, +``` + +so that we can jump around in `keys` and then look things up in `offs`. Saving the 8 bytes of the `usize` entry helps a lot for small keys, as we jump around in a relatively smaller amount of memory while looking for the key. + +This is most of what I wanted to say about the immutable collection. We are going to use it, but there isn't all that much to this implementation so far. + +#### Step 2: An append-only collection + +We have an approach for immutable collections, but for some reason we don't get all the data at once. We keep getting more and more immutable collections handed to us, which probably doesn't seem like the immutability we were promised. + +But, there are standard ways of dealing with this. The one I thought was most natural was just to keep the immutable collections as they come in, + +```rust +struct Spine { + batches: Vec>>, +} +``` + +--- + +**Side notes**: yes, we will merge the immutable collections and create new immutable collections. Also, the `Rc` there is a reference-counting smart pointer, which we use so we can share the batches without taking ownership from others. + +--- + +We also need a newer and smarter cursor implementation which would keep its fingers in each of immutable collections, perhaps by using their cursors. + +What is a good way of merging several cursors? I had [a pretty horrible way before](https://github.com/frankmcsherry/trie/blob/master/src/merge.rs), but I came up with something that I find pretty tasteful. It has probably been invented before, so if you know the name, holler! + +The rough idea is that we'll keep all of our cursors in a list, which is a smart place to keep them, but we keep them sorted first by each's current key, and then among cursor with the smallest key we keep them sorted by each's current value. Let's imagine we also write down how many cursors currently have the same key and within them the same value, so we don't have to constantly re-evaluate this. + +```rust +struct CursorMerge { + cursors: Vec, + key_prefix: usize, + val_prefix: usize, +} +``` + +The current key and value can be read out of the first element in the list. To navigate around, we need to ask some of the cursors to navigate, but we only need to work with the prefix of the cursors whose keys or values, appropriately, are smaller than the target. (Note: this only works as we seek forward, which is going to be how we look for keys). + +#### Step 3: Merging and Compaction + +I said I wasn't going to say much about these, and I'll try not to. When we need to, we merge immutable collections into new merged collections. The results are (i) one collection and cursor, and (ii) less space used. As we do this, we can apply `advance_by` logic to each of the `time` fields, which gives us the potential to consolidate `diff` fields and occasionally cancel tuples out of the collection. + +It's still append-only, kinda. + +### 3. Growing pains + +The bespoke approach above is fine. It has decent worst-case performance, but isn't going to win any awards (unless there are prizes for being impressed with oneself). + +There are a variety of ways we might want to tweak this implementation to accommodate structural properties of the data, or to take aim at better performance. Let's go through a few examples: + +#### Collections without values + +What if your collection isn't `(key, val)` pairs, but is just `val` values? + +This is not uncommon in, for example, Datalog computations where the collections are all set-valued. The pevasive `distinct` operator doesn't know anything about keys and values, nor does half of the `semijoin` operator (which passes those records whose keys exist in a second set). + +In this setting, while we could just take our `V` type to be `()`, the zero space unit type, we feel pretty silly keeping around + + vals: Vec<((), usize)>, + +when we could have just put the offsets into `times` up in the `keys` vector. + +Instead, it makes sense to have a different datastructure: + +```rust +struct Batch { + keys: Vec<(K, usize)>, + times: Vec<(T, isize), +} +``` + +Of course, we'll need a different cursor now too. Copy/paste? + +#### Static collections + +Some collections don't change. Or, maybe they just have few enough times at which they change that we could tolerate an immutable collection for each time. + +If we want to do this, it neither makes sense to record the times in the collection, nor to allow `vals` to describe ranges (as there will only be one `diff`). Instead, we'd like a struct like + +```rust +struct Batch { + time: T, + keys: Vec<(K, usize)>, + vals: Vec<(V, isize)>, +} +``` + +This saves us a bunch on the footprint (those times aren't small. nor is `usize`). It also simplifies some of the cursor logic, which we'll have to rewrite of course. + +#### Hashing vs Searching + +Ordered lists are pretty handy for a few reasons. + +1. We can search for elements in logarithmic time. +2. We can merge lists quite efficiently. +3. Flat data layout is appealing for copying, serialization. + +One potential improvement is to use an order-preserving hash table. My personal favorite is [Robin Hood Hashing](https://en.wikipedia.org/wiki/Hash_table#Robin_Hood_hashing), which keeps all elements in sorted order of their hash codes, but puts elements no earlier than the index their hash code would suggest. It leaves element in sorted order, but with some gaps that in essence absorb some of the pain of bucket collisions. It still meets most of the properties up above, but only applies when your type implements `Hash` in addition to `Ord`. + +The changes are not wildly complicated: + +```rust +struct Batch { + keys: Vec>, + vals: Vec>, + times: Vec<(T, isize), +} +``` + +The only change here is that we've wrapped each of our entries in an `Option` type, which allows for the absense of data with its `None` variant. + +Of course, there is a substantially different cursor to write now. + +#### Hybridizing + +These two examples are a good start, but maybe what we actually want is hashing for the keys, and ordered lists for the values, because we jump around in keys mostly and then scan values. Or maybe we want hashing for that neat key-free version that `distinct` uses. Awesome! + +Do I really have to write out the full cross-product of these variants? What when we invent a new variant that doesn't use `isize` but instead has a binary polarity for the whole batch and uses the number of occurrences to indicate the count? + +You could write all these things out, and that would be one way to spend several months. + +### 4. A modular approach + +The bespoke approach quickly becomes painful to maintain. There are obviously many improvements, but the cost of implementing them (and maintaining them, if there are bugs, lol) makes it not that exciting for me at least. + +Let's do something smarter, and make things faster in the process. + +#### Layers on layers + +Our various representations above varied mostly in which fields (`keys`, `vals`, `times`) we kept, and how we navigated through each of these layers. In fact, there is a pretty simple contract between the layers: each layer is segmented into distinct regions by `usize`d boundaries known one layer up. We can think about writing generic "layers" that we then plug on top of other layers. + +Let me give you a taste of what this looks like: + +```rust +struct OrdLayer { + keys: Vec<(K, usize)>, + vals: L, +} +``` + +Ok, this generic struct has `keys` like before, but then it just has `vals` which is some type `L: Layer`. What does that mean? + +The generic struct means that we can plug in our favorite type that implements some `Layer` trait, and we get the ability to navigate around by key `K`, and perhaps then dive down into `vals` using the `usize` offsets associated with the keys. + +Let's do another one, with less cleverness: + +```rust +struct BottomLayer { + vals: Vec<(T, isize)>, +} +``` + +I've suggestively named the layer and its generic parameter to connect this with the last level of our first `Batch` implementation. Speaking of which, + +```rust +type Batch = OrdLayer>>; +``` + +Ooo. That was pretty easy, wasn't it? Let's do that value-free (ha) version that something like `distinct` would use: + +```rust +type Batch = OrdLayer>; +``` + +Oh neat. What about that hashing stuff? We could certainly write out something like + +```rust +struct HashLayer { + keys: Vec<(K, usize)>, + vals: L, +} +``` + +which lets us choose to use our Robin Hood hashing on a layer-by-layer basis. We could do + +```rust +type Batch = HashLayer>>; +``` + +or + +```rust +type Batch = OrdLayer>>; +``` + +or + +```rust +type Batch = HashLayer>; +``` + +or whatever we want, as long as what we want is to type in lots of different batch definitions. + +But there's more isn't there? Just writing the structs is the easy part; there is *code* to write too. + +#### Cursors + +Probably the scariest thing to worry about are the cursors. That is where all the logic exploiting our cleverness lives. But, cursors really aren't all that hard. + +Here is the trait that I'm currently using: + +```rust +pub trait Cursor { + type Key: Ord; + fn step(&mut self); + fn seek(&mut self, key: &Self::Key); + fn key(&self) -> &Self::Key; + fn valid(&self) -> bool; + fn reposition(&mut self, lower: usize, upper: usize); +} +``` + +The trait has an associated type `Key` which is just the cursor's way of saying "you need to use this type to talk with me". Ok. It has `step` and `seek`, which do the things you would expect (jump around). It has `key` and `valid` methods to know what you are looking at. Finally, there is a `reposition` method that, let's be honest, you aren't supposed to call. That is for parent cursors to call when they jump around. Hands off. + +Ok, this moves us around one level, but there is nothing in the trait definition about diving down to lower values and such. How do we do that? + +I've just kept it out of the trait definition, though perhaps it could live there too. It seemed easier just to do something like + +```rust +pub struct OrdCursor { + layer: OrdLayer, + pos: usize, + bounds: (usize, usize), + pub vals: L, +} +``` + +This cursor has a link to the layer, a current position, and bounds on the interval it is currently exploring (so you don't want into the `K` for some other `G`, or whatever lives above you). + +In addition, the cursor has a public field, `vals`, which is just the cursor for the next level down. You can use that as you see fit. Importantly, none of the `vals` navigation informs the `keys` cursor; once the vals cursor runs out of values to look at (or whatever), it just sits in an invalid state. The only protocol between the two is that as we advance in the parent `OrdCursor`, we need to reposition the child `vals` to the corresponding range of values (hence `reposition`). + +The reality is that I'm going to have to implement the `Cursor` trait from way back above (with `step_key` and `seek_val) for each of these types. No magic there, yet. But that code should be relatively simple, and mostly just makes associations between the trait methods and those of the weird type that I've built. + +As long as all of our batches are the same time, their cursors will be the same type and we can just throw them in that `CursorMerger` struct, which was written to be generic over cursor implementation. + +#### Merging + +This part is pretty cool. I think. + +We are going to have to merge some of our `Batch` collections. No sweat, but let's at least try and do a good job here. Batches are sorted by *something* (either key or hash), and we can walk through them doing a merge. + +We all know how to write merge sort, maybe. We repeatedly look at the next element in the cursor, and take the smaller or, if there is a tie, take them both. Actually, it's a bit more complicated with batches: if there are tied keys, we need to merge the associated values. If there are tied values we need to merge the associated times. Ok, it's like merge sort but a bit weirder. + +Actually, it is like merge sort but awesomer, in some few important ways. + +Let me describe a variant of merge sort that works on arrays, and uses our `advance` method from up above. Remember that one? It skips forward in ordered sequences to find the first element greater than some target, using logarithmic steps. + +Here is a merge sort fragment that compares two elements, and based on which one wins uses `advance` to zip through the one with lesser values, looking for a number `step` of elements to copy over. It then uses `extend_from_slice`, which Rust folks will know is one of the few ways to trick LLVM into using `memcpy`. + +```rust +// merge sort, kinda. +fn merge(mut source1: &[T], mut source2: &[T], target: &mut Vec) { + while source1.len() > 0 && source2.len() > 0 { + match source1[0].cmp(&source2[0]) { + Ordering::Less => { + let step = 1 + advance(&source1[1..], |k| k < &source2[0]); + target.extend_from_slice(&source1[..step]); + source1 = &source1[step..]; + } + Ordering::Equal => { + target.push(source1[0]); + target.push(source2[0]); + source1 = &source1[1..]; + source2 = &source2[1..]; + } + Ordering::Greater => { + let step = 1 + advance(&source2[1..], |k| k < &source1[0]); + target.extend_from_slice(&source2[..step]); + source2 = &source2[step..]; + } + } + } + + target.extend_from_slice(&source1); + target.extend_from_slice(&source2); +} +``` + +This is perhaps a bit over-engineered. Maybe these runs aren't very common in sorting benchmarks, but they show up quite a lot in our batches. Wait, really? + +In our batches, the analogue of `extend_from_slice` doesn't just copy over some keys, it needs to move over all of the associated values, and for each of those values the associated times. Even if we only get a few keys in a run, we are likely to want to bulk copy a bunch of values and times. + +Fortunately, we can make our `Layer` implementations expose a `copy_range` interface, which efficiently pulls over large blocks of data from existing tries. This makes merging a lot less painful, and we only have to write it once for each `Layer` implementation, rather than for each of our weird cross-product implementations. This should make merging hurt a lot less. + +### 5. Evaluation + +Let's see how this goes. We are initially going to compare against differential dataflow's default implementation, which was the "not very impressive" implementation whose performance limitations we suggested at back at the beginning. We *expect* to see improvements here. + +Specifically, we'd like to get speedy random access, high-throughput merging, and more efficient memory layout where appropriate. The things we will measure are + +1. Time to build collections from unsorted tuples. +2. Time to merge collections for varying sizes. +3. Time to look up query sets in collections, each of varying sizes. + +We'll actually do a few flavors of queries, taking random query sets of varying sizes and sorting them to see how performance improves as we get larger batches of queries. All of our implementations are designed to go faster as we get larger batches of queries to look up. + +As data, I'm not going to work too hard here and just introduce a bunch of `(u32, (usize, isize))` tuples, where the key ranges from `0` up to some limit, and there is just one `usize` value for each key. Our goal is to stress out the key layer, not the layer below it. I think we could probably do some more interesting measurements, but we would start to conflate a bunch of performance issues and likely not understand the behavior of these layers as easily. + +Let's see! + +#### Baseline performance + +To start, let's consider using Rust's `HashMap` to drive a layer. I'm not really sure how this would work modularly, but we can use it for the top layer (keys), and get a sense for how "random access" could look, and what building and merging such a structure looks like (in fairness, it is optimized for neither). + +```rust + HashMap +``` + +For increasing numbers of keys, we are going to stash `(key, (0u64, 0u64))` in the hash map. This is less work than the other layers are going to do, as they have variable numbers of values to maintain too, but since `HashMap` doesn't do that we won't measure it. Instead, we will build up the hash map, merge the hash map with itself, and seke around in the hash map. In each case we report elapsed times divided by the number of records involved. + +| keys | build | merge | seek | +|------------:|------:|------:|------:| +| 1,000 | 137ns | 82ns | 19ns | +| 10,000 | 102ns | 49ns | 24ns | +| 100,000 | 153ns | 108ns | 69ns | +| 1,000,000 | 256ns | 154ns | 77ns | +| 10,000,000 | 232ns | 177ns | 83ns | +| 100,000,000 | 283ns | 228ns | 177ns | + +Hash maps are meant to have good `seek` latencies at the expense of building and merging overhead; hash map construction stabs around in memory, and stutters a bit once we get more and more keys. We do see the performance drop off a bit once we get to 100 million keys, a result of stabbing around randomly in quite a bit of memory. + +#### `OrdLayer` performance + +Our first candidate, and roughly equivalent to what we are currently using in differential dataflow, is the ordered layer. We are going to be using: + +```rust + OrdLayer> +``` + +Let's start by looking at the time it takes to build up an instance of the layer, starting from randomly ordered data (explicitly shuffled). We'll go through each key size, where we just have a single value associated with each (as we are trying to test this layer, not others). There are two costs to building the layer: sorting the input tuples, and then feeding them into the layer. + +| keys | sort | build | merge | +|------------:|------:|------:|------:| +| 1,000 | 82ns | 9ns | 10ns | +| 10,000 | 84ns | 11ns | 16ns | +| 100,000 | 96ns | 24ns | 17ns | +| 1,000,000 | 136ns | 31ns | 23ns | +| 10,000,000 | 166ns | 40ns | 20ns | +| 100,000,000 | 180ns | 43ns | 24ns | + +The measurements indicate that sorting dominates, and if we could improve this we would see some gains. Or perhaps viewed differently, the building process is relatively efficient, which is good news because merging looks more like building than it looks like sorting. This merge also happens to be a pessimal merge, in that we merge identical traces with each other, meaning we never get a chance to use our fancy `memcpy`. The performance is still pretty good, as is notably much better than the `HashMap` implementation. + +Up next, let's evaluate the time it takes to issue a bunch of queries. We'll randomly shuffle all the keys, and then issue queries in batches of varying sizes. The implementation can (and will) sort the queries to use one forward scan through the layer, which is important because our `advance` method only goes forward. Let's plot the average number of nanoseconds for a `seek_key()` call across all the keys, batched at different sizes: + +| keys \ batch | 1 | 10 | 100 | 1,000 | ALL | +|-------------:|--------:|--------:|--------:|--------:|-----:| +| 1,000 | 61ns | 51ns | 56ns | 44ns | 44ns | +| 10,000 | 57ns | 51ns | 52ns | 50ns | 49ns | +| 100,000 | 207ns | 174ns | 112ns | 86ns | 67ns | +| 1,000,000 | 513ns | 484ns | 384ns | 261ns | 78ns | +| 10,000,000 | 1,168ns | 1,290ns | 1,004ns | 707ns | 86ns | +| 100,000,000 | 1,812ns | 2,014ns | 1,646ns | 1,185ns | 96ns | + +We see a general trend toward lower latencies with higher batch sizes, though this only really shows up at larger batch sizes. The latencies at large batch sizes are *quite* high, and notably much *worse* than the `HashMap` implementation (by about 10x). + +For informational purposes, let's re-do this measurement where we don't charge the cost of sorting to the query. This makes some sense in cases where the keys are pre-sorted, as when they arrive in a pre-formed layer (which they do, in differential dataflow). + +| keys \ batch | 1 | 10 | 100 | 1,000 | ALL | +|-------------:|--------:|--------:|--------:|--------:|-----:| +| 1,000 | 56ns | 40ns | 22ns | 9ns | 4ns | +| 10,000 | 56ns | 42ns | 28ns | 18ns | 4ns | +| 100,000 | 209ns | 119ns | 87ns | 34ns | 5ns | +| 1,000,000 | 494ns | 501ns | 364ns | 215ns | 5ns | +| 10,000,000 | 1,135ns | 1,257ns | 975ns | 651ns | 4ns | +| 100,000,000 | 1,661ns | 1,931ns | 1,538ns | 1,183ns | 4ns | + +Here the improvement as batch sizes increase becomes much clearer, even for small numbers of keys. On the other hand, the performance for large numbers of keys stays pretty unbearable. + +#### `HashLayer` performance + +Our next candidate is the Robin Hood hashing layer, + +```rust + HashLayer> +``` + +This evaluation taught me something about how the [FNV](https://en.wikipedia.org/wiki/Fowler–Noll–Vo_hash_function) hash function works, or rather *doesn't* work on consecutive integers. It's fast on small inputs, though, so let's use it for now. I'm using a trait `Hashed` with two methods that reveal a hash value, and how many of the bits are legit. The latter is most commonly `64`, but can be less for `32` bit hash values (bear with me on this). Here is the default implementation for things that can be hashed. + +```rust +impl Hashed for T { + fn hashed(&self) -> u64 { + let mut h: fnv::FnvHasher = Default::default(); + self.hash(&mut h); + h.finish() + } + fn bits(&self) -> usize { 64 } +} +``` + +If we spin up the computation as with `OrdLayer`, assembling different numbers of distinct keys, we get numbers like: + + +| keys | sort | build | merge | +|------------:|------:|------:|------:| +| 1,000 | 421ns | 165ns | 28ns | +| 10,000 | 129ns | 69ns | 27ns | +| 100,000 | 122ns | 104ns | 43ns | +| 1,000,000 | 95ns | 68ns | 45ns | +| 10,000,000 | 91ns | 64ns | 32ns | +| 100,000,000 | 98ns | 74ns | 37ns | + +The sorting has improved by a factor of 2x over `OrdLayer`, for non-small key counts, but the build times have increased (which makes sense, as we are trying to be more clever). These numbers also call out the overhead my radix sorter has for small key sizes (there is a fixed cost proportional to `256`). + +Up next, let's evaluate the time it takes to issue a bunch of queries. We'll randomly shuffle all the keys, and then issue queries in batches of varying sizes. The implementation can (and will) sort the queries to use one forward scan through the layer, which is important because our `advance` method only goes forward. Let's plot the average number of nanoseconds for a `seek_key()` call across all the keys: + +| keys \ batch | 1 | 10 | 100 | 1,000 | ALL | +|-------------:|--------:|--------:|--------:|------:|------:| +| 1,000 | 63ns | 69ns | 115ns | 118ns | 143ns | +| 10,000 | 43ns | 46ns | 84ns | 120ns | 151ns | +| 100,000 | 112ns | 89ns | 113ns | 145ns | 183ns | +| 1,000,000 | 145ns | 90ns | 120ns | 147ns | 218ns | +| 10,000,000 | 174ns | 158ns | 194ns | 216ns | 248ns | +| 100,000,000 | 279ns | 279ns | 309ns | 346ns | 324ns | + +We see a different trend here than in `OrdLayer`. Namely, as the batch size gets bigger things get slower. But, this seems mostly due to the fact that sorting has a cost, and the random look-ups we are doing are already pretty fast (for large key sizes and small batches, almost 10x faster than `OrdLayer`). The cost of sorting can be decreased with radix sorting, but my implementation isn't especially brilliant on small batches (i.e. those other than `ALL`). + +Let's check again on the case where the cost of sorting is not counted against the cost of doing a query, simulating performance when the data arrive pre-sorted. + +| keys \ batch | 1 | 10 | 100 | 1,000 | ALL | +|-------------:|--------:|--------:|------:|------:|------:| +| 1,000 | 51ns | 32ns | 35ns | 17ns | 19ns | +| 10,000 | 40ns | 21ns | 15ns | 16ns | 12ns | +| 100,000 | 82ns | 88ns | 91ns | 40ns | 17ns | +| 1,000,000 | 140ns | 65ns | 54ns | 55ns | 17ns | +| 10,000,000 | 163ns | 145ns | 133ns | 133ns | 28ns | +| 100,000,000 | 268ns | 240ns | 235ns | 227ns | 68ns | + +These numbers look a lot better. If the queries come in sorted, we will be almost as happy as with `OrdLayer`, and much happier for the small batch sizes (by almost a factor of 10x for large key spaces). + +But, these numbers still aren't as good at we would like. The `ALL` measurement is noticeably larger than the single-digit nanoseconds of the `OrdLayer` measurements. + +--- + +It turns out that the janky performance can be attributed (I think) to poor distribution of hash values. If we instrument the implementation, some keys are displaced 40 positions from their intended location, and the variance of displacement is something like 24. These are much larger than they are supposed to be (the variance is meant to be 1-ish). + +If you dig into how the FNV hash function works, it treats its arguments as a sequence of bytes, and repeatedly x-ors in a byte and then multiplies by some large prime. Unfortunately, that prime is basically a large power of two plus some small junk; specifically, for 64 bit hashes it is + + 2^40 + 2^8 + 0xb3 + +The problem (for us) is that something as simple as consecutive integers differ only in their final byte, and by not very much. The result is that the high-order bits of the hash will differ only by some small multiple of `2^40`, which doesn't quite reach the highest-order bits of the hash. If we look primarily at the high-order bits to determine intended locations, we'll have a bunch of collisions when our keys are consecutive integers. + +For example, let's take the numbers `0 .. 128`, compute the FNV hash, and look at the top byte. There are 256 possible values, think hash buckets, so there should be plenty of space for 128 values. Unfortunately, they collide a lot. There are only 24 distinct bytes we see, so buckets will get five-ish entries on average. And worse, the occupied buckets form runs; here are the first six occupied buckets, and their counts: + + counts[0x0B]: 4 + counts[0x0C]: 6 + counts[0x0D]: 6 + counts[0x2B]: 4 + counts[0x2C]: 6 + counts[0x2D]: 6 + ... + +So, all of those entries are just going to pile up, aren't they. At least, at this small scale. Apparently similar things happen at the larger scales too. + +--- + +One fix that seems crazy, but works pretty great, is to ask all the nice graph processing people to take their very thoughtfully produced consecutive integer identifiers, and turn them into distinct random numbers. We can then just use this number as the hash, which is pretty handy (we skip the FNV evaluation, and radix sorting is cheaper because we have half the bytes in the key). + +```rust +#[derive(PartialEq, Eq, PartialOrd, Ord, Clone, Default)] +pub struct Uniform { + val: u32, +} + +impl Hashed for Uniform { + fn hashed(&self) -> u64 { self.val as u64 } + fn bits(&self) -> usize { 32 } +} +``` + +Our layer is now going to be + +```rust + HashLayer> +``` + +which is roughly the same as our previous `HashLayer`, just with a different `hashed()` implementation. If we re-run these measurements with random `u32` identifiers rather than `0 .. nodes`, we see a maximum displacement of 10 (rather than 40), and a variance of 1, (rather than 24). That feels better. + +Let's look at how this version fares. Building and merging shouldn't be much different, as these are largely agnostic to collisions. However, we save on the sorting (because radix sort is smart enough to notice that `u32` has fewer bytes than `u64`), and we might save a bit on the FNV evaluation, which we aren't doing now. + +| keys | sort | build | merge | +|------------:|------:|------:|------:| +| 1,000 | 310ns | 213ns | 30ns | +| 10,000 | 68ns | 57ns | 26ns | +| 100,000 | 53ns | 92ns | 37ns | +| 1,000,000 | 44ns | 63ns | 36ns | +| 10,000,000 | 49ns | 65ns | 35ns | +| 100,000,000 | 52ns | 84ns | 31ns | + +These sort times are about half what they are for the data when we use `u64` hash values to sort. The radix sort is smart and understands how many bytes are in the key, based on its type, and only does that many passes. Building and merging both seem to be about the same amount of time as before, which makes sense (perhaps fewer FNV invocations, but not enough to matter). + +When we start to do look-ups, we notice generally improved times. + +| keys \ batch | 1 | 10 | 100 | 1,000 | ALL | +|-------------:|--------:|--------:|------:|------:|------:| +| 1,000 | 41ns | 43ns | 65ns | 69ns | 76ns | +| 10,000 | 39ns | 24ns | 41ns | 67ns | 87ns | +| 100,000 | 68ns | 97ns | 60ns | 69ns | 105ns | +| 1,000,000 | 125ns | 61ns | 70ns | 83ns | 122ns | +| 10,000,000 | 138ns | 54ns | 61ns | 78ns | 132ns | +| 100,000,000 | 191ns | 89ns | 96ns | 127ns | 158ns | + +As before, a lot of overhead here is due to sorting, but we've also cut into the actual query times. If we check out the times without counting sorting against the queries, we see just how much is sorting: + +| keys \ batch | 1 | 10 | 100 | 1,000 | ALL | +|-------------:|--------:|--------:|------:|------:|------:| +| 1,000 | 26ns | 19ns | 10ns | 11ns | 11ns | +| 10,000 | 34ns | 17ns | 14ns | 12ns | 8ns | +| 100,000 | 82ns | 23ns | 22ns | 37ns | 10ns | +| 1,000,000 | 65ns | 42ns | 27ns | 25ns | 6ns | +| 10,000,000 | 134ns | 42ns | 34ns | 33ns | 8ns | +| 100,000,000 | 189ns | 76ns | 65ns | 64ns | 9ns | + +These times are now much closer to `OrdLayer` for large query batches, especially `ALL`, and are up to an order of magnitude faster than `OrdLayer` for queries against large key sizes. In fact, the apparently rough single-query batches seem to be an artifact of re-initializing the cursor more than anything; if we re-use the same cursor (the `HashCursor` supports forward and backward navigation) the times become roughly the same as the batch size `10` numbers; this makes some sense because 10 keys out of 100 million don't really have any more locality than one key out of 100 million. What doesn't make sense is why construcing a cursor breaks performance (it does very little). + +Our times are also similar to, and occasionally better than the `HashMap` times. My understanding of `HashMap` is that it cynically optimizes for "misses", by storing values separately from keys. This means that when we do successful look-ups and want the value, there is another memory dereference to do. We stash the values with the key, so once we've found the key we have the data at hand. That's my best interpretation of the difference between the two. We also don't use siphash, which could make a difference. This isn't meant to be a critique of `HashMap`, so don't get too excited about the difference. + +#### `HashLayer>>` + +Let's try out what happens when we put a spurious bogus layer in place. We are going to use a collection of type + +```rust + HashLayer>> +``` + +which just adds a bogus `()` value. This is meant to represent the world in which we just have keys, no values, but have to suffer due to an implementation that assumes we must have values. + +| keys | sort | build | merge | +|------------:|------:|------:|------:| +| 1,000 | 320ns | 204ns | 28ns | +| 10,000 | 68ns | 58ns | 27ns | +| 100,000 | 53ns | 82ns | 55ns | +| 1,000,000 | 44ns | 73ns | 43ns | +| 10,000,000 | 49ns | 75ns | 41ns | +| 100,000,000 | 52ns | 60ns | 67ns | + +These numbers aren't actually that bad. I was hoping they would be worse, actually. Oh well. The build and merge seem to creep upwards slighly, except for some heroics building the 100 million element collection. But it isn't as bad as we might have worried. + +Looking things up will not behave any differently until we actually try to track down some values, which we haven't done just yet (we've just been calling `seek` and confirming we found the key). But, it's a pretty easy experiment to do, so lets compare the two. I've done the version where we don't charge sorting to the query, to try and make the distinction the clearest. + +First, the previous implementation without the bogus `OrdLayer<(),_>` gumming up the works: + +| keys \ batch | 1 | 10 | 100 | 1,000 | ALL | +|-------------:|--------:|--------:|------:|------:|------:| +| 1,000 | 41ns | 22ns | 16ns | 12ns | 13ns | +| 10,000 | 42ns | 19ns | 16ns | 14ns | 9ns | +| 100,000 | 142ns | 31ns | 35ns | 53ns | 10ns | +| 1,000,000 | 205ns | 63ns | 46ns | 43ns | 9ns | +| 10,000,000 | 233ns | 68ns | 74ns | 54ns | 9ns | +| 100,000,000 | 351ns | 146ns | 126ns | 113ns | 11ns | + +Notice that these numbers are for sure worse than when we didn't look at any of the associated data. This is due to the additional dereference we need to do to find the data. It becomes less of a problem with `ALL` keys, because our sequential scan of the keys also turns in to a sequential scan of the values. + +Now with the implementation with the bogus layer containing `()` values: + +| keys \ batch | 1 | 10 | 100 | 1,000 | ALL | +|-------------:|--------:|--------:|------:|------:|------:| +| 1,000 | 50ns | 27ns | 16ns | 14ns | 24ns | +| 10,000 | 54ns | 25ns | 23ns | 18ns | 13ns | +| 100,000 | 206ns | 129ns | 93ns | 62ns | 14ns | +| 1,000,000 | 287ns | 103ns | 87ns | 78ns | 11ns | +| 10,000,000 | 440ns | 128ns | 103ns | 119ns | 11ns | +| 100,000,000 | 515ns | 232ns | 191ns | 197ns | 13ns | + +These numbers are definitely worse. Not catastrophic, but definitely worse. Add to this the fact that we hold on to an extra `usize` for each key, 800 MB for the 100 million key case, and I'm comfortable with the effort spent making it possible to use the leaner representation. + +#### Fancy merging + +What about our fancy `memcpy` merging? Let's evaluate that by doing some merging on two different pairs of collections. The first pair of collections will have all the even keys in one collection, and all the odd keys in the other collection. The second pair of collections will have the first `nodes` keys in one collection, and the second `nodes` keys in the other collection. The former collections alternate keys and will be annoying to merge, whereas the second collections can largely just copy their ranges over. + +| keys | alternating | contiguous | +|------------:|------------:|-----------:| +| 1,000 | 73ns | 36ns | +| 10,000 | 57ns | 22ns | +| 100,000 | 59ns | 23ns | +| 1,000,000 | 53ns | 26ns | +| 10,000,000 | 55ns | 25ns | +| 100,000,000 | 67ns | 31ns | + +This shows off what sorts of gains we might pick up as we copy over larger regions. We have picked the very best possibly case to show this off, so it remains to be seen how cool it is in practice. + +### 6. Conclusions, next steps. + +It seems like we have a decent toolkit to start building up interesting index structures. My first plan is to try and slot in a + +```rust + HashLayer>> +``` + +into differential dataflow and see how that looks. Of course, there is a bunch of glue to write, and perhaps some re-factoring of interfaces if pain points get discovered. \ No newline at end of file