serial trees

frankmcsherry · Aug 25, 2024 · 9c4d548 · 9c4d548
1 parent dd802bb
commit 9c4d548
Showing 1 changed file with 369 additions and 0 deletions.
diff --git a/posts/2024-08-25.md b/posts/2024-08-25.md
@@ -0,0 +1,369 @@
+## More serial layout in Rust
+
+I have been poking again at laying out complicated Rust types in a simpler serial layout.
+The idea, [discussed in an earlier post](https://github.com/frankmcsherry/blog/blob/master/posts/2024-08-24.md), is to repeatedly find simpler surrogates for `Vec<T>` for various types `T`, each time making the surrogate into lists of simpler types.
+For example,
+1.  For primitive types `T`, `Vec<T>` can stay how it is!
+2.  For product types, e.g. `Vec<(S, T)>` can be replaced by the pair `(Vec<S>, Vec<T>)`.
+3.  For sum types, e.g. `Vec<Result<S, T>>` can be replaced by the triple `(Vec<S>, Vec<T>, Vec<Result<usize, usize>>)`.
+    This one could be succincter, for sure.
+4.  For list types, e.g. `Vec<Vec<T>>` can be replaced by the pair `(Vec<usize>, Vec<T>)`.
+
+If you repeat this process, you end up with a `struct` that has only `Vec<T>` for primitive `T`.
+That is, assuming you started with a product, sum, or list combination of primitive types.
+If you start with a type that doesn't reduce down in these ways for some reason, you get stuck.
+In fact, the first case, of just using `Vec<T>`, works for all types, but you don't get any benefit.
+
+But there's a very idiomatic way to build types that end up irreducible.
+The venerable tree. 
+```rust
+struct Tree<T> {
+    data: T,
+    kids: Vec<Tree<T>>,
+}
+```
+When you pull apart a `Tree<T>`, you don't find simpler types inside.
+You find .. more `Tree<T>`.
+
+However, there's a fairly easy way to lay out trees like this in serial form.
+We're going to walk through a way to build up a surrogate for `Vec<Tree<T>>`.
+We'll follow that, in detail that depends on our complexity budget, with a way to build up a surrogate for `Vec<Json>`, which happens to be a tree with a bunch of weird complexity.
+We'll end up with a representation that trades away mutability (sorry!) for performance:
+```
+Read 1000 records
+705.291µs       json_vals cloned
+41.209µs        json_cols cloned
+603.5µs         json_vals ser/de
+223.5µs         json_cols ser/de
+```
+Here we load 1000 JSON records shamelessly lifted from [a fasterthanlime post](https://fasterthanli.me/articles/small-strings-in-rust).
+We then both clone the 1000 JSON values themselves, and the 1000 values stored in a columnar representation.
+The simpler layout makes this substantially easier.
+Serialization and deserialization are also better, round tripping through `bincode`, though the time is certainly higher than the `memcpy`s it could be (presumably it should look more like the cloning time).
+
+---
+
+Though I had to cheat because apparently `serde_json` and `bincode` don't work together?
+I swapped out the `serde_json::Number` type for a `usize` for the serialization tests; I guess it uses some tagging features that leave `bincode` wrong-footed.
+I'm a bit surprised that these core Rust things don't work together off the shelf.
+
+---
+
+### A layout for trees
+
+Laying trees out in a serial order is not new, and may not be a surprise to many of you, or even any of you!
+But, I like writing things down, because a few years from now I'll try to re-invent this and it will be helpful to have the conclusions already at hand.
+
+The type we are going to start with is a tree with some `data: T` at each tree node, and a variable number of children.
+```rust
+/// A tree; some data.
+struct Tree<T> {
+    data: T,
+    kids: Vec<Tree<T>>,
+}
+```
+
+In conventional Rust layout, each of the non-empty `kids` will be backed by their own allocations.
+This is great, because each can be independently manipulated, and you can implement all sorts of tree algorithms that require mutation.
+
+We are going to focus on trees that shall not be mutated.
+I think you'll be able to change the `data`, but certainly not the `kids`.
+The motivation is inherited from the previous work: many cases where I would like to ship some data around, from some source of truth to interested recipients.
+The recipients are there to observe the data, not to change it.
+
+Our plan is not very sophisticated: we will perform a pre-order traversal and write the nodes down.
+When we write them down, we'll obviously record `data` but we'll need to write *something* for `kids`.
+In this case, as we are building a container for multiple `Tree<T>` instances, we'll jot down the bounds that describe where to find the children in the container.
+The only challenge is that, as a pre-order traversal, we haven't written them anywhere yet.
+Fortunately, this will be a non-issue based on our traversal order!
+
+```rust
+/// A stand-in for `Vec<Tree<T>>`
+#[derive(Clone)]
+pub struct ColumnTree<T> {
+    pub groups: Vec<usize>,     // inserted tree delimiters.
+    pub bounds: Vec<usize>,     // node child delimiters.
+    pub values: Vec<T>,         // node values. 
+}
+```
+
+Both `bounds` and `values` have ~the same length, and each index for both will correspond to a tree node.
+The node's data are found in `values`, and the `bounds` indicate where to look for the node's children.
+When we insert a tree, we'll populate as many of these as we have instances of `Tree<T>` in our tree.
+The offsets in `groups` will delimit ranges of indexes corresponding to whole trees that have been inserted;
+we need this because we insert multiple trees, and want to be able to find their roots.
+The first index in each range delimited by `groups` will be the root of the tree.
+
+The plan is to implement a `ColumnTree::push` function that accepts trees and transcribes them.
+```rust
+impl<T> ColumnTree<T> {
+    // Pushes a tree containing data onto `self`.
+    pub fn push(&mut self, tree: Tree<T>) {
+        // Our plan is to repeatedly transcribe tree nodes, enqueueing
+        // any child nodes for transcription. When we do, we'll need to
+        // know where they will be written, to leave a forward reference.
+        // We'll derive this by adding `values.len()` and `todo.len()`.
+        let mut todo = std::collections::VecDeque::default();
+        todo.push_back(tree);
+        while let Some(node) = todo.pop_front() {
+            // Children will land at positions in `self.values` determined
+            // by its current length, plus `todo.len()`, plus one (for node).
+            let cursor = self.values.len() + todo.len() + 1;
+            self.values.push(node.data);
+            self.bounds.push(cursor + node.kids.len());
+            for child in node.kids.into_iter() {
+                todo.push_back(child);
+            }
+        }
+        // Commit the new number of nodes to the delimiters.
+        self.groups.push(self.values.len());
+    }
+}
+```
+
+The interesting-to-me moment is when we need to write down something is `self.bounds` to indicate where we might find the node's children.
+Although we may still have many nodes to transcribe, we know where we are now (`values.len()`) and how much outstanding work there is (`todo.len()`), and putting these together we can figure out where each enqueued child will end up.
+This is because we are doing a breadth-first traversal, which isn't necessarily the best order as it maintains the most intermediate state, and perhaps with some more thinking we could rework this.
+The write-once approach we have here is appealing in a setting where we might ship (to disk, or network) the parts of the `ColumnTree` we've written.
+
+To navigate the tree we'll mostly just follow pointers around, or rather *offsets* from pointers.
+We'll use a handy helper type for this:
+```rust
+/// A stand-in for `&Tree<T>`
+pub struct ColumnTreeRef<'a, T> {
+    value: &'a T,
+    lower: usize,
+    upper: usize,
+    nodes: &'a ColumnTree<T>
+}
+```
+Here the value is clearly just a reference to the value, and the children are described by the lower and upper node indexes that can be found in the `nodes` reference.
+```rust
+impl<'a, T> ColumnTreeRef<'a, T> {
+    /// A reference to the node's value.
+    pub fn value(&self) -> &T { self.value }
+    /// A reference to a specific child.
+    pub fn child(&self, index: usize) -> ColumnTreeRef<'a, T> {
+        assert!(index < self.upper - self.lower);
+        let child = self.lower + index;
+        ColumnTreeRef {
+            value: &self.nodes.values[child],
+            lower: self.nodes.bounds[child],
+            upper: self.nodes.bounds[child+1],
+            nodes: self.nodes,
+        }
+    }
+    /// The number of children.
+    pub fn kids(&self) -> usize { self.upper - self.lower }
+}
+```
+I even went and implemented `PartialEq<Tree<T>>` for this type, to make validation easier.
+```rust
+impl<'a, T: PartialEq> PartialEq<Tree<T>> for ColumnTreeRef<'a, T> {
+    fn eq(&self, other: &Tree<T>) -> bool {
+        let mut todo = vec![(*self, other)];
+        while let Some((this, that)) = todo.pop() {
+            if this.value != &that.data { 
+                return false; 
+            } else if (this.upper - this.lower) != that.kids.len() {
+                return false;
+            } else {
+                for (index, child) in that.kids.iter().enumerate() {
+                    todo.push((this.child(index), child));
+                }
+            }
+        }
+        true
+    }
+}
+```
+This is iterative rather than recursive, but that's usually a smart thing to do if you don't like stack overflows.
+It also relies on a `Copy` implementation I didn't show you; not complicated, but needlessly verbose.
+
+A few more implementation details round out our `ColumnTree` implementation.
+```rust
+impl<T> ColumnTree<T> {
+    /// Access the root of the `index`th tree.
+    pub fn index(&self, index: usize) -> ColumnTreeRef<'_, T> {
+        let root = self.groups[index];
+        ColumnTreeRef {
+            value: &self.values[root],
+            lower: self.bounds[root]+1, // +1 because ...
+            upper: self.bounds[root+1],
+            nodes: self,
+        }
+    }
+    /// Create an empty `ColumnTree`.
+    pub fn new() -> Self {
+        Self {
+            groups: vec![0],    // init delimiters with zero.
+            bounds: vec![0],    // init delimiters with zero.
+            values: Vec::default(),
+        }
+    }
+}
+```
+The nit above is that strictly speaking the root of each tree lists itself as a child.
+This is .. a quirk of how we've written things down, using delimiters rather than twice the memory to have explicit lower and upper bounds.
+It's not the end of the world, but does require care at a few moments.
+
+### Trying out trees
+
+We'll make some nonsense trees, with lots of nodes.
+```rust
+let mut tree = tree::Tree { data: 0, kids: vec![] };
+for i in 0 .. 11 {
+    let mut kids = Vec::with_capacity(i);
+    for _ in 0 .. i {
+        kids.push(tree.clone());
+    }
+    tree.data = i;
+    tree.kids = kids;
+}
+```
+Each level in this tree has as many children as the level, so the tree size goes up .. factorially?
+
+We'll do a few things with the tree itself, just to start.
+We'll sum up the values, and clone the tree, to get a sense for navigating and copying the type.
+```rust
+let timer = std::time::Instant::now();
+let sum = tree.sum();
+let time = timer.elapsed();
+println!("{:?}\ttree summed: {:?}", time, sum);
+
+let timer = std::time::Instant::now();
+let clone = tree.clone();
+let time = timer.elapsed();
+println!("{:?}\ttree cloned", time);
+```
+We'll then form the columnar tree, and validate the equivalence.
+```rust
+let timer = std::time::Instant::now();
+let mut cols = tree::ColumnTree::new();
+cols.push(tree);
+let time = timer.elapsed();
+println!("{:?}\tcols formed", time);
+
+let timer = std::time::Instant::now();
+if cols.index(0) != clone {
+    println!("UNEQUAL!!!");
+}
+let time = timer.elapsed();
+println!("{:?}\tcompared", time);
+```
+Finally, we'll sum the values in the serial layout, and clone the column tree.
+```rust
+let timer = std::time::Instant::now();
+let sum = cols.values.iter().sum::<usize>();
+let time = timer.elapsed();
+println!("{:?}\tcols summed: {:?}", time, sum);
+
+let timer = std::time::Instant::now();
+let _ = cols.clone();
+let time = timer.elapsed();
+println!("{:?}\tcols cloned", time);
+```
+Bit of a cheat in that last one that we sum the values without navigating the tree.
+Take that with the same grain of salt as when I tell you that I don't know whether the recursive (not iterative) traversal I coded for `Tree<T>` is breadth first or depth first.
+
+The results!
+```
+19.399209ms     tree summed: 9864100
+132.551834ms    tree cloned
+423.755666ms    cols formed
+49.098042ms     compared
+2.623875ms      cols summed: 9864100
+16.077542ms     cols cloned
+```
+First, the sums are the same, and we didn't see `UNEQUAL!!!` which is a good start.
+The times for manipulating the serial layout are generally smaller, but the time to form the serial layout is surprisingly high.
+I'm going to look at that next, but my guess is that it is a combination of 1. taking owned data and needing to de-allocate as it goes, and 2. not pre-allocating sizes, and so repeatedly doubling and copying.
+
+There's a lot more to test here, especially around navigation.
+We've given up the ability to push, pop, or otherwise alter the children of nodes.
+In exchange, it's a lot easier to move the result around.
+
+### JSON in serial
+
+JSON as represented by `serde_json` is .. just a complicated tree!
+```rust
+/// Stand in for JSON, from `serde_json`.
+pub enum Value {
+    Null,
+    Bool(bool),
+    Number(Number),
+    String(String),
+    Array(Vec<Value>),
+    /// Idk what the `Map` type is, so sorted list for us.
+    Object(Vec<(String, Value)>),
+}
+```
+The "children" here are in the `Array` and `Object` variants, where there are variable numbers of `Value` instances.
+It's not exactly the tree we have above, but .. it could be made to be.
+I didn't actually do this, but I think it's a good thought experiment, and we'll pretend that it is what I did so that I don't have to show you.
+
+```rust
+/// The non-recursive content of Value
+pub enum ValuePayload {
+    Null,
+    Bool(bool),
+    Number(Number),
+    String(String),
+    Array,
+    Object(Vec<String>),
+};
+
+/// Value is a tree, with some payload.
+type Value = Tree<ValuePayload>;
+```
+
+We are able to do the same transformation as we did with `Tree<T>`, where `T = ValuePayload`.
+I haven't shown you how you flatten out things like `Vec<String>`, but [the previous post](https://github.com/frankmcsherry/blog/blob/master/posts/2024-08-24.md) goes in to some detail about that.
+And let's just say that I did this, mostly by hand unfortunately.
+I'm writing this out mostly to learn things like the above, and try and draw out the recurring patterns.
+The above transform doesn't *always* work out, and we were lucky that each variant had either zero or one `Vec<Value>` in it.
+
+We'll load up 1,000 JSON records taken from [the aforementioned fasterthanlime post](https://fasterthanli.me/articles/small-strings-in-rust).
+It's really small; about 244KB.
+We'll do the same things as with the `Tree`, minus the summing of values.
+I supposed we could do that, or add up some tree lengths, or what have you, but we did not.
+```
+Read 1000 records
+689.375µs       json_vals cloned
+372.166µs       json_cols formed
+107.083µs       compared
+29.792µs        json_cols cloned
+547.708µs       json_vals ser/de
+191.25µs        json_cols ser/de
+```
+Different numbers from above because I ran it again.
+But you can see that in this case even forming the serial layout is faster than cloning the thousand reconds.
+Cloning is about 20x faster in this case, and serialization is nearly 3x faster (though again, it should be closer to the 30us cloning time to `memcpy` the several primitive vectors).
+
+I googled "large json file" and found [a 25MB one](https://github.com/json-iterator/test-data/blob/master/large-file.json), and re-ran things with that.
+```
+Read 11351 records
+76.386042ms     json_vals cloned
+24.9265ms       json_cols formed
+12.32275ms      compared
+4.444583ms      json_cols cloned
+62.579458ms     json_vals ser/de
+42.471792ms     json_cols ser/de
+```
+As before, forming the serial layout is faster than cloning the data, and cloning the serial layout is faster still.
+There's less of a gap between serialization now, perhaps it is limited by the volume of data, but again the "right number" is probably closer to the cloning number that moves bytes around.
+
+I don't want to say that you should be using something like our serial layout, but I do a lot more reading of immutable JSON than I do creating and mutating JSON.
+It's a pretty good fit for the sort of things that I do.
+With a bit more spit and polish, and perhaps a `serde_json` deserialization routine that skips the owned types entirely, there might be something here.
+
+### Wrapping up
+
+I'm still as interested as ever, perhaps even more so, in laying data out so that it is easy to work with.
+This fights against some other goals, notably those around automatic management of mutable memory.
+At least, when you end up spending most of your computer slogging through pointer after pointer looking for what you really need, you start to think about alternatives.
+
+This a great moment to plug [the Broom paper](https://www.ionelgog.org/data/papers/2015-hotos-broom.pdf) from HotOS 2015, by several former colleagues.
+They look at big data systems and conclude that your memory management might be best in a few different modalities: some ephemerally collected, some long-lived ownership based, and some region allocated.
+The stuff we're talking about here are the bits suited for region allocation: the communication of bulk information between otherwise independent workers.
+We haven't even looked at the overheads of Rust's resource management across threads, where reconstituting the alloctations is that much more stressful for the  underlying allocator.