Skip to content

Commit

Permalink
serial trees
Browse files Browse the repository at this point in the history
  • Loading branch information
frankmcsherry committed Aug 25, 2024
1 parent dd802bb commit 9c4d548
Showing 1 changed file with 369 additions and 0 deletions.
369 changes: 369 additions & 0 deletions posts/2024-08-25.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,369 @@
## More serial layout in Rust

I have been poking again at laying out complicated Rust types in a simpler serial layout.
The idea, [discussed in an earlier post](https://github.com/frankmcsherry/blog/blob/master/posts/2024-08-24.md), is to repeatedly find simpler surrogates for `Vec<T>` for various types `T`, each time making the surrogate into lists of simpler types.
For example,
1. For primitive types `T`, `Vec<T>` can stay how it is!
2. For product types, e.g. `Vec<(S, T)>` can be replaced by the pair `(Vec<S>, Vec<T>)`.
3. For sum types, e.g. `Vec<Result<S, T>>` can be replaced by the triple `(Vec<S>, Vec<T>, Vec<Result<usize, usize>>)`.
This one could be succincter, for sure.
4. For list types, e.g. `Vec<Vec<T>>` can be replaced by the pair `(Vec<usize>, Vec<T>)`.

If you repeat this process, you end up with a `struct` that has only `Vec<T>` for primitive `T`.
That is, assuming you started with a product, sum, or list combination of primitive types.
If you start with a type that doesn't reduce down in these ways for some reason, you get stuck.
In fact, the first case, of just using `Vec<T>`, works for all types, but you don't get any benefit.

But there's a very idiomatic way to build types that end up irreducible.
The venerable tree.
```rust
struct Tree<T> {
data: T,
kids: Vec<Tree<T>>,
}
```
When you pull apart a `Tree<T>`, you don't find simpler types inside.
You find .. more `Tree<T>`.

However, there's a fairly easy way to lay out trees like this in serial form.
We're going to walk through a way to build up a surrogate for `Vec<Tree<T>>`.
We'll follow that, in detail that depends on our complexity budget, with a way to build up a surrogate for `Vec<Json>`, which happens to be a tree with a bunch of weird complexity.
We'll end up with a representation that trades away mutability (sorry!) for performance:
```
Read 1000 records
705.291µs json_vals cloned
41.209µs json_cols cloned
603.5µs json_vals ser/de
223.5µs json_cols ser/de
```
Here we load 1000 JSON records shamelessly lifted from [a fasterthanlime post](https://fasterthanli.me/articles/small-strings-in-rust).
We then both clone the 1000 JSON values themselves, and the 1000 values stored in a columnar representation.
The simpler layout makes this substantially easier.
Serialization and deserialization are also better, round tripping through `bincode`, though the time is certainly higher than the `memcpy`s it could be (presumably it should look more like the cloning time).

---

Though I had to cheat because apparently `serde_json` and `bincode` don't work together?
I swapped out the `serde_json::Number` type for a `usize` for the serialization tests; I guess it uses some tagging features that leave `bincode` wrong-footed.
I'm a bit surprised that these core Rust things don't work together off the shelf.

---

### A layout for trees

Laying trees out in a serial order is not new, and may not be a surprise to many of you, or even any of you!
But, I like writing things down, because a few years from now I'll try to re-invent this and it will be helpful to have the conclusions already at hand.

The type we are going to start with is a tree with some `data: T` at each tree node, and a variable number of children.
```rust
/// A tree; some data.
struct Tree<T> {
data: T,
kids: Vec<Tree<T>>,
}
```

In conventional Rust layout, each of the non-empty `kids` will be backed by their own allocations.
This is great, because each can be independently manipulated, and you can implement all sorts of tree algorithms that require mutation.

We are going to focus on trees that shall not be mutated.
I think you'll be able to change the `data`, but certainly not the `kids`.
The motivation is inherited from the previous work: many cases where I would like to ship some data around, from some source of truth to interested recipients.
The recipients are there to observe the data, not to change it.

Our plan is not very sophisticated: we will perform a pre-order traversal and write the nodes down.
When we write them down, we'll obviously record `data` but we'll need to write *something* for `kids`.
In this case, as we are building a container for multiple `Tree<T>` instances, we'll jot down the bounds that describe where to find the children in the container.
The only challenge is that, as a pre-order traversal, we haven't written them anywhere yet.
Fortunately, this will be a non-issue based on our traversal order!

```rust
/// A stand-in for `Vec<Tree<T>>`
#[derive(Clone)]
pub struct ColumnTree<T> {
pub groups: Vec<usize>, // inserted tree delimiters.
pub bounds: Vec<usize>, // node child delimiters.
pub values: Vec<T>, // node values.
}
```

Both `bounds` and `values` have ~the same length, and each index for both will correspond to a tree node.
The node's data are found in `values`, and the `bounds` indicate where to look for the node's children.
When we insert a tree, we'll populate as many of these as we have instances of `Tree<T>` in our tree.
The offsets in `groups` will delimit ranges of indexes corresponding to whole trees that have been inserted;
we need this because we insert multiple trees, and want to be able to find their roots.
The first index in each range delimited by `groups` will be the root of the tree.

The plan is to implement a `ColumnTree::push` function that accepts trees and transcribes them.
```rust
impl<T> ColumnTree<T> {
// Pushes a tree containing data onto `self`.
pub fn push(&mut self, tree: Tree<T>) {
// Our plan is to repeatedly transcribe tree nodes, enqueueing
// any child nodes for transcription. When we do, we'll need to
// know where they will be written, to leave a forward reference.
// We'll derive this by adding `values.len()` and `todo.len()`.
let mut todo = std::collections::VecDeque::default();
todo.push_back(tree);
while let Some(node) = todo.pop_front() {
// Children will land at positions in `self.values` determined
// by its current length, plus `todo.len()`, plus one (for node).
let cursor = self.values.len() + todo.len() + 1;
self.values.push(node.data);
self.bounds.push(cursor + node.kids.len());
for child in node.kids.into_iter() {
todo.push_back(child);
}
}
// Commit the new number of nodes to the delimiters.
self.groups.push(self.values.len());
}
}
```

The interesting-to-me moment is when we need to write down something is `self.bounds` to indicate where we might find the node's children.
Although we may still have many nodes to transcribe, we know where we are now (`values.len()`) and how much outstanding work there is (`todo.len()`), and putting these together we can figure out where each enqueued child will end up.
This is because we are doing a breadth-first traversal, which isn't necessarily the best order as it maintains the most intermediate state, and perhaps with some more thinking we could rework this.
The write-once approach we have here is appealing in a setting where we might ship (to disk, or network) the parts of the `ColumnTree` we've written.

To navigate the tree we'll mostly just follow pointers around, or rather *offsets* from pointers.
We'll use a handy helper type for this:
```rust
/// A stand-in for `&Tree<T>`
pub struct ColumnTreeRef<'a, T> {
value: &'a T,
lower: usize,
upper: usize,
nodes: &'a ColumnTree<T>
}
```
Here the value is clearly just a reference to the value, and the children are described by the lower and upper node indexes that can be found in the `nodes` reference.
```rust
impl<'a, T> ColumnTreeRef<'a, T> {
/// A reference to the node's value.
pub fn value(&self) -> &T { self.value }
/// A reference to a specific child.
pub fn child(&self, index: usize) -> ColumnTreeRef<'a, T> {
assert!(index < self.upper - self.lower);
let child = self.lower + index;
ColumnTreeRef {
value: &self.nodes.values[child],
lower: self.nodes.bounds[child],
upper: self.nodes.bounds[child+1],
nodes: self.nodes,
}
}
/// The number of children.
pub fn kids(&self) -> usize { self.upper - self.lower }
}
```
I even went and implemented `PartialEq<Tree<T>>` for this type, to make validation easier.
```rust
impl<'a, T: PartialEq> PartialEq<Tree<T>> for ColumnTreeRef<'a, T> {
fn eq(&self, other: &Tree<T>) -> bool {
let mut todo = vec![(*self, other)];
while let Some((this, that)) = todo.pop() {
if this.value != &that.data {
return false;
} else if (this.upper - this.lower) != that.kids.len() {
return false;
} else {
for (index, child) in that.kids.iter().enumerate() {
todo.push((this.child(index), child));
}
}
}
true
}
}
```
This is iterative rather than recursive, but that's usually a smart thing to do if you don't like stack overflows.
It also relies on a `Copy` implementation I didn't show you; not complicated, but needlessly verbose.

A few more implementation details round out our `ColumnTree` implementation.
```rust
impl<T> ColumnTree<T> {
/// Access the root of the `index`th tree.
pub fn index(&self, index: usize) -> ColumnTreeRef<'_, T> {
let root = self.groups[index];
ColumnTreeRef {
value: &self.values[root],
lower: self.bounds[root]+1, // +1 because ...
upper: self.bounds[root+1],
nodes: self,
}
}
/// Create an empty `ColumnTree`.
pub fn new() -> Self {
Self {
groups: vec![0], // init delimiters with zero.
bounds: vec![0], // init delimiters with zero.
values: Vec::default(),
}
}
}
```
The nit above is that strictly speaking the root of each tree lists itself as a child.
This is .. a quirk of how we've written things down, using delimiters rather than twice the memory to have explicit lower and upper bounds.
It's not the end of the world, but does require care at a few moments.

### Trying out trees

We'll make some nonsense trees, with lots of nodes.
```rust
let mut tree = tree::Tree { data: 0, kids: vec![] };
for i in 0 .. 11 {
let mut kids = Vec::with_capacity(i);
for _ in 0 .. i {
kids.push(tree.clone());
}
tree.data = i;
tree.kids = kids;
}
```
Each level in this tree has as many children as the level, so the tree size goes up .. factorially?

We'll do a few things with the tree itself, just to start.
We'll sum up the values, and clone the tree, to get a sense for navigating and copying the type.
```rust
let timer = std::time::Instant::now();
let sum = tree.sum();
let time = timer.elapsed();
println!("{:?}\ttree summed: {:?}", time, sum);

let timer = std::time::Instant::now();
let clone = tree.clone();
let time = timer.elapsed();
println!("{:?}\ttree cloned", time);
```
We'll then form the columnar tree, and validate the equivalence.
```rust
let timer = std::time::Instant::now();
let mut cols = tree::ColumnTree::new();
cols.push(tree);
let time = timer.elapsed();
println!("{:?}\tcols formed", time);

let timer = std::time::Instant::now();
if cols.index(0) != clone {
println!("UNEQUAL!!!");
}
let time = timer.elapsed();
println!("{:?}\tcompared", time);
```
Finally, we'll sum the values in the serial layout, and clone the column tree.
```rust
let timer = std::time::Instant::now();
let sum = cols.values.iter().sum::<usize>();
let time = timer.elapsed();
println!("{:?}\tcols summed: {:?}", time, sum);

let timer = std::time::Instant::now();
let _ = cols.clone();
let time = timer.elapsed();
println!("{:?}\tcols cloned", time);
```
Bit of a cheat in that last one that we sum the values without navigating the tree.
Take that with the same grain of salt as when I tell you that I don't know whether the recursive (not iterative) traversal I coded for `Tree<T>` is breadth first or depth first.

The results!
```
19.399209ms tree summed: 9864100
132.551834ms tree cloned
423.755666ms cols formed
49.098042ms compared
2.623875ms cols summed: 9864100
16.077542ms cols cloned
```
First, the sums are the same, and we didn't see `UNEQUAL!!!` which is a good start.
The times for manipulating the serial layout are generally smaller, but the time to form the serial layout is surprisingly high.
I'm going to look at that next, but my guess is that it is a combination of 1. taking owned data and needing to de-allocate as it goes, and 2. not pre-allocating sizes, and so repeatedly doubling and copying.

There's a lot more to test here, especially around navigation.
We've given up the ability to push, pop, or otherwise alter the children of nodes.
In exchange, it's a lot easier to move the result around.

### JSON in serial

JSON as represented by `serde_json` is .. just a complicated tree!
```rust
/// Stand in for JSON, from `serde_json`.
pub enum Value {
Null,
Bool(bool),
Number(Number),
String(String),
Array(Vec<Value>),
/// Idk what the `Map` type is, so sorted list for us.
Object(Vec<(String, Value)>),
}
```
The "children" here are in the `Array` and `Object` variants, where there are variable numbers of `Value` instances.
It's not exactly the tree we have above, but .. it could be made to be.
I didn't actually do this, but I think it's a good thought experiment, and we'll pretend that it is what I did so that I don't have to show you.

```rust
/// The non-recursive content of Value
pub enum ValuePayload {
Null,
Bool(bool),
Number(Number),
String(String),
Array,
Object(Vec<String>),
};

/// Value is a tree, with some payload.
type Value = Tree<ValuePayload>;
```

We are able to do the same transformation as we did with `Tree<T>`, where `T = ValuePayload`.
I haven't shown you how you flatten out things like `Vec<String>`, but [the previous post](https://github.com/frankmcsherry/blog/blob/master/posts/2024-08-24.md) goes in to some detail about that.
And let's just say that I did this, mostly by hand unfortunately.
I'm writing this out mostly to learn things like the above, and try and draw out the recurring patterns.
The above transform doesn't *always* work out, and we were lucky that each variant had either zero or one `Vec<Value>` in it.

We'll load up 1,000 JSON records taken from [the aforementioned fasterthanlime post](https://fasterthanli.me/articles/small-strings-in-rust).
It's really small; about 244KB.
We'll do the same things as with the `Tree`, minus the summing of values.
I supposed we could do that, or add up some tree lengths, or what have you, but we did not.
```
Read 1000 records
689.375µs json_vals cloned
372.166µs json_cols formed
107.083µs compared
29.792µs json_cols cloned
547.708µs json_vals ser/de
191.25µs json_cols ser/de
```
Different numbers from above because I ran it again.
But you can see that in this case even forming the serial layout is faster than cloning the thousand reconds.
Cloning is about 20x faster in this case, and serialization is nearly 3x faster (though again, it should be closer to the 30us cloning time to `memcpy` the several primitive vectors).

I googled "large json file" and found [a 25MB one](https://github.com/json-iterator/test-data/blob/master/large-file.json), and re-ran things with that.
```
Read 11351 records
76.386042ms json_vals cloned
24.9265ms json_cols formed
12.32275ms compared
4.444583ms json_cols cloned
62.579458ms json_vals ser/de
42.471792ms json_cols ser/de
```
As before, forming the serial layout is faster than cloning the data, and cloning the serial layout is faster still.
There's less of a gap between serialization now, perhaps it is limited by the volume of data, but again the "right number" is probably closer to the cloning number that moves bytes around.

I don't want to say that you should be using something like our serial layout, but I do a lot more reading of immutable JSON than I do creating and mutating JSON.
It's a pretty good fit for the sort of things that I do.
With a bit more spit and polish, and perhaps a `serde_json` deserialization routine that skips the owned types entirely, there might be something here.

### Wrapping up

I'm still as interested as ever, perhaps even more so, in laying data out so that it is easy to work with.
This fights against some other goals, notably those around automatic management of mutable memory.
At least, when you end up spending most of your computer slogging through pointer after pointer looking for what you really need, you start to think about alternatives.

This a great moment to plug [the Broom paper](https://www.ionelgog.org/data/papers/2015-hotos-broom.pdf) from HotOS 2015, by several former colleagues.
They look at big data systems and conclude that your memory management might be best in a few different modalities: some ephemerally collected, some long-lived ownership based, and some region allocated.
The stuff we're talking about here are the bits suited for region allocation: the communication of bulk information between otherwise independent workers.
We haven't even looked at the overheads of Rust's resource management across threads, where reconstituting the alloctations is that much more stressful for the underlying allocator.

0 comments on commit 9c4d548

Please sign in to comment.