Skip to content

Latest commit

 

History

History
278 lines (226 loc) · 14.7 KB

2024-09-10.md

File metadata and controls

278 lines (226 loc) · 14.7 KB

A Fascinating Diversion into Compression

I've recently been working on columnar data representations in Rust. The intent is that if you have lists of some complex type T, say a struct or an enum or a list itself (Vec), you might have better options than storing the types themselves in a list. The columnar repository contains the work, and a previous post goes in to detail about the process.

What I want to talk about today is the curious case of Result<S, T>.

We'll start with a pretty standard intro to how you might represent this in a columnar form, but by the end of the post we'll have developed a way to compress JSON objects. I was suprised and delighted by the connection, which I stumbled upon more than discovered, and would love to hear more about if you understand where this came from.

Re-orienting Result to columnar form

The Result<S, T> type is a sum type, each instance of which can either be Ok(S) or Err(T). That is, it is either one type or the other type, but not both. It is different from a pair (S, T) which always has both types. In addition to either S or T, it also has to communicate which of the two it is (they might be the same, or the same size).

The size of a Result is determined by the sizes of S and T, but it clearly needs to be able to hold either of the two, and a spare bit to indicate which it is. For example, the size of a Result<u64, i64> is 16 bytes, but the size of a Result<u64, u8> is also 16 bytes. These might be larger than you expect because of alignment: types in Rust end up sized at least to an integral multiple of their alignment, which is generally at least the size of the largest member. If you have a sequence of Result<u64, u8> you'll need as 16 bytes times as many elements as you have, even if many of them are the u8 variant.

An alternate way to represent the same information as a sequence of results, is to demultiplex the two variants into two sequences, leaving enough information to re-interleave them.

/// Replacement for `Vec<Result<S, T>>`.
struct ResultColumns<S, T> {
    /// Variant, and corresponding offset.
    indexes: Vec<Result<usize, usize>>,
    /// The `S` variants in order.
    s_store: Vec<S>,
    /// The `T` variants in order.
    t_store: Vec<T>,
}

In this example, the backing data are stored in s_store and t_store, and elements can be retrieved from the information in indexes:

impl<S, T> ResultColumns<S, T> {
    /// Not quite `&Result<S, T>` but pretty close.
    fn get(&self, index: usize) -> Result<&S, &T> {
        match self.indexes[index] {
            Ok(index) => Ok(&self.s_store[index]),
            Err(index) => Err(&self.t_store[index]),
        }
    }
}

This pattern allows us to store S and T records separately, but with some overhead. Each element also requires a Result<usize, usize>, which .. is the same size as the hypothetical Result<u64, u8> we started with. So that's not great.

Succincter data structures

It turns out that indexes stores way more information than we strictly need. We need to be able to see which variant, Ok or Err, an element is, and figure out where to find it in the corresponding store.

We could alternately use a bit vector, a bit like a Vec<bool>, to record the information in indexes. To implement get(index) we would look in the bit vector to determine which variant is at that location. Then, to determine where to find it .. ah .. we could count the number of occurences of that bit that precedes the one we found.

impl<S, T> ResultColumns<S, T> {
    /// Not quite `&Result<S, T>` but pretty close.
    fn get(&self, index: usize) -> Result<&S, &T> {
        let bit = self.indexes[index];
        let pos = self.indexes
                      .iter()
                      .take(index)
                      .filter(|x| == bit)
                      .count();
        match bit {
            0 => Ok(&self.s_store[pos]),
            1 => Err(&self.t_store[pos]),
        }
    }
}

And you are probably already screaming about how inefficient this is.

This implementation is quite snug with respect to space. Beyond the S and T data itself, we store one additional bit for every record. It's hard to imagine using any less, but it's also hard to imagine going any slower than scanning all of self.indexes. The good news is that if we allow the memory to creep up just a .. bit .. we can recover random access.

In the field of succinct data structures folks study among other things the problem of succinct indexable dictionaries. We will think of these as bit sequences, with additional support for a rank function that says for each position how many bits are set before it. It turns out, this is exactly what we want: the rank function (combined with index) tells us exactly where to find our data! If we make the arbitrary decision that 1 corresponds to S, then it looks like this:

impl<S, T> ResultColumns<S, T> {
    /// Not quite `&Result<S, T>` but pretty close.
    fn get(&self, index: usize) -> Result<&S, &T> {
        let bit = self.indexes[index];
        let rank = self.indexes.rank(index);
        match bit {
            0 => Ok(&self.s_store[rank]),
            1 => Err(&self.t_store[index - rank]),
        }
    }
}

The cost of these succinct indexable dictionaries is "barely more than a bit". Formally, it needs to be 1 + o(1) for each element, so in the limit basically just a bit.

They are pretty complicated, so I implemented one that has exactly two bits for each element. This is not "succinct" in the technical sense, but instead "compact", according to Wikipedia. Whatever it is, you can draw the overhead down close to one bit, at some cost (Guy Jacobson's PhD thesis is ~$40 from ProQuest).

Compacter data structures

I thought I'd talk through my implementation, which is not very smart but is very easy.

struct CompactBits {
    counts: Vec<u64>,   // counts ones in preceding `values`.
    values: Vec<u64>,   // contains bits packed together.
    last_word: u64,     // in-progress bits, not yet 64 of them.
    last_bits: u8,      // number of in-progress bits.
}

Without going in to great detail, we pack all the bits into a u64, and put those in a list (values). At the same time, we also maintain the running sum of the number of ones in these blocks (counts). There may be some bits that aren't cleanly packed into a u64, and we hold on to these separately.

To randomly access bits we look things up either in values or in last_word. To determine the rank, we get the running sum from counts, and then count various bits in the word holding our bit.

As you can see, it's pretty simple and only two bits for each element because the counts pace the bits themselves. In fact, we could drop down to 1.5 bits by just using u32 for the running counts, because .. maybe we don't plan to hold more than 4B elements? Let's not do that now, but perhaps by the end of the post you'll have a better idea.

Adaptive Dictionary Compression

We are about to escalate things, but it's actually a very easy and pleasant path to take. Don't get stressed by the intimidating section heading!

When we implemented our get(index) function we were able to find our record in either variant storage. We did this by looking up its position, and then looking in the storage. You might have noticed that we either used rank or index - rank, depending on which variant it was. The one that corresponds to our variant .. was the answer we wanted! Tada!

What is at the other location, though?

For any index that results in rank, both s_store[rank] and t_store[index - rank] contain data. One of the two of them is the data we indexed to find. The other one of them is .. the most recent other variant in the list? What could you possibly want to do with that?

Let's imagine we have a sequence of S that we thought might have repetitions in it. There are probably some interesting ways to encode this, but here's a really easy one using ResultColumns<S, ()>:

  1. If the item is not a repeat of the item before it, insert Ok(item).
  2. If the item is a repeat of the item before it, insert Err(()).

We inserted repeats using the almost meaningless Err(()), an error variant containing the empty tuple (which takes no space and stores no information). We inserted non-repeats using Ok(item), which will land item into s_store. If we go and look things up the normal way, we'll find the non-repeats, and find nothing meaningful for the repeats.

Instead let's look things up a non-normal way.

  1. If we find an Ok(item) we will produce item.
  2. If we find an Err(()) we will .. instead look up the most recent Ok(item) at the time of the index's insertion and return item.

When we find the Err variant we actually find the data we are looking for in the other store. The encoding used the most recent value to lead us to choose the error variant, and we can decode with the same context.

The ResultColumns<S, ()> ends up storing two bits per element, and only the deduplicated S values (although only removing adjacent duplicates). That's potentially pretty handy compression, and we didn't even have to invent anything to do it. It just sort of happens.

Adaptive Dictionary Compression, part 2

Deduplication isn't exactly dictionary compression, so let's fix that.

ResultColumns<S, u8>

Boom. Ok. Done here.

Ah, in more detail, then. Rather than use Err(()) to encode "direct repetition", we'll use Err(offset) to indicate "recent repetition". The offset will tell us how far back to go in s_store to find our value. An offset of zero indicates "the previous value" and an offset of 10 indicates "go back ten values".

/// Insert an item by first checking the previous 256.
fn push(&mut self, item: S) {
    // Look backwards for a matching value.
    let offset = self.inner
                     .s_store
                     .iter()
                     .take(256)
                     .position(|x| x == item);

    if let Some(back) = offset {
        self.inner.push(Err(back));
    } else {
        self.inner.push(Ok(item));
    }
}

Although it may look like we are just looking at the current s_store for values, we'll be able to return to this exact point with index and rank.

/// Retrieve an item reference by index.
fn get(&self, index: usize) -> &S { 
    match self.inner.get(index) {
        Ok(item) => item,
        Err(back) => {
            // Go backwards from `s_store` at time of insertion.
            let pos = self.inner.indexes.rank(index) - 1;
            self.inner.s_store.get(pos - (*back as usize))
        },
    }
}

We've encoded elements in a sequence of S with (ideally) a u8 back-reference to a matching element, or if that fails then the element itself. Ideally this is often a u8 rather than whatever S is. Informally, we are using the previous 256 distinct S values as a dictionary (yes distinct, because how could they repeat?). As the sequence moves along, the dictionary we use adapts, admittedly in a primitive but not unhelpful way.

Trying it out

I have a bunch of JSON that I downloaded from the internet, and am trying to get to work with columnar. One thing you might know about JSON is that a JSON value can be an "object": a map from strings to other JSON values. Commonly, these strings are field or attribute names, and they do not always demonstrate a rich heterogeneity.

Without wanting to overwhelm you with detail (JSON is structurally recursive, and columnarization is complicated), let's look at how we might store objects. Our columnar JSON container has a member that initially looks like:

/// Columnar representation of `Vec<(String, Value)>`.
objects: VecColumns<(StringColumns, Vec<ValueIdx>)>,

There are a few details here, but the important ones are:

  1. StringColumns stores as many strings as you like, by concatenating their bytes and recording offsets.
  2. VecColumns stores as many list of things as you like, by concatenating the lists and recording offsets.
  3. There is a sneaky ( , ) combinator in there that stores pairs of things in pairs of storage.

The tl;dr is that all of the Strings across all of the objects will be packed in sequence in the StringColumns.

If you load up the 11,351 JSON records (26.1MB), you end up with 9,460,926 bytes in the StringColumns. These are unsurprisingly largely repetitions of things like "id", "login", and "gravatar_id". Ideally these will compress up nicely.

We'll make a minor modification to the type, wrapping StringColumns with a LookbackColumns wrapper that employs the techniques above.

/// Columnar representation of `Vec<(String, Value)>`.
objects: VecColumns<(LookbackColumns<StringColumns>, Vec<ValueIdx>)>,

This ends up with 772,022 bytes data in the LookbackColumns. Of those bytes, almost none are text. They break down as:

  • indexes: 153,840 bytes (two bits per element)
  • s_store: 2,958 bytes text (all non-repeat elements)
  • t_store: 615,224 bytes (one byte per repeat element)

The bytes in t_store are surprisingly varied. They go as large as 168, for which there are 10,162 entries (out of at least 615,224 strings used as names in objects). Clearly this will depend on your data, as will the efficacy of the approach generally.

Rounding up

I basically stumbled on the approach above while poking around learning about succinct data structures (which the thing I implemented is not). I was impressed by how little overhead there can be to recording variants of different sizes, given how accustomed to it I have become to the bloat in Rust, and all without having to give up random access.

But I was more flabbergasted by the ability to use the other variants as a way of compressing information. Just recording offsets backwards results in a natural dictionary encoding that requires no auxiliary structures or code or what have you. It just kind of works.

I have a few other things planned, to try and fit other techniques into the same framework. For example, many of the string and vector offsets are (or are nearly) linear functions of index. This happens when you are actually depositing fixed lengths that are not know a priori. It seems entirely reasonable to encode the offsets as Result<usize, u8> indicating either a fixed position, or an edit to a linear interpolation from the previous fixed position. Or (goodness) some other more complex interpolation from the previous several points.