QP tries are faster than crit-bit tries because they extract up to four bits of information out of the key per branch, whereas crit-bit tries only get one bit per branch. If branch nodes have more children, then the trie should be shallower on average, which should mean lookups are faster.
So if we extract even more bits of information from the key per branch, it should be even better, right? But, there are downsides to wider fanout.
I originally chose to index by nibbles for two reasons: they are easy to pull out of the key, and the bitmap easily fits in a word with room for a key index.
If we index keys by 5-bit chunks we pay a penalty for more complicated indexing.
If we index keys by 6-bit chunks, trie nodes have to be bigger to fit a bigger bitmap, so we pay a memory overhead penalty.
How do these costs compare to the benefits of wider branches?
(The next two sections expand on the downsides; skip to "results" for the tl;dr.)
If we use the key 5 bits at a time, the bitmap is 2^5 == 32 bits. This leaves room in a 64 bit word for a 28 bit index (limiting keys to 256 Mbytes), three bits to store the shift/alignment of the 5 bit field relative to 8 bit bytes, and a 1 bit branch/leaf flag.
When indexing by 4 bit chunks, we only need to look at one byte of key at a time; the chunk can be in the top half or the bottom half. For larger chunks we often need to extract a chunk that overlaps two bytes. To identify a 5-bit chunk we store the byte index of the first byte it overlaps, and how much to shift the pair of bytes to extract the chunk we want. (The aim is to avoid complicated arithmetic in the inner loops.) This gives us a lot more freedom to identify chunks than we actually use, because although there are 8 possible shifts at any particular byte index, we only use one or two of them. The shifts have to be consistent with representing the key as a string of successive 5-bit chunks.
The diagram below shows the correspondence between byte indexes (i
)
and valid chunk alignment shifts (s
). The shift value is how much to
shift the chunk left to move it to the top of its byte, so shift
values and indexes increase from left to right. This fact is used when
working out how deep a new branch node is; for instance, when i%5==1
the possible shifts are 2 and 7, so the branch for the chunk with
shift 2 has to be a parent of 7. Although shift 5 overlaps the same
byte, it can only occur when i%5=0
.
i%5==0 i%5==1 i%5==2 i%5==3 i%5==4
| | | | | | bytes
7654321076543210765432107654321076543210
| | | | | | | | | chunks
s=0 s=5 s=2 s=7 s=4 s=1 s=6 s=3
When we are working out which is the first chunk where two keys differ (so we can work out where to insert a new branch) we start off by scanning to find the first byte that differs. But, if the first differing bits are in the top of the byte, they can be in a chunk which overlaps the preceding byte. In those cases we have to subtract one from the index byte.
If we use the key in 6-bit chunks, we get most of the lookup complexity of 5-bit chunks, plus added size problems.
The bitmap now is 2^6 == 64 bits, a whole word. So we have to expand trie nodes to three words: bitmap, index, and twigs pointer. This means there's now a word of wasted space in the leaf nodes, because they have to be the same size as branch nodes, but they only need two words for key and value pointers.
My code borrows the caller's existing pointer to the key, rather than taking a copy of the key for its own use. If I change it to take a copy, then I could save the wasted space in big leaf nodes by using it to store short keys inline. But for now I've just done enough of a proof of concept to see what is the effect of wider branches, without re-doing the memory ownership model.
The other potential loss for 6-bit chunks is the potential size of the twig array, up to 3864 = 1536 bytes. This is 24 cache lines, so prefetching is unlikely to be much help if the twig index is large, because the prefetch guess is way off.
The data sets I use are:
-
in-b9
: words extracted from the BIND-9 source code- 63k lines, 784k bytes, av. len. 12.3
- lots of common prefixes
-
in-dns
: list of domain names under cam.ac.uk.- 314k lines, 14 049k bytes, av. len. 44.7
-
in-rdns
: same as above, but all keys reversed- huge common prefixes
-
in-usdw
:/usr/share/dict/words
- 236k lines, 2 493k bytes, av. len. 10.7
- English dictionary
-
top-1m
: the Alexa top million web site list- 1000k lines, 15 559k bytes, av. len. 15.6
- numeric rankings removed
A benchmark run takes four time measurements (lower is better): read the data into memory (without timing), "load" it into the trie, "search" for 1 000 000 pseudorandomly selected keys, "mutate" (i.e. add or delete) 1 000 000 pseudorandomly selected keys, reload any missing items (without timing), then "free" the entire trie one key at a time. A benchmark run has an explicit PRNG seed.
Each benchmark iteration has a fixed PRNG seed for all runs. It performs benchmark runs with every data set and every version of the data structure under test (5 x 4 runs). The versions of the data structure are:
- cb: crit-bit tries
- qp: quadbit popcount patricia tries
- fp: 5-bit version of qp tries
- wp: 6-bit version of qp tries
Benchmarks are iterated and we take the mean.
The following table gives some information about the sizes of the data
structures: number of leaf nodes, number of branch nodes, "overhead"
i.e. number of words of storage per leaf (not counting key and value
pointers), and the average depth of the trie. The in-dns
file seems
to be particularly weird.
leaves branches OH depth
cb 63603 63602 2.00 29.93 in-b9
qp 63603 35414 1.11 12.45
fp 63603 32490 1.02 10.56
wp 63603 31311 2.48 9.27
cb 314309 314308 2.00 42.19 in-dns
qp 314309 95094 0.61 17.53
fp 314309 119916 0.76 16.87
wp 314309 95866 1.92 14.20
cb 314309 314308 2.00 38.53 in-rdns
qp 314309 188252 1.20 20.67
fp 314309 186714 1.19 18.78
wp 314309 173500 2.66 17.93
cb 235882 235881 2.00 26.50 in-usdw
qp 235882 169434 1.44 12.46
fp 235882 154272 1.31 10.34
wp 235882 144357 2.84 9.02
cb 1000000 999999 2.00 30.76 top-1m
qp 1000000 617027 1.23 12.52
fp 1000000 560985 1.12 10.45
wp 1000000 514113 2.54 8.95
Overall, 5-bit chunks win over 4-bit chunks. 6-bit chunks are not a win on my slow computers, but are a win on a bigger Xeon. (All these machines are pretty old.)
- laptop
- MacBook Core 2 Duo P8600 2.4GHz, Mac OS 10.11.1
- Apple LLVM version 7.0.0 (clang-700.0.72)
-O3 -march=native -mtune=native
__builtin_popcount
in-b9 | in-dns | in-rdns | in-usdw | top-1m | ||
---|---|---|---|---|---|---|
cb | 0.030 | 0.173 | 0.151 | 0.095 | 1.438 | free |
qp | 0.028 | 0.153 | 0.155 | 0.099 | 1.017 | |
fp | 0.026 | 0.165 | 0.144 | 0.096 | 0.924 | |
wp | 0.029 | 0.164 | 0.155 | 0.100 | 0.945 | |
cb | 0.052 | 0.181 | 0.199 | 0.081 | 1.425 | load |
qp | 0.053 | 0.250 | 0.276 | 0.131 | 1.170 | |
fp | 0.051 | 0.233 | 0.265 | 0.129 | 1.102 | |
wp | 0.053 | 0.231 | 0.279 | 0.130 | 1.127 | |
cb | 0.489 | 1.314 | 1.221 | 1.064 | 1.855 | mutate |
qp | 0.461 | 0.997 | 1.046 | 0.833 | 1.292 | |
fp | 0.425 | 0.988 | 1.005 | 0.773 | 1.163 | |
wp | 0.461 | 0.995 | 1.051 | 0.822 | 1.131 | |
cb | 0.465 | 1.090 | 1.079 | 0.861 | 1.756 | search |
qp | 0.387 | 0.883 | 0.921 | 0.708 | 1.139 | |
fp | 0.358 | 0.836 | 0.888 | 0.657 | 1.002 | |
wp | 0.426 | 0.828 | 0.932 | 0.710 | 0.973 |
- desktop
- Dell Core 2 Duo E8500 3.16GHz, FreeBSD 8
- gcc 4.2.1
-O3 -march=native -mtune=native
HAVE_SLOW_POPCOUNT
in-b9 | in-dns | in-rdns | in-usdw | top-1m | ||
---|---|---|---|---|---|---|
cb | 0.017 | 0.085 | 0.071 | 0.041 | 0.809 | free |
qp | 0.020 | 0.102 | 0.087 | 0.056 | 0.643 | |
fp | 0.020 | 0.102 | 0.085 | 0.055 | 0.603 | |
wp | 0.020 | 0.103 | 0.091 | 0.057 | 0.669 | |
cb | 0.027 | 0.125 | 0.142 | 0.051 | 0.940 | load |
qp | 0.029 | 0.164 | 0.188 | 0.079 | 0.734 | |
fp | 0.028 | 0.159 | 0.186 | 0.080 | 0.693 | |
wp | 0.028 | 0.151 | 0.194 | 0.080 | 0.733 | |
cb | 0.380 | 0.889 | 0.829 | 0.642 | 1.307 | mutate |
qp | 0.356 | 0.711 | 0.764 | 0.518 | 0.940 | |
fp | 0.342 | 0.717 | 0.749 | 0.481 | 0.879 | |
wp | 0.345 | 0.741 | 0.815 | 0.561 | 0.907 | |
cb | 0.301 | 0.740 | 0.730 | 0.521 | 1.202 | search |
qp | 0.262 | 0.592 | 0.661 | 0.443 | 0.798 | |
fp | 0.251 | 0.596 | 0.644 | 0.412 | 0.730 | |
wp | 0.255 | 0.610 | 0.704 | 0.488 | 0.737 |
- server
- Generic Xeon E5620 2.40GHz, SLES 11 SP3
- gcc 4.3.4
-O3 -march=native -mtune=native -mpopcnt
__builtin_popcnt
in-b9 | in-dns | in-rdns | in-usdw | top-1m | ||
---|---|---|---|---|---|---|
cb | 0.020 | 0.106 | 0.095 | 0.041 | 0.789 | free |
qp | 0.014 | 0.087 | 0.073 | 0.042 | 0.457 | |
fp | 0.013 | 0.089 | 0.074 | 0.042 | 0.414 | |
wp | 0.013 | 0.081 | 0.071 | 0.039 | 0.415 | |
cb | 0.038 | 0.200 | 0.225 | 0.073 | 1.172 | load |
qp | 0.026 | 0.160 | 0.191 | 0.077 | 0.667 | |
fp | 0.027 | 0.160 | 0.193 | 0.080 | 0.612 | |
wp | 0.024 | 0.139 | 0.183 | 0.071 | 0.615 | |
cb | 0.481 | 0.865 | 0.794 | 0.641 | 1.261 | mutate |
qp | 0.285 | 0.495 | 0.550 | 0.385 | 0.721 | |
fp | 0.270 | 0.520 | 0.542 | 0.355 | 0.662 | |
wp | 0.256 | 0.494 | 0.534 | 0.350 | 0.634 | |
cb | 0.473 | 0.880 | 0.844 | 0.611 | 1.292 | search |
qp | 0.297 | 0.522 | 0.607 | 0.391 | 0.737 | |
fp | 0.275 | 0.526 | 0.584 | 0.360 | 0.668 | |
wp | 0.264 | 0.506 | 0.582 | 0.361 | 0.644 |
Written by Tony Finch dot@dotat.at https://dotat.at/; You may do anything with this. It has no warranty. https://creativecommons.org/publicdomain/zero/1.0/