Flat Storage for Trie Nodes #9443

pugachAG · 2023-08-16T16:59:53Z

pugachAG
Aug 16, 2023
Collaborator

TLDR: Storing trie nodes in Flat Storage could make our storage 3x to 5x faster!

Context

This discussion outlines the Flat Nodes idea and prototyping results for improving storage write performance. It assumes that we keep trie structure as it is and do not change storage related gas costs.

Prototype code can be found in the fast-writes-proto branch.

Benchmark

In order to check prototype solutions it is important to have a robust, reproducible and quick-to-run benchmark.
"token.sweat" is the mainnet contract with one of the most storage-heavy loads. So it was decided to simulate its storage load as a benchmark. The idea is to take K random keys from its ContractData and issue trie storage reads for those keys (skipping flat storage). This way we reproduce read-for-writes trie nodes fetching.
Also worth mentioning that we need to do quite a few "warmup" requests to make sure that all sst files for underlying RocksDB column are loaded, otherwise the results are not realistic.
In order to make the benchmark repeatable we want to flush OS disk page cache before every execution, which is implemented as flush_disk_cache function.

Code: test_sweat.rs
The result for K=800 with the current storage based on State column is ~3.4 seconds.

execution logs

> neard database test-sweat
...
Flush disk page cache
Generate contract data keys
Read keys from the fs cache
Start tries warmup with 100 keys
Elapsed trie: 2.153015292s
Executing get_ref for 800 keys
Node reads: TrieNodesCount { db_reads: 3844, mem_reads: 19136 }, total gas: 15TGas
Elapsed trie: 3.431353493s
Finished SWEAT test

Flat Nodes

One idea is to essentially store trie nodes as part of Flat Storage. It means we store RawTrieNodeWithSize keyed by trie path to that node.
Code: flat_nodes.rs

Results

Benchmark result: ~1.26 seconds which is a 2.8x improvement. Running the benchmark with different --request-count argument values gives improvement in the range of 2x to 3x.

execution logs

> neard database test-sweat --flat-lookup-mode flat_nodes
...
Flush disk page cache
Generate contract data keys
Read keys from the fs cache
Start tries warmup with 100 keys
Elapsed trie: 2.313224252s
Elapsed flat: 542.043316ms
Executing get_ref for 800 keys
Node reads: TrieNodesCount { db_reads: 3844, mem_reads: 19136 }, total gas: 15TGas
Elapsed trie: 3.628824664s
Elapsed flat: 1.261878484s (2.88x)
flat db reads avg = 319.516µs
flat db reads cache hit ratio 0.4037460978147763
flat db reads p0.5 = 327.988µs
flat db reads p0.75 = 577.213µs
flat db reads p0.9 = 671.563µs
flat db reads p0.95 = 723.885µs
flat db reads p0.99 = 876.983µs
flat db reads p0.999 = 1.714274ms
node size avg = 335
node size p0.1 = 78
node size p0.25 = 79
node size p0.5 = 395
node size p0.75 = 523
node size p0.9 = 523
node size p0.95 = 523
Finished SWEAT test

Performance improvements attribution

Performance improvements can be explained by the following factors:

Better data locality on disk.
More RocksDB performance friendly data.

Better data locality on disk

For sparse trie nodes the child nodes are stored close to the parent one. It means that the next node on the trie path belongs to the same RocksDB block as the previous one. This way we save IO requests.
The effect of this can be measured by checking the number of RocksDB requests that didn't go to disk. In the benchmark we count such requests as the ones with the latency below 100us. In the execution logs above we see flat db reads cache hit ratio 0.4037460978147763 line which means that ~40% of RocksDB requests hit block cache.

More RocksDB performance friendly data

FlatNodes column is considerably smaller than State and it doesn't have a merge operator attached which can potentially negatively affect read performance. In order to test this improvement in isolation from data locality we create SmallState column which has the same structure as State, but only contains the nodes for a single state root and doesn't have a merge operator. The result is ~1.5x performance improvement.

execution logs

> neard database test-sweat --flat-lookup-mode small_state
...
Flush disk page cache
Generate contract data keys
Read keys from the fs cache
Start tries warmup with 100 keys
Elapsed trie: 2.330421097s
Elapsed flat: 1.309398059s
Executing get_ref for 800 keys
Node reads: TrieNodesCount { db_reads: 3844, mem_reads: 19136 }, total gas: 15TGas
Elapsed trie: 3.872136358s
Elapsed flat: 2.513826673s (1.54x)
flat db reads avg = 645.117µs
flat db reads cache hit ratio 0.003902185223725286
flat db reads p0.5 = 635.937µs
flat db reads p0.75 = 730.626µs
flat db reads p0.9 = 848.487µs
flat db reads p0.95 = 1.002263ms
flat db reads p0.99 = 1.688801ms
flat db reads p0.999 = 6.26145ms
node size avg = 335
node size p0.1 = 78
node size p0.25 = 79
node size p0.5 = 395
node size p0.75 = 523
node size p0.9 = 523
node size p0.95 = 523
Finished SWEAT test

Parallel nodes prefetching

Using node trie path instead of its hash as storage key makes it possible to prefetch nodes with all possible lookup key prefixes while we traverse the trie. Reads for non-existent prefixes should be fast since RocksDB uses bloom filters to avoid IO when the entry does not exist. This approach was implemented as Prefetcher struct.

Using FlatNodes along with the prefetching approach with 5 threads gives 5x improvement comparing to the baseline. Overall this approach give improvements in the range from 3x to 5x depending on the test requests count.

execution logs

> neard database test-sweat --flat-lookup-mode flat_nodes_prefetch
Start SWEAT: 800 test keys, 100 warmup keys, flat lookup mode: Some(FlatNodesWithPrefetcher { prefetcher_threads: 5 })
...
Flush disk page cache
Generate contract data keys
Read keys from the fs cache
Start tries warmup with 100 keys
Elapsed trie: 1.912040917s
Elapsed flat: 352.397263ms
Executing get_ref for 800 keys
Node reads: TrieNodesCount { db_reads: 3844, mem_reads: 19136 }, total gas: 15TGas
Elapsed trie: 3.984700418s
Elapsed flat: 790.573851ms (5.04x)
flat db reads avg = 196.231µs
flat db reads cache hit ratio 0.6693548387096774
flat db reads p0.5 = 960ns
flat db reads p0.75 = 338.709µs
flat db reads p0.9 = 729.169µs
flat db reads p0.95 = 807.965µs
flat db reads p0.99 = 940.075µs
flat db reads p0.999 = 1.261384ms
node size avg = 335
node size p0.1 = 78
node size p0.25 = 79
node size p0.5 = 395
node size p0.75 = 523
node size p0.9 = 523
node size p0.95 = 523
Finished SWEAT test

Further improvements

RocksDB tuning

Performance of FlatNodes column is more sensitive to RocksDB tuning: we play with block size, block cache size, block_restart_interval etc.

Better data layout

The default lexicographic order of keys for FlatNodes is still not great as it essentially implies nodes sorting in depth-first search order. Having node's all close descendants located near that node on the disk should be a more cache-friendly data layout. This can be achieved by recursively inlining node's children in breadth-first search order until we reach some size threshold (something like disk page cache size), though it is not clear how to support updates for such structure.

Challenges

Additional disk space and RAM

Maintaining another copy of state requires more resources:

~20GB of additional disk space
440MB of RAM for RocksDB column filter and index blocks
TBD amount of RAM for block cache, but we can reallocate some of the shard cache memory for that

This was measured by column_stats.rs

execution logs

> neard database column-stats -c FlatNodes
...
total size: keys 9.512161892GB, values 25.280195331GB
rocksdb.live-sst-files-size=19808004392
rocksdb.estimate-live-data-size=19808004392
rocksdb.compaction-pending=0
rocksdb.num-running-compactions=0
rocksdb.estimate-pending-compaction-bytes=0
rocksdb.estimate-table-readers-mem=440167526
rocksdb.block-cache-capacity=2197815296
rocksdb.block-cache-usage=2197170996
rocksdb.cur-size-active-mem-table=2048
rocksdb.size-all-mem-tables=2048
rocksdb.num-files-at-level0=0
rocksdb.num-files-at-level1=0
rocksdb.num-files-at-level2=0
rocksdb.num-files-at-level3=0
rocksdb.num-files-at-level4=0
rocksdb.num-files-at-level5=0
rocksdb.num-files-at-level6=302

Implementation complexity

We need to handle access to non-final blocks by maintaining those in a similar way as flat state deltas.

robin-near · 2023-08-22T16:13:19Z

robin-near
Aug 22, 2023
Collaborator

Awesome work, @pugachAG !! I was able to reproduce this on my gcp node. I also tried increasing the number of warmups and lookups both to 10000, and then for small_state there's only a 1.03x improvement whereas for flat_nodes (without prefetching), there was a 2.48x improvement. The latter is consistently high with other benchmark settings too.

Do I understand correctly that the difference between small_state and flat_nodes is that the former is encoded with hash but the latter is encoded with prefixes? If so, it would appear that most of the savings does come from data locality.

1 reply

pugachAG Aug 27, 2023
Collaborator Author

Thank you, @robin-near!

I was able to reproduce this on my gcp node.

That is great! And thanks for taking time to reproduce the results.

Do I understand correctly that the difference between small_state and flat_nodes is that the former is encoded with hash but the latter is encoded with prefixes? If so, it would appear that most of the savings does come from data locality.

Yes, that is the only difference between those columns. So yeah, to me it also looks like with a sufficiently high number of warmup requests all latency improvements come from data locality.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Flat Storage for Trie Nodes #9443

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 1 reply

{{title}}

{{title}}

Select a reply

Flat Storage for Trie Nodes #9443

pugachAG Aug 16, 2023 Collaborator

Context

Benchmark

Flat Nodes

Results

Performance improvements attribution

Better data locality on disk

More RocksDB performance friendly data

Parallel nodes prefetching

Further improvements

RocksDB tuning

Better data layout

Challenges

Additional disk space and RAM

Implementation complexity

Replies: 1 comment · 1 reply

robin-near Aug 22, 2023 Collaborator

pugachAG Aug 27, 2023 Collaborator Author

pugachAG
Aug 16, 2023
Collaborator

Replies: 1 comment 1 reply

robin-near
Aug 22, 2023
Collaborator

pugachAG Aug 27, 2023
Collaborator Author