Replies: 1 comment 1 reply
-
Awesome work, @pugachAG !! I was able to reproduce this on my gcp node. I also tried increasing the number of warmups and lookups both to 10000, and then for small_state there's only a 1.03x improvement whereas for flat_nodes (without prefetching), there was a 2.48x improvement. The latter is consistently high with other benchmark settings too. Do I understand correctly that the difference between small_state and flat_nodes is that the former is encoded with hash but the latter is encoded with prefixes? If so, it would appear that most of the savings does come from data locality. |
Beta Was this translation helpful? Give feedback.
1 reply
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
TLDR: Storing trie nodes in Flat Storage could make our storage 3x to 5x faster!
Context
This discussion outlines the Flat Nodes idea and prototyping results for improving storage write performance. It assumes that we keep trie structure as it is and do not change storage related gas costs.
Prototype code can be found in the fast-writes-proto branch.
Benchmark
In order to check prototype solutions it is important to have a robust, reproducible and quick-to-run benchmark.
"token.sweat" is the mainnet contract with one of the most storage-heavy loads. So it was decided to simulate its storage load as a benchmark. The idea is to take
K
random keys from itsContractData
and issue trie storage reads for those keys (skipping flat storage). This way we reproduce read-for-writes trie nodes fetching.Also worth mentioning that we need to do quite a few "warmup" requests to make sure that all sst files for underlying RocksDB column are loaded, otherwise the results are not realistic.
In order to make the benchmark repeatable we want to flush OS disk page cache before every execution, which is implemented as
flush_disk_cache
function.Code: test_sweat.rs
The result for
K=800
with the current storage based onState
column is ~3.4 seconds.execution logs
Flat Nodes
One idea is to essentially store trie nodes as part of Flat Storage. It means we store
RawTrieNodeWithSize
keyed by trie path to that node.Code: flat_nodes.rs
Results
Benchmark result: ~1.26 seconds which is a 2.8x improvement. Running the benchmark with different
--request-count
argument values gives improvement in the range of 2x to 3x.execution logs
Performance improvements attribution
Performance improvements can be explained by the following factors:
Better data locality on disk
For sparse trie nodes the child nodes are stored close to the parent one. It means that the next node on the trie path belongs to the same RocksDB block as the previous one. This way we save IO requests.
The effect of this can be measured by checking the number of RocksDB requests that didn't go to disk. In the benchmark we count such requests as the ones with the latency below 100us. In the execution logs above we see
flat db reads cache hit ratio 0.4037460978147763
line which means that ~40% of RocksDB requests hit block cache.More RocksDB performance friendly data
FlatNodes
column is considerably smaller thanState
and it doesn't have a merge operator attached which can potentially negatively affect read performance. In order to test this improvement in isolation from data locality we createSmallState
column which has the same structure asState
, but only contains the nodes for a single state root and doesn't have a merge operator. The result is ~1.5x performance improvement.execution logs
Parallel nodes prefetching
Using node trie path instead of its hash as storage key makes it possible to prefetch nodes with all possible lookup key prefixes while we traverse the trie. Reads for non-existent prefixes should be fast since RocksDB uses bloom filters to avoid IO when the entry does not exist. This approach was implemented as
Prefetcher
struct.Using
FlatNodes
along with the prefetching approach with 5 threads gives 5x improvement comparing to the baseline. Overall this approach give improvements in the range from 3x to 5x depending on the test requests count.execution logs
Further improvements
RocksDB tuning
Performance of
FlatNodes
column is more sensitive to RocksDB tuning: we play with block size, block cache size,block_restart_interval
etc.Better data layout
The default lexicographic order of keys for
FlatNodes
is still not great as it essentially implies nodes sorting in depth-first search order. Having node's all close descendants located near that node on the disk should be a more cache-friendly data layout. This can be achieved by recursively inlining node's children in breadth-first search order until we reach some size threshold (something like disk page cache size), though it is not clear how to support updates for such structure.Challenges
Additional disk space and RAM
Maintaining another copy of state requires more resources:
This was measured by column_stats.rs
execution logs
Implementation complexity
We need to handle access to non-final blocks by maintaining those in a similar way as flat state deltas.
Beta Was this translation helpful? Give feedback.
All reactions