-
Notifications
You must be signed in to change notification settings - Fork 882
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add trie log pruning triggered after trie log persist #6026
Conversation
|
@@ -203,6 +204,10 @@ public Optional<byte[]> getTrieLog(final Hash blockHash) { | |||
return trieLogStorage.get(blockHash.toArrayUnsafe()); | |||
} | |||
|
|||
public Stream<byte[]> streamTrieLogs() { | |||
return trieLogStorage.streamKeys(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Used to preload the pruner. Need to test the feasibility of this on real nodes with lots of trie logs, might need to limit it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes seems to be dangerous to do that without limit
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
agreed. If we do not close the RocksIterator, that is going to lead to problems. I would suggest passing in a Consumer or something like that so we can compartmentalize the iterator open/close. Or do like streamFlat* methods and supply a Predicate or row limit, collect to a map and return.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now I'm simply using streamKeys().limit(..)
, hopefully that is safe!
More info about new approach in the updated description
import org.slf4j.Logger; | ||
import org.slf4j.LoggerFactory; | ||
|
||
public class TrieLogPruner { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would rather see the pruning implemented in a subclass of CachedWorldStorageManager. This feels like a management function that could benefit from knowing the current state of cache. we could ensure that DEFAULT_MAX_BLOCKS_TO_PRUNE doesn't conflict with AbstractTrieLogManager.RETAINED_LAYERS, etc.
Having a separate pruner class breaks encapsulation imo.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the main problem with this is that at the CachedWorldStorageManager
level, the cachedWorldStatesByHash
cache only maintains canonical blocks, not forks...in other words CacheWorldStorageManager.addCachedLayer
(and therefore scrubCachedLayers
) is only called for canonical blocks (when the head is moved during an FCU AFAICT).
So we'd miss out on pruning forks this way.
Am still investigating potential benefits of the pruning being aware of this cache
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As discussed in person, these concepts (and therefore caches) are separated out now. There may still be a task to align these in code rather than relying on configuration.
trieLogPruner.rememberTrieLogKeyForPruning( | ||
forBlockHeader.getNumber(), forBlockHeader.getBlockHash().toArrayUnsafe()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since block number isn't unique for trielogs, we will get collisions during reorgs that could result in stale trielogs hanging around. depending on how the initial/startup cleanup is implemented, this might not be a major concern.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yep, the data structure hold forks for pruning too (it's a multimap)
Test results for this first iteration (with no startup loadingLimit) on some Most of the work is in streaming all the trie log entries into the
The pruning windows themselves were performant since RocksDB marks for deletion, the range of time taken: |
TrieLogPruner loads all trie logs on startup. Each time a trie log is persisted, the pruner is run and uses a pruning window (currently hardcoded to 1000) to chip away at an initially large backlog of trie logs Once backlog is cleared, each prune run should just be a single trie log. Signed-off-by: Simon Dudley <simon.dudley@consensys.net>
757ab62
to
161b05b
Compare
Signed-off-by: Simon Dudley <simon.dudley@consensys.net>
Signed-off-by: Simon Dudley <simon.dudley@consensys.net>
Signed-off-by: Simon Dudley <simon.dudley@consensys.net>
Signed-off-by: Simon Dudley <simon.dudley@consensys.net>
Signed-off-by: Simon Dudley <simon.dudley@consensys.net> Signed-off-by: Simon Dudley <simon.dudley@consensys.net>
Signed-off-by: Simon Dudley <simon.dudley@consensys.net>
Defaults to 0 = unlimited threshold === pruning disabled Wire in the TrieLogPruner based on this. Initialize the pruner on startup to preload the cache using the loadingLimit (currently hardcoded as 1000) Signed-off-by: Simon Dudley <simon.dudley@consensys.net>
Testing various number of trie log limits for preloading pruning cache.
Tested on the 3 month old
(Throwaway code for this test #6108) |
Signed-off-by: Simon Dudley <simon.dudley@consensys.net>
Rename method Signed-off-by: Simon Dudley <simon.dudley@consensys.net>
Signed-off-by: Simon Dudley <simon.dudley@consensys.net>
Signed-off-by: Simon Dudley <simon.dudley@consensys.net>
Signed-off-by: Simon Dudley <simon.dudley@consensys.net>
Signed-off-by: Simon Dudley <simon.dudley@consensys.net>
Signed-off-by: Simon Dudley <simon.dudley@consensys.net>
Ready for code review, but leaving in draft until testing is complete. |
(...instead of --Xbonsai-trie-log-retention-threshold=0) Use 512 as default and also minimum value for --Xbonsai-trie-log-retention-threshold Validate that --Xbonsai-trie-log-prune-limit is positive Signed-off-by: Simon Dudley <simon.dudley@consensys.net>
…imit Signed-off-by: Simon Dudley <simon.dudley@consensys.net>
Renames and logging Refactor test Signed-off-by: Simon Dudley <simon.dudley@consensys.net>
Separate list is for logging reasons Signed-off-by: Simon Dudley <simon.dudley@consensys.net>
Second and third iteration tested on the following deployments:
Last 12 hours: New Payload: ![]() FCU: ![]() Java Memory Used: ![]() Java Memory Committed: ![]() The mean response times and memory usage appear to be randomly spread across the different deployment versions so I don't think there's a significant difference between |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM as it's experimental feature, we will have to do other PRs to do flag renaming and call db optimization
lines.add("Trie log pruning enabled:"); | ||
lines.add(" - retention threshold: " + trieLogRetentionThreshold + " blocks"); | ||
if (trieLogPruningLimit != null) { | ||
lines.add(" - prune limit: " + trieLogPruningLimit + " blocks"); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: seems a little verbose to be using three lines in the config logging. can to be shortened to one line?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done this here: 6b27522
Looks like this:
# Trie log pruning enabled: retention: 512; prune limit: 30000 #
besu/src/main/java/org/hyperledger/besu/cli/options/stable/DataStorageOptions.java
Show resolved
Hide resolved
ethereum/core/src/main/java/org/hyperledger/besu/ethereum/bonsai/trielog/TrieLogPruner.java
Outdated
Show resolved
Hide resolved
ethereum/core/src/main/java/org/hyperledger/besu/ethereum/bonsai/trielog/TrieLogPruner.java
Outdated
Show resolved
Hide resolved
ethereum/core/src/main/java/org/hyperledger/besu/ethereum/bonsai/trielog/TrieLogPruner.java
Outdated
Show resolved
Hide resolved
ethereum/core/src/main/java/org/hyperledger/besu/ethereum/bonsai/trielog/TrieLogPruner.java
Outdated
Show resolved
Hide resolved
Signed-off-by: Simon Dudley <simon.dudley@consensys.net>
final Signed-off-by: Simon Dudley <simon.dudley@consensys.net>
Signed-off-by: Simon Dudley <simon.dudley@consensys.net>
Signed-off-by: Simon Dudley <simon.dudley@consensys.net>
Signed-off-by: Simon Dudley <simon.dudley@consensys.net>
Signed-off-by: Simon Dudley <simon.dudley@consensys.net>
I setup a reorg test based on this hive test, with some modifications to send finalized blocks trailing the head by 10 blocks. Note it uses a mock CL. In the abbreviated logs below, we:
The test was setup with an artificially low
(I intend to turn this into a besu AT in a future PR) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM seems to be fine with finalized block modification
@BeforeEach | ||
public void setup() { | ||
Configurator.setLevel(LogManager.getLogger(TrieLogPruner.class).getName(), Level.TRACE); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why this one is needed for the test ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not needed - purely convenience for comprehending the tests when you run them. I thought it was useful enough to leave in but happy to remove.
.addArgument(loadingLimit) | ||
.log(); | ||
try { | ||
final Stream<byte[]> trieLogKeys = rootWorldStateStorage.streamTrieLogKeys(loadingLimit); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Think this should be done in a try-with-resources so that we close the stream
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done 13f0040
Signed-off-by: Simon Dudley <simon.dudley@consensys.net>
Signed-off-by: Simon Dudley <simon.dudley@consensys.net>
Signed-off-by: Simon Dudley <simon.dudley@consensys.net>
Test pruner cache/queue size, with default value of 512 retention, size of the cache/queue is ~140KB
In a non-finality event, this queue is unbounded, however since the size is small I don't think this will be a problem. More likely other parts of besu would break first. For context, the recent non-finality event lasted a max of 9 epochs. Max of 32 blocks per epoch = To reach Megabyte magnitude, the non-finality event would have to be 3,636 blocks long (1 MB / 275 bytes) = 114 epochs = 12 hours of non-finality |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Checking the performance and the implementation, I think this should be done asynchronously to not have an overhead on the block processing time. Also, I think handling this case on either OnblockAdded event or onTrieLogAdded event, and process the event asynchronously. |
Test on existing validator canary once code is finalized (which should have orphaned trielogs)TL;DR it works 🎉 Tested on Trie log count before enabling pruning:
As expected, there are orphaned trielogs due to block creation. The number is high because this sepolia node is 20 / 1972 validators = 1% of the network so does an unusual amount of block creation compared to mainnet nodes. Log snippets
Example of the initial load picking up a batch of trielogs. Note, we don't control the order of the trielog loading due to the way they are stored - it appears to be quite random. The upshot is that not all of them are eligible for pruning immediately because they may fall inside the retention window...
After that (and every restart), there's a period of no pruning while the queue builds up its 512 block retention window. This period lasts ~2 hours. This is only the case for pre-existing nodes with a larger backlog of trielogs. If total number of trielogs is <= 30,000 then we will prune most on load and have a full retention window to begin. Note pruning triggers on both block processing thread and block building thread...
During this time we prune the odd trielog, which I believe occurs when an queue entry that was preloaded during startup becomes eligible. Once the queue size reaches 512 retention window, we start pruning ~1 block every block (one in, one out).
When we propose a block, we actually save multiple trie logs, but only the latest one makes it into the chain (hence the orphaned trielogs). During pruning (512 blocks later), these will show up as a list of trie logs stored against the same block number (similar to forks).
|
- Toggled with --Xbonsai-trie-log-pruning-enabled - Introduces TrieLogPruner which loads a limited number of trie logs on startup to preload the pruner queue, based on loadingLimit with a default value of 30,000 blocks (configured with --Xbonsai-trie-log-pruning-limit). - Each time a trie log is persisted it is added to the pruner queue and then the pruner is run against the queue, which will prune trie logs associated with block numbers below the --Xbonsai-trie-log-retention-threshold (default 512). - Once the retention threshold is reached, each prune run should just be a single trie log. - Prune any orphaned trielogs that were created during block creation. - Don't prune non-finalized blocks for PoS chains. --------- Signed-off-by: Simon Dudley <simon.dudley@consensys.net> Signed-off-by: Justin Florentine <justin+github@florentine.us>
- Toggled with --Xbonsai-trie-log-pruning-enabled - Introduces TrieLogPruner which loads a limited number of trie logs on startup to preload the pruner queue, based on loadingLimit with a default value of 30,000 blocks (configured with --Xbonsai-trie-log-pruning-limit). - Each time a trie log is persisted it is added to the pruner queue and then the pruner is run against the queue, which will prune trie logs associated with block numbers below the --Xbonsai-trie-log-retention-threshold (default 512). - Once the retention threshold is reached, each prune run should just be a single trie log. - Prune any orphaned trielogs that were created during block creation. - Don't prune non-finalized blocks for PoS chains. --------- Signed-off-by: Simon Dudley <simon.dudley@consensys.net> Signed-off-by: Justin Florentine <justin+github@florentine.us>
First piece of #5390
Feature enabled with
renamed to--Xbonsai-trie-log-pruning-enabled
--Xbonsai-limit-trie-logs-enabled
TrieLogPruner loads
alla limited number of trie logs on startup to preload the prunercachequeue, based on loadingLimit with a default value of 30,000 blocks (configured withrenamed to--Xbonsai-trie-log-prune-limit
--Xbonsai-trie-logs-pruning-window-size
). We immediately prune anything that is eligible for pruning following the load.Each time a trie log is persisted it is added to the pruner
cachequeue and then the pruner is run against thecachequeue, which will prune trie logs associated with block numbers below thenow using--Xbonsai-trie-log-retention-threshold
(defaults to 0 === disabled)--bonsai-historical-block-limit
.The retention-threshold is approximately the size of the trie log pruning
cachequeue (it is measured in blocks but the multimapcachequeue may contain forks, so they can differ).For PoS networks, we read the finalizedBlock hash from the blockchain, load the header and use that to ensure we don't prune non-finalized blocks which might be subject to reorg. Since the minimum retention is currently 512, there would have to be a long non-finality event and reorg for this check to be needed. It is a nice safety net to have though and opens to possibility to prune closer to head in the future.
Trie log backlog management
There are two reasons the database might contain old trie logs that don't exist in the
cachequeue:cachequeue but leaving the database with the trie logs associated with the blocks above the retention-threshold.The backlog is gradually pruned: a single set of trie logs are loaded each time besu is started up, which ensures pruneable trie logs aren't forgotten when besu gets restarted, whilst avoiding any complicated logic for progressively managing the backlog.
A loadingLimit > retention-threshold is desirable since if besu is restarted, we want to at least preload the amount of trie logs that were forgotten by the
cachequeue.Once backlog is cleared, each prune run should just be a single trie log.
Depending on the size of the backlog, it is possible that it will never be cleared. This is particularly true for nodes that have been running before this feature was enabled. To mitigate this, we can offer one of two options:
Tasks:
blockchain
; neither canonical nor a fork)Prevent block building from saving orphaned trie logssee below