Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(state-sync): DB Snapshots #9090

Merged
merged 168 commits into from
Jun 20, 2023
Merged

feat(state-sync): DB Snapshots #9090

merged 168 commits into from
Jun 20, 2023

Conversation

nikurt
Copy link
Contributor

@nikurt nikurt commented May 22, 2023

FlatStorage provides a fast way to generate state parts, but it needs a consistent view of state at the beginning of the epoch.
This PR makes a snapshot of the whole DB, deletes unused columns. This gives the node a read-only database with exactly the view of Flat State that is needed for State Sync.
The snapshot is made in a separate actix Actor to avoid blocking ClientActor.

To improve the iteration speed on the development, added a testing mechanism. A config option can trigger state snapshots every N blocks. Makes no sense for state sync, but very useful to observe enough snapshot events.

Tested that a node can process blocks while snapshotting is in progress.
Without compaction snapshots can result in 100GB extra disk space requirement.

Enabling compaction reduces extra disk overhead from ~100GB to ~10GB.

@nikurt nikurt force-pushed the nikurt-seize-the-state branch from c234bf3 to b21eab5 Compare May 22, 2023 19:19
@nikurt nikurt requested a review from Longarithm May 24, 2023 12:32
@nikurt nikurt marked this pull request as ready for review May 24, 2023 12:32
@nikurt nikurt requested a review from a team as a code owner May 24, 2023 12:32
near-bulldozer bot pushed a commit that referenced this pull request May 25, 2023
Gennerate inner part of state part using flat storage using idea present in #8984.

In short, if flat storage head corresponds to the state root for which we sync state, it is enough to read only boundary nodes, and inner trie part can be reconstructed using range of KV pairs from state. The main logic for that is contained in `Trie::get_trie_nodes_for_part_with_flat_storage`.

It requires couple of minor changes:
* now we allow creating "view" `Trie`s with flat storage as well. As before, we want to avoid creating non-view `Tries` because `TrieCache` accesses may be blocking for chunk processing
* `get_head_hash` and `shard_uid` methods for `FlatStorage` allowing to make correct range query to flat storage
* `FlatStateValue` moved to `primitives` to allow more general access


## TODO
* prometheus metrics
* integration test checking that flat storage is used during normal block processing on client (or wait for #9090)
 
## Testing

https://nayduck.near.org/#/run/3023

Big sanity test `get_trie_nodes_for_part_with_flat_storage` covering all scenarios I could think of:
* results with/without flat storage must match
* result with incorrect flat storage must be an error
* result with flat storage and missing intermediate node should be still okay
@Longarithm
Copy link
Member

I like the code, it's very clear. But I have some general concern.

We shouldn't put state_snapshot fields in NightshadeRuntime. It is designed to be stateless and hold only logic for transaction processing. My rule of thumb is: the less methods we have in RuntimeAdapter, the better, as it is already bloated, and there was a huge effort to separate EpochManager from it. Logic for making snapshots and getting state parts don't really fit in Runtime because it is not related to txn processing. I think that having obtain_state_part inside RuntimeAdapter was a mistake, which is causing this unnatural dependency of state parts logic on Runtime.

I don't have opinion yet where this logic should be - maybe it's a separate struct like StatePartManager in Client, maybe part of StateSync. But it is more or less clear that it doesn't need to know about Runtime. As a short-term hack we could pass StateSnapshot to obtain_state_part to make the code work. Happy to discuss it here or offline.

@Longarithm
Copy link
Member

Maybe this logic fits well into ShardTries, because this struct owns state store updates and provides state views. Looks like it already knows all needed context.

nikurt pushed a commit that referenced this pull request May 31, 2023
Gennerate inner part of state part using flat storage using idea present in #8984.

In short, if flat storage head corresponds to the state root for which we sync state, it is enough to read only boundary nodes, and inner trie part can be reconstructed using range of KV pairs from state. The main logic for that is contained in `Trie::get_trie_nodes_for_part_with_flat_storage`.

It requires couple of minor changes:
* now we allow creating "view" `Trie`s with flat storage as well. As before, we want to avoid creating non-view `Tries` because `TrieCache` accesses may be blocking for chunk processing
* `get_head_hash` and `shard_uid` methods for `FlatStorage` allowing to make correct range query to flat storage
* `FlatStateValue` moved to `primitives` to allow more general access

* prometheus metrics
* integration test checking that flat storage is used during normal block processing on client (or wait for #9090)

https://nayduck.near.org/#/run/3023

Big sanity test `get_trie_nodes_for_part_with_flat_storage` covering all scenarios I could think of:
* results with/without flat storage must match
* result with incorrect flat storage must be an error
* result with flat storage and missing intermediate node should be still okay
@nikurt
Copy link
Contributor Author

nikurt commented Jun 1, 2023

@Longarithm Please take another look.
Moved the data and logic to ShardTries. Maybe this logic needs to be in a separate class?
Also, I butchered some logic that in shard_tries.rs, please advise.

Many tests are failing because they use epoch length that is too short. It takes a couple of seconds to:

  • delete an old snapshot
  • checkpoint
  • delete columns
  • compact
    And if snapshots are requested too often, it's not happy. Handled that by ignoring consecutive requests if I can't lock flat storage.

@Longarithm
Copy link
Member

Looking.

Many tests are failing because they use epoch length that is too short.

I see... We can disable removing columns and compaction for tests, because state is very small there. Could it simplify changes?

@nikurt
Copy link
Contributor Author

nikurt commented Jun 2, 2023

I see... We can disable removing columns and compaction for tests, because state is very small there. Could it simplify changes?

It'll help, but there are still limits to it, and I don't want to introduce more configuration flags.

@nikurt
Copy link
Contributor Author

nikurt commented Jun 5, 2023

@Longarithm Please take a look

Comment on lines 101 to 105
assert_ne!(
flat_storage_manager.set_flat_state_updates_mode(false),
Some(false),
"Failed to lock flat state updates"
);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assertion looks fine (would it even be possible to propagate error from this actor to main actor?).
But let's change set_flat_state_updates_mode return type to Result<bool, StorageError>.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed this to check whether set_flat_state_updates_mode() succeeded. If not, then ignore the snapshotting request.
I don't return an error, because the caller doesn't care about an error. The blocks need to be processed further. Prefer the fail-open (aka fail-safe) approach, because the snapshots are not critical to any individual node. Though they are critical to the network as a whole.

Comment on lines 169 to 171
Some(v) => assert_eq!(
prev_shard_value, v,
"All FlatStorage are expected to have the same value of `move_head_enabled`"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similarly to comment below get_make_snapshot_callback - let's make it an error and propagate further. Looks like we can't return None from here anyway.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please see an updated version of this function.

@@ -2547,11 +2568,40 @@ fn test_catchup_gas_price_change() {
genesis.config.min_gas_price = min_gas_price;
genesis.config.gas_limit = 1000000000000;
let chain_genesis = ChainGenesis::new(&genesis);

let tmp_dir1 = tempfile::tempdir().unwrap();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why we can't reuse real_epoch_managers and nightshade_runtimes as before?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nightshade_runtimes() initializes runtime with home_dir ../../../../ where I can't create a new directory. Therefore, I create a runtime manually.
real_epoch_managers() by itself is fine, but if I use it, I don't have an epoch manager to pass to the runtime I'm initializing manually.

@nikurt
Copy link
Contributor Author

nikurt commented Jun 19, 2023

Nayduck tests are fine, but a few tests fail occasionally due to reasons unrelated to this PR.

@near-bulldozer near-bulldozer bot merged commit a5ede1d into master Jun 20, 2023
@near-bulldozer near-bulldozer bot deleted the nikurt-seize-the-state branch June 20, 2023 14:07
ppca added a commit that referenced this pull request Sep 19, 2023
marcelo-gonzalez added a commit to marcelo-gonzalez/nearcore that referenced this pull request Mar 15, 2024
This was added in near#9090
to provide a way to reduce the size of snapshots, as the commit
message said. But that's not needed anymore when we just drop the
unneeded column families and have a smaller snapshot to begin with
marcelo-gonzalez added a commit to marcelo-gonzalez/nearcore that referenced this pull request Mar 15, 2024
This was added in near#9090
to provide a way to reduce the size of snapshots, as the commit
message said. But that's not needed anymore when we just drop the
unneeded column families and have a smaller snapshot to begin with
marcelo-gonzalez added a commit to marcelo-gonzalez/nearcore that referenced this pull request Mar 15, 2024
This was added in near#9090
to provide a way to reduce the size of snapshots, as the commit
message said. But that's not needed anymore when we just drop the
unneeded column families and have a smaller snapshot to begin with
marcelo-gonzalez added a commit to marcelo-gonzalez/nearcore that referenced this pull request Mar 15, 2024
This was added in near#9090
to provide a way to reduce the size of snapshots, as the commit
message said. But that's not needed anymore when we just drop the
unneeded column families and have a smaller snapshot to begin with
github-merge-queue bot pushed a commit that referenced this pull request Dec 15, 2024
…pshots (#12589)

`test_resharding_v3_shard_shuffling_slower_post_processing_tasks`
exposes a bug that can be triggered if child flat storages are not split
after a resharding by the time we want to take a state snapshot. Then
the state snapshot code will fail because the flat storage is not ready,
but will not retry. To fix it, we add a `want_snapshot` field that will
be set when we decide to take a state snapshot. We also add a
`split_in_progress` field to the `FlatStorageManager` that will be set
to `true` when a resharding is started, and back to false when it's
finished and the catchup code has progressed to a height close to the
desired snapshot height. The state snapshot code will wait until
`split_in_progress` is false to proceed, and the flat storage catchup
code will wait until `want_snapshot` is cleared if it has already
advanced to the desired snapshot hash, so that we don't advance past the
point that was wanted by the state snapshot. The first one is the one
actually causing the test failure, but the second one is also required.

We implement this waiting by rescheduling the message sends in the
future. A Condvar would be a very natural choice, but it unfortunately
doesn't seem to work in testloop, since actors that are normally running
on different threads are put on the same thread, and a blocker on a
Condvar won't be woken up.

Here we are making a change to the behavior of the old
`set_flat_state_updates_mode()`, which used to refuse to proceed if the
update mode was already set to the same value. This seems to be an
artifact of the fact that when state snapshots were implemented in
#9090, this extra logic was added
because there was another user of this function
(`inline_flat_state_values()` added in
#9037), but that function has since
been deleted, so the state snapshot code is now the only user of
`set_flat_state_updates_mode()`.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.