Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Blocks log replay fix, enhance libtester to support customizing startup options, and add blocks log replay tests #133

Merged
merged 18 commits into from
Jul 10, 2024

Conversation

linh2931
Copy link
Member

@linh2931 linh2931 commented May 13, 2024

  • Fix a bug in blocks log replay where reversible blocks are removed if replay is interrupted.
    • chain_head now saved to chain_head.dat in state directory. This allows replay to be interrupted and restarted without impacting fork database.
  • Enhance libtester to support customizing startup options such that shutdown can be triggered at particular block.
  • Add tests for
    1. replay through blocks log and irreversible blocks
    2. replay stopping in the middle of blocks log and resuming
    3. replay stopping in the middle of reversible blocks and resuming
  • Update replay integration tests to test relaunch after a replay or start from snapshot.

Note to devs: Restart from snapshot required for any existing running node.

Resolves #7

@linh2931 linh2931 requested review from heifner and greg7mdp May 13, 2024 17:18
unittests/blocks_log_replay_tests.cpp Outdated Show resolved Hide resolved
unittests/blocks_log_replay_tests.cpp Outdated Show resolved Hide resolved
unittests/blocks_log_replay_tests.cpp Outdated Show resolved Hide resolved
unittests/blocks_log_replay_tests.cpp Outdated Show resolved Hide resolved
unittests/blocks_log_replay_tests.cpp Outdated Show resolved Hide resolved
unittests/blocks_log_replay_tests.cpp Outdated Show resolved Hide resolved
unittests/blocks_log_replay_tests.cpp Outdated Show resolved Hide resolved
@heifner
Copy link
Member

heifner commented May 14, 2024

I don't see how this fixes the issue. For example:

rm -rf main-d
cp /mnt/c/Users/kevin/Downloads/snapshot-2024-05-14-10-eos-v6-0372933677.bin.zst .
zstd -d snapshot-2024-05-14-10-eos-v6-0372933677.bin.zst
./nodeos --data-dir main-d --config-dir mainconfig --snapshot snapshot-2024-05-14-10-eos-v6-0372933677.bin
[let it sync up to head]
rm -rf main-d/state
./nodeos --data-dir main-d --config-dir mainconfig --snapshot snapshot-2024-05-14-10-eos-v6-0372933677.bin

Wait for: existing block log, attempting to replay from 372933678 to 372954549 blocks and ctrl-c.

info  2024-05-14T13:05:26.800 nodeos    controller.cpp:1556           replay_block_log     ] existing block log, attempting to replay from 372933678 to 372954549 blocks
info  2024-05-14T13:05:27.224 nodeos    platform_timer_accurac:62     compute_and_print_ti ] Checktime timer accuracy: min:1us max:758us mean:7us stddev:16us
^Cinfo  2024-05-14T13:05:36.173 nodeos    main.cpp:165                  operator()           ] appbase quit called
info  2024-05-14T13:05:36.191 nodeos    controller.cpp:1610           replay_block_log     ] quitting from replay_block_log because of shutdown
info  2024-05-14T13:05:36.192 nodeos    controller.cpp:1621           replay_block_log     ] 256 irreversible blocks replayed
info  2024-05-14T13:05:36.192 nodeos    controller.cpp:1624           replay_block_log     ] replayed 256 blocks in 9 seconds, 36.83050196078431782 ms/block
info  2024-05-14T13:05:36.192 nodeos    controller.cpp:1651           replay               ] quitting from replay because of shutdown
info  2024-05-14T13:05:36.192 nodeos    controller.cpp:1787           startup              ] Finished initialization from snapshot (snapshot load time was 300s)
info  2024-05-14T13:05:36.192 nodeos    chain_plugin.cpp:1130         plugin_startup       ] starting chain in read/write mode
info  2024-05-14T13:05:36.192 nodeos    chain_plugin.cpp:1138         plugin_startup       ] Blockchain started; head block is #372933933
warn  2024-05-14T13:05:36.204 nodeos    fork_database.cpp:723         close                ] fork_database is in a bad state when closing; not writing out '/home/heifner/ext/leap/cmake-build-relwithdebinfo/bin/main-d/blocks/reversible/fork_db.dat', legacy_valid=false, savanna_valid=false
info  2024-05-14T13:05:36.808 nodeos    main.cpp:165                  operator()           ] appbase quit called

unittests/blocks_log_replay_tests.cpp Outdated Show resolved Hide resolved
unittests/blocks_log_replay_tests.cpp Outdated Show resolved Hide resolved
@linh2931
Copy link
Member Author

I don't see how this fixes the issue. For example:

rm -rf main-d
cp /mnt/c/Users/kevin/Downloads/snapshot-2024-05-14-10-eos-v6-0372933677.bin.zst .
zstd -d snapshot-2024-05-14-10-eos-v6-0372933677.bin.zst
./nodeos --data-dir main-d --config-dir mainconfig --snapshot snapshot-2024-05-14-10-eos-v6-0372933677.bin
[let it sync up to head]
rm -rf main-d/state
./nodeos --data-dir main-d --config-dir mainconfig --snapshot snapshot-2024-05-14-10-eos-v6-0372933677.bin

Wait for: existing block log, attempting to replay from 372933678 to 372954549 blocks and ctrl-c.

info  2024-05-14T13:05:26.800 nodeos    controller.cpp:1556           replay_block_log     ] existing block log, attempting to replay from 372933678 to 372954549 blocks
info  2024-05-14T13:05:27.224 nodeos    platform_timer_accurac:62     compute_and_print_ti ] Checktime timer accuracy: min:1us max:758us mean:7us stddev:16us
^Cinfo  2024-05-14T13:05:36.173 nodeos    main.cpp:165                  operator()           ] appbase quit called
info  2024-05-14T13:05:36.191 nodeos    controller.cpp:1610           replay_block_log     ] quitting from replay_block_log because of shutdown
info  2024-05-14T13:05:36.192 nodeos    controller.cpp:1621           replay_block_log     ] 256 irreversible blocks replayed
info  2024-05-14T13:05:36.192 nodeos    controller.cpp:1624           replay_block_log     ] replayed 256 blocks in 9 seconds, 36.83050196078431782 ms/block
info  2024-05-14T13:05:36.192 nodeos    controller.cpp:1651           replay               ] quitting from replay because of shutdown
info  2024-05-14T13:05:36.192 nodeos    controller.cpp:1787           startup              ] Finished initialization from snapshot (snapshot load time was 300s)
info  2024-05-14T13:05:36.192 nodeos    chain_plugin.cpp:1130         plugin_startup       ] starting chain in read/write mode
info  2024-05-14T13:05:36.192 nodeos    chain_plugin.cpp:1138         plugin_startup       ] Blockchain started; head block is #372933933
warn  2024-05-14T13:05:36.204 nodeos    fork_database.cpp:723         close                ] fork_database is in a bad state when closing; not writing out '/home/heifner/ext/leap/cmake-build-relwithdebinfo/bin/main-d/blocks/reversible/fork_db.dat', legacy_valid=false, savanna_valid=false
info  2024-05-14T13:05:36.808 nodeos    main.cpp:165                  operator()           ] appbase quit called

Thanks for running the branch. The new tests (replaying from genesis and in Savanna mode) works. It seems some additional fixes are required for replaying from snapshot or in Legacy mode?

@linh2931
Copy link
Member Author

Do I miss something? I ctrl-c during the replay of blocks log:

info  2024-05-14T14:53:49.541 nodeos    controller.cpp:1556           replay_block_log     ] existing block log, attempting to replay from 2 to 2272 blocks
^Cinfo  2024-05-14T14:53:49.953 nodeos    main.cpp:165                  operator()           ] appbase quit called
info  2024-05-14T14:53:49.957 nodeos    platform_timer_accurac:62     compute_and_print_ti ] Checktime timer accuracy: min:0us max:1341us mean:1us stddev:15us
info  2024-05-14T14:53:49.957 nodeos    controller.cpp:1610           replay_block_log     ] quitting from replay_block_log because of shutdown
info  2024-05-14T14:53:49.957 nodeos    controller.cpp:1621           replay_block_log     ] 1 irreversible blocks replayed

I then restarted the nodeos. It replayed successfully:

info  2024-05-14T14:55:27.693 nodeos    controller.cpp:1516           initialize_blockchai ] Initializing new blockchain with genesis state
info  2024-05-14T14:55:27.716 nodeos    controller.cpp:1556           replay_block_log     ] existing block log, attempting to replay from 2 to 2272 blocks
info  2024-05-14T14:55:28.131 nodeos    platform_timer_accurac:62     compute_and_print_ti ] Checktime timer accuracy: min:0us max:6us mean:1us stddev:1us
info  2024-05-14T14:55:28.233 nodeos    controller.cpp:1614           replay_block_log     ] 500 of 2272
info  2024-05-14T14:55:28.336 nodeos    controller.cpp:1614           replay_block_log     ] 1000 of 2272
info  2024-05-14T14:55:28.439 nodeos    controller.cpp:1614           replay_block_log     ] 1500 of 2272
info  2024-05-14T14:55:28.542 nodeos    controller.cpp:1614           replay_block_log     ] 2000 of 2272
info  2024-05-14T14:55:28.598 nodeos    controller.cpp:1621           replay_block_log     ] 2271 irreversible blocks replayed
info  2024-05-14T14:55:28.598 nodeos    controller.cpp:1622           replay_block_log     ] replayed 2271 blocks in 0 seconds, 0.38843127753303969 ms/block
warn  2024-05-14T14:55:28.598 nodeos    controller.cpp:1687           operator()           ] No existing chain state. Initializing fresh blockchain state.
info  2024-05-14T14:55:28.598 nodeos    controller.cpp:1709           operator()           ] fork database head 2273, root 2272
info  2024-05-14T14:55:28.598 nodeos    controller.cpp:1736           operator()           ] 1 reversible blocks replayed
info  2024-05-14T14:55:28.598 nodeos    chain_plugin.cpp:1130         plugin_startup       ] starting chain in read/write mode

@heifner
Copy link
Member

heifner commented May 14, 2024

rm -rf main-d/state
./nodeos --data-dir main-d --config-dir mainconfig --snapshot snapshot-2024-05-14-10-eos-v6-0372933677.bin
[ctrl-c]
./nodeos --data-dir main-d --config-dir mainconfig
[ctrl-c]
info  2024-05-14T19:25:50.659 nodeos    main.cpp:165                  operator()           ] appbase quit called
info  2024-05-14T19:25:50.659 nodeos    main.cpp:158                  operator()           ] nodeos version v1.0.0-dev v1.0.0-dev-08e4701ae217fc3c219b05d71ab275f4dee187c2-dirty
info  2024-05-14T19:25:50.660 nodeos    main.cpp:62                   log_non_default_opti ] Non-default options: data-dir = main-d, config-dir = mainconfig, chain-state-db-size-mb = 65536, api-accept-transactions = false, p2p-accept-transactions = false, p2p-peer-address = peer.main.alohaeos.com:9876, p2p-peer-address = p2p.eosargentina.io:9876, p2p-peer-address = p2p.genereos.io:9876, p2p-peer-address = eos.defibox.xyz:9876, p2p-peer-address = eos.edenia.cloud:9876, p2p-peer-address = p2p.eos.cryptolions.io:9876, p2p-peer-address = p2p.donates2eden.io:9876, p2p-peer-address = p2p.eos42.io:9876, p2p-peer-address = mainnet.eosamsterdam.net:9876, p2p-peer-address = mainnet.eosarabia.net:3571, p2p-peer-address = p2p2.eoseoul.io:30333, p2p-peer-address = p2p.eosflare.io:9876, p2p-peer-address = p2p.bitmars.one:8080, p2p-peer-address = p2p.eos.detroitledger.tech:1337, p2p-peer-address = peer.eosio.sg:9876, p2p-peer-address = eos.seed.eosnation.io:9876, p2p-peer-address = peer1.eosphere.io:9876, p2p-peer-address = p2p.eossweden.org:9876, p2p-peer-address = eos.p2p.eosusa.io:9882
error 2024-05-14T19:25:50.660 nodeos    main.cpp:217                  main                 ] 3020000 fork_database_exception: Fork database exception
fork database head (372954877) is inconsistent with state (372936075)
    {"db":372936075,"head":372954877}
    nodeos  controller.cpp:1895 init

    {}
    nodeos  chain_plugin.cpp:1152 plugin_startup

@ericpassmore
Copy link
Contributor

Note:start
group: IF
category: INTERNALS
summary: More robust block log replay through refactoring head block state management.
Note:end

@heifner heifner requested a review from greg7mdp July 1, 2024 21:05
@arhag arhag self-requested a review July 3, 2024 20:04
Copy link
Member Author

@linh2931 linh2931 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@heifner Thanks for finishing the PR! It looks good. You can approve it.

@@ -199,9 +199,9 @@ namespace eosio::testing {
open(snapshot);
}

void base_tester::init(controller::config config, const genesis_state& genesis) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may prefer an enum class do_statrtup_t { no, yes } instead of the bool here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll make the changes to use enum class do_statrtup_t { no, yes }.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done by d08e3ae

@greg7mdp
Copy link
Contributor

greg7mdp commented Jul 9, 2024

It seemed somewhat cleaner to have all the state we need in fork_db rather than adding an extra file. Couldn't we have an extra member in fork_db to store the chain_head, that we would update whenever the chain_head in controller changes?

@heifner
Copy link
Member

heifner commented Jul 9, 2024

It seemed somewhat cleaner to have all the state we need in fork_db rather than adding an extra file. Couldn't we have an extra member in fork_db to store the chain_head, that we would update whenever the chain_head in controller changes?

The block state is variable length, so would require loading the fork database and then writing it back out on an interrupted replay. If we did move it, seems we would want to consider moving chain_head to fork_database as well and that doesn't seem correct. As of #347, fork database is just a database of possible forks. I'm now thinking we should keep savanna_lib_id in controller and store it in the chain_head.dat file.

@greg7mdp
Copy link
Contributor

greg7mdp commented Jul 9, 2024

so would require loading the fork database and then writing it back out on an interrupted replay.

This doesn't seem so bad to me.

fork database is just a database of possible forks

Well, you can see it that way. Or you could rename it reversible_state and have it contain all the non-final info we need to restart (in addition to the state file).

I think it is somewhat worse for users to add a new file, and I don't think we would have a convincing explanation as to why it is needed.

…ment value in init() and open() to have clearer intention of parameters
@heifner heifner merged commit d6e94f2 into main Jul 10, 2024
36 checks passed
@heifner heifner deleted the replay_test branch July 10, 2024 13:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Refactor head block state management
4 participants