Implement Savanna disaster recovery unit tests. #444

greg7mdp · 2024-07-30T18:28:26Z

Partially resolves #380

Implemented here:

unit tests: Disaster recovery

Single finalizer goes down

[sd0] recovery when nodes go down

shutdown C
A produces 4 more blocks. Verify that lib advances by 4
restart C
push blocks A -> C
verify that C votes again (strong) and that lib continues to advance

[sd1] Recover a killed node with old finalizer safety info

save C's fsi
A produces 2 blocks
take snapshot of C
A produces 2 blocks
shutdown C
A produces 2 blocks, verify lib continues to advance
remove C's state, replace C's fsi with previously saved file
restart C from previously taken snapshot
push blocks A -> C
A produces 2 blocks, verify that C votes again (strong) and that lib continues to advance

[sd2] Recover a killed node with deleted finalizer safety info

A produces 2 blocks
take snapshot of C
A produces 2 blocks
shutdown C
A produces 2 blocks, verify lib continues to advance
remove C's state and fsi
restart C from previously taken snapshot
push blocks A -> C
A produces 2 blocks, verify that C votes again (strong) and that lib continues to advance

[sd3] Recover a killed node while retaining up to date finalizer safety info

A produces 2 blocks
take snapshot of C
A produces 2 blocks
shutdown C
A produces 2 blocks, verify lib continues to advance
remove C's state, lease fsi alone
restart C from previously taken snapshot
push blocks A -> C
A produces 2 blocks, verify that C votes again (strong) and that lib continues to advance

All but one finalizer nodes go down

Tests are similar above, except that C is replaced by the set { B, C, D }, and lib stops advancing when { B, C, D } are shutdown

[md0] recovery when nodes go down

shutdown { B, C, D }
A produces 4 more blocks. Verify that lib advances by 1
restart { B, C, D }
push blocks A -> { B, C, D }
verify that { B, C, D } vote again (strong) and that lib continues to advance

[md1] Recover a killed node with old finalizer safety info

save { B, C, D }'s fsi
A produces 2 blocks
take snapshot of C
A produces 2 blocks
shutdown { B, C, D }
A produces 2 blocks, verify lib continues to advance
remove { B, C, D }'s state, replace { B, C, D }'s fsi with previously saved file
restart { B, C, D } from previously taken snapshot
push blocks A -> { B, C, D }
A produces 2 blocks, verify that { B, C, D } vote again (strong) and that lib continues to advance

[md2] Recover a killed node with deleted finalizer safety info

A produces 2 blocks
take snapshot of C
A produces 2 blocks
shutdown { B, C, D }
A produces 2 blocks, verify lib continues to advance
remove { B, C, D }'s state and fsi
restart { B, C, D } from previously taken snapshot
push blocks A -> { B, C, D }
A produces 2 blocks, verify that { B, C, D } vote again (strong) and that lib continues to advance

[md3] Recover a killed node while retaining up to date finalizer safety info

A produces 2 blocks
take snapshot of C
A produces 2 blocks
shutdown { B, C, D }
A produces 2 blocks, verify lib continues to advance
remove { B, C, D }'s state, lease fsi alone
restart { B, C, D } from previously taken snapshot
push blocks A -> { B, C, D }
A produces 2 blocks, verify that { B, C, D } vote again (strong) and that lib continues to advance

All nodes are shutdown with reversible blocks lost

[rv0] nodes shutdown with reversible blocks lost

A produces 2 blocks
take snapshot of C
A produces enough blocks so the snapshot block becomes irreversible and the snapshot is created.
verify that all nodes have the same last irreversible block ID (lib_id) and head block ID (h_id) - the snapshot block
split network { A, B } and { C, D }
A produces two more blocks, so A and B will vote strong but finality will not advance
remove network split
shutdown all four nodes
delete the state and the reversible data for all nodes, but do not delete the fsi or blocks log
restart all four nodes from previously saved snapshot. A and B finalizers will be locked on lib_id's child which was lost
A produces 4 blocks
verify that head is advancing on all nodes
verify that lib does not advance and is stuck at lib_id (because validators are locked on a reversible block which has been lost, so they cannot vote any since the claim on the lib block is just copied forward and will always be on a block with a timestamp < that the lock block in the fsi)
verify that A and B aren't voting
shutdown all four nodes again
delete every node's fsi
restart all four nodes
A produces 4 blocks, verify that every node is voting strong again on each new block and that lib advances

After base_tester::open() - need to reconnect the signals - need to re-initialize the node finalizers Also add a couple working tests.

libraries/chain/controller.cpp

unittests/savanna_cluster.hpp

unittests/savanna_transition.cpp

heifner · 2024-07-31T18:02:21Z

unittests/savanna_disaster_recovery.cpp

+         }
+      }
+   }));
+} FC_LOG_AND_RETHROW()


I love how clean all these test cases are. Well done!

…'s root not set.

ericpassmore · 2024-08-01T15:36:36Z

Note:start
group: STABILITY
category: TEST
summary: Implement Savanna disaster recovery unit tests.
Note:end

unittests/savanna_cluster.hpp

libraries/testing/include/eosio/testing/tester.hpp

linh2931 · 2024-08-02T13:17:22Z

unittests/savanna_disaster_recovery.cpp

+
+using namespace eosio::chain;
+using namespace eosio::testing;
+


The PR description is very informative. Copy some high level info to the beginning of each test to describe test purpose.

I tried to have the test be almost as readable as the list of steps in the description above, and I feel it is better to have one source of truth.

When duplicating the code logic into comments, they often become inconsistent as the code is changed while the comments are not. This is why I did not want to add the list of steps as a block comment on top. However, let me see if I can add a more general comment describing the intent of the test.

You don't need to add the exact list of steps here, just a short sentence at the beginning would be good.

unittests/savanna_disaster_recovery.cpp

linh2931 · 2024-08-02T13:24:00Z

unittests/savanna_disaster_recovery.cpp

+   BOOST_REQUIRE_EQUAL(A.lib_advances_by([&]() { A.produce_blocks(4);  }), 4); // lib still advances with 3 finalizers
+   C.open();
+   A.push_blocks_to(C);
+   BOOST_REQUIRE_EQUAL(A.lib_advances_by([&]() { A.produce_blocks(4);  }), 4); // all 4 finalizers should be back voting


The comment can be improved. Here only checks LIB advancing; the next checks C is back to vote.

Yes, that comment really applies to the next line, and the // let's make sure of that implies that it is only verified on that line.

linh2931 · 2024-08-02T13:25:52Z

unittests/savanna_disaster_recovery.cpp

+   auto& A=_nodes[0]; auto& C=_nodes[2];
+
+   C.close();
+   BOOST_REQUIRE_EQUAL(A.lib_advances_by([&]() { A.produce_blocks(4);  }), 4); // lib still advances with 3 finalizers


Can we verify only 3 finalizers (not including C) are voting?

We don't have an API for doing this directly... let me see if I can add one.

unittests/savanna_disaster_recovery.cpp

linh2931 · 2024-08-02T15:08:08Z

libraries/testing/include/eosio/testing/tester.hpp

      }

      void set_node_finalizers() {
         if (node_num_keys)
-            t.set_node_finalizers({&key_names.at(node_first_key), node_num_keys});
+            t.set_node_finalizers({&key_names.at(node_first_key_idx), node_num_keys});
      }

      // updates the finalizer_policy to the `fin_policy_size` keys starting at `first_key`


The comment about first_key should be changed too. But OK not to touch them this time.

greg7mdp added 19 commits July 24, 2024 17:18

Start work on new savanna_cluster tests.

8e9e7aa

Update savanna_cluster to take into account shutdown nodes.

df1cced

Rename push_blocks member function

116801c

Merge branch 'main' of github.com:AntelopeIO/spring into gh_380

525333c

Add cluster support for snapshot and fsi changes.

9a8e391

Add support for removing finalizer safety information.

13178dc

Complete disaster recovery tests when single finalizer goes down.

9ae8303

Move implementation of require_lib_advancing_by to node_t.

014b826

Add a couple assert.

cd6a60a

Merge branch 'main' of github.com:AntelopeIO/spring into gh_380

00f3a0f

Fix issues with multiple down nodes

bf7cef3

After base_tester::open() - need to reconnect the signals - need to re-initialize the node finalizers Also add a couple working tests.

Implement recover_killed_nodes_with_deleted_fsi test.

5b71d38

Implement recover_killed_nodes_while_retaining_fsi test.

873238f

wip test all_nodes_shutdown_with_reversible_blocks_lost

809570c

Finish implementing savanna disaster recovery tests.

a8d0f2f

minor changes (mostly comments and whitespace).

c144c41

Merge branch 'main' of github.com:AntelopeIO/spring into gh_380

108c1dd

Fix blocks_log_replay_tests.

0c62358

Merge branch 'main' of github.com:AntelopeIO/spring into gh_380

98b395a

greg7mdp requested review from heifner and linh2931 July 30, 2024 20:28

greg7mdp added 6 commits July 30, 2024 16:31

Remove comment and whitespace.

7415f7a

Start working on finalizer transition tests.

d9b9d15

Merge branch 'main' of github.com:AntelopeIO/spring into gh_380

1aa6029

Simplify savanna_cluster esp. management of finalizer keys.

42f1900

Merge branch 'gh_380_2' of github.com:AntelopeIO/spring into gh_380

f0d0259

Fix initial savanna_transition test.

1011082

heifner requested changes Jul 31, 2024

View reviewed changes

libraries/chain/controller.cpp Show resolved Hide resolved

libraries/chain/controller.cpp Outdated Show resolved Hide resolved

unittests/savanna_cluster.hpp Outdated Show resolved Hide resolved

unittests/savanna_transition.cpp Show resolved Hide resolved

heifner reviewed Jul 31, 2024

View reviewed changes

Make constructor explicit

4a79595

greg7mdp added 2 commits July 31, 2024 14:30

Update comments.

71e9d5d

Allow controller::last_irreversible_block_id() to assert if fork_db…

889d08c

…'s root not set.

heifner approved these changes Jul 31, 2024

View reviewed changes

greg7mdp added 2 commits August 1, 2024 00:02

Merge branch 'main' of github.com:AntelopeIO/spring into gh_380

959babd

Merge branch 'gh_380' of github.com:AntelopeIO/spring into gh_380

50af859

heifner approved these changes Aug 1, 2024

View reviewed changes

heifner reviewed Aug 1, 2024

View reviewed changes

unittests/savanna_cluster.hpp Outdated Show resolved Hide resolved

greg7mdp added 2 commits August 1, 2024 18:07

Fix whitespace.

1af2ca2

Merge branch 'main' of github.com:AntelopeIO/spring into gh_380

8c11dfc

heifner approved these changes Aug 2, 2024

View reviewed changes

Merge branch 'main' of github.com:AntelopeIO/spring into gh_380

37eed66

linh2931 reviewed Aug 2, 2024

View reviewed changes

greg7mdp added 2 commits August 2, 2024 10:34

Update variable names according to PR comment.

4480e7e

Add comments.

6b171c5

linh2931 reviewed Aug 2, 2024

View reviewed changes

linh2931 approved these changes Aug 2, 2024

View reviewed changes

greg7mdp added 3 commits August 2, 2024 11:12

Switch order of arguments to BOOST_REQUIRE_EQUAL.

5c7e9f6

Add comments.

b4a36c5

Update comment and parameter name.

1fa1610

linh2931 approved these changes Aug 2, 2024

View reviewed changes

greg7mdp merged commit d429a51 into main Aug 2, 2024
36 checks passed

greg7mdp deleted the gh_380 branch August 2, 2024 16:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement Savanna disaster recovery unit tests. #444

Implement Savanna disaster recovery unit tests. #444

greg7mdp commented Jul 30, 2024

heifner Jul 31, 2024

ericpassmore commented Aug 1, 2024

linh2931 Aug 2, 2024

greg7mdp Aug 2, 2024

linh2931 Aug 2, 2024

linh2931 Aug 2, 2024

linh2931 Aug 2, 2024

greg7mdp Aug 2, 2024

linh2931 Aug 2, 2024

greg7mdp Aug 2, 2024

linh2931 Aug 2, 2024


		using namespace eosio::chain;
		using namespace eosio::testing;

Implement Savanna disaster recovery unit tests. #444

Implement Savanna disaster recovery unit tests. #444

Conversation

greg7mdp commented Jul 30, 2024

unit tests: Disaster recovery

Single finalizer goes down

[sd0] recovery when nodes go down

[sd1] Recover a killed node with old finalizer safety info

[sd2] Recover a killed node with deleted finalizer safety info

[sd3] Recover a killed node while retaining up to date finalizer safety info

All but one finalizer nodes go down

[md0] recovery when nodes go down

[md1] Recover a killed node with old finalizer safety info

[md2] Recover a killed node with deleted finalizer safety info

[md3] Recover a killed node while retaining up to date finalizer safety info

All nodes are shutdown with reversible blocks lost

[rv0] nodes shutdown with reversible blocks lost

Choose a reason for hiding this comment

ericpassmore commented Aug 1, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment