Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement Savanna disaster recovery unit tests. #444

Merged
merged 38 commits into from
Aug 2, 2024
Merged

Implement Savanna disaster recovery unit tests. #444

merged 38 commits into from
Aug 2, 2024

Conversation

greg7mdp
Copy link
Contributor

Partially resolves #380

Implemented here:

unit tests: Disaster recovery

Single finalizer goes down

[sd0] recovery when nodes go down

  • shutdown C
  • A produces 4 more blocks. Verify that lib advances by 4
  • restart C
  • push blocks A -> C
  • verify that C votes again (strong) and that lib continues to advance

[sd1] Recover a killed node with old finalizer safety info

  • save C's fsi
  • A produces 2 blocks
  • take snapshot of C
  • A produces 2 blocks
  • shutdown C
  • A produces 2 blocks, verify lib continues to advance
  • remove C's state, replace C's fsi with previously saved file
  • restart C from previously taken snapshot
  • push blocks A -> C
  • A produces 2 blocks, verify that C votes again (strong) and that lib continues to advance

[sd2] Recover a killed node with deleted finalizer safety info

  • A produces 2 blocks
  • take snapshot of C
  • A produces 2 blocks
  • shutdown C
  • A produces 2 blocks, verify lib continues to advance
  • remove C's state and fsi
  • restart C from previously taken snapshot
  • push blocks A -> C
  • A produces 2 blocks, verify that C votes again (strong) and that lib continues to advance

[sd3] Recover a killed node while retaining up to date finalizer safety info

  • A produces 2 blocks
  • take snapshot of C
  • A produces 2 blocks
  • shutdown C
  • A produces 2 blocks, verify lib continues to advance
  • remove C's state, lease fsi alone
  • restart C from previously taken snapshot
  • push blocks A -> C
  • A produces 2 blocks, verify that C votes again (strong) and that lib continues to advance

All but one finalizer nodes go down

Tests are similar above, except that C is replaced by the set { B, C, D }, and lib stops advancing when { B, C, D } are shutdown

[md0] recovery when nodes go down

  • shutdown { B, C, D }
  • A produces 4 more blocks. Verify that lib advances by 1
  • restart { B, C, D }
  • push blocks A -> { B, C, D }
  • verify that { B, C, D } vote again (strong) and that lib continues to advance

[md1] Recover a killed node with old finalizer safety info

  • save { B, C, D }'s fsi
  • A produces 2 blocks
  • take snapshot of C
  • A produces 2 blocks
  • shutdown { B, C, D }
  • A produces 2 blocks, verify lib continues to advance
  • remove { B, C, D }'s state, replace { B, C, D }'s fsi with previously saved file
  • restart { B, C, D } from previously taken snapshot
  • push blocks A -> { B, C, D }
  • A produces 2 blocks, verify that { B, C, D } vote again (strong) and that lib continues to advance

[md2] Recover a killed node with deleted finalizer safety info

  • A produces 2 blocks
  • take snapshot of C
  • A produces 2 blocks
  • shutdown { B, C, D }
  • A produces 2 blocks, verify lib continues to advance
  • remove { B, C, D }'s state and fsi
  • restart { B, C, D } from previously taken snapshot
  • push blocks A -> { B, C, D }
  • A produces 2 blocks, verify that { B, C, D } vote again (strong) and that lib continues to advance

[md3] Recover a killed node while retaining up to date finalizer safety info

  • A produces 2 blocks
  • take snapshot of C
  • A produces 2 blocks
  • shutdown { B, C, D }
  • A produces 2 blocks, verify lib continues to advance
  • remove { B, C, D }'s state, lease fsi alone
  • restart { B, C, D } from previously taken snapshot
  • push blocks A -> { B, C, D }
  • A produces 2 blocks, verify that { B, C, D } vote again (strong) and that lib continues to advance

All nodes are shutdown with reversible blocks lost

[rv0] nodes shutdown with reversible blocks lost

  • A produces 2 blocks
  • take snapshot of C
  • A produces enough blocks so the snapshot block becomes irreversible and the snapshot is created.
  • verify that all nodes have the same last irreversible block ID (lib_id) and head block ID (h_id) - the snapshot block
  • split network { A, B } and { C, D }
  • A produces two more blocks, so A and B will vote strong but finality will not advance
  • remove network split
  • shutdown all four nodes
  • delete the state and the reversible data for all nodes, but do not delete the fsi or blocks log
  • restart all four nodes from previously saved snapshot. A and B finalizers will be locked on lib_id's child which was lost
  • A produces 4 blocks
  • verify that head is advancing on all nodes
  • verify that lib does not advance and is stuck at lib_id (because validators are locked on a reversible block which has been lost, so they cannot vote any since the claim on the lib block is just copied forward and will always be on a block with a timestamp < that the lock block in the fsi)
  • verify that A and B aren't voting
  • shutdown all four nodes again
  • delete every node's fsi
  • restart all four nodes
  • A produces 4 blocks, verify that every node is voting strong again on each new block and that lib advances

@greg7mdp greg7mdp requested review from heifner and linh2931 July 30, 2024 20:28
libraries/chain/controller.cpp Show resolved Hide resolved
libraries/chain/controller.cpp Outdated Show resolved Hide resolved
unittests/savanna_cluster.hpp Outdated Show resolved Hide resolved
unittests/savanna_transition.cpp Show resolved Hide resolved
}
}
}));
} FC_LOG_AND_RETHROW()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I love how clean all these test cases are. Well done!

@ericpassmore
Copy link
Contributor

Note:start
group: STABILITY
category: TEST
summary: Implement Savanna disaster recovery unit tests.
Note:end

libraries/testing/include/eosio/testing/tester.hpp Outdated Show resolved Hide resolved

using namespace eosio::chain;
using namespace eosio::testing;

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR description is very informative. Copy some high level info to the beginning of each test to describe test purpose.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried to have the test be almost as readable as the list of steps in the description above, and I feel it is better to have one source of truth.

When duplicating the code logic into comments, they often become inconsistent as the code is changed while the comments are not. This is why I did not want to add the list of steps as a block comment on top. However, let me see if I can add a more general comment describing the intent of the test.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You don't need to add the exact list of steps here, just a short sentence at the beginning would be good.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

unittests/savanna_disaster_recovery.cpp Show resolved Hide resolved
unittests/savanna_disaster_recovery.cpp Outdated Show resolved Hide resolved
BOOST_REQUIRE_EQUAL(A.lib_advances_by([&]() { A.produce_blocks(4); }), 4); // lib still advances with 3 finalizers
C.open();
A.push_blocks_to(C);
BOOST_REQUIRE_EQUAL(A.lib_advances_by([&]() { A.produce_blocks(4); }), 4); // all 4 finalizers should be back voting
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment can be improved. Here only checks LIB advancing; the next checks C is back to vote.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that comment really applies to the next line, and the // let's make sure of that implies that it is only verified on that line.

auto& A=_nodes[0]; auto& C=_nodes[2];

C.close();
BOOST_REQUIRE_EQUAL(A.lib_advances_by([&]() { A.produce_blocks(4); }), 4); // lib still advances with 3 finalizers
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we verify only 3 finalizers (not including C) are voting?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't have an API for doing this directly... let me see if I can add one.

unittests/savanna_disaster_recovery.cpp Show resolved Hide resolved
unittests/savanna_disaster_recovery.cpp Outdated Show resolved Hide resolved
}

void set_node_finalizers() {
if (node_num_keys)
t.set_node_finalizers({&key_names.at(node_first_key), node_num_keys});
t.set_node_finalizers({&key_names.at(node_first_key_idx), node_num_keys});
}

// updates the finalizer_policy to the `fin_policy_size` keys starting at `first_key`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment about first_key should be changed too. But OK not to touch them this time.

@greg7mdp greg7mdp merged commit d429a51 into main Aug 2, 2024
36 checks passed
@greg7mdp greg7mdp deleted the gh_380 branch August 2, 2024 16:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Create Savanna unittests modeled after the fast testnet wave tests.
4 participants