Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test: add tests for SnapshotHostInfo broadcast behavior #10204

Merged
merged 12 commits into from
Nov 27, 2023

Conversation

jancionear
Copy link
Contributor

Add a few tests which test that SnapshotHostInfo is properly broadcast between peers.

SnapshotHostInfo lets other peers know that a certain peer has a state snapshot at a specific point in time, which is later used for state sync. Nodes publish this information periodically and it should be broadcast to all nodes in the network.

Fixes: #10172

@jancionear jancionear requested a review from a team as a code owner November 17, 2023 14:26
Copy link
Collaborator

@saketh-are saketh-are left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding these tests, this looks great overall! Left a few minor suggestions.

@@ -2,5 +2,6 @@ mod accounts_data;
mod connection_pool;
mod nonce;
mod routing;
mod sync_snapshot_hosts;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: snapshot_hosts would be more consistent with the other modules here.

/// 3. fcntl() duplicating one end of some globally shared socketpair()
/// 4. fcntl() duplicating epoll socket created in (2)
/// This gives 5 file descriptors per PeerActor (4 + 1 TCP socket).
const FDS_PER_PEER: usize = 5;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This constant and the associated commentary is now duplicated here and in chain/network/src/peer_manager/tests/accounts_data.rs. Consider moving it to chain/network/src/peer_manager/testonly.rs and importing it instead.

tracing::info!(target:"test", "Send a SyncSnapshotHosts message from peer1, make sure that all peers receive it.");

let info1 =
make_snapshot_host_info(&peer1_config.node_id(), 123, vec![0, 1], &peer1_config.node_key);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since your tests aren't at all sensitive to the contents of SnapshotHostInfo, it may help readability to pass rng into make_snapshot_host_info and generate some random epoch_height and shards inside of it rather than in-lining arbitrary values here.

let peer4_sync_msg = peer4.events.recv_until(take_sync_snapshot_msg).await;
assert_eq!(peer4_sync_msg.hosts, vec![info1.clone()]);

tracing::info!(target:"test", "Publish another piece of snapshot information, check that it's also broadcasted.");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice coverage of all the interesting cases in this test.

pms[1].actix.addr.send(message.with_span_context()).await.unwrap();

tracing::info!(target:"test", "Make sure that the message sent from #1 reaches #2 on the other side of the ring.");
let received_message = pms[2].events.recv_until(take_sync_snapshot_msg_manager).await;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if we moved this inside peer_manager/testonly.rs and actually checked the internal state of the SnapshotHostCache, along the lines of:

// Awaits until the accounts_data state matches `want`.
pub async fn wait_for_accounts_data(&self, want: &HashSet<Arc<SignedAccountData>>) {
self.wait_for_accounts_data_pred(|cache| {
&cache.data.values().cloned().collect::<HashSet<_>>() == want
})
.await
}

That is, verify that the event is not only received by the peer manager actor but actually processed correctly into the local state?

}

/// Used to consume peer manager events until there's an event of type SyncSnapshotHosts
fn take_sync_snapshot_msg_manager(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See comment at the end of the file, I think this lies in is a weird middle ground between consuming just the network events (as done in the first tests with fake peer managers) and what is possible in the last test (in the full peer managers there exists a SnapshotHostsCache instance and we can verify its state).

assert_eq!(got1.hosts, vec![info2.clone()]);

tracing::info!(target:"test", "Connect another peer, check that it receieves all of the published information.");

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Considering removing this newline; having a single chunk starting with the tracing message and ending with the assert helps readability in these long tests with multiple sections.

@jancionear
Copy link
Contributor Author

Thank you for the feedback!
I fixed all the issues, please take another look.

Copy link
Collaborator

@saketh-are saketh-are left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great, thanks!

) -> Arc<SnapshotHostInfo> {
let epoch_height: EpochHeight = rng.gen::<EpochHeight>();
let shards_num: usize = rng.gen_range(1..16);
let shards: Vec<ShardId> = (0..shards_num).map(|_| rng.gen()).collect();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It shouldn't matter for these tests, but small, distinct values would be more realistic for shard ids. It works like this in practice:

  • The shard layout for the protocol has shards labelled 0, 1, ... N-1. Today N = 4.
  • A snapshot contains some subset of those shards.

The components you are testing here (network infra and the SnapshotHostsCache) don't really care about these properties and can handle arbitrary values, so I don't necessarily think these tests need to change.

Unless/until we introduce additional verification of these values within the network side of things, the downstream consumers of this content should be able to handle unexpected values for ShardId anyway. @VanBarbascu

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I intentionally didn't limit the values here because I wanted to see what happens :P
Maybe the cache allocates vec![0; max_shard_id+1] and the program would crash? I felt that it would make the test more throughout. After all malicious peers could send messages with large shard values, so it should be handled gracefully.

Copy link
Collaborator

@saketh-are saketh-are Nov 20, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll throw out one other idea (but please feel free to merge this as-is and move on), which would be to conduct the "happy path" tests on valid data and add some separate testtest_unexpected_shard_ids which passes large shard values to see what happens. Then in the future if we happen to introduce some downstream consumer which does something like you are imagining (e.g. allocate a vec of size max_shard_id + 1) the failing test would pinpoint the exact problem.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea, I made shard generation more realistic and added a separate test which makes sure that PeerManager handles big shard ids correctly.

Copy link
Contributor Author

@jancionear jancionear Nov 20, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing that worries me a bit is that this test doesn't pass when I send the message as the peer manager itself. It looks like a PeerManager doesn't process its own messages. Is this a concern, or is it working as expected?
Here's a failing test, it hangs on the last line:

/// Send a SyncSnapshotHosts message with very large shard ids.
/// Makes sure that PeerManager processes large shard ids without any problems.
#[tokio::test]
async fn large_shard_id_in_cache_pure_manager() {
    init_test_logger();
    let mut rng = make_rng(921853233);
    let rng = &mut rng;
    let mut clock = time::FakeClock::default();
    let chain = Arc::new(data::Chain::make(&mut clock, rng, 10));

    tracing::info!(target:"test", "Create a peer manager.");
    let pm = peer_manager::testonly::start(
        clock.clock(),
        near_store::db::TestDB::new(),
        chain.make_config(rng),
        chain.clone(),
    )
    .await;

    tracing::info!(target:"test", "Send a SnapshotHostInfo message with very large shard ids.");
    let big_shard_info = Arc::new(SnapshotHostInfo::new(
        pm.peer_info().id,
        CryptoHash::hash_borsh(1234_u64),
        1234,
        vec![0, 1232232, ShardId::MAX - 1, ShardId::MAX],
        &pm.cfg.node_key,
    ));

    let message = PeerManagerMessageRequest::NetworkRequests(NetworkRequests::SnapshotHostInfo {
        sync_hash: big_shard_info.sync_hash,
        epoch_height: big_shard_info.epoch_height,
        shards: big_shard_info.shards.clone(),
    });

    pm.actix.addr.send(message.with_span_context()).await.unwrap();

    tracing::info!(target:"test", "Make sure that the message is received and processed without any problems.");
    let want: HashSet<Arc<SnapshotHostInfo>> = std::iter::once(big_shard_info).collect();
    pm.wait_for_snapshot_hosts(&want).await;
}

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, interesting case. It is technically working as expected because NetworkRequest::SnapshotHostInfo triggers a network message to the peers of the node:

NetworkRequests::SnapshotHostInfo { sync_hash, epoch_height, shards } => {
// Sign the information about the locally created snapshot using the keys in the
// network config before broadcasting it
let snapshot_host_info = SnapshotHostInfo::new(
self.state.config.node_id(),
sync_hash,
epoch_height,
shards,
&self.state.config.node_key,
);
self.state.tier2.broadcast_message(Arc::new(PeerMessage::SyncSnapshotHosts(
SyncSnapshotHosts { hosts: vec![snapshot_host_info.into()] },
)));
NetworkResponses::NoResponse
}

However, it sounds reasonable to also insert the SnapshotHostInfo to the local cache here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we do something with it? I don't feel qualified to decide, since you're the expert on this code.
Can I merge the PR, or should I go and "fix" this "issue"?

Copy link
Collaborator

@saketh-are saketh-are Nov 27, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the idea of inserting to the local cache here (in addition to broadcasting to all peers) if you want to sneak that in with this PR. It should just be a 1 line change (and then you can keep the test as is).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Eh for now I'll just merge this PR, as I need to have these changes on master.
The insertion can be done in a follow up PR.

Copy link

codecov bot commented Nov 20, 2023

Codecov Report

All modified and coverable lines are covered by tests ✅

Comparison is base (2b24bc8) 71.87% compared to head (07505e2) 71.88%.

Additional details and impacted files
@@            Coverage Diff             @@
##           master   #10204      +/-   ##
==========================================
+ Coverage   71.87%   71.88%   +0.01%     
==========================================
  Files         707      707              
  Lines      141791   141815      +24     
  Branches   141791   141815      +24     
==========================================
+ Hits       101907   101943      +36     
+ Misses      35171    35153      -18     
- Partials     4713     4719       +6     
Flag Coverage Δ
backward-compatibility 0.08% <ø> (ø)
db-migration 0.08% <ø> (ø)
genesis-check 1.23% <ø> (ø)
integration-tests 36.16% <0.00%> (-0.01%) ⬇️
linux 71.78% <100.00%> (+0.04%) ⬆️
linux-nightly 71.64% <95.83%> (-0.02%) ⬇️
macos 55.74% <ø> (-0.01%) ⬇️
pytests 1.46% <ø> (ø)
sanity-checks 1.26% <ø> (ø)
unittests 68.29% <100.00%> (+0.07%) ⬆️
upgradability 0.13% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@jancionear
Copy link
Contributor Author

Hmm something went wrong with code coverage, I will rebase onto the latest master

Add a test which spawns a single PeerManager and a few peers connected to it.
The peers send out SyncSnapshotHosts messages and the PeerManager
should broadcast them to all of the connected peers.
PeerManager shouldn't broadcast SyncSnapshotHosts messages that
contain an invalid signature.
When one PeerManager receives a SyncSnapshotHosts message,
it should propagate it to all of the connected PeerManagers.
The message should spread through the network, eventually
reaching PeerManagers that aren't directly connected to the sender.

I hit the file descriptor limit with this test, so I reused the logic
from accounts_data.rs to increase the limit.
Now we're waiting for the right state, so there is no need
to consume the empty messages at the beginning.
Previously shard_ids were totally random, which isn't very realistic.
The new code chooses max 16 unique random shard ids between 0 and 32,
which matches the real world examples better.
@jancionear jancionear force-pushed the snapshot_broadcast_tests branch from 3ef55d9 to 07505e2 Compare November 20, 2023 20:16
@jancionear jancionear added this pull request to the merge queue Nov 27, 2023
Merged via the queue into near:master with commit 0d64bc4 Nov 27, 2023
@jancionear jancionear deleted the snapshot_broadcast_tests branch November 27, 2023 20:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add tests for SnapshotHostInfo broadcast behavior
3 participants