New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

test: add tests for SnapshotHostInfo broadcast behavior #10204

Merged

jancionear merged 12 commits into near:master from jancionear:snapshot_broadcast_tests

Nov 27, 2023

Contributor

jancionear commented Nov 17, 2023

Add a few tests which test that SnapshotHostInfo is properly broadcast between peers.

SnapshotHostInfo lets other peers know that a certain peer has a state snapshot at a specific point in time, which is later used for state sync. Nodes publish this information periodically and it should be broadcast to all nodes in the network.

Fixes: #10172

jancionear requested a review from saketh-are

November 17, 2023 14:26

jancionear requested a review from a team as a code owner

November 17, 2023 14:26

saketh-are reviewed

View reviewed changes

Collaborator

saketh-are left a comment

Thanks for adding these tests, this looks great overall! Left a few minor suggestions.

chain/network/src/peer_manager/tests/mod.rs Outdated

@@ @@ -2,5 +2,6 @@ mod accounts_data; @@
               mod connection_pool;
               mod nonce;
               mod routing;
+              mod sync_snapshot_hosts;

Collaborator

saketh-are Nov 17, 2023

nit: snapshot_hosts would be more consistent with the other modules here.

chain/network/src/peer_manager/tests/sync_snapshot_hosts.rs Outdated

+              /// 3. fcntl() duplicating one end of some globally shared socketpair()
+              /// 4. fcntl() duplicating epoll socket created in (2)
+              /// This gives 5 file descriptors per PeerActor (4 + 1 TCP socket).
+              const FDS_PER_PEER: usize = 5;

Collaborator

saketh-are Nov 17, 2023

This constant and the associated commentary is now duplicated here and in chain/network/src/peer_manager/tests/accounts_data.rs. Consider moving it to chain/network/src/peer_manager/testonly.rs and importing it instead.

chain/network/src/peer_manager/tests/sync_snapshot_hosts.rs Outdated

+                  tracing::info!(target:"test", "Send a SyncSnapshotHosts message from peer1, make sure that all peers receive it.");
+                  let info1 =
+                      make_snapshot_host_info(&peer1_config.node_id(), 123, vec![0, 1], &peer1_config.node_key);

Collaborator

saketh-are Nov 17, 2023

Since your tests aren't at all sensitive to the contents of SnapshotHostInfo, it may help readability to pass rng into make_snapshot_host_info and generate some random epoch_height and shards inside of it rather than in-lining arbitrary values here.

chain/network/src/peer_manager/tests/sync_snapshot_hosts.rs Outdated

+                  let peer4_sync_msg = peer4.events.recv_until(take_sync_snapshot_msg).await;
+                  assert_eq!(peer4_sync_msg.hosts, vec![info1.clone()]);
+                  tracing::info!(target:"test", "Publish another piece of snapshot information, check that it's also broadcasted.");

Collaborator

saketh-are Nov 17, 2023

Nice coverage of all the interesting cases in this test.

chain/network/src/peer_manager/tests/sync_snapshot_hosts.rs Outdated

+                  pms[1].actix.addr.send(message.with_span_context()).await.unwrap();
+                  tracing::info!(target:"test", "Make sure that the message sent from #1 reaches #2 on the other side of the ring.");
+                  let received_message = pms[2].events.recv_until(take_sync_snapshot_msg_manager).await;

Collaborator

saketh-are Nov 17, 2023

What if we moved this inside peer_manager/testonly.rs and actually checked the internal state of the SnapshotHostCache, along the lines of:

nearcore/chain/network/src/peer_manager/testonly.rs

Lines 404 to 410 in ed3dc7e

    
           // Awaits until the accounts_data state matches `want`. 
        
           pub async fn wait_for_accounts_data(&self, want: &HashSet<Arc<SignedAccountData>>) { 
        
               self.wait_for_accounts_data_pred(|cache| { 
        
                   &cache.data.values().cloned().collect::<HashSet<_>>() == want 
        
               }) 
        
               .await 
        
           }

That is, verify that the event is not only received by the peer manager actor but actually processed correctly into the local state?

chain/network/src/peer_manager/tests/sync_snapshot_hosts.rs Outdated

+              }
+              /// Used to consume peer manager events until there's an event of type SyncSnapshotHosts
+              fn take_sync_snapshot_msg_manager(

Collaborator

saketh-are Nov 17, 2023

See comment at the end of the file, I think this lies in is a weird middle ground between consuming just the network events (as done in the first tests with fake peer managers) and what is possible in the last test (in the full peer managers there exists a SnapshotHostsCache instance and we can verify its state).

chain/network/src/peer_manager/tests/sync_snapshot_hosts.rs Outdated

		assert_eq!(got1.hosts, vec![info2.clone()]);

		tracing::info!(target:"test", "Connect another peer, check that it receieves all of the published information.");

Collaborator

saketh-are Nov 17, 2023

nit: Considering removing this newline; having a single chunk starting with the tracing message and ending with the assert helps readability in these long tests with multiple sections.

Contributor Author

jancionear commented Nov 20, 2023

Thank you for the feedback!
I fixed all the issues, please take another look.

jancionear requested a review from saketh-are

November 20, 2023 11:09

saketh-are approved these changes

View reviewed changes

Collaborator

saketh-are left a comment

Looks great, thanks!

chain/network/src/peer_manager/tests/snapshot_hosts.rs Outdated

+              ) -> Arc<SnapshotHostInfo> {
+                  let epoch_height: EpochHeight = rng.gen::<EpochHeight>();
+                  let shards_num: usize = rng.gen_range(1..16);
+                  let shards: Vec<ShardId> = (0..shards_num).map(|_| rng.gen()).collect();

Collaborator

saketh-are Nov 20, 2023

It shouldn't matter for these tests, but small, distinct values would be more realistic for shard ids. It works like this in practice:

The shard layout for the protocol has shards labelled 0, 1, ... N-1. Today N = 4.
A snapshot contains some subset of those shards.

The components you are testing here (network infra and the SnapshotHostsCache) don't really care about these properties and can handle arbitrary values, so I don't necessarily think these tests need to change.

Unless/until we introduce additional verification of these values within the network side of things, the downstream consumers of this content should be able to handle unexpected values for ShardId anyway. @VanBarbascu

Contributor Author

jancionear Nov 20, 2023

I intentionally didn't limit the values here because I wanted to see what happens :P
Maybe the cache allocates vec![0; max_shard_id+1] and the program would crash? I felt that it would make the test more throughout. After all malicious peers could send messages with large shard values, so it should be handled gracefully.

Collaborator

saketh-are Nov 20, 2023 •

edited

Loading

I'll throw out one other idea (but please feel free to merge this as-is and move on), which would be to conduct the "happy path" tests on valid data and add some separate testtest_unexpected_shard_ids which passes large shard values to see what happens. Then in the future if we happen to introduce some downstream consumer which does something like you are imagining (e.g. allocate a vec of size max_shard_id + 1) the failing test would pinpoint the exact problem.

Contributor Author

jancionear Nov 20, 2023

Good idea, I made shard generation more realistic and added a separate test which makes sure that PeerManager handles big shard ids correctly.

Contributor Author

jancionear Nov 20, 2023 •

edited

Loading

One thing that worries me a bit is that this test doesn't pass when I send the message as the peer manager itself. It looks like a PeerManager doesn't process its own messages. Is this a concern, or is it working as expected?
Here's a failing test, it hangs on the last line:

/// Send a SyncSnapshotHosts message with very large shard ids.
/// Makes sure that PeerManager processes large shard ids without any problems.
#[tokio::test]
async fn large_shard_id_in_cache_pure_manager() {
    init_test_logger();
    let mut rng = make_rng(921853233);
    let rng = &mut rng;
    let mut clock = time::FakeClock::default();
    let chain = Arc::new(data::Chain::make(&mut clock, rng, 10));

    tracing::info!(target:"test", "Create a peer manager.");
    let pm = peer_manager::testonly::start(
        clock.clock(),
        near_store::db::TestDB::new(),
        chain.make_config(rng),
        chain.clone(),
    )
    .await;

    tracing::info!(target:"test", "Send a SnapshotHostInfo message with very large shard ids.");
    let big_shard_info = Arc::new(SnapshotHostInfo::new(
        pm.peer_info().id,
        CryptoHash::hash_borsh(1234_u64),
        1234,
        vec![0, 1232232, ShardId::MAX - 1, ShardId::MAX],
        &pm.cfg.node_key,
    ));

    let message = PeerManagerMessageRequest::NetworkRequests(NetworkRequests::SnapshotHostInfo {
        sync_hash: big_shard_info.sync_hash,
        epoch_height: big_shard_info.epoch_height,
        shards: big_shard_info.shards.clone(),
    });

    pm.actix.addr.send(message.with_span_context()).await.unwrap();

    tracing::info!(target:"test", "Make sure that the message is received and processed without any problems.");
    let want: HashSet<Arc<SnapshotHostInfo>> = std::iter::once(big_shard_info).collect();
    pm.wait_for_snapshot_hosts(&want).await;
}

Collaborator

saketh-are Nov 21, 2023

OK, interesting case. It is technically working as expected because NetworkRequest::SnapshotHostInfo triggers a network message to the peers of the node:

nearcore/chain/network/src/peer_manager/peer_manager_actor.rs

Lines 787 to 802 in bf7ba11

    
           NetworkRequests::SnapshotHostInfo { sync_hash, epoch_height, shards } => { 
        
               // Sign the information about the locally created snapshot using the keys in the 
        
               // network config before broadcasting it 
        
               let snapshot_host_info = SnapshotHostInfo::new( 
        
                   self.state.config.node_id(), 
        
                   sync_hash, 
        
                   epoch_height, 
        
                   shards, 
        
                   &self.state.config.node_key, 
        
               ); 
        
               self.state.tier2.broadcast_message(Arc::new(PeerMessage::SyncSnapshotHosts( 
        
                   SyncSnapshotHosts { hosts: vec![snapshot_host_info.into()] }, 
        
               ))); 
        
               NetworkResponses::NoResponse 
        
           }

However, it sounds reasonable to also insert the SnapshotHostInfo to the local cache here.

Contributor Author

jancionear Nov 23, 2023

Should we do something with it? I don't feel qualified to decide, since you're the expert on this code.
Can I merge the PR, or should I go and "fix" this "issue"?

Collaborator

saketh-are Nov 27, 2023 •

edited

Loading

I like the idea of inserting to the local cache here (in addition to broadcasting to all peers) if you want to sneak that in with this PR. It should just be a 1 line change (and then you can keep the test as is).

Contributor Author

jancionear Nov 27, 2023

Eh for now I'll just merge this PR, as I need to have these changes on master.
The insertion can be done in a follow up PR.

codecov bot commented Nov 20, 2023 •

edited

Loading

Codecov Report

All modified and coverable lines are covered by tests ✅

Comparison is base (2b24bc8) 71.87% compared to head (07505e2) 71.88%.

Additional details and impacted files

@@            Coverage Diff             @@
##           master   #10204      +/-   ##
==========================================
+ Coverage   71.87%   71.88%   +0.01%     
==========================================
  Files         707      707              
  Lines      141791   141815      +24     
  Branches   141791   141815      +24     
==========================================
+ Hits       101907   101943      +36     
+ Misses      35171    35153      -18     
- Partials     4713     4719       +6

Flag	Coverage Δ
backward-compatibility	`0.08% <ø> (ø)`
db-migration	`0.08% <ø> (ø)`
genesis-check	`1.23% <ø> (ø)`
integration-tests	`36.16% <0.00%> (-0.01%)`	⬇️
linux	`71.78% <100.00%> (+0.04%)`	⬆️
linux-nightly	`71.64% <95.83%> (-0.02%)`	⬇️
macos	`55.74% <ø> (-0.01%)`	⬇️
pytests	`1.46% <ø> (ø)`
sanity-checks	`1.26% <ø> (ø)`
unittests	`68.29% <100.00%> (+0.07%)`	⬆️
upgradability	`0.13% <ø> (ø)`

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Contributor Author

jancionear commented Nov 20, 2023

Hmm something went wrong with code coverage, I will rebase onto the latest master

jancionear added 12 commits

November 20, 2023 21:13


          test: SyncSnapshotHosts message is broadcast by PeerManager

33239c5

Add a test which spawns a single PeerManager and a few peers connected to it.
The peers send out SyncSnapshotHosts messages and the PeerManager
should broadcast them to all of the connected peers.


          test: SyncSnapshotHosts with an invalid signature is not broadcast

864a456

PeerManager shouldn't broadcast SyncSnapshotHosts messages that
contain an invalid signature.


          test: SyncSnapshotHosts is propagated between PeerManagers

fd48f4b

When one PeerManager receives a SyncSnapshotHosts message,
it should propagate it to all of the connected PeerManagers.
The message should spread through the network, eventually
reaching PeerManagers that aren't directly connected to the sender.

I hit the file descriptor limit with this test, so I reused the logic
from accounts_data.rs to increase the limit.


          Fix clippy warnings

6b5436b


          rename sync_snapshot_hosts.rs to snapshot_hosts.rs

9c6a280


          Move FDS_PER_PEER to peer_manager/testonly.rs

caeff2c


          Use rng to generate values in make_snapshot_host_info

32d52c3


          Remove newline to improve readability

5aed005


          Check the state of snapshot_hosts cache instead of checking a message

3d324e4


          Don't wait for the initial empty messages in propagate()

e6f886d

Now we're waiting for the right state, so there is no need
to consume the empty messages at the beginning.


          Choose more realistic shards in make_snapshot_host_info

6be4f77

Previously shard_ids were totally random, which isn't very realistic.
The new code chooses max 16 unique random shard ids between 0 and 32,
which matches the real world examples better.


          test: test that PeerManager handles big shard ids correctly

07505e2

jancionear force-pushed the snapshot_broadcast_tests branch from 3ef55d9 to 07505e2 Compare

November 20, 2023 20:16

walnut-the-cat assigned walnut-the-cat and jancionear and unassigned walnut-the-cat

jancionear added this pull request to the merge queue

Merged via the queue into near:master with commit 0d64bc4

jancionear deleted the snapshot_broadcast_tests branch

November 27, 2023 20:14

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet