-
Notifications
You must be signed in to change notification settings - Fork 678
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
test: add tests for SnapshotHostInfo broadcast behavior #10204
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for adding these tests, this looks great overall! Left a few minor suggestions.
@@ -2,5 +2,6 @@ mod accounts_data; | |||
mod connection_pool; | |||
mod nonce; | |||
mod routing; | |||
mod sync_snapshot_hosts; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: snapshot_hosts
would be more consistent with the other modules here.
/// 3. fcntl() duplicating one end of some globally shared socketpair() | ||
/// 4. fcntl() duplicating epoll socket created in (2) | ||
/// This gives 5 file descriptors per PeerActor (4 + 1 TCP socket). | ||
const FDS_PER_PEER: usize = 5; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This constant and the associated commentary is now duplicated here and in chain/network/src/peer_manager/tests/accounts_data.rs
. Consider moving it to chain/network/src/peer_manager/testonly.rs
and importing it instead.
tracing::info!(target:"test", "Send a SyncSnapshotHosts message from peer1, make sure that all peers receive it."); | ||
|
||
let info1 = | ||
make_snapshot_host_info(&peer1_config.node_id(), 123, vec![0, 1], &peer1_config.node_key); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since your tests aren't at all sensitive to the contents of SnapshotHostInfo, it may help readability to pass rng
into make_snapshot_host_info
and generate some random epoch_height
and shards
inside of it rather than in-lining arbitrary values here.
let peer4_sync_msg = peer4.events.recv_until(take_sync_snapshot_msg).await; | ||
assert_eq!(peer4_sync_msg.hosts, vec![info1.clone()]); | ||
|
||
tracing::info!(target:"test", "Publish another piece of snapshot information, check that it's also broadcasted."); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice coverage of all the interesting cases in this test.
pms[1].actix.addr.send(message.with_span_context()).await.unwrap(); | ||
|
||
tracing::info!(target:"test", "Make sure that the message sent from #1 reaches #2 on the other side of the ring."); | ||
let received_message = pms[2].events.recv_until(take_sync_snapshot_msg_manager).await; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if we moved this inside peer_manager/testonly.rs
and actually checked the internal state of the SnapshotHostCache, along the lines of:
nearcore/chain/network/src/peer_manager/testonly.rs
Lines 404 to 410 in ed3dc7e
// Awaits until the accounts_data state matches `want`. | |
pub async fn wait_for_accounts_data(&self, want: &HashSet<Arc<SignedAccountData>>) { | |
self.wait_for_accounts_data_pred(|cache| { | |
&cache.data.values().cloned().collect::<HashSet<_>>() == want | |
}) | |
.await | |
} |
That is, verify that the event is not only received by the peer manager actor but actually processed correctly into the local state?
} | ||
|
||
/// Used to consume peer manager events until there's an event of type SyncSnapshotHosts | ||
fn take_sync_snapshot_msg_manager( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See comment at the end of the file, I think this lies in is a weird middle ground between consuming just the network events (as done in the first tests with fake peer managers) and what is possible in the last test (in the full peer managers there exists a SnapshotHostsCache instance and we can verify its state).
assert_eq!(got1.hosts, vec![info2.clone()]); | ||
|
||
tracing::info!(target:"test", "Connect another peer, check that it receieves all of the published information."); | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Considering removing this newline; having a single chunk starting with the tracing message and ending with the assert helps readability in these long tests with multiple sections.
Thank you for the feedback! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great, thanks!
) -> Arc<SnapshotHostInfo> { | ||
let epoch_height: EpochHeight = rng.gen::<EpochHeight>(); | ||
let shards_num: usize = rng.gen_range(1..16); | ||
let shards: Vec<ShardId> = (0..shards_num).map(|_| rng.gen()).collect(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It shouldn't matter for these tests, but small, distinct values would be more realistic for shard ids. It works like this in practice:
- The shard layout for the protocol has shards labelled
0, 1, ... N-1
. TodayN = 4
. - A snapshot contains some subset of those shards.
The components you are testing here (network infra and the SnapshotHostsCache) don't really care about these properties and can handle arbitrary values, so I don't necessarily think these tests need to change.
Unless/until we introduce additional verification of these values within the network side of things, the downstream consumers of this content should be able to handle unexpected values for ShardId anyway. @VanBarbascu
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I intentionally didn't limit the values here because I wanted to see what happens :P
Maybe the cache allocates vec![0; max_shard_id+1]
and the program would crash? I felt that it would make the test more throughout. After all malicious peers could send messages with large shard values, so it should be handled gracefully.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll throw out one other idea (but please feel free to merge this as-is and move on), which would be to conduct the "happy path" tests on valid data and add some separate testtest_unexpected_shard_ids
which passes large shard values to see what happens. Then in the future if we happen to introduce some downstream consumer which does something like you are imagining (e.g. allocate a vec of size max_shard_id + 1
) the failing test would pinpoint the exact problem.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea, I made shard generation more realistic and added a separate test which makes sure that PeerManager
handles big shard ids correctly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One thing that worries me a bit is that this test doesn't pass when I send the message as the peer manager itself. It looks like a PeerManager doesn't process its own messages. Is this a concern, or is it working as expected?
Here's a failing test, it hangs on the last line:
/// Send a SyncSnapshotHosts message with very large shard ids.
/// Makes sure that PeerManager processes large shard ids without any problems.
#[tokio::test]
async fn large_shard_id_in_cache_pure_manager() {
init_test_logger();
let mut rng = make_rng(921853233);
let rng = &mut rng;
let mut clock = time::FakeClock::default();
let chain = Arc::new(data::Chain::make(&mut clock, rng, 10));
tracing::info!(target:"test", "Create a peer manager.");
let pm = peer_manager::testonly::start(
clock.clock(),
near_store::db::TestDB::new(),
chain.make_config(rng),
chain.clone(),
)
.await;
tracing::info!(target:"test", "Send a SnapshotHostInfo message with very large shard ids.");
let big_shard_info = Arc::new(SnapshotHostInfo::new(
pm.peer_info().id,
CryptoHash::hash_borsh(1234_u64),
1234,
vec![0, 1232232, ShardId::MAX - 1, ShardId::MAX],
&pm.cfg.node_key,
));
let message = PeerManagerMessageRequest::NetworkRequests(NetworkRequests::SnapshotHostInfo {
sync_hash: big_shard_info.sync_hash,
epoch_height: big_shard_info.epoch_height,
shards: big_shard_info.shards.clone(),
});
pm.actix.addr.send(message.with_span_context()).await.unwrap();
tracing::info!(target:"test", "Make sure that the message is received and processed without any problems.");
let want: HashSet<Arc<SnapshotHostInfo>> = std::iter::once(big_shard_info).collect();
pm.wait_for_snapshot_hosts(&want).await;
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, interesting case. It is technically working as expected because NetworkRequest::SnapshotHostInfo
triggers a network message to the peers of the node:
nearcore/chain/network/src/peer_manager/peer_manager_actor.rs
Lines 787 to 802 in bf7ba11
NetworkRequests::SnapshotHostInfo { sync_hash, epoch_height, shards } => { | |
// Sign the information about the locally created snapshot using the keys in the | |
// network config before broadcasting it | |
let snapshot_host_info = SnapshotHostInfo::new( | |
self.state.config.node_id(), | |
sync_hash, | |
epoch_height, | |
shards, | |
&self.state.config.node_key, | |
); | |
self.state.tier2.broadcast_message(Arc::new(PeerMessage::SyncSnapshotHosts( | |
SyncSnapshotHosts { hosts: vec![snapshot_host_info.into()] }, | |
))); | |
NetworkResponses::NoResponse | |
} |
However, it sounds reasonable to also insert the SnapshotHostInfo
to the local cache here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we do something with it? I don't feel qualified to decide, since you're the expert on this code.
Can I merge the PR, or should I go and "fix" this "issue"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the idea of inserting to the local cache here (in addition to broadcasting to all peers) if you want to sneak that in with this PR. It should just be a 1 line change (and then you can keep the test as is).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Eh for now I'll just merge this PR, as I need to have these changes on master.
The insertion can be done in a follow up PR.
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## master #10204 +/- ##
==========================================
+ Coverage 71.87% 71.88% +0.01%
==========================================
Files 707 707
Lines 141791 141815 +24
Branches 141791 141815 +24
==========================================
+ Hits 101907 101943 +36
+ Misses 35171 35153 -18
- Partials 4713 4719 +6
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
Hmm something went wrong with code coverage, I will rebase onto the latest master |
Add a test which spawns a single PeerManager and a few peers connected to it. The peers send out SyncSnapshotHosts messages and the PeerManager should broadcast them to all of the connected peers.
PeerManager shouldn't broadcast SyncSnapshotHosts messages that contain an invalid signature.
When one PeerManager receives a SyncSnapshotHosts message, it should propagate it to all of the connected PeerManagers. The message should spread through the network, eventually reaching PeerManagers that aren't directly connected to the sender. I hit the file descriptor limit with this test, so I reused the logic from accounts_data.rs to increase the limit.
Now we're waiting for the right state, so there is no need to consume the empty messages at the beginning.
Previously shard_ids were totally random, which isn't very realistic. The new code chooses max 16 unique random shard ids between 0 and 32, which matches the real world examples better.
3ef55d9
to
07505e2
Compare
Add a few tests which test that
SnapshotHostInfo
is properly broadcast between peers.SnapshotHostInfo
lets other peers know that a certain peer has a state snapshot at a specific point in time, which is later used for state sync. Nodes publish this information periodically and it should be broadcast to all nodes in the network.Fixes: #10172