Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(core): reconnect disconnected peer #5303

Closed
wants to merge 2 commits into from

Conversation

aoyako
Copy link
Contributor

@aoyako aoyako commented Feb 5, 2025

Summary

This PR addresses scenarios like those described in #5286, where peers may not appear in online peer lists, causing them to passively wait for incoming connections (become isolated).

Changes

  • During the gossip process, if a peer disconnects, it will now actively update the online peer list with previously connected peers and those in TRUSTED_PEERS. On the other hand, on the restart, peers will reference the TRUSTED_PEERS list.

Checks are added during peer gossip, so there are certain delays in online peer updates. This shouldn't cause problems with too many requests.

Signed-off-by: Lohachov Mykhailo <lohachov@soramitsu.co.jp>
Signed-off-by: Lohachov Mykhailo <lohachov@soramitsu.co.jp>
@aoyako aoyako changed the title fix(bug): reconnect disconnected peer fix(core): reconnect disconnected peer Feb 5, 2025
@dima74
Copy link
Contributor

dima74 commented Feb 6, 2025

I think this should be fixed differently. There is an existing mechanism when peer constantly tries to connect to other peers -NetworkBase::update_topology(). It relies on NetworkBase::current_peers_addresses, which is updated by PeersGossiper::network_update_peers_addresses(), which depends on PeersGossiper::initial_peers, PeersGossiper::gossip_peers and NetworkBaseHandle::online_peers(). The bug is that network_update_peers_addresses() is not called when online_peers() are updated. So I think it should be fixed by changing

() = self.network.wait_online_peers_update(|_| ()) => {
self.gossip_peers();
}

to

() = self.network.wait_online_peers_update(|_| ()) => {
    self.gossip_peers();
    self.network_update_peers_addresses();
}

@dima74 dima74 self-assigned this Feb 6, 2025
@aoyako
Copy link
Contributor Author

aoyako commented Feb 7, 2025

I think this should be fixed differently. There is a mechanism when peer constantly tries to connect to other peers -NetworkBase::update_topology(). It relies on NetworkBase::current_peers_addresses, which is updated by PeersGossiper::network_update_peers_addresses(), which depends on PeersGossiper::initial_peers, PeersGossiper::gossip_peers and NetworkBaseHandle::online_peers(). The bug is that network_update_peers_addresses() is not called when online_peers() are updated. So I think it should be fixed by changing

() = self.network.wait_online_peers_update(|_| ()) => {
self.gossip_peers();
}

to

() = self.network.wait_online_peers_update(|_| ()) => {
    self.gossip_peers();
    self.network_update_peers_addresses();
}

I agree that network_update_peers_addresses can be used instead of the new sync_online_peers since it covers more functionality, including peer connections and disconnections.
However, it seems that network_update_peers_addresses attempts to also connect to gossip_peers, which can only grow (assuming the topology remains unchanged, i.e., no unregisters). This may lead to repeated connection attempts after each peer connects or disconnects (once wait_online_peers_update is received).

I believe in the described case, there should be a check after wait_online_peers_update to determine if there are zero connected peers. If so, a forced reconnect bynetwork_update_peers_addresses might be necessary. In this situation, I am unsure of the best approach: should we trigger an immediate reconnect or wait for the next gossip period?

@dima74
Copy link
Contributor

dima74 commented Feb 7, 2025

However, it seems that network_update_peers_addresses attempts to also connect to gossip_peers, which can only grow (assuming the topology remains unchanged, i.e., no unregisters). This may lead to repeated connection attempts after each peer connects or disconnects (once wait_online_peers_update is received).

Sorry I don't understand the problem, could you give an example? NetworkBase::update_topology() tries to connect to peer only if it is not yet connected, and no connection attempt is in process:

&& !self.peers.contains_key(id)
&& !self
.connecting_peers
.values()
.any(|peer| (peer.id(), peer.address()) == (id, address))

@aoyako
Copy link
Contributor Author

aoyako commented Feb 10, 2025

Sorry I don't understand the problem, could you give an example? NetworkBase::update_topology() tries to connect to a peer only if it is not yet connected and no connection attempt is in the process:

Suppose peers in the topology are divided into three categories:

  1. Always online (ON)
  2. Were previously connected but currently disconnected (OFF)
  3. Randomly Connecting/Disconnecting (CD)

So when some peer connects, it will trigger changes in online_peers, and
network_update_peers_addresses will send peers to update:
update_peers = initial_peers + gossip_peers - online_peers
Note that gossip_peers also contains OFF peers.

update peers will call update_topology with received peers, and it will try to connect:
update_peers - ready_peers - connecting_peers

If I understand correctly, network_update_peers_addresses will include OFF and CD peers, causing OFF peers to be present in update_topology.
My concern is that these reconnection attempts will be made every time some peer connects/disconnects.

Please correct me if I am wrong.

@dima74
Copy link
Contributor

dima74 commented Feb 10, 2025

My concern is that these reconnection attempts will be made every time some peer connects/disconnects.

This might be a problem. In this case I propose to remove self.update_topology() from NetworkBase::set_current_peers_addresses() (in addition to this change), so that any reconnection will happen at most once a second. This will fix #5286 and it will be exactly the same behaviour as we had before #5176 (which introduced bug #5286). And I propose to open separate issue to investigate/discuss whether reconnection once a second is ok or should be adjusted further

@aoyako
Copy link
Contributor Author

aoyako commented Feb 17, 2025

This might be a problem. In this case I propose to remove self.update_topology() from NetworkBase::set_current_peers_addresses() (in addition to this change), so that any reconnection will happen at most once a second. This will fix #5286 and it will be exactly the same behaviour as we had before #5176 (which introduced bug #5286). And I propose to open separate issue to investigate/discuss whether reconnection once a second is ok or should be adjusted further

I am not sure what the best solution in this case will be. self.update_topology is not triggered from the regular self.gossip_peers(), so it works fine at this moment. Is there a reason why current behavior should be changed?

@dima74
Copy link
Contributor

dima74 commented Feb 17, 2025

I think this should be fixed this way. I think problem with current implementation in this PR is that it introduces duplicated mechanism for peer address update, and that it fixes only case when online_peers are empty (#5286), however there might be other cases (the root problem is that network_update_peers_addresses() is not called when online_peers() are updated)

@s8sato
Copy link
Contributor

s8sato commented Feb 21, 2025

A smaller change would be preferred for this bug fix, especially since it will be a backport to RC1.
Could you consider this approach #5303 (comment) @aoyako ?
If you have any ideas about P2P logic, you can create a separate PR for refactoring or performance improvements

@s8sato s8sato self-assigned this Feb 21, 2025
@aoyako aoyako marked this pull request as draft February 21, 2025 10:26
@aoyako aoyako closed this Feb 23, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants