Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AutoNAT: multiple opened connection #2515

Closed
hamamo opened this issue Feb 14, 2022 · 9 comments
Closed

AutoNAT: multiple opened connection #2515

hamamo opened this issue Feb 14, 2022 · 9 comments

Comments

@hamamo
Copy link

hamamo commented Feb 14, 2022

It seems that AutoNAT opens peer connections as part of its probing loop but does not close them again, so after some time connected peers have dozens of open connections between them. I only have a very rudimentary understanding of the implications but it doesn't look quite right :-)

@elenaf9
Copy link
Contributor

elenaf9 commented Feb 14, 2022

From what I understand, connections don't need to be explicitly closed.
Instead, a connection is generally only kept alive while the handler of at least one network behaviour reports KeepAlive::True (ref). The AutoNAT protocol internally wraps the request-response protocol: a probe is an outbound request, and the dial-response from the server is the response. After the server sent the response, the handler in the request-response protocol should not be keeping the connection alive anymore (only for 10s, as this is its default connection timeout), hence it should close. But in case that you are using any other network behaviour protocols in combination with AutoNAT, it could be that they keep the connection alive?
@hamamo Could you share a bit more information about the behaviours that you are using?
Or maybe I am missing something here, so please correct if that is the case.

@hamamo
Copy link
Author

hamamo commented Feb 18, 2022

@elenaf9 I'm using ping to keep connections alive, that may be the cause of the observed behavior. Since my application relies on timely distribution of information via gossipsub, I'm trying to keep the nodes connected most of the time.
This may be something I change when I have a better sync mechanism for historical data, and nodes can catch up when they reconnect, but I'd rather have autonat and ping behave well together :-)

@rkuhn
Copy link
Contributor

rkuhn commented Feb 18, 2022

same here

@elenaf9
Copy link
Contributor

elenaf9 commented Feb 18, 2022

Since my application relies on timely distribution of information via gossipsub, I'm trying to keep the nodes connected most of the time.

I am not too familiar with the details of gossipsub, but doesn't the gossipsub protocol itself already takes care that you stay connected with the peer in a mesh? What's the benefit of also using the ping for keep-alive?

// If we have joined the mesh, keep the connection alive.
GossipsubHandlerIn::JoinedMesh => {
self.in_mesh = true;
self.keep_alive = KeepAlive::Yes;
}

But independent of that, I still agree that autonat and ping should behave well together. What I am not so sure about is whether this is the responsibility of ping, or of autonat.

The whole logic around keep-alive follows more the idea of "every behaviour protocol decides for themself if they need the substream or not". Afaik none of the protocols explicitly closes a connection? Instead, even after the protocol that initiated the connection doesn't need that connection anymore, it is kept open "in case that any other protocol needs it". And imho network behaviours should not rely on other logic to explicitly close connections again. It could also be the case that the user frequently calls Swarm::dial for whatever reason, in which case we would run into the same problem.

I think ping is really the only protocol that just blindly keeps every substream alive. And the description for its keep-alive actually states that it keeps only the connection alive, not every substream. So what I would propose instead is to change Ping so that it only keeps at least one connection to a peer alive, but not every single one.

Overall, I am myself a bit undecided on this. If you still think that AutoNAT should explicitly clean up the connection that are opened in a dial-attempt I am fine with that as well, as I definitely see the reason for it. But my above point still stands.
What do you think?

@mxinden
Copy link
Member

mxinden commented Feb 18, 2022

I'm using ping to keep connections alive, that may be the cause of the observed behavior.

Do I understand correctly that you are setting PingConfig::with_keep_alive(true) @hamamo?

/// Whether the connection should generally be kept alive unless
/// `max_failures` occur.
keep_alive: bool,

Since my application relies on timely distribution of information via gossipsub, I'm trying to keep the nodes connected most of the time.

Note that connections are kept alive, even without PingConfig::keep_alive(true), as long as they are being used. E.g., as @elenaf9 mentioned above, in the case of gossipsub that means that a connection is kept alive as long as the peer is part of the corresponding mesh.

@hamamo if you can provide a small reproducer, or point us to your code itself, that would be great.

I know we are not doing a great job documenting the connection lifecycle mechanism in libp2p. Let me know, I can extend the documentation by a paragraph or two in the next couple of days.

@hamamo
Copy link
Author

hamamo commented Feb 26, 2022

Today I tried setting keep_alive in PingConfig to false, but that causes the connection to be closed, and GossipSub events to be missed. So if a mesh is established but not permanently used it does not seem to keep connections alive, and GossipSub also doesn't attempt to re-dial peers known to be part of the mesh.

@mxinden: The code is at https://github.com/hamamo/reputation-net/tree/tokio (the master branch where I used async_std isn't updated yet). I know this is a big ball of spaghetti, didn't have the time yet to come up with a clean small example showing the issue.

@mxinden
Copy link
Member

mxinden commented Mar 1, 2022

So if a mesh is established but not permanently used it does not seem to keep connections alive

That is surprising to me. Once a peer joins a mesh, the in_mesh flag on the corresponding ConnectionHandler should be set to true and thus the connection should be kept alive.

https://github.com/libp2p/rust-libp2p/blob/master/protocols/gossipsub/src/handler.rs#L132-L134

Could you check whether a peer successfully joins a mesh but is still disconnected? Also could you post the logs (RUST_LOG=debug) of a small reputation-net reproducing the issue @hamamo?

@mxinden
Copy link
Member

mxinden commented Mar 1, 2022

Just a wild guess, seeing logic in reputation_net/mod.rs related to awaiting storage operations within the custom NetworkBehaviour. Do these operations block the main libp2p task for longer period of time? If so, this might cause connections to time out. Let's see what the logs say.

https://github.com/hamamo/reputation-net/blob/tokio/src/reputation_net/mod.rs

@thomaseizinger
Copy link
Contributor

We are reworking idle connection management, see #4306.

I am closing this as stale.

@thomaseizinger thomaseizinger closed this as not planned Won't fix, can't repro, duplicate, stale Sep 20, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants