not connected to peer after disconnect #616

offerm · 2018-10-30T08:20:02Z

My node was up connected to 3 peers and have peer orders.

wifi down for 3 minutes, wifi back up.

peers are missing (no reconnect).

kilrau · 2018-10-30T08:34:40Z

Maybe add logs @offerm

offerm · 2018-10-30T08:45:53Z

Not really needed. This is not a bug - we don't [YET] have the code that suppose to do it (IMHO). Log only shows the disconnects and that is it.

…

On Tue, Oct 30, 2018 at 10:34 AM Kilian Rausch ⚡️ ***@***.***> wrote: Maybe add logs @offerm <https://github.com/offerm> — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#616 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AJQ0cnA7YiaRCDqaYa5k0tXVFLwoWuaAks5uqA8jgaJpZM4YBZTX> .

kilrau · 2018-10-30T09:15:35Z

Hmm shouldn't #392 (#311) catch it where we defined a connection retry logic for a momentarily unavailable peer?

moshababo · 2018-10-31T13:35:44Z

@kilrau #392 is for the initial connection retries, not after disconnection. We talked before about re-connecting after a disconnection, but decided not to do so.

This specific disconnection happens because peer is stalling ping/pong response. It doesn't make sense to define re-connection from the side which closed the connection. Instead we can check
network connectivity before closing peers for stalling responses.

But the other side might close the connection as well, and we won't know about it until we're back online. In this case re-connecting make sense. To implement this, we need to define specific cases for reconnecting after peer closed the connection, because in some cases it can be considered as DoS.

To do this, we first need to implement #152.
I expect that we'll get this packet when we'll be back online, but we need to verify it.

sangaman · 2018-10-31T14:14:35Z

Would the idea then be to attempt a reconnection after receiving a DISCONNECTING packet? And I'm guessing the packet should specify that the reason for the disconnection is because our node became unresponsive?

offerm · 2018-10-31T14:40:55Z

Kind of inventing the wheel here, no? See how this started, I had a wifi issue, not a disconnect by a remote peer. IMHO, if I connected to a peer and disconnected, there should be a simple retry mechanism that try to connect in a similar way we are doing when LND disconnected. Now, if the connection was rejected (reject package) by a peer there is no need to try an reconnect. IMHO, peer rejection and peer banning have some time... wifi problems and even closing my my laptop going to `sleep` are here already.

…

On Wed, Oct 31, 2018 at 4:14 PM Daniel McNally ***@***.***> wrote: Would the idea then be to attempt a reconnection after receiving a DISCONNECTING packet? And I'm guessing the packet should specify that the reason for the disconnection is because our node became unresponsive? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#616 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AJQ0cleSG_WnjE2KoTVZT6_Un0nH4B4Gks5uqbBPgaJpZM4YBZTX> .

sangaman · 2018-10-31T14:52:30Z

I'm not 100% sure but as I understand it the TCP socket will not necessarily close if you lose wifi connectivity temporarily, but your peers will see that you have become unresponsive and disconnect from you. So the actual socket connection would be closed by the peer in this case. Correct me if I'm wrong.

offerm · 2018-10-31T15:07:09Z

TCP socket will close when TCP layers notice it. If you have a short drop that is recovered quickly, TCP me recover without the peers notice. If TCP detected the problem and fail to revoced the peer will not be able to recover. So, when see the error on the application level, it is a done deal.

…

On Wed, Oct 31, 2018 at 4:54 PM Daniel McNally ***@***.***> wrote: I'm not 100% sure but as I understand it the TCP socket will not necessarily close if you lose wifi connectivity temporarily, but your peers will see that you have become unresponsive and disconnect from you. So the actual socket connection would be closed by the peer in this case. Correct me if I'm wrong. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#616 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AJQ0ciRx5JuGrI7qPTMwfIbNAP7d3dPlks5uqbmQgaJpZM4YBZTX> .

sangaman · 2018-10-31T15:23:51Z

Thinking about this some more, your node would likewise see that pings are failing and would disconnect from the peer even if it is technically you that are having connectivity issues. So I'm thinking we wouldn't receive a DISCONNECTING packet in either case.

My question is, how are we going to tell the difference between ourselves going temporarily offline and a peer going offline? We can also attempt pings to google.com or something , but that wouldn't always work if we are on a LAN or something.

If we detect that we've lost connectivity and then regained it, we can just attempt to reconnect to all nodes like we do on startup.

Another approach would be to just apply the same retry logic we use during initial connection to a node anytime we disconnect to a node due to loss of connectivity and packets timing out.

offerm · 2018-10-31T15:47:26Z

Which disconnect packet do you expect to received? When you disconnect it is final. No need to look for the reason. Just attempt to reconnect with the peer (if you created the connection or if you know how to connect it (not using the temporaty port). Just as LND is doing :-)

…

On Wed, Oct 31, 2018 at 5:29 PM Daniel McNally ***@***.***> wrote: Thinking about this some more, your node would likewise see that pings are failing and would disconnect from the peer even if it is technically you that are having connectivity issues. So I'm thinking we wouldn't receive a DISCONNECTING packet in either case. My question is, how are we going to tell the difference between ourselves going temporarily offline and a peer going offline? We can also attempt pings to google.com or something , but that wouldn't always work if we are on a LAN or something. If we detect that we've lost connectivity and then regained it, we can just attempt to reconnect to all nodes like we do on startup. Another approach would be to just apply the same retry logic we use during initial connection to a node anytime we disconnect to a node due to loss of connectivity and packets timing out. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#616 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AJQ0cgXtNCCqhoM0pnWqfevWixGkQMymks5uqcG_gaJpZM4YBZTX> .

sangaman · 2018-10-31T15:59:06Z

We don't have a disconnection packet currently, it's just an issue/idea. The reason is if the peer intentionally disconnected with you, it's not really related to this topic.

I'd be fine with automatically attempting to reconnect if we outgoing connection information, seems like the simplest solution.

moshababo · 2018-11-01T08:18:18Z

@offerm
according to you’re suggestion, if i’m getting a socket close event, i’ll just try to reconnect. it means that the peer cannot chose to disconnect from me, because every time he tries to, i’m reconnecting.

@sangaman
If we can't detect our connectivity, we can't detect if packets are timing out due to stalling, and we can only assume it's due to loss of connectivity.

I'm not 100% sure but as I understand it the TCP socket will not necessarily close if you lose wifi connectivity temporarily, but your peers will see that you have become unresponsive and disconnect from you. So the actual socket connection would be closed by the peer in this case. Correct me if I'm wrong.

correct, and in this case you'll probably get the PING/DISCONNECTING packets & socket close event when you’ll be back online. but i'm not sure we can rely on that.

offerm · 2018-11-01T10:58:26Z

@moshababo
For socket close (TCP/IP), and disconnect due to timeouts and missing pings, we for sure should try to reconnect.
if you want a peer to "reject a connection" that should be done by a dedicated packet on the XUD level.
If a peer is just stopping, we should try to reconnect until it is back up.

Only time when we are not trying to reconnect is after we got this special purpose packet that should basically block the peer. That packet should not be used by the peer when the peer is going down.

We should also use that packet during handshake it the peer is banned.

kilrau · 2018-11-02T15:31:39Z

If a peer is just stopping, we should try to reconnect until it is back up agree with the same intervals from Connection retry logic of an momentarily unavailable peer #311.
Also agree on the special purpose packet.

Anything that speaks against this? @moshababo @sangaman

moshababo · 2018-11-03T09:56:44Z

special purpose packet

@offerm Is there something in what you’r suggesting which differs from what’s specified in #152, besides the latter being more general-purpose?

So the idea is:

if peer is stalling responses, xud will send the p2p packet: disconnecting #152 packet (for stalling), close the socket, and immediately attempt to reconnect (with all the retry logic from Connection retry logic of an momentarily unavailable peer #311). It means that we're trying to reconnect after we actively closed the socket - a bit weird, but we can live with that
If peer sent the p2p packet: disconnecting #152 packet and then closed the socket, attempt reconnect only on specific cases (like for stalling, but not for being banned)
send the p2p packet: disconnecting #152 packet instead of HELLO if peer is banned, before closing the socket
if we're getting the p2p packet: disconnecting #152 packet for being banned instead of HELLO, we should ban the peer as well (at least for outgoing connections)
if peer closed the socket without sending the p2p packet: disconnecting #152 packet, after sending HELLO, attempt to reconnect
if peer closed the socket without sending the p2p packet: disconnecting #152 packet or HELLO, don't attempt to reconnect (peer doesn't follow the protocol), but don't ban (he might recover)

Do you agree? @kilrau @offerm @sangaman

offerm · 2018-11-05T16:18:26Z

I don't see a need for stalling packet. Just close the connection and retry.

I would try to reconnect also after the last point. The time between reconnect attempt should reset only after successful handshake.

I wouldn't implement the #152 packet until the situation is stable.

kilrau · 2018-11-05T18:03:26Z

Let's assume we are still in @offerm's scenario from above. His laptop connected to one of our cloud servers. Wifi issues, connection breaks on tcp level. When he comes back online, I am having a hard time to imagine how he even gets a #152 packet from his peers if the connection was closed on tcp level, we cannot rely on it to still be open as you wrote above. The stalling packet makes sense for some advanced use cases like "I'm going to ban you know because of reason XYZ" as described in #152 but not here I think. So I agree with @offerm to wait with #152 .

In short:
If I go offline, come back online and notice my sockets are closed, I'll initiate the reconnect from my side to all my previous peers.

if peer closed the socket without sending the #152 packet or HELLO, don't attempt to reconnect (peer doesn't follow the protocol), but don't ban (he might recover)

That's exactly the situation from a peers point of view, when I loose my wifi connection. It's perfectly fine to do nothing here because as described above I will initiate the reconnect which is much more efficient.

moshababo · 2018-11-06T14:46:36Z

@offerm

I would try to reconnect also after the last point. The time between reconnect attempt should reset only after successful handshake.

can you clarify this?

--

My assumption is that we don't know our network connectivity status. so if the socket got closed, I don't know whether it's because I was offline and stalling. If it's because I got banned, and i'll try to reconnect, it's a problem because we'll get a feedback loop.

So without #152 packet, I see it reasonable to implement only the first bullet from my suggestion:

if peer is stalling responses, ~~xud will send the #152 packet (for stalling)~~, close the socket, and immediately attempt to reconnect (with all the retry logic from #311). It means that we're trying to reconnect after we actively closed the socket - a bit weird, but we can live with that

I think this would be fine as we are expected to inspect the peer stalling responses before the TCP socket will timeout. After closing the socket, and reconnecting, we might still be offline, but connection retries will trigger as well.

kilrau · 2018-11-07T11:15:44Z

I agree on the necessity for #152 to implement this properly. Since handling TCP layer connectivity issues isn't focus currently, I suggest to do dependency #152 first and then pick up this one in the next milestone. In the meanwhile we try to find a consensus on what we expect from the implementation of this issue.

Summary by @moshababo with some fixes:

if peer is stalling responses, xud will send the p2p packet: disconnecting #152 packet (for stalling), close the socket and don't try to reconnect. Peer will try to reconnect (next two points).
xud comes back online, receives the p2p packet: disconnecting #152 packet (socket was still working) and then attempts to reconnect in specific cases (like for stalling, but not for being banned)
xud comes back online, didn't receive p2p packet: disconnecting #152 from peer (e.g. socket not working) = doesn't know why peer closed the connection. xud tries to reconnect with logic from Connection retry logic of an momentarily unavailable peer #311
xud sends the p2p packet: disconnecting #152 packet instead of HELLO if peer which is trying to connect is banned, then close the socket

EDIT: some of the more complicated #152 behavior moved to new issue #693

moshababo · 2018-11-07T11:58:21Z

So the basic thing is that if the socket got closed but we didn't receive #152 - we reconnect (assume network issue). Otherwise if we did received the #152, we reconnect on specific cases.

Sounds fine to me.

moshababo · 2018-11-07T12:03:58Z

For banning a node solely for outgoing connections, we can simply delete the node from our repository. But if we want to keep it there for historical data, we need to add a new banning mode. We can look into it before implementing though.

If a peer got disconnected from us post-handshake due to stalling or without specifying any reason, reconnect to him if we have a listening address for him. Closes #616. Closes #698.

kilrau added this to the 1.0.0-alpha.2 milestone Oct 30, 2018

kilrau added p2p Peer to peer networking P1 top priority labels Oct 30, 2018

kilrau modified the milestones: 1.0.0-alpha.2, 1.0.0-alpha.3 Oct 30, 2018

kilrau assigned moshababo Oct 30, 2018

kilrau modified the milestones: 1.0.0-alpha.4, 1.0.0-alpha.3 Oct 30, 2018

kilrau modified the milestones: 1.0.0-alpha.3, 1.0.0-alpha.4 Nov 5, 2018

kilrau mentioned this issue Nov 7, 2018

Advertise version #154

Closed

kilrau removed this from the 1.0.0-alpha.4 milestone Nov 7, 2018

kilrau added this to the 1.0.0-alpha.5 milestone Nov 7, 2018

kilrau mentioned this issue Nov 7, 2018

Ban by IP address (fail2ban) #458

Open

moshababo mentioned this issue Nov 14, 2018

feat: disconnecting packet #658

Merged

kilrau modified the milestones: 1.0.0-alpha.5, 1.0.0-alpha.6 Nov 18, 2018

kilrau mentioned this issue Nov 20, 2018

Implement behavior for #152 disconnect package "banned" #693

Open

moshababo mentioned this issue Nov 20, 2018

feat(p2p): reconnect to peer after disconnection #695

Merged

ghost added the in progress label Nov 20, 2018

offerm mentioned this issue Dec 2, 2018

fix(rpc): close streaming calls on shutdown #717

Merged

sangaman closed this as completed in #695 Dec 4, 2018

ghost removed the in progress label Dec 4, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

not connected to peer after disconnect #616

not connected to peer after disconnect #616

offerm commented Oct 30, 2018

kilrau commented Oct 30, 2018

offerm commented Oct 30, 2018 via email

kilrau commented Oct 30, 2018

moshababo commented Oct 31, 2018

sangaman commented Oct 31, 2018

offerm commented Oct 31, 2018 via email

sangaman commented Oct 31, 2018

offerm commented Oct 31, 2018 via email

sangaman commented Oct 31, 2018

offerm commented Oct 31, 2018 via email

sangaman commented Oct 31, 2018

moshababo commented Nov 1, 2018

offerm commented Nov 1, 2018

kilrau commented Nov 2, 2018

moshababo commented Nov 3, 2018

offerm commented Nov 5, 2018

kilrau commented Nov 5, 2018 •

edited

Loading

moshababo commented Nov 6, 2018

kilrau commented Nov 7, 2018 •

edited

Loading

moshababo commented Nov 7, 2018

moshababo commented Nov 7, 2018

not connected to peer after disconnect #616

not connected to peer after disconnect #616

Comments

offerm commented Oct 30, 2018

kilrau commented Oct 30, 2018

offerm commented Oct 30, 2018 via email

kilrau commented Oct 30, 2018

moshababo commented Oct 31, 2018

sangaman commented Oct 31, 2018

offerm commented Oct 31, 2018 via email

sangaman commented Oct 31, 2018

offerm commented Oct 31, 2018 via email

sangaman commented Oct 31, 2018

offerm commented Oct 31, 2018 via email

sangaman commented Oct 31, 2018

moshababo commented Nov 1, 2018

offerm commented Nov 1, 2018

kilrau commented Nov 2, 2018

moshababo commented Nov 3, 2018

offerm commented Nov 5, 2018

kilrau commented Nov 5, 2018 • edited Loading

moshababo commented Nov 6, 2018

kilrau commented Nov 7, 2018 • edited Loading

moshababo commented Nov 7, 2018

moshababo commented Nov 7, 2018

kilrau commented Nov 5, 2018 •

edited

Loading

kilrau commented Nov 7, 2018 •

edited

Loading