Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

not connected to peer after disconnect #616

Closed
offerm opened this issue Oct 30, 2018 · 21 comments
Closed

not connected to peer after disconnect #616

offerm opened this issue Oct 30, 2018 · 21 comments
Assignees
Labels
P1 top priority p2p Peer to peer networking
Milestone

Comments

@offerm
Copy link
Contributor

offerm commented Oct 30, 2018

My node was up connected to 3 peers and have peer orders.

wifi down for 3 minutes, wifi back up.

peers are missing (no reconnect).

@kilrau kilrau added this to the 1.0.0-alpha.2 milestone Oct 30, 2018
@kilrau kilrau added p2p Peer to peer networking P1 top priority labels Oct 30, 2018
@kilrau
Copy link
Contributor

kilrau commented Oct 30, 2018

Maybe add logs @offerm

@offerm
Copy link
Contributor Author

offerm commented Oct 30, 2018 via email

@kilrau kilrau modified the milestones: 1.0.0-alpha.2, 1.0.0-alpha.3 Oct 30, 2018
@kilrau
Copy link
Contributor

kilrau commented Oct 30, 2018

Hmm shouldn't #392 (#311) catch it where we defined a connection retry logic for a momentarily unavailable peer?

@moshababo
Copy link
Collaborator

@kilrau #392 is for the initial connection retries, not after disconnection. We talked before about re-connecting after a disconnection, but decided not to do so.

This specific disconnection happens because peer is stalling ping/pong response. It doesn't make sense to define re-connection from the side which closed the connection. Instead we can check
network connectivity before closing peers for stalling responses.

But the other side might close the connection as well, and we won't know about it until we're back online. In this case re-connecting make sense. To implement this, we need to define specific cases for reconnecting after peer closed the connection, because in some cases it can be considered as DoS.

To do this, we first need to implement #152.
I expect that we'll get this packet when we'll be back online, but we need to verify it.

@sangaman
Copy link
Collaborator

Would the idea then be to attempt a reconnection after receiving a DISCONNECTING packet? And I'm guessing the packet should specify that the reason for the disconnection is because our node became unresponsive?

@offerm
Copy link
Contributor Author

offerm commented Oct 31, 2018 via email

@sangaman
Copy link
Collaborator

I'm not 100% sure but as I understand it the TCP socket will not necessarily close if you lose wifi connectivity temporarily, but your peers will see that you have become unresponsive and disconnect from you. So the actual socket connection would be closed by the peer in this case. Correct me if I'm wrong.

@offerm
Copy link
Contributor Author

offerm commented Oct 31, 2018 via email

@sangaman
Copy link
Collaborator

Thinking about this some more, your node would likewise see that pings are failing and would disconnect from the peer even if it is technically you that are having connectivity issues. So I'm thinking we wouldn't receive a DISCONNECTING packet in either case.

My question is, how are we going to tell the difference between ourselves going temporarily offline and a peer going offline? We can also attempt pings to google.com or something , but that wouldn't always work if we are on a LAN or something.

If we detect that we've lost connectivity and then regained it, we can just attempt to reconnect to all nodes like we do on startup.

Another approach would be to just apply the same retry logic we use during initial connection to a node anytime we disconnect to a node due to loss of connectivity and packets timing out.

@offerm
Copy link
Contributor Author

offerm commented Oct 31, 2018 via email

@sangaman
Copy link
Collaborator

We don't have a disconnection packet currently, it's just an issue/idea. The reason is if the peer intentionally disconnected with you, it's not really related to this topic.

I'd be fine with automatically attempting to reconnect if we outgoing connection information, seems like the simplest solution.

@moshababo
Copy link
Collaborator

@offerm
according to you’re suggestion, if i’m getting a socket close event, i’ll just try to reconnect. it means that the peer cannot chose to disconnect from me, because every time he tries to, i’m reconnecting.

@sangaman
If we can't detect our connectivity, we can't detect if packets are timing out due to stalling, and we can only assume it's due to loss of connectivity.

I'm not 100% sure but as I understand it the TCP socket will not necessarily close if you lose wifi connectivity temporarily, but your peers will see that you have become unresponsive and disconnect from you. So the actual socket connection would be closed by the peer in this case. Correct me if I'm wrong.

correct, and in this case you'll probably get the PING/DISCONNECTING packets & socket close event when you’ll be back online. but i'm not sure we can rely on that.

@offerm
Copy link
Contributor Author

offerm commented Nov 1, 2018

@moshababo
For socket close (TCP/IP), and disconnect due to timeouts and missing pings, we for sure should try to reconnect.
if you want a peer to "reject a connection" that should be done by a dedicated packet on the XUD level.
If a peer is just stopping, we should try to reconnect until it is back up.

Only time when we are not trying to reconnect is after we got this special purpose packet that should basically block the peer. That packet should not be used by the peer when the peer is going down.

We should also use that packet during handshake it the peer is banned.

@kilrau
Copy link
Contributor

kilrau commented Nov 2, 2018

Anything that speaks against this? @moshababo @sangaman

@moshababo
Copy link
Collaborator

special purpose packet

@offerm Is there something in what you’r suggesting which differs from what’s specified in #152, besides the latter being more general-purpose?

So the idea is:

Do you agree? @kilrau @offerm @sangaman

@offerm
Copy link
Contributor Author

offerm commented Nov 5, 2018

I don't see a need for stalling packet. Just close the connection and retry.

I would try to reconnect also after the last point. The time between reconnect attempt should reset only after successful handshake.

I wouldn't implement the #152 packet until the situation is stable.

@kilrau
Copy link
Contributor

kilrau commented Nov 5, 2018

Let's assume we are still in @offerm's scenario from above. His laptop connected to one of our cloud servers. Wifi issues, connection breaks on tcp level. When he comes back online, I am having a hard time to imagine how he even gets a #152 packet from his peers if the connection was closed on tcp level, we cannot rely on it to still be open as you wrote above. The stalling packet makes sense for some advanced use cases like "I'm going to ban you know because of reason XYZ" as described in #152 but not here I think. So I agree with @offerm to wait with #152 .

In short:
If I go offline, come back online and notice my sockets are closed, I'll initiate the reconnect from my side to all my previous peers.

if peer closed the socket without sending the #152 packet or HELLO, don't attempt to reconnect (peer doesn't follow the protocol), but don't ban (he might recover)

That's exactly the situation from a peers point of view, when I loose my wifi connection. It's perfectly fine to do nothing here because as described above I will initiate the reconnect which is much more efficient.

@kilrau kilrau modified the milestones: 1.0.0-alpha.3, 1.0.0-alpha.4 Nov 5, 2018
@moshababo
Copy link
Collaborator

@offerm

I would try to reconnect also after the last point. The time between reconnect attempt should reset only after successful handshake.

can you clarify this?

--

My assumption is that we don't know our network connectivity status. so if the socket got closed, I don't know whether it's because I was offline and stalling. If it's because I got banned, and i'll try to reconnect, it's a problem because we'll get a feedback loop.

So without #152 packet, I see it reasonable to implement only the first bullet from my suggestion:

if peer is stalling responses, xud will send the #152 packet (for stalling), close the socket, and immediately attempt to reconnect (with all the retry logic from #311). It means that we're trying to reconnect after we actively closed the socket - a bit weird, but we can live with that

I think this would be fine as we are expected to inspect the peer stalling responses before the TCP socket will timeout. After closing the socket, and reconnecting, we might still be offline, but connection retries will trigger as well.

@kilrau kilrau removed this from the 1.0.0-alpha.4 milestone Nov 7, 2018
@kilrau kilrau added this to the 1.0.0-alpha.5 milestone Nov 7, 2018
@kilrau
Copy link
Contributor

kilrau commented Nov 7, 2018

I agree on the necessity for #152 to implement this properly. Since handling TCP layer connectivity issues isn't focus currently, I suggest to do dependency #152 first and then pick up this one in the next milestone. In the meanwhile we try to find a consensus on what we expect from the implementation of this issue.

Summary by @moshababo with some fixes:

EDIT: some of the more complicated #152 behavior moved to new issue #693

@moshababo
Copy link
Collaborator

So the basic thing is that if the socket got closed but we didn't receive #152 - we reconnect (assume network issue). Otherwise if we did received the #152, we reconnect on specific cases.

Sounds fine to me.

@moshababo
Copy link
Collaborator

For banning a node solely for outgoing connections, we can simply delete the node from our repository. But if we want to keep it there for historical data, we need to add a new banning mode. We can look into it before implementing though.

@kilrau kilrau modified the milestones: 1.0.0-alpha.5, 1.0.0-alpha.6 Nov 18, 2018
@ghost ghost added the in progress label Nov 20, 2018
sangaman pushed a commit that referenced this issue Dec 4, 2018
If a peer got disconnected from us post-handshake due to stalling or
without specifying any reason, reconnect to him if we have a listening
address for him.

Closes #616. Closes #698.
@ghost ghost removed the in progress label Dec 4, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
P1 top priority p2p Peer to peer networking
Projects
None yet
Development

No branches or pull requests

4 participants