connection retries #392

moshababo · 2018-08-23T20:39:17Z

Solves #311

retries are blocking initConnection
retries are not activated when manually trying to connect to a peer. if it would, the user won't get a response for a very long time upon failure, since the retries are blocking.
if max retries period exceeds, and peer was previously connected (exists on nodes table)
peer supposed to get banned. Banning doesn't work now for pre-handshake peers, but can be fixed. I'm not sure banning is the right solution, because in the future it might also block inbound connections. Deleting or soft deleting the node entry might not be the right thing either. Perhaps penalizing, which we haven't implemented yet.
Anyway - that will hardly happen, because it will trigger only after 1 week, and only for previously connected peers.
The main thing now is to verify that trying to retry over an entire week for every node we're hearing about is reasonable. I think it's too much for nodes we never successfully connected with.

sangaman · 2018-08-23T21:33:12Z

if max retries period exceeds, and peer was previously connected (exists on nodes table)
peer supposed to get banned. Banning doesn't work now for pre-handshake peers, but can be fixed. I'm not sure banning is the right solution, because in the future it might also block inbound connections.

I definitely don't think we want to ban. I like the idea of a soft delete. Maybe we add another boolean column to the nodes table, and don't try connecting to them again unless it's done manually or we hear about them again from a NODES packet from another peer. If we're able to successfully reconnect, then we undelete them. Another option for a soft delete is to delete their known listening addresses. I figure this matter can be dealt with in another PR though.

Also, I agree that we don't need to retry connecting to nodes that we've never connected with. I think one try is enough. If it's a manual connection, failing lets the user know to retry when the issue blocking the connection is resolved. If it's automatic, like a node we learned about from a peer, #386 makes it such that we should only be learning about nodes that are online from peers.

I see retrying as valuable when attempting to reconnect to a peer who has temporarily gone down, or whom we've connected to successfully in the past but was offline at the time xud started.

sangaman

Looks good mostly, have you tested this yet?

Another thing that occurred to me while I was going through this, how do we handle a case where we have a timer set to retry a connection to a node and in the meantime that node successfully connects to us. Maybe we need to make some check before retrying a connection to ensure we're not already connected?

Also if I follow correctly, Pool will contain Peers that we're not currently connected to, but instead are only pending reconnection attempts. I think that may affect some other logic that treats all the peers in Pool as actively connected. Which makes me also realize that it's possible we'd have some pre-handshake peers in Pool and we should check for those before doing things like sending them orders. Maybe I will address that in a separate PR, by checking that we're in post handshake state with peers before we attempt to send them packets.

sangaman · 2018-08-23T22:41:31Z

lib/p2p/Peer.ts

+  /** Connection retries min delay. */
+  private static CONNECTION_RETRIES_MIN_DELAY = 5000;
+  /** Connection retries max delay. */
+  private static CONNECTION_RETIES_MAX_DELAY = 3600000;


RETRIES typo.

sangaman · 2018-08-23T22:42:31Z

lib/p2p/Peer.ts

+  /** Connection retries max delay. */
+  private static CONNECTION_RETIES_MAX_DELAY = 3600000;
+  /** Connection retries max period. */
+  private static CONNECTION_RETIES_MAX_PERIOD = 604800000;


RETRIES typo.

lib/p2p/Peer.ts

sangaman · 2018-08-24T00:12:01Z

lib/p2p/Peer.ts

@@ -239,6 +245,10 @@ class Peer extends EventEmitter {
    }

    return new Promise((resolve, reject) => {
+      const startTime = Date.now();
+      let retryDelay: number | undefined;


How about we initialize retryDelay by setting it to CONNECTION_RETRIES_MIN_DELAY, and then move the line that doubles the retry delay to after the retryDelay > Peer.CONNECTION_RETIES_MAX_PERIOD check. It feels a bit cleaner that way, and saves a bunch of null checks on retryDelay.

That's what I did originally, but It forced me to update the next value in the timeout function, which I found to be less optimal. I can change it back.

What do you mean it forced you to update the next value?

I changed it, so you can look now. Nothing serious.

sangaman · 2018-08-24T00:13:37Z

lib/p2p/Peer.ts

+          : Math.min(Peer.CONNECTION_RETIES_MAX_DELAY, retryDelay * 2);
+
+        if (Date.now() - startTime + retryDelay > Peer.CONNECTION_RETIES_MAX_PERIOD) {
+          this.close();


Here too I don't think we need to close because we're not connected in the first place.

See my comment above.

Before my recent changes, this.close() was called even if connection wasn't established yet. I changed that, and that's why i'm calling it here as well.

sangaman · 2018-08-24T00:17:50Z

lib/p2p/Peer.ts

+        }, retryDelay);
+      };
+
+      const listen = () => {


"listen" seems a bit inaccurate here, it makes it sound like we're listening for incoming connections. Maybe "bind"?

sangaman · 2018-08-24T00:19:02Z

lib/p2p/Peer.ts

+          return;
+        }
+
+        this.logger.info(


I think debug or perhaps verbose would be better here, this feels like we'd have a lot of messages in the logs that aren't particularly useful.

moshababo · 2018-08-24T08:06:07Z

have you tested this yet?

I tested it manually. I’d like to add some automatic tests as well, after we’ll settle on everything.

Another thing that occurred to me while I was going through this, how do we handle a case where we have a timer set to retry a connection to a node and in the meantime that node successfully connects to us. Maybe we need to make some check before retrying a connection to ensure we're not already connected?

It’s not handled at the moment. If the inbound node would suddenly become available for us for outbound as well, then a duplicated connection supposed to be detected. If it’s not available for us, it will continue to retry. I can add some checks to avoid that, hopefully it won’t become to verbose.

Also if I follow correctly, Pool will contain Peers that we're not currently connected to, but instead are only pending reconnection attempts. I think that may affect some other logic that treats all the peers in Pool as actively connected. Which makes me also realize that it's possible we'd have some pre-handshake peers in Pool and we should check for those before doing things like sending them orders. Maybe I will address that in a separate PR, by checking that we're in post handshake state with peers before we attempt to send them packets.

The peer object will exist, but it's not accessible and not on PeerList, because peer are getting added there only after they are “opened”, while initConnection with its retries is blocking it.

sangaman · 2018-08-24T14:40:34Z

It’s not handled at the moment. If the inbound node would suddenly become available for us for outbound as well, then a duplicated connection supposed to be detected. If it’s not available for us, it will continue to retry. I can add some checks to avoid that, hopefully it won’t become to verbose.

Right, I think currently it would get to the point of attempting a handshake and then fail, not a huge deal. It would be nice to not even attempt the outgoing connection if we're already connected, but as you said if it's too complicated or verbose then don't worry about it. Nothing obvious jumps to mind since the Peer object isn't aware of other peers and connections.

The peer object will exist, but it's not accessible and not on PeerList, because peer are getting added there only after they are “opened”, while initConnection with its retries is blocking it.

Yes you are right, I'd overlooked that. Never mind.

sangaman · 2018-08-25T14:17:37Z

lib/p2p/Pool.ts

@@ -155,7 +155,7 @@ class Pool extends EventEmitter {
  private connectNode = async ({ addresses, nodePubKey }: NodeConnectionInfo) => {
    for (let n = 0; n < addresses.length; n += 1) {
      try {
-        await this.addOutbound(addresses[n], nodePubKey);
+        await this.addOutbound(addresses[n], nodePubKey, true);


One more thought, this looks like we'd wait for the full timeout (currently a week, although I'm thinking we probably should shorten that) before we attempt the next address for a node that has multiple listening addresses. What do you think about making so we retry each address once before delaying and retrying? That would definitely require a bit of refactoring, but wanted to see what you thought.

Sounds like a good idea, and shouldn't require serious refactoring, just to do one iteration without retries, and another one with.

Ok that should work, but then we'd basically only be retrying a single address from that point on right?

Yes. I continued to discuss the option of doing the retries in parallel on the main comments thread.

Answered in the main thread. In short: yes, retry logic PER listening address (otherwise this can take a very long time)

moshababo · 2018-08-26T09:34:31Z

Right, I think currently it would get to the point of attempting a handshake and then fail, not a huge deal. It would be nice to not even attempt the outgoing connection if we're already connected, but as you said if it's too complicated or verbose then don't worry about it. Nothing obvious jumps to mind since the Peer object isn't aware of other peers and connections.

This indeed breaks the encapsulation, although technically can be implemented simply with Pool passing Peer a callback. But I didn't want to do it now, since you already raised some similar issues on gitter, which might require a different solution.

Other topics:

don't retry connecting with a node which we never had a successful connection with
when iterating on the node addresses, don't retry connecting until all addresses tried out. If all failed, retry connecting on the first address
soft-delete nodes which exceeded the max retry period by deleting their addresses list (although it might turn out to be empty also because the nodes failed to retrieve their external ip address)

@kilrau let's all agree on that before i'll go ahead and implement

sangaman · 2018-08-26T11:53:25Z

I'd agree with all 3. Retrying connections on all their addresses would be nice but not a big deal.

moshababo · 2018-08-26T14:54:08Z

We can do retries on multiple addresses in parallel. Verifying that the pubkey wasn't already connected before each retry shouldn't be a big deal.

sangaman · 2018-08-26T15:19:59Z

OK, if you think you can get parallel connection attempts working that would be a nice bonus, but it's not necessary for now. You could also approach that in a different PR if you'd like.

sangaman · 2018-08-27T20:15:02Z

I'm actually thinking that maybe we want to keep the connection attempts attempted in a serial fashion though so that we don't have to deal with multiple attempts succeeding on the first attempt thereby creating multiple conncections to the same node, I brought up that issue in #398.

@moshababo Are you planning to make any more changes to this PR or should I review it again as is?

moshababo · 2018-08-27T20:36:25Z

I don't see a problem with it, if we'll manage the pending peers correctly. But we can indeed avoid this complexity for now.

There's nothing major to review yet. I was waiting for more feedback on my comment. I'll go ahead and implement it anyway tomorrow morning.

sangaman · 2018-08-27T20:40:41Z

OK, here's my thoughts after reading/thinking some more.

don't retry connecting with a node which we never had a successful connection with

Yes, and in fact I think ultimately we may want to only retry connections to the last address we've successful connected to.

when iterating on the node addresses, don't retry connecting until all addresses tried out. If all failed, retry connecting on the first address

Yes, and I think like above maybe the "first address" in this case should be the last successful address we've connected to.

soft-delete nodes which exceeded the max retry period by deleting their addresses list (although it might turn out to be empty also because the nodes failed to retrieve their external ip address)

Yes, and maybe we can delete particular addresses that have failed from the list.

kilrau · 2018-08-27T20:42:36Z

don't retry connecting with a node which we never had a successful connection with

Agree, once we went through the whole retry cycle and never managed to connect, we don't retry again (even after reboot). This node gets soft-deleted which needs to be stored permanently. As described in #365 (comment) we should mark successful connections and only apply retry logic to these.

when iterating on the node addresses, don't retry connecting until all addresses tried out. If all failed, retry connecting on the first address

I'm actually thinking that maybe we want to keep the connection attempts attempted in a serial fashion though so that we don't have to deal with multiple attempts succeeding on the first attempt thereby creating multiple conncections to the same node, I brought up that issue in #398.

Ehm not sure if I get you right, but I think we should try addresses in parallel, otherwise this procedure will take ages if a peer tells me 10 listeningAdresses...

t1: listeningAddress1, wait for delay x, try listeningAddress1 wait for delay x*2
t2: listeningAddress2, wait for delay x, try listeningAddress1 wait for delay x*2

we can just make peers in pendingPeers to be pubKey@IP:port, #398 (comment)

EDIT: just saw your comment @sangaman , I think I'm not getting this one right. Could you please clarify this one?

soft-delete nodes which exceeded the max retry period by deleting their addresses list (although it might turn out to be empty also because the nodes failed to retrieve their external ip address)

Agree, soft-deleting makes sense to me now, also to store 1. permanently. Not sure if deleting the listening addresses is the way to do it though - you decide on that.

sangaman · 2018-08-27T20:57:30Z

Ehm not sure if I get you right, but I would we should try addresses in parallel, otherwise this procedure will take ages if a peer tells me 10 listeningAdresses...

t1: listeningAddress1, wait for delay x, try listeningAddress1 wait for delay x2
t2: listeningAddress2, wait for delay x, try listeningAddress1 wait for delay x2

I'm thinking we try each address only once with no delay and no retry, one at a time, and if they all fail, start the wait-retry routine only with the first address (or preferably last successful address).

kilrau · 2018-08-27T22:19:48Z

... and just don't care about all other listeningAddresses? Or will we try them at some point? If yes, when?

sangaman · 2018-08-27T22:23:33Z

We try them all once up front, and only continue retrying with a single on.

moshababo · 2018-08-28T08:51:37Z

If we want to designate the "first address" to be the one we had a successful connection with, it doesn't work well with deleting it upon connection retries failure.

We can consider doing retries on different addresses in a serial way to obviously be infeasible. So either we'll do it in parallel, or try only one address.
If the later, so either we don't delete upon failure, or we first fix the "first address" problem (by adding another property to the address object, and maybe also removing the array to a different table).
If we want to do it in parallel, then the same applies, but it also affect #398, so we better finalize on that one first.

I pushed some WIP stuff, basically just implementing suggestions 1 & 2, relying on the first address to be the one which we had a successful connection with (which might not be the case).

Let's talk about this in our weekly dev chat.

kilrau · 2018-08-28T16:56:27Z

As agreed, only try on latest successful. @moshababo

sangaman · 2018-08-28T17:38:49Z

One consideration for adding another property to the address object is that we're updating the addresses in the nodes table each time we get a HelloPacket. A peer could technically remove and add the same address from one handshake to the next, or we could connect to a peer using an address that it doesn't advertise in the handshake. The advantage of tracking the last connected address (or addresses) separately in the database would be to separate it from handshake data.

moshababo · 2018-08-30T20:58:31Z

@sangaman I missed seeing your comment before implementing it. I encounter what you described, and solved it by not overriding existing addresses when they are being updated from the handshake state (add/remove still works). It doesn't solve the problem of xud being able to connect to a peer with an address which wasn't advertised by the peer, but I don't think it's a problem, because we're not re-using this address anyway at the moment.

We can start separating it as you suggested, and this will also give us the ability to re-use this address on future bootstrapping. But i'm not sure if that wouldn't be an overkill for now. Let me know what you think.

Other stuff:

retrying only on bootstrap nodes, only on latest connected address
if retrying max period exceeds, address is being removed
Pool.addOutbound changed to start throwing errors. The string-based interface wasn't a good design, and I wanted to change it before, but now I had to, mostly for Pool.connectNode. This required me to refactor the p2p sanity tests. for gRPC access, the error is caught by GrpcService, so it doesn't crash.
continued to apply the tryX naming convention, for functions which swallow errors

kilrau · 2018-09-05T23:34:12Z

"Reconnect to last successful address of a peer first" in a separate issue, so taking it out of this one #442

Other stuff:

retrying only on bootstrap nodes, only on latest connected address
if retrying max period exceeds, address is being removed
Pool.addOutbound changed to start throwing errors. The string-based interface wasn't a good design, and I wanted to change it before, but now I had to, mostly for Pool.connectNode. This required me to refactor the p2p sanity tests. for gRPC access, the error is caught by GrpcService, so it doesn't crash.
continued to apply the tryX naming convention, for functions which swallow errors

Sounds all right to me. Are these questions?

moshababo · 2018-09-06T09:00:09Z

"Reconnect to last successful address of a peer first" in a separate issue, so taking it out of this one #442

I assumed referred here to initial connection sequence (iterating though all addresses), not to the connection retries.
We already had the retries working only on the last successful address. So not much was needed to support what you suggested, so I already added it here.

The other stuff that I wrote described what I already implemented.

sangaman

Sorry for not doing another review sooner, I didn't realize it was ready. Comments below, nothing major but a few things are probably worth changing.

lib/grpc/GrpcService.ts

sangaman · 2018-09-06T15:00:31Z

lib/p2p/NodeList.ts

-      node.addresses = addresses;
+      // avoid overriding the `lastConnected` field for existing matching addresses unless a new value was set
+      node.addresses = addresses.map((newAddress) => {
+        const oldAddress = node.addresses.find(address => address.host === newAddress.host && address.port === newAddress.port);


Might be nice to have an equality method in addressUtils for this logic since we're using in a couple of places.

sangaman · 2018-09-06T15:09:04Z

lib/p2p/Peer.ts

@@ -234,17 +242,22 @@ class Peer extends EventEmitter {
   * Ensure we are connected (for inbound connections) or listen for the `connect` socket event (for outbound connections)
   * and set the [[connectTime]] timestamp. If an outbound connection attempt errors or times out, throw an error.
   */
-  private initConnection = async () => {
+  private initConnection = (retry?: boolean): Promise<void> => {


Making this function async and not returning an explicit Promise.resolve() below was an intentional change that @michael1011 and I made in a separate PR. I think the code is a bit cleaner, that way. If you agree, line 252 would just return and 255 would return a Promise<void>.

Another option is to keep the method signature as is, and move the if (this.connected) logic into the new Promise logic, and then resolve() and return in there.

I overrode this method signature when I resolved conflicts. Will return it to how it was on master.

sangaman · 2018-09-06T15:14:15Z

lib/p2p/Peer.ts


-      this.connectTimeout = setTimeout(onTimeout, Peer.CONNECTION_TIMEOUT);
+      const bind = () => {


This would be a bit easier to read if bind were declared before onError which uses the bind method, so just move this block of code up a bit.

But then we'll have the same problem, because bind uses onError ;)

sangaman · 2018-09-06T15:21:30Z

lib/p2p/Pool.ts

@@ -11,6 +11,7 @@ import { OutgoingOrder, OrderIdentifier, StampedPeerOrder } from '../types/order
 import { Models } from '../db/DB';
 import Logger from '../Logger';
 import { HandshakeState, Address, NodeConnectionInfo } from '../types/p2p';
+import { NodeInstance } from '../types/db';


I don't think NodeInstance is actually needed in this file. I pulled down the branch and removed all references and it compiled fine. Is there another reason for this, or can it just be removed?

I used it to make it clear that connectNodes can also get a db node entry. I ended up not doing anything with it, but still left this definition. I'll remove it.

sangaman · 2018-09-06T15:38:02Z

lib/p2p/Pool.ts

+    if (retryConnecting && sortedAddresses.length && sortedAddresses[0].lastConnected) {
+      try {
+        await this.addOutbound(sortedAddresses[0], nodePubKey, true);
+      } catch (err) {}


Same here for logging.

sangaman · 2018-09-06T15:46:02Z

lib/p2p/Pool.ts

@@ -331,7 +355,7 @@ class Pool extends EventEmitter {
      }
      case PacketType.NODES: {
        const nodes = (packet as packets.NodesPacket).body!;
-        await this.connectNodes(nodes);
+        await this.connectNodes(nodes, false, false);


I'd prefer this line without explicitly setting the boolean parameters to false since that's their default, it's just less verbose.

sangaman · 2018-09-06T15:48:20Z

lib/p2p/Pool.ts

+      // if connection is inbound, then the port is discarded. This might result in multiple addresses getting updated
+      const addresses = peer.addresses!.map((address) => {
+        if (peer.socketAddress.host === address.host &&
+            peer.inbound ? true : peer.socketAddress.port === address.port


I don't think we should be updating lastConnected on inbound connections, as you point out it might update multiple addresses, but mainly the inbound connection doesn't actually verify that we're able to connect to their external address, which is what I'm thinking lastConnected should track

Makes sense, i'll change it

sangaman · 2018-09-06T15:51:12Z

lib/p2p/errors.ts

@@ -29,6 +32,10 @@ const errors = {
    message: `could not retrieve external IP: ${err.message}`,
    code: errorCodes.EXTERNAL_IP_UNRETRIEVABLE,
  }),
+  CONNECTING_RETRIES_MAX_PERIOD_EXCEEDED: {
+    message: `Connection retries max period exceeded`,


Maybe make this a bit more descriptive, like "Connection retry attempts to peer exceeded maximum time allotment"

sangaman · 2018-09-06T15:53:01Z

test/p2p/sanity.spec.ts

  });

  after(async () => {
    await Promise.all([nodeOne['db'].models.Node.truncate(), nodeTwo['db'].models.Node.truncate()]);
    await Promise.all([nodeOne['shutdown'](), nodeTwo['shutdown']()]);
  });
+


Can drop this extra blank line.

sangaman · 2018-09-07T14:30:20Z

Looks like it just needs to resolve a small conflict and be squashed, then this is ready for merge. Great work.

sangaman · 2018-09-08T03:07:26Z

To get this in for our release I squashed and resolved the conflict myself, merging.

moshababo requested review from sangaman, michael1011 and kilrau August 23, 2018 20:39

ghost assigned moshababo Aug 23, 2018

ghost added the in progress label Aug 23, 2018

sangaman requested changes Aug 24, 2018

View reviewed changes

sangaman reviewed Aug 25, 2018

View reviewed changes

michael1011 approved these changes Aug 26, 2018

View reviewed changes

kilrau mentioned this pull request Aug 27, 2018

Enhance shared peers info #403

Closed

kilrau mentioned this pull request Aug 27, 2018

Track pending outgoing connections to peers #398

Closed

moshababo mentioned this pull request Sep 4, 2018

Make P2P functionality configurable #434

Closed

kilrau added this to the 1.0.0-prealpha.3 milestone Sep 5, 2018

moshababo mentioned this pull request Sep 6, 2018

Reconnect to last successful address of a peer first #442

Closed

sangaman reviewed Sep 6, 2018

View reviewed changes

moshababo requested a review from sangaman September 7, 2018 10:06

sangaman approved these changes Sep 7, 2018

View reviewed changes

connection retries

1773836

sangaman force-pushed the conn_retries branch from 7c24350 to 1773836 Compare September 8, 2018 02:49

ghost assigned sangaman Sep 8, 2018

sangaman merged commit 1f61aae into master Sep 8, 2018

ghost removed the in progress label Sep 8, 2018

sangaman deleted the conn_retries branch September 8, 2018 03:07

sangaman mentioned this pull request Sep 8, 2018

Grpc error consistency #426

Merged

kilrau mentioned this pull request Oct 30, 2018

not connected to peer after disconnect #616

Closed


		this.connectTimeout = setTimeout(onTimeout, Peer.CONNECTION_TIMEOUT);
		const bind = () => {

connection retries #392

connection retries #392

Conversation

moshababo commented Aug 23, 2018

sangaman commented Aug 23, 2018

sangaman left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

moshababo commented Aug 24, 2018

sangaman commented Aug 24, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

moshababo commented Aug 26, 2018

sangaman commented Aug 26, 2018

moshababo commented Aug 26, 2018

sangaman commented Aug 26, 2018

sangaman commented Aug 27, 2018

moshababo commented Aug 27, 2018

sangaman commented Aug 27, 2018

kilrau commented Aug 27, 2018 • edited Loading

sangaman commented Aug 27, 2018

kilrau commented Aug 27, 2018

sangaman commented Aug 27, 2018

moshababo commented Aug 28, 2018 • edited Loading

kilrau commented Aug 28, 2018

sangaman commented Aug 28, 2018

moshababo commented Aug 30, 2018 • edited Loading

kilrau commented Sep 5, 2018 • edited Loading

moshababo commented Sep 6, 2018

sangaman left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sangaman commented Sep 7, 2018

sangaman commented Sep 8, 2018

kilrau commented Aug 27, 2018 •

edited

Loading

moshababo commented Aug 28, 2018 •

edited

Loading

moshababo commented Aug 30, 2018 •

edited

Loading

kilrau commented Sep 5, 2018 •

edited

Loading