Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

connection retries #392

Merged
merged 1 commit into from
Sep 8, 2018
Merged

connection retries #392

merged 1 commit into from
Sep 8, 2018

Conversation

moshababo
Copy link
Collaborator

Solves #311

  • retries are blocking initConnection
  • retries are not activated when manually trying to connect to a peer. if it would, the user won't get a response for a very long time upon failure, since the retries are blocking.
  • if max retries period exceeds, and peer was previously connected (exists on nodes table)
    peer supposed to get banned. Banning doesn't work now for pre-handshake peers, but can be fixed. I'm not sure banning is the right solution, because in the future it might also block inbound connections. Deleting or soft deleting the node entry might not be the right thing either. Perhaps penalizing, which we haven't implemented yet.
    Anyway - that will hardly happen, because it will trigger only after 1 week, and only for previously connected peers.
    The main thing now is to verify that trying to retry over an entire week for every node we're hearing about is reasonable. I think it's too much for nodes we never successfully connected with.

@ghost ghost assigned moshababo Aug 23, 2018
@ghost ghost added the in progress label Aug 23, 2018
@sangaman
Copy link
Collaborator

if max retries period exceeds, and peer was previously connected (exists on nodes table)
peer supposed to get banned. Banning doesn't work now for pre-handshake peers, but can be fixed. I'm not sure banning is the right solution, because in the future it might also block inbound connections.

I definitely don't think we want to ban. I like the idea of a soft delete. Maybe we add another boolean column to the nodes table, and don't try connecting to them again unless it's done manually or we hear about them again from a NODES packet from another peer. If we're able to successfully reconnect, then we undelete them. Another option for a soft delete is to delete their known listening addresses. I figure this matter can be dealt with in another PR though.

Also, I agree that we don't need to retry connecting to nodes that we've never connected with. I think one try is enough. If it's a manual connection, failing lets the user know to retry when the issue blocking the connection is resolved. If it's automatic, like a node we learned about from a peer, #386 makes it such that we should only be learning about nodes that are online from peers.

I see retrying as valuable when attempting to reconnect to a peer who has temporarily gone down, or whom we've connected to successfully in the past but was offline at the time xud started.

Copy link
Collaborator

@sangaman sangaman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good mostly, have you tested this yet?

Another thing that occurred to me while I was going through this, how do we handle a case where we have a timer set to retry a connection to a node and in the meantime that node successfully connects to us. Maybe we need to make some check before retrying a connection to ensure we're not already connected?

Also if I follow correctly, Pool will contain Peers that we're not currently connected to, but instead are only pending reconnection attempts. I think that may affect some other logic that treats all the peers in Pool as actively connected. Which makes me also realize that it's possible we'd have some pre-handshake peers in Pool and we should check for those before doing things like sending them orders. Maybe I will address that in a separate PR, by checking that we're in post handshake state with peers before we attempt to send them packets.

lib/p2p/Peer.ts Outdated
/** Connection retries min delay. */
private static CONNECTION_RETRIES_MIN_DELAY = 5000;
/** Connection retries max delay. */
private static CONNECTION_RETIES_MAX_DELAY = 3600000;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RETRIES typo.

lib/p2p/Peer.ts Outdated
/** Connection retries max delay. */
private static CONNECTION_RETIES_MAX_DELAY = 3600000;
/** Connection retries max period. */
private static CONNECTION_RETIES_MAX_PERIOD = 604800000;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RETRIES typo.

lib/p2p/Peer.ts Outdated
@@ -239,6 +245,10 @@ class Peer extends EventEmitter {
}

return new Promise((resolve, reject) => {
const startTime = Date.now();
let retryDelay: number | undefined;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about we initialize retryDelay by setting it to CONNECTION_RETRIES_MIN_DELAY, and then move the line that doubles the retry delay to after the retryDelay > Peer.CONNECTION_RETIES_MAX_PERIOD check. It feels a bit cleaner that way, and saves a bunch of null checks on retryDelay.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's what I did originally, but It forced me to update the next value in the timeout function, which I found to be less optimal. I can change it back.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you mean it forced you to update the next value?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed it, so you can look now. Nothing serious.

lib/p2p/Peer.ts Outdated
: Math.min(Peer.CONNECTION_RETIES_MAX_DELAY, retryDelay * 2);

if (Date.now() - startTime + retryDelay > Peer.CONNECTION_RETIES_MAX_PERIOD) {
this.close();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here too I don't think we need to close because we're not connected in the first place.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See my comment above.

Before my recent changes, this.close() was called even if connection wasn't established yet. I changed that, and that's why i'm calling it here as well.

lib/p2p/Peer.ts Outdated
}, retryDelay);
};

const listen = () => {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"listen" seems a bit inaccurate here, it makes it sound like we're listening for incoming connections. Maybe "bind"?

lib/p2p/Peer.ts Outdated
return;
}

this.logger.info(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think debug or perhaps verbose would be better here, this feels like we'd have a lot of messages in the logs that aren't particularly useful.

@moshababo
Copy link
Collaborator Author

have you tested this yet?

I tested it manually. I’d like to add some automatic tests as well, after we’ll settle on everything.

Another thing that occurred to me while I was going through this, how do we handle a case where we have a timer set to retry a connection to a node and in the meantime that node successfully connects to us. Maybe we need to make some check before retrying a connection to ensure we're not already connected?

It’s not handled at the moment. If the inbound node would suddenly become available for us for outbound as well, then a duplicated connection supposed to be detected. If it’s not available for us, it will continue to retry. I can add some checks to avoid that, hopefully it won’t become to verbose.

Also if I follow correctly, Pool will contain Peers that we're not currently connected to, but instead are only pending reconnection attempts. I think that may affect some other logic that treats all the peers in Pool as actively connected. Which makes me also realize that it's possible we'd have some pre-handshake peers in Pool and we should check for those before doing things like sending them orders. Maybe I will address that in a separate PR, by checking that we're in post handshake state with peers before we attempt to send them packets.

The peer object will exist, but it's not accessible and not on PeerList, because peer are getting added there only after they are “opened”, while initConnection with its retries is blocking it.

@sangaman
Copy link
Collaborator

It’s not handled at the moment. If the inbound node would suddenly become available for us for outbound as well, then a duplicated connection supposed to be detected. If it’s not available for us, it will continue to retry. I can add some checks to avoid that, hopefully it won’t become to verbose.

Right, I think currently it would get to the point of attempting a handshake and then fail, not a huge deal. It would be nice to not even attempt the outgoing connection if we're already connected, but as you said if it's too complicated or verbose then don't worry about it. Nothing obvious jumps to mind since the Peer object isn't aware of other peers and connections.

The peer object will exist, but it's not accessible and not on PeerList, because peer are getting added there only after they are “opened”, while initConnection with its retries is blocking it.

Yes you are right, I'd overlooked that. Never mind.

lib/p2p/Pool.ts Outdated
@@ -155,7 +155,7 @@ class Pool extends EventEmitter {
private connectNode = async ({ addresses, nodePubKey }: NodeConnectionInfo) => {
for (let n = 0; n < addresses.length; n += 1) {
try {
await this.addOutbound(addresses[n], nodePubKey);
await this.addOutbound(addresses[n], nodePubKey, true);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One more thought, this looks like we'd wait for the full timeout (currently a week, although I'm thinking we probably should shorten that) before we attempt the next address for a node that has multiple listening addresses. What do you think about making so we retry each address once before delaying and retrying? That would definitely require a bit of refactoring, but wanted to see what you thought.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds like a good idea, and shouldn't require serious refactoring, just to do one iteration without retries, and another one with.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok that should work, but then we'd basically only be retrying a single address from that point on right?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. I continued to discuss the option of doing the retries in parallel on the main comments thread.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Answered in the main thread. In short: yes, retry logic PER listening address (otherwise this can take a very long time)

@moshababo
Copy link
Collaborator Author

Right, I think currently it would get to the point of attempting a handshake and then fail, not a huge deal. It would be nice to not even attempt the outgoing connection if we're already connected, but as you said if it's too complicated or verbose then don't worry about it. Nothing obvious jumps to mind since the Peer object isn't aware of other peers and connections.

This indeed breaks the encapsulation, although technically can be implemented simply with Pool passing Peer a callback. But I didn't want to do it now, since you already raised some similar issues on gitter, which might require a different solution.

Other topics:

  1. don't retry connecting with a node which we never had a successful connection with
  2. when iterating on the node addresses, don't retry connecting until all addresses tried out. If all failed, retry connecting on the first address
  3. soft-delete nodes which exceeded the max retry period by deleting their addresses list (although it might turn out to be empty also because the nodes failed to retrieve their external ip address)

@kilrau let's all agree on that before i'll go ahead and implement

@sangaman
Copy link
Collaborator

I'd agree with all 3. Retrying connections on all their addresses would be nice but not a big deal.

@moshababo
Copy link
Collaborator Author

We can do retries on multiple addresses in parallel. Verifying that the pubkey wasn't already connected before each retry shouldn't be a big deal.

@sangaman
Copy link
Collaborator

OK, if you think you can get parallel connection attempts working that would be a nice bonus, but it's not necessary for now. You could also approach that in a different PR if you'd like.

@kilrau kilrau mentioned this pull request Aug 27, 2018
@sangaman
Copy link
Collaborator

I'm actually thinking that maybe we want to keep the connection attempts attempted in a serial fashion though so that we don't have to deal with multiple attempts succeeding on the first attempt thereby creating multiple conncections to the same node, I brought up that issue in #398.

@moshababo Are you planning to make any more changes to this PR or should I review it again as is?

@moshababo
Copy link
Collaborator Author

I don't see a problem with it, if we'll manage the pending peers correctly. But we can indeed avoid this complexity for now.

There's nothing major to review yet. I was waiting for more feedback on my comment. I'll go ahead and implement it anyway tomorrow morning.

@sangaman
Copy link
Collaborator

OK, here's my thoughts after reading/thinking some more.

don't retry connecting with a node which we never had a successful connection with

Yes, and in fact I think ultimately we may want to only retry connections to the last address we've successful connected to.

when iterating on the node addresses, don't retry connecting until all addresses tried out. If all failed, retry connecting on the first address

Yes, and I think like above maybe the "first address" in this case should be the last successful address we've connected to.

soft-delete nodes which exceeded the max retry period by deleting their addresses list (although it might turn out to be empty also because the nodes failed to retrieve their external ip address)

Yes, and maybe we can delete particular addresses that have failed from the list.

@kilrau
Copy link
Contributor

kilrau commented Aug 27, 2018

  1. don't retry connecting with a node which we never had a successful connection with

Agree, once we went through the whole retry cycle and never managed to connect, we don't retry again (even after reboot). This node gets soft-deleted which needs to be stored permanently. As described in #365 (comment) we should mark successful connections and only apply retry logic to these.

  1. when iterating on the node addresses, don't retry connecting until all addresses tried out. If all failed, retry connecting on the first address

I'm actually thinking that maybe we want to keep the connection attempts attempted in a serial fashion though so that we don't have to deal with multiple attempts succeeding on the first attempt thereby creating multiple conncections to the same node, I brought up that issue in #398.

Ehm not sure if I get you right, but I think we should try addresses in parallel, otherwise this procedure will take ages if a peer tells me 10 listeningAdresses...

t1: listeningAddress1, wait for delay x, try listeningAddress1 wait for delay x*2
t2: listeningAddress2, wait for delay x, try listeningAddress1 wait for delay x*2

we can just make peers in pendingPeers to be pubKey@IP:port, #398 (comment)

EDIT: just saw your comment @sangaman , I think I'm not getting this one right. Could you please clarify this one?

  1. soft-delete nodes which exceeded the max retry period by deleting their addresses list (although it might turn out to be empty also because the nodes failed to retrieve their external ip address)

Agree, soft-deleting makes sense to me now, also to store 1. permanently. Not sure if deleting the listening addresses is the way to do it though - you decide on that.

@sangaman
Copy link
Collaborator

Ehm not sure if I get you right, but I would we should try addresses in parallel, otherwise this procedure will take ages if a peer tells me 10 listeningAdresses...

t1: listeningAddress1, wait for delay x, try listeningAddress1 wait for delay x2
t2: listeningAddress2, wait for delay x, try listeningAddress1 wait for delay x
2

I'm thinking we try each address only once with no delay and no retry, one at a time, and if they all fail, start the wait-retry routine only with the first address (or preferably last successful address).

@kilrau
Copy link
Contributor

kilrau commented Aug 27, 2018

... and just don't care about all other listeningAddresses? Or will we try them at some point? If yes, when?

@sangaman
Copy link
Collaborator

We try them all once up front, and only continue retrying with a single on.

@moshababo
Copy link
Collaborator Author

moshababo commented Aug 28, 2018

If we want to designate the "first address" to be the one we had a successful connection with, it doesn't work well with deleting it upon connection retries failure.

We can consider doing retries on different addresses in a serial way to obviously be infeasible. So either we'll do it in parallel, or try only one address.
If the later, so either we don't delete upon failure, or we first fix the "first address" problem (by adding another property to the address object, and maybe also removing the array to a different table).
If we want to do it in parallel, then the same applies, but it also affect #398, so we better finalize on that one first.

I pushed some WIP stuff, basically just implementing suggestions 1 & 2, relying on the first address to be the one which we had a successful connection with (which might not be the case).

Let's talk about this in our weekly dev chat.

@kilrau
Copy link
Contributor

kilrau commented Aug 28, 2018

As agreed, only try on latest successful. @moshababo

@sangaman
Copy link
Collaborator

One consideration for adding another property to the address object is that we're updating the addresses in the nodes table each time we get a HelloPacket. A peer could technically remove and add the same address from one handshake to the next, or we could connect to a peer using an address that it doesn't advertise in the handshake. The advantage of tracking the last connected address (or addresses) separately in the database would be to separate it from handshake data.

@moshababo
Copy link
Collaborator Author

moshababo commented Aug 30, 2018

@sangaman I missed seeing your comment before implementing it. I encounter what you described, and solved it by not overriding existing addresses when they are being updated from the handshake state (add/remove still works). It doesn't solve the problem of xud being able to connect to a peer with an address which wasn't advertised by the peer, but I don't think it's a problem, because we're not re-using this address anyway at the moment.

We can start separating it as you suggested, and this will also give us the ability to re-use this address on future bootstrapping. But i'm not sure if that wouldn't be an overkill for now. Let me know what you think.

Other stuff:

  • retrying only on bootstrap nodes, only on latest connected address
  • if retrying max period exceeds, address is being removed
  • Pool.addOutbound changed to start throwing errors. The string-based interface wasn't a good design, and I wanted to change it before, but now I had to, mostly for Pool.connectNode. This required me to refactor the p2p sanity tests. for gRPC access, the error is caught by GrpcService, so it doesn't crash.
  • continued to apply the tryX naming convention, for functions which swallow errors

@kilrau kilrau added this to the 1.0.0-prealpha.3 milestone Sep 5, 2018
@kilrau
Copy link
Contributor

kilrau commented Sep 5, 2018

"Reconnect to last successful address of a peer first" in a separate issue, so taking it out of this one #442

Other stuff:

retrying only on bootstrap nodes, only on latest connected address
if retrying max period exceeds, address is being removed
Pool.addOutbound changed to start throwing errors. The string-based interface wasn't a good design, and I wanted to change it before, but now I had to, mostly for Pool.connectNode. This required me to refactor the p2p sanity tests. for gRPC access, the error is caught by GrpcService, so it doesn't crash.
continued to apply the tryX naming convention, for functions which swallow errors

Sounds all right to me. Are these questions?

@moshababo
Copy link
Collaborator Author

"Reconnect to last successful address of a peer first" in a separate issue, so taking it out of this one #442

I assumed referred here to initial connection sequence (iterating though all addresses), not to the connection retries.
We already had the retries working only on the last successful address. So not much was needed to support what you suggested, so I already added it here.

The other stuff that I wrote described what I already implemented.

Copy link
Collaborator

@sangaman sangaman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for not doing another review sooner, I didn't realize it was ready. Comments below, nothing major but a few things are probably worth changing.

node.addresses = addresses;
// avoid overriding the `lastConnected` field for existing matching addresses unless a new value was set
node.addresses = addresses.map((newAddress) => {
const oldAddress = node.addresses.find(address => address.host === newAddress.host && address.port === newAddress.port);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be nice to have an equality method in addressUtils for this logic since we're using in a couple of places.

lib/p2p/Peer.ts Outdated
@@ -234,17 +242,22 @@ class Peer extends EventEmitter {
* Ensure we are connected (for inbound connections) or listen for the `connect` socket event (for outbound connections)
* and set the [[connectTime]] timestamp. If an outbound connection attempt errors or times out, throw an error.
*/
private initConnection = async () => {
private initConnection = (retry?: boolean): Promise<void> => {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Making this function async and not returning an explicit Promise.resolve() below was an intentional change that @michael1011 and I made in a separate PR. I think the code is a bit cleaner, that way. If you agree, line 252 would just return and 255 would return a Promise<void>.

Another option is to keep the method signature as is, and move the if (this.connected) logic into the new Promise logic, and then resolve() and return in there.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I overrode this method signature when I resolved conflicts. Will return it to how it was on master.

lib/p2p/Peer.ts Outdated

this.connectTimeout = setTimeout(onTimeout, Peer.CONNECTION_TIMEOUT);
const bind = () => {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would be a bit easier to read if bind were declared before onError which uses the bind method, so just move this block of code up a bit.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But then we'll have the same problem, because bind uses onError ;)

lib/p2p/Pool.ts Outdated
@@ -11,6 +11,7 @@ import { OutgoingOrder, OrderIdentifier, StampedPeerOrder } from '../types/order
import { Models } from '../db/DB';
import Logger from '../Logger';
import { HandshakeState, Address, NodeConnectionInfo } from '../types/p2p';
import { NodeInstance } from '../types/db';
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think NodeInstance is actually needed in this file. I pulled down the branch and removed all references and it compiled fine. Is there another reason for this, or can it just be removed?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I used it to make it clear that connectNodes can also get a db node entry. I ended up not doing anything with it, but still left this definition. I'll remove it.

if (retryConnecting && sortedAddresses.length && sortedAddresses[0].lastConnected) {
try {
await this.addOutbound(sortedAddresses[0], nodePubKey, true);
} catch (err) {}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here for logging.

lib/p2p/Pool.ts Outdated
@@ -331,7 +355,7 @@ class Pool extends EventEmitter {
}
case PacketType.NODES: {
const nodes = (packet as packets.NodesPacket).body!;
await this.connectNodes(nodes);
await this.connectNodes(nodes, false, false);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer this line without explicitly setting the boolean parameters to false since that's their default, it's just less verbose.

lib/p2p/Pool.ts Outdated
// if connection is inbound, then the port is discarded. This might result in multiple addresses getting updated
const addresses = peer.addresses!.map((address) => {
if (peer.socketAddress.host === address.host &&
peer.inbound ? true : peer.socketAddress.port === address.port
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should be updating lastConnected on inbound connections, as you point out it might update multiple addresses, but mainly the inbound connection doesn't actually verify that we're able to connect to their external address, which is what I'm thinking lastConnected should track

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense, i'll change it

@@ -29,6 +32,10 @@ const errors = {
message: `could not retrieve external IP: ${err.message}`,
code: errorCodes.EXTERNAL_IP_UNRETRIEVABLE,
}),
CONNECTING_RETRIES_MAX_PERIOD_EXCEEDED: {
message: `Connection retries max period exceeded`,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe make this a bit more descriptive, like "Connection retry attempts to peer exceeded maximum time allotment"

});

after(async () => {
await Promise.all([nodeOne['db'].models.Node.truncate(), nodeTwo['db'].models.Node.truncate()]);
await Promise.all([nodeOne['shutdown'](), nodeTwo['shutdown']()]);
});

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can drop this extra blank line.

@moshababo moshababo requested a review from sangaman September 7, 2018 10:06
@sangaman
Copy link
Collaborator

sangaman commented Sep 7, 2018

Looks like it just needs to resolve a small conflict and be squashed, then this is ready for merge. Great work.

@sangaman
Copy link
Collaborator

sangaman commented Sep 8, 2018

To get this in for our release I squashed and resolved the conflict myself, merging.

@sangaman sangaman merged commit 1f61aae into master Sep 8, 2018
@ghost ghost removed the in progress label Sep 8, 2018
@sangaman sangaman deleted the conn_retries branch September 8, 2018 03:07
@sangaman sangaman mentioned this pull request Sep 8, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants