Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Connection retry logic of an momentarily unavailable peer #311

Closed
kilrau opened this issue Aug 9, 2018 · 7 comments
Closed

Connection retry logic of an momentarily unavailable peer #311

kilrau opened this issue Aug 9, 2018 · 7 comments
Assignees
Labels
p2p Peer to peer networking

Comments

@kilrau
Copy link
Contributor

kilrau commented Aug 9, 2018

I don't think we got this yet: #185 (comment)

Do this now or as part of #178 ?

@sangaman
Copy link
Collaborator

sangaman commented Aug 9, 2018

Yeah we had discussed an exponential backoff on connection attempts, and I think that would be a pretty decent and simple way to implement this functionality at least at first. We'll keep trying to connect with longer waits between attempts after each failure.

@kilrau kilrau added this to the 1.0.0-alpha milestone Aug 14, 2018
@kilrau kilrau modified the milestones: 1.0.0-alpha.2, 1.0.0-alpha.1 Aug 15, 2018
@kilrau
Copy link
Contributor Author

kilrau commented Aug 15, 2018

Proposal:
The minimum amount of time, in seconds, that xud will wait before it tries to reconnect to the source or target destination after a failure.

This attribute works with Minimum Delay, Maximum Delay and Incremental Delay (all in seconds) attributes. After the first failure to connect to a destination, xud will wait for the number of seconds defined by Minimum Delay.

If the second trial also fails, it will increase its waiting time by the number of seconds defined by Incremental Delay. The maximum delay time is defined by Maximum Delay. Once the waiting time is increased to the maximum value, xud will not increase its waiting time anymore.

Delete node from nodes db if no successful connection could be established within DeleteNodes Delay.
Once xud successfully connects to a peer, its waiting time will be reset to the initial value defined by Minimum Delay.

Minumum Delay: 5
Incremental Delay: Current Delay*2
Maximum Delay: 3600
DeleteNodes Delay. 604800

Source: https://docs.oracle.com/cd/E13222_01/wls/docs81/ConsoleHelp/domain_messagingbridge_config_connectionretry.html

@kilrau kilrau added the p2p Peer to peer networking label Aug 15, 2018
@sangaman
Copy link
Collaborator

I think a week isn't that long if/when we target non-exchanges that don't expect to be online 24/7. It would be a good thing to have configurable I think, but something around there may be fine for now. But mainly I think a 5 second fixed backoff is too little and will result in way too many connection attempts, I think an exponential backoff (5, 10, 20, 40, 80, 160, etc) makes a lot more sense.

Also I don't think we should delete the node from the db per se, maybe we just drop its listening addresses from the db indicating that we no longer believe those are good and we stop trying to connect to them.

@kilrau
Copy link
Contributor Author

kilrau commented Aug 16, 2018

adjusted

@moshababo
Copy link
Collaborator

@kilrau

Once the waiting time is increased to the maximum value, xud will not increase its waiting time anymore.

It will not increase, or reconnection attempts should stop?

Another thing: Maximum Delay is smaller then DeleteNodes Delay, so how would the later ever be reached?
We can instead stop trying to connect to a node if the last successful connection attempt was too long ago.

At the moment we're not listing a node in db before our first successful connection.
So when trying to connect to a new node, if reconnection attempts would stop at some point, the node will be forgotten.

@kilrau
Copy link
Contributor Author

kilrau commented Aug 22, 2018

Once the waiting time is increased to the maximum value, xud will not increase its waiting time anymore.
It will not increase, or reconnection attempts should stop?

It will not increase. Re-connection attempts go on til DeleteNodes Delay (total), then the node gets deleted. In plain language: the reconnection attempts start with a 5s delay, then increase *2 until a 3600s interval is reached. This 3600s is kept until a total time of 604800s (1 week) is reached, then the node gets deleted. Maybe DeleteNodeTime is the better word.

So when trying to connect to a new node, if reconnection attempts would stop at some point, the node will be forgotten.

With #386 we'll learn about the node again once it's online again at some point.

@kilrau
Copy link
Contributor Author

kilrau commented Sep 5, 2018

add automatic test

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
p2p Peer to peer networking
Projects
None yet
Development

No branches or pull requests

3 participants