-
Notifications
You must be signed in to change notification settings - Fork 178
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Backoff for websocket connection retry #338
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The exponential backoff wrapping looks good to me. Would feel better about merging if someone a little closer to the go-code these days also gave the thumbs up :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
About the max retries. For websocket connections to Infura, I think we really should retry indefinitely. We don't want to force user interaction if all they need to do is re-establish the connection. That is something our code can do on its own. The only thing I'm uncertain about at the moment is how indefinite retries would affect the code that's responsible for consuming the event monitor interface.
However, we certainly should still notify the user if we lose the connection for some period of time, just so they are appraised of the websocket status. But if we can continue to retry, we should.
circle.yml
Outdated
@@ -15,7 +15,7 @@ dependencies: | |||
- "$HOME/ffmpeg" | |||
- "$HOME/compiled" | |||
override: | |||
- go get github.com/livepeer/go-livepeer/cmd/livepeer | |||
# - go get github.com/livepeer/go-livepeer/cmd/livepeer |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Replace with git clone
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea!
core/livepeernode.go
Outdated
} | ||
bo := backoff.NewExponentialBackOff() | ||
bo.MaxElapsedTime = time.Second * 15 | ||
if err := backoff.Retry(getBlock, backoff.WithMaxRetries(bo, SubscribeRetry)); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the case of websocket connections to Infura, we really should retry indefinitely. We don't want to force user interaction if all they need to do is re-establish the connection.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair enough. My concern was around spamming the Infura network, but I guess they are there for a reason and we should always try to reconnect.
eth/claimmanager.go
Outdated
// segs := make([]int64, 0) | ||
// for k, _ := range c.unclaimedSegs { | ||
// segs = append(segs, k) | ||
// } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
extraneous?
glog.Errorf("SubscribeNewRound error: %v. Retrying...", err) | ||
return err | ||
} else { | ||
glog.Infof("SubscribeNewRound successful.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this a one-time message or would it lead to additional logging after each round?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a one-time message per connection (it will re-print if there is a re-connection)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the use of backoff strategies, but I feel like it might be better if the backoff retry logic is placed in the RPC client https://github.com/livepeer/go-livepeer/blob/master/vendor/github.com/ethereum/go-ethereum/rpc/client.go since all connection based failures will originate from some operation in that client. This would also avoid additional duplication of the retry set up code everywhere we need to make an RPC request. The current retry code in the RPC client here: https://github.com/livepeer/go-livepeer/blob/master/vendor/github.com/ethereum/go-ethereum/rpc/client.go#L261 could be modified to use some backoff strategy and similar backoff retry logic could be placed in the subscription based functions in the RPC client.
|
||
return nil | ||
} | ||
if err := backoff.Retry(getBlock, backoff.WithMaxRetries(backoff.NewConstantBackOff(time.Second), SubscribeRetry)); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why use a constant backoff here, but an exponential backoff strategy in CreateTranscodeJob
in livepeernode.go
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I removed all the exponential backoffs.
eth/eventmonitor.go
Outdated
@@ -35,6 +40,7 @@ type eventMonitor struct { | |||
backend *ethclient.Client | |||
contractAddrMap map[string]common.Address | |||
eventSubMap map[string]*EventSubscription | |||
latestBlock *big.Int |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this used anywhere?
@yondonfu - the reason I implemented the backoff strategies in the user of the RPC client is because we may want different backoff strategies for different types of connections. For example, @j0sh brought up the fact that the transcoder should keep retrying forever. But I'm not sure if all RPC requests should have that behavior. |
Ah make sense. I think we change the appropriate retry strategy for a transcoder in a separate PR |
var job *lpTypes.Job | ||
getJob := func() error { | ||
j, err := s.node.Eth.GetJob(jid) | ||
if j.StreamId == "" { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm so when using Infura, an event can come back with empty fields? Weird. I think there is an edge case where someone actually creates a job with an empty streamID, but I guess in that scenario, after a number of retries the transcoder will give up and just drop the new job event
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah it's really strange... But I've seen that a few times and it crashed the transcoder. #342
job = j | ||
return err | ||
} | ||
if err := backoff.Retry(getJob, backoff.NewConstantBackOff(time.Second*2)); err != nil { | ||
glog.Errorf("Error getting job info: %v", err) | ||
return false, err |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One additional thing I thought of - in that edge case described in the above comment, the transcoder would retry a number of times and then give up and actually stop watching for new job events since it returns false, err
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the current logic here is to indefinitely retry every 2 seconds. I'm actually not sure when we will get an error here, but we do end up getting one, something unexpected has happened.
We've been using Infura, and they recommend to "treat the connection as it could be broken at anytime" - which is not great.
I added some reconnection and exponential backoff logic here so we can re-establish the connections better, but I think we need a bigger refactor of the eventMonitor code soon to address all of the issues.