-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Subscription dance: backoff, logging, and test coverage #62
Conversation
The mechanism wraps event.Resubscribe mechanism from go-ethereum adding alterting if the time between resubscription attemps is shorter than the specific threshold. This mechanism will replace the one we currently have in the generated code, adding alerting and backoff.
The mechanism monitors the time elapsed between resubscription attempts and if it is shorter than the specificed alertThreshold, it writes the time elapsed between resubscription attempts to the thresholdViolated channel to alarm about potential problems with the stability of the subscription. Every error returned by the wrapped subscription function, is written to subscriptionFailed channel. If the calling code is interested in reading from thresholdViolated and/or subscriptionFailed channel, appropriate readers need to be set up *before* WithResubscription is called. Writes to thresholdViolated and subscriptionFailed channels are non-blocking and are not stopping resubscription attempts. Those two channels will let the client code log warning/errors or even kill the subscription in case alrt threshold is violated too many times.
So far, all the resubscription logic was placed in the generated contract bindings. The new ethutil.WithResubscription function takes care of resubscription, having backoff and proper alerting implemented.
This message is emitted when a subscription failed immediatelly afters it's been requested. It is not the case when some long-living connection has been dropped by a load balancer. In this case, it makes sense to log the problem on error level.
This value is still quite experimental but 15 minutes is a stronger requirement than 5 minutes and if the client can't keep WS connection alive for at least 15 minutes it is better to inspect it.
Instead of accepting channels for altert threshold violation and resubscribe function failure notifications, WithResubscription function accepts two callback functions now. Channels were used initially because the generated Watch* code had a for-select section routing events from a channel to handler passed as a Watch* function parameter. This behavior will be eliminated in the next commit and Watch* function will accept a channel as a parameter and will not be doing any forwarding from channel to handler. Given that, two function handlers accepted by WithResubscription will be much more convenient.
The previous version of WithResubscription accepted channels. The new version accepting function callbacks will be easier to use in contract bindings once the goroutine passing events from channel to callback will be eliminated from there. And that's the plan in one of the next commits.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First part of comments.
// subscription required before the subscription failure happens and | ||
// resubscription follows so that the resubscription does not emit an error | ||
// to the logs alerting about potential problems with Ethereum client. | ||
{{.ShortVar}}SubscriptionAlertThreshold = 15 * time.Minute |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🤔 should this be per-contract?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it should not be but I do not have a good place to put it at. 😬 Maybe subscribe_opts.go
from ethutil
introduced in the next PR #63 is a good candidate?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm going to merge this PR. Please let's handle this change in #63.
Define thresholdViolated and subscriptionFailed functions in contract bindings separately for a better clarity of the code.
This function was called resubscribeFn just because of event.ResubscribeFn type. However, this function is called for the first-ever subscription as well and we do have subscriptionFailedFn as another parameter, named this way and not resubscriptionFailedFn for exactly the same reason - it can be called for the first-ever subscription. Moreover, each resubscription is in fact another subscription attempt.
Just to make it clear what we are cancelling.
We can report all failed assertions instead of failing on the first one failed. The only exception is TestEmitOriginalError where we want to fail the test in case the first assertion fails so that we do not hang on <-subscriptionFailed next line.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No more comments! Leaving the rest for @nkuba
%v works fine as well but doing it with %s for a nicer code.
Forgot to regenerate the template for the previous commit with a minor change about formatting string.
I updated keep-network/keep-ecdsa#663 to point to the version with all review comments addressed, which is c6adefd - the last commit. |
Subscription dance: backoff and logging Refs #680 Depends on keep-network/keep-common#62 The referenced `keep-common` PR solves three problems and two of them are important for ECDSA client. Currently, the subscription mechanism used by ECDSA client does not implement a backoff. Each failed subscription is retried 5 seconds after it failed. For many individual subscriptions ECDSA client is opening, a massive retry with no backoff can be interpreted as misbehavior by a third party Ethereum provider and further attempts can be completely blocked. Some operators already experienced it with Alchemy. The second problem is that all the logging regarding resubscriptions is done on the warning level. With the lack of backoff mentioned in the previous point, this can produce a deadly mixture - keep client could be trying to reconnect for a long time, Ethereum client could be rejecting those attempts interpreting them as DoS/misbehavior, and in this scenario, operator would receive only warnings with no single error. Both those points are addressed here by bumping up the dependency on `keep-common` and regenerating contract bindings. Also, the code in `ethereum.go` is getting cleaned up a little to reflect the fact all `Watch*` attempts are getting retried until they succeed or until unsubscribe happens. For example, let's say that the client was able to establish all subscriptions, they were working fine for two minutes, failed, and were established again successfully. Dropped WS connection is fine assuming it's not happening too often. Two minutes is certainly alarming, so the client will now alarm with an error: > 2021-01-07T18:29:20.522+0100 ERROR keep-contract-BondedECDSAKeepFactory subscription to event BondedECDSAKeepCreated had to be retried [2m0.006063728s] since the last attempt; please inspect Ethereum client connectivity Now, let's say that the retry will fail. Again, the client will alarm with an error: > 2021-01-07T18:31:21.532+0100 ERROR keep-contract-BondedECDSAKeep subscription to event KeepClosed failed: [dial tcp 127.0.0.1:8546: connect: connection refused]; resubscription attempt will be performed Last but not least, we now have a backoff. The delay is increased twice until it reaches the maximum backoff time which is set to 2 minutes.
New Ethereum event subscription API, background event pull loop # Overview There are two major changes to Ethereum subscription API proposed here: - new subscription API with `OnEvent` and `Pipe` functions, - background monitoring loop pulling past events from the chain. The first change should allow implementing some handler logic easier and to avoid complex logic leading to bugs such as keep-network/keep-core#1333 or keep-network/keep-core#2052. The second change should improve client responsiveness for operators running their nodes against Ethereum deployments that are not very reliable on the event delivery front. This code has been integrated with ECDSA keep client in `keep-ecdsa` repository and can be tested there on the branch `pipe-it` (keep-network/keep-ecdsa#671). # New API Event subscription API has been refactored to resemble the proposition from keep-network/keep-core#491. The new event subscription mechanism allows installing event callback handler function with `OnEvent` function as well as piping events from a subscription to a channel with `Pipe` function. Example usage of `OnEvent`: ``` handlerFn := func( submittingMember common.Address, conflictingPublicKey []byte, blockNumber uint64, ) { // (...) } subscription := keepContract.ConflictingPublicKeySubmitted( nil, // default SubscribeOpts nil, // no filtering on submitting member ).OnEvent(handlerFn) ``` The same subscription but with a `Pipe`: ``` sink := make(chan *abi.BondedECDSAKeepConflictingPublicKeySubmitted) subscription := keepContract.ConflictingPublicKeySubmitted( nil, // default SubscribeOpts nil, // no filtering on submitting member ).Pipe(sink) ``` Currently, all our event subscriptions use function handlers. While it is convenient in some cases, in some other cases it is the opposite. For example, `OnBondedECDSAKeepCreated` handler in ECDSA client works perfectly fine as a function. It triggers the protocol and does not have to constantly monitor the state of the chain. On the other hand, `OnDKGResultSubmitted` handler from the beacon client needs to monitor the chain and exit the process of event publication in case another node has published the result. In this case, the code could be better structured with a channel-based subscription that would allow listening for block counter events and events from DKG result submitted subscription in one for-select loop. # Background monitoring loop Some nodes in the network are running against Ethereum setups that are not particularly reliable in delivering events. Events are not delivered, nodes are not starting key-generation, or are not participating in redemption signing. Another problem is the stability of the event subscription mechanism (see #62). If the web socket connection is dropped too often, the resubscription mechanism is not enough to receive events emitted when the connection was in a weird, stale state. To address this problem, we introduce a background loop periodically pulling past events from the chain next to a regular `watchLogs` subscription. How often events are pulled and how many blocks are taken into account can be configured with `SubscribeOpts` parameters. This way, even if the event was lost by `watchLogs` subscription for whatever reason, it should be pulled by a background monitoring loop later. An extremely important implication of this change is that handlers should have a logic in place allowing them to de-duplicate received events even if a lot of time passed between receiving the original event and the duplicate. I have been experimenting with various options here, including de-duplication events in the chain implementation layer, but none of them proved to be successful as the correct de-duplication algorithm requires domain knowledge about a certain type of an event and in what circumstances identical event emitted later should or should not be identified as a duplicate. De-duplicator implementations should be added to `keep-core` and `keep-ecdsa` clients and are out of the scope of `keep-common` and this PR.
Update Go contract bindings: resubscription mechanism and subscription background monitoring loop Refs keep-network/keep-ecdsa#680 Depends on keep-network/keep-common#62 Depends on keep-network/keep-common#63 The referenced PRs from `keep-common` repository refactor Go contract bindings updating Ethereum event resubscription mechanism and adding subscription background monitoring loop in case an event was lost for whatever reason. The full reasoning behind those changes is explained in the referenced PRs. Here we update `keep-common` dependency and regenerate contract bindings to include the mentioned changes. Leaving as a Draft until the two referenced PRs are not merged. Please do not merge this PR until `keep-common` dependency is updated to the merge commit from `keep-common`.
Refs keep-network/keep-ecdsa#680
This PR is the first one in the series of PRs refactoring Ethereum event subscriptions mechanism I plan to open in the next few days.
You can see it in action here: keep-network/keep-ecdsa#663.
So far resubscription logging was implemented in contract binding templates. There were three problems with that mechanism:
Here we address all those problems.
Instead of implementing resubscriptions on our side, we wrap the code from
github.com/ethereum/go-ethereum/event
with some additional logging logic. This code is used from contract templates to keep the subscription alive. We lean ongo-ethereum
to do those resubscriptions right and we cover our logic addition in unit tests on our side.The code from
go-ethereum
implements backoffs, increasing the delay twice until it reaches the maximum backoff time which is set in our bindings to 2 minutes.The new code in
ethutil
wrappinggo-ethereum
's resubscriber, allows to log an error if the subscription is dropped too often which may indicate problems with Ethereum client. The threshold is set to 15 minutes.