Subscription dance: backoff, logging, and test coverage #62

pdyraga · 2021-01-07T17:45:00Z

This PR is the first one in the series of PRs refactoring Ethereum event subscriptions mechanism I plan to open in the next few days.

You can see it in action here: keep-network/keep-ecdsa#663.

So far resubscription logging was implemented in contract binding templates. There were three problems with that mechanism:

No resubscribe backoff. Each failed subscription was retried 5 seconds after it failed. For many individual subscriptions ECDSA client is opening, a massive retry with no backoff could be interpreted as misbehavior by a third party Ethereum provider and further attempts could be completely blocked. Some operators experienced it with Alchemy.
All the logging was on the warning level. With the lack of backoff mentioned in the previous point, this could produce a deadly mixture - keep client could be trying to reconnect for a long time, Ethereum client could be rejecting those attempts interpreting them as DoS/misbehavior, and operator received only warnings with no single error.
Lack of test coverage for the resubscription mechanism. All the subscription code was placed in templates used to generate Go contract bindings and was very hard - if possible at all - to test.

Here we address all those problems.

Instead of implementing resubscriptions on our side, we wrap the code from github.com/ethereum/go-ethereum/event with some additional logging logic. This code is used from contract templates to keep the subscription alive. We lean on go-ethereum to do those resubscriptions right and we cover our logic addition in unit tests on our side.

The code from go-ethereum implements backoffs, increasing the delay twice until it reaches the maximum backoff time which is set in our bindings to 2 minutes.

The new code in ethutil wrapping go-ethereum's resubscriber, allows to log an error if the subscription is dropped too often which may indicate problems with Ethereum client. The threshold is set to 15 minutes.

The mechanism wraps event.Resubscribe mechanism from go-ethereum adding alterting if the time between resubscription attemps is shorter than the specific threshold. This mechanism will replace the one we currently have in the generated code, adding alerting and backoff.

The mechanism monitors the time elapsed between resubscription attempts and if it is shorter than the specificed alertThreshold, it writes the time elapsed between resubscription attempts to the thresholdViolated channel to alarm about potential problems with the stability of the subscription. Every error returned by the wrapped subscription function, is written to subscriptionFailed channel. If the calling code is interested in reading from thresholdViolated and/or subscriptionFailed channel, appropriate readers need to be set up *before* WithResubscription is called. Writes to thresholdViolated and subscriptionFailed channels are non-blocking and are not stopping resubscription attempts. Those two channels will let the client code log warning/errors or even kill the subscription in case alrt threshold is violated too many times.

So far, all the resubscription logic was placed in the generated contract bindings. The new ethutil.WithResubscription function takes care of resubscription, having backoff and proper alerting implemented.

This message is emitted when a subscription failed immediatelly afters it's been requested. It is not the case when some long-living connection has been dropped by a load balancer. In this case, it makes sense to log the problem on error level.

This value is still quite experimental but 15 minutes is a stronger requirement than 5 minutes and if the client can't keep WS connection alive for at least 15 minutes it is better to inspect it.

Instead of accepting channels for altert threshold violation and resubscribe function failure notifications, WithResubscription function accepts two callback functions now. Channels were used initially because the generated Watch* code had a for-select section routing events from a channel to handler passed as a Watch* function parameter. This behavior will be eliminated in the next commit and Watch* function will accept a channel as a parameter and will not be doing any forwarding from channel to handler. Given that, two function handlers accepted by WithResubscription will be much more convenient.

The previous version of WithResubscription accepted channels. The new version accepting function callbacks will be easier to use in contract bindings once the goroutine passing events from channel to callback will be eliminated from there. And that's the plan in one of the next commits.

nkuba

First part of comments.

pkg/chain/ethereum/ethutil/resubscribe.go

tools/generators/ethereum/contract_events.go.tmpl

pkg/chain/ethereum/ethutil/resubscribe.go

pkg/chain/ethereum/ethutil/resubscribe_test.go

tools/generators/ethereum/contract_events.go.tmpl

pkg/chain/ethereum/ethutil/resubscribe_test.go

pkg/chain/ethereum/ethutil/resubscribe.go

pkg/chain/ethereum/ethutil/resubscribe_test.go

tools/generators/ethereum/contract_events.go.tmpl

pkg/chain/ethereum/ethutil/resubscribe_test.go

pkg/chain/ethereum/ethutil/resubscribe.go

Shadowfiend · 2021-01-15T13:57:41Z

tools/generators/ethereum/contract.go.tmpl

+    // subscription required before the subscription failure happens and
+    // resubscription follows so that the resubscription does not emit an error
+    // to the logs alerting about potential problems with Ethereum client.
+    {{.ShortVar}}SubscriptionAlertThreshold = 15 * time.Minute


🤔 should this be per-contract?

I think it should not be but I do not have a good place to put it at. 😬 Maybe subscribe_opts.go from ethutil introduced in the next PR #63 is a good candidate?

I'm going to merge this PR. Please let's handle this change in #63.

Define thresholdViolated and subscriptionFailed functions in contract bindings separately for a better clarity of the code.

This function was called resubscribeFn just because of event.ResubscribeFn type. However, this function is called for the first-ever subscription as well and we do have subscriptionFailedFn as another parameter, named this way and not resubscriptionFailedFn for exactly the same reason - it can be called for the first-ever subscription. Moreover, each resubscription is in fact another subscription attempt.

Just to make it clear what we are cancelling.

We can report all failed assertions instead of failing on the first one failed. The only exception is TestEmitOriginalError where we want to fail the test in case the first assertion fails so that we do not hang on <-subscriptionFailed next line.

lukasz-zimnoch

No more comments! Leaving the rest for @nkuba

%v works fine as well but doing it with %s for a nicer code.

Forgot to regenerate the template for the previous commit with a minor change about formatting string.

pdyraga · 2021-01-27T12:19:19Z

I updated keep-network/keep-ecdsa#663 to point to the version with all review comments addressed, which is c6adefd - the last commit.

Subscription dance: backoff and logging Refs #680 Depends on keep-network/keep-common#62 The referenced `keep-common` PR solves three problems and two of them are important for ECDSA client. Currently, the subscription mechanism used by ECDSA client does not implement a backoff. Each failed subscription is retried 5 seconds after it failed. For many individual subscriptions ECDSA client is opening, a massive retry with no backoff can be interpreted as misbehavior by a third party Ethereum provider and further attempts can be completely blocked. Some operators already experienced it with Alchemy. The second problem is that all the logging regarding resubscriptions is done on the warning level. With the lack of backoff mentioned in the previous point, this can produce a deadly mixture - keep client could be trying to reconnect for a long time, Ethereum client could be rejecting those attempts interpreting them as DoS/misbehavior, and in this scenario, operator would receive only warnings with no single error. Both those points are addressed here by bumping up the dependency on `keep-common` and regenerating contract bindings. Also, the code in `ethereum.go` is getting cleaned up a little to reflect the fact all `Watch*` attempts are getting retried until they succeed or until unsubscribe happens. For example, let's say that the client was able to establish all subscriptions, they were working fine for two minutes, failed, and were established again successfully. Dropped WS connection is fine assuming it's not happening too often. Two minutes is certainly alarming, so the client will now alarm with an error: > 2021-01-07T18:29:20.522+0100 ERROR keep-contract-BondedECDSAKeepFactory subscription to event BondedECDSAKeepCreated had to be retried [2m0.006063728s] since the last attempt; please inspect Ethereum client connectivity Now, let's say that the retry will fail. Again, the client will alarm with an error: > 2021-01-07T18:31:21.532+0100 ERROR keep-contract-BondedECDSAKeep subscription to event KeepClosed failed: [dial tcp 127.0.0.1:8546: connect: connection refused]; resubscription attempt will be performed Last but not least, we now have a backoff. The delay is increased twice until it reaches the maximum backoff time which is set to 2 minutes.

New Ethereum event subscription API, background event pull loop # Overview There are two major changes to Ethereum subscription API proposed here: - new subscription API with `OnEvent` and `Pipe` functions, - background monitoring loop pulling past events from the chain. The first change should allow implementing some handler logic easier and to avoid complex logic leading to bugs such as keep-network/keep-core#1333 or keep-network/keep-core#2052. The second change should improve client responsiveness for operators running their nodes against Ethereum deployments that are not very reliable on the event delivery front. This code has been integrated with ECDSA keep client in `keep-ecdsa` repository and can be tested there on the branch `pipe-it` (keep-network/keep-ecdsa#671). # New API Event subscription API has been refactored to resemble the proposition from keep-network/keep-core#491. The new event subscription mechanism allows installing event callback handler function with `OnEvent` function as well as piping events from a subscription to a channel with `Pipe` function. Example usage of `OnEvent`: ``` handlerFn := func( submittingMember common.Address, conflictingPublicKey []byte, blockNumber uint64, ) { // (...) } subscription := keepContract.ConflictingPublicKeySubmitted( nil, // default SubscribeOpts nil, // no filtering on submitting member ).OnEvent(handlerFn) ``` The same subscription but with a `Pipe`: ``` sink := make(chan *abi.BondedECDSAKeepConflictingPublicKeySubmitted) subscription := keepContract.ConflictingPublicKeySubmitted( nil, // default SubscribeOpts nil, // no filtering on submitting member ).Pipe(sink) ``` Currently, all our event subscriptions use function handlers. While it is convenient in some cases, in some other cases it is the opposite. For example, `OnBondedECDSAKeepCreated` handler in ECDSA client works perfectly fine as a function. It triggers the protocol and does not have to constantly monitor the state of the chain. On the other hand, `OnDKGResultSubmitted` handler from the beacon client needs to monitor the chain and exit the process of event publication in case another node has published the result. In this case, the code could be better structured with a channel-based subscription that would allow listening for block counter events and events from DKG result submitted subscription in one for-select loop. # Background monitoring loop Some nodes in the network are running against Ethereum setups that are not particularly reliable in delivering events. Events are not delivered, nodes are not starting key-generation, or are not participating in redemption signing. Another problem is the stability of the event subscription mechanism (see #62). If the web socket connection is dropped too often, the resubscription mechanism is not enough to receive events emitted when the connection was in a weird, stale state. To address this problem, we introduce a background loop periodically pulling past events from the chain next to a regular `watchLogs` subscription. How often events are pulled and how many blocks are taken into account can be configured with `SubscribeOpts` parameters. This way, even if the event was lost by `watchLogs` subscription for whatever reason, it should be pulled by a background monitoring loop later. An extremely important implication of this change is that handlers should have a logic in place allowing them to de-duplicate received events even if a lot of time passed between receiving the original event and the duplicate. I have been experimenting with various options here, including de-duplication events in the chain implementation layer, but none of them proved to be successful as the correct de-duplication algorithm requires domain knowledge about a certain type of an event and in what circumstances identical event emitted later should or should not be identified as a duplicate. De-duplicator implementations should be added to `keep-core` and `keep-ecdsa` clients and are out of the scope of `keep-common` and this PR.

Update Go contract bindings: resubscription mechanism and subscription background monitoring loop Refs keep-network/keep-ecdsa#680 Depends on keep-network/keep-common#62 Depends on keep-network/keep-common#63 The referenced PRs from `keep-common` repository refactor Go contract bindings updating Ethereum event resubscription mechanism and adding subscription background monitoring loop in case an event was lost for whatever reason. The full reasoning behind those changes is explained in the referenced PRs. Here we update `keep-common` dependency and regenerate contract bindings to include the mentioned changes. Leaving as a Draft until the two referenced PRs are not merged. Please do not merge this PR until `keep-common` dependency is updated to the merge commit from `keep-common`.

pdyraga added 5 commits January 4, 2021 19:05

Updated contract bindings to use ethutil.WithResubscription

e1a10bc

So far, all the resubscription logic was placed in the generated contract bindings. The new ethutil.WithResubscription function takes care of resubscription, having backoff and proper alerting implemented.

Increased subscription alert threshold to 15 minutes

2d19dff

This value is still quite experimental but 15 minutes is a stronger requirement than 5 minutes and if the client can't keep WS connection alive for at least 15 minutes it is better to inspect it.

pdyraga mentioned this pull request Jan 7, 2021

Subscription dance: backoff and logging keep-network/keep-ecdsa#663

Merged

pdyraga added 2 commits January 12, 2021 13:33

nkuba reviewed Jan 15, 2021

View reviewed changes

pkg/chain/ethereum/ethutil/resubscribe.go Show resolved Hide resolved

pkg/chain/ethereum/ethutil/resubscribe.go Show resolved Hide resolved

pkg/chain/ethereum/ethutil/resubscribe.go Show resolved Hide resolved

tools/generators/ethereum/contract_events.go.tmpl Outdated Show resolved Hide resolved

lukasz-zimnoch requested changes Jan 15, 2021

View reviewed changes

nkuba reviewed Jan 15, 2021

View reviewed changes

Shadowfiend reviewed Jan 15, 2021

View reviewed changes

pdyraga mentioned this pull request Jan 15, 2021

New Ethereum event subscription API, background event pull loop #63

Merged

This was referenced Jan 25, 2021

Update Go contract bindings: resubscription mechanism and subscription background monitoring loop keep-network/tbtc#750

Merged

Ethereum event subscription refactoring keep-network/keep-ecdsa#680

Closed

pdyraga added 6 commits January 27, 2021 10:38

Define thresholdViolated/subscriptionFailed functions separately

eb5ee3d

Define thresholdViolated and subscriptionFailed functions in contract bindings separately for a better clarity of the code.

Explained why channels in test use 10-elements buffer

44412b5

Renamed cancel to cancelCtx

7a5fcf1

Just to make it clear what we are cancelling.

Define expected error message in test in only one place

eb2cc4a

lukasz-zimnoch reviewed Jan 27, 2021

View reviewed changes

pdyraga added 3 commits January 27, 2021 13:04

Simplified error message in subscription duration threshold violated fn

c777eab

Use %s for printing duration in thresholdViolatedFn

64cab83

%v works fine as well but doing it with %s for a nicer code.

Regenerated template content for contract bindings

c6adefd

Forgot to regenerate the template for the previous commit with a minor change about formatting string.

lukasz-zimnoch approved these changes Jan 27, 2021

View reviewed changes

nkuba approved these changes Jan 27, 2021

View reviewed changes

nkuba merged commit bb217b9 into master Jan 27, 2021

nkuba deleted the resubscriptions-dance branch January 27, 2021 15:20

pdyraga added this to the v1.4.0 milestone Feb 16, 2021

pdyraga modified the milestones: v1.5.0, v1.4.0 Mar 11, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Subscription dance: backoff, logging, and test coverage #62

Subscription dance: backoff, logging, and test coverage #62

pdyraga commented Jan 7, 2021 •

edited

Loading

nkuba left a comment

Shadowfiend Jan 15, 2021

pdyraga Jan 27, 2021

nkuba Jan 27, 2021

lukasz-zimnoch left a comment

pdyraga commented Jan 27, 2021

Subscription dance: backoff, logging, and test coverage #62

Subscription dance: backoff, logging, and test coverage #62

Conversation

pdyraga commented Jan 7, 2021 • edited Loading

nkuba left a comment

Choose a reason for hiding this comment

Shadowfiend Jan 15, 2021

Choose a reason for hiding this comment

pdyraga Jan 27, 2021

Choose a reason for hiding this comment

nkuba Jan 27, 2021

Choose a reason for hiding this comment

lukasz-zimnoch left a comment

Choose a reason for hiding this comment

pdyraga commented Jan 27, 2021

pdyraga commented Jan 7, 2021 •

edited

Loading