Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Subscription dance: backoff, logging, and test coverage #62

Merged
merged 16 commits into from
Jan 27, 2021

Conversation

pdyraga
Copy link
Member

@pdyraga pdyraga commented Jan 7, 2021

Refs keep-network/keep-ecdsa#680

This PR is the first one in the series of PRs refactoring Ethereum event subscriptions mechanism I plan to open in the next few days.

You can see it in action here: keep-network/keep-ecdsa#663.

So far resubscription logging was implemented in contract binding templates. There were three problems with that mechanism:

  1. No resubscribe backoff. Each failed subscription was retried 5 seconds after it failed. For many individual subscriptions ECDSA client is opening, a massive retry with no backoff could be interpreted as misbehavior by a third party Ethereum provider and further attempts could be completely blocked. Some operators experienced it with Alchemy.
  2. All the logging was on the warning level. With the lack of backoff mentioned in the previous point, this could produce a deadly mixture - keep client could be trying to reconnect for a long time, Ethereum client could be rejecting those attempts interpreting them as DoS/misbehavior, and operator received only warnings with no single error.
  3. Lack of test coverage for the resubscription mechanism. All the subscription code was placed in templates used to generate Go contract bindings and was very hard - if possible at all - to test.

Here we address all those problems.

Instead of implementing resubscriptions on our side, we wrap the code from github.com/ethereum/go-ethereum/event with some additional logging logic. This code is used from contract templates to keep the subscription alive. We lean on go-ethereum to do those resubscriptions right and we cover our logic addition in unit tests on our side.

The code from go-ethereum implements backoffs, increasing the delay twice until it reaches the maximum backoff time which is set in our bindings to 2 minutes.

The new code in ethutil wrapping go-ethereum's resubscriber, allows to log an error if the subscription is dropped too often which may indicate problems with Ethereum client. The threshold is set to 15 minutes.

The mechanism wraps event.Resubscribe mechanism from go-ethereum adding
alterting if the time between resubscription attemps is shorter than the
specific threshold.

This mechanism will replace the one we currently have in the generated
code, adding alerting and backoff.
The mechanism monitors the time elapsed between resubscription attempts and
if it is shorter than the specificed alertThreshold, it writes the time
elapsed between resubscription attempts to the thresholdViolated channel to
alarm about potential problems with the stability of the subscription.

Every error returned by the wrapped subscription function, is written to
subscriptionFailed channel.

If the calling code is interested in reading from thresholdViolated and/or
subscriptionFailed channel, appropriate readers need to be set up *before*
WithResubscription is called.

Writes to thresholdViolated and subscriptionFailed channels are non-blocking
and are not stopping resubscription attempts.

Those two channels will let the client code log warning/errors or even
kill the subscription in case alrt threshold is violated too many times.
So far, all the resubscription logic was placed in the generated
contract bindings. The new ethutil.WithResubscription function takes
care of resubscription, having backoff and proper alerting implemented.
This message is emitted when a subscription failed immediatelly afters
it's been requested. It is not the case when some long-living connection
has been dropped by a load balancer. In this case, it makes sense to log
the problem on error level.
This value is still quite experimental but 15 minutes is a stronger
requirement than 5 minutes and if the client can't keep WS connection
alive for at least 15 minutes it is better to inspect it.
Instead of accepting channels for altert threshold violation and
resubscribe function failure notifications, WithResubscription function
accepts two callback functions now.

Channels were used initially because the generated Watch* code had a
for-select section routing events from a channel to handler passed as a
Watch* function parameter. This behavior will be eliminated in the next
commit and Watch* function will accept a channel as a parameter and will
not be doing any forwarding from channel to handler. Given that, two
function handlers accepted by WithResubscription will be much more
convenient.
The previous version of WithResubscription accepted channels. The new
version accepting function callbacks will be easier to use in contract
bindings once the goroutine passing events from channel to callback will
be eliminated from there. And that's the plan in one of the next
commits.
Copy link
Member

@nkuba nkuba left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First part of comments.

pkg/chain/ethereum/ethutil/resubscribe.go Show resolved Hide resolved
pkg/chain/ethereum/ethutil/resubscribe.go Show resolved Hide resolved
pkg/chain/ethereum/ethutil/resubscribe.go Show resolved Hide resolved
tools/generators/ethereum/contract_events.go.tmpl Outdated Show resolved Hide resolved
pkg/chain/ethereum/ethutil/resubscribe.go Show resolved Hide resolved
pkg/chain/ethereum/ethutil/resubscribe_test.go Outdated Show resolved Hide resolved
pkg/chain/ethereum/ethutil/resubscribe_test.go Outdated Show resolved Hide resolved
pkg/chain/ethereum/ethutil/resubscribe_test.go Outdated Show resolved Hide resolved
pkg/chain/ethereum/ethutil/resubscribe_test.go Outdated Show resolved Hide resolved
tools/generators/ethereum/contract_events.go.tmpl Outdated Show resolved Hide resolved
pkg/chain/ethereum/ethutil/resubscribe_test.go Outdated Show resolved Hide resolved
pkg/chain/ethereum/ethutil/resubscribe.go Outdated Show resolved Hide resolved
tools/generators/ethereum/contract_events.go.tmpl Outdated Show resolved Hide resolved
tools/generators/ethereum/contract_events.go.tmpl Outdated Show resolved Hide resolved
tools/generators/ethereum/contract_events.go.tmpl Outdated Show resolved Hide resolved
tools/generators/ethereum/contract_events.go.tmpl Outdated Show resolved Hide resolved
pkg/chain/ethereum/ethutil/resubscribe.go Show resolved Hide resolved
// subscription required before the subscription failure happens and
// resubscription follows so that the resubscription does not emit an error
// to the logs alerting about potential problems with Ethereum client.
{{.ShortVar}}SubscriptionAlertThreshold = 15 * time.Minute
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤔 should this be per-contract?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it should not be but I do not have a good place to put it at. 😬 Maybe subscribe_opts.go from ethutil introduced in the next PR #63 is a good candidate?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm going to merge this PR. Please let's handle this change in #63.

Define thresholdViolated and subscriptionFailed functions in contract
bindings separately for a better clarity of the code.
This function was called resubscribeFn just because of
event.ResubscribeFn type. However, this function is called for the
first-ever subscription as well and we do have subscriptionFailedFn as
another parameter, named this way and not resubscriptionFailedFn for
exactly the same reason - it can be called for the first-ever
subscription. Moreover, each resubscription is in fact another
subscription attempt.
Just to make it clear what we are cancelling.
We can report all failed assertions instead of failing on the first one
failed. The only exception is TestEmitOriginalError where we want to
fail the test in case the first assertion fails so that we do not hang
on <-subscriptionFailed next line.
Copy link
Member

@lukasz-zimnoch lukasz-zimnoch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No more comments! Leaving the rest for @nkuba

%v works fine as well but doing it with %s for a nicer code.
Forgot to regenerate the template for the previous commit with a minor
change about formatting string.
@pdyraga
Copy link
Member Author

pdyraga commented Jan 27, 2021

I updated keep-network/keep-ecdsa#663 to point to the version with all review comments addressed, which is c6adefd - the last commit.

@nkuba nkuba merged commit bb217b9 into master Jan 27, 2021
@nkuba nkuba deleted the resubscriptions-dance branch January 27, 2021 15:20
nkuba added a commit to keep-network/keep-ecdsa that referenced this pull request Jan 27, 2021
Subscription dance: backoff and logging

Refs #680

Depends on keep-network/keep-common#62

The referenced `keep-common` PR solves three problems and two of them are important for ECDSA client.

Currently, the subscription mechanism used by ECDSA client does not implement a backoff. Each failed subscription is retried 5 seconds after it failed. For many individual subscriptions ECDSA client is opening, a massive retry with no backoff can be interpreted as misbehavior by a third party Ethereum provider and further attempts can be completely blocked. Some operators already experienced it with Alchemy.

The second problem is that all the logging regarding resubscriptions is done on the warning level. With the lack of backoff mentioned in the previous point, this can produce a deadly mixture - keep client could be trying to reconnect for a long time, Ethereum client could be rejecting those attempts interpreting them as DoS/misbehavior, and in this scenario, operator would receive only warnings with no single error. 

Both those points are addressed here by bumping up the dependency on `keep-common` and regenerating contract bindings. Also, the code in `ethereum.go` is getting cleaned up a little to reflect the fact all `Watch*` attempts are getting retried until they succeed or until unsubscribe happens.

For example, let's say that the client was able to establish all subscriptions, they were working fine for two minutes, failed, and were established again successfully. Dropped WS connection is fine assuming it's not happening too often. Two minutes is certainly alarming, so the client will now alarm with an error:

> 2021-01-07T18:29:20.522+0100	ERROR	keep-contract-BondedECDSAKeepFactory	subscription to event BondedECDSAKeepCreated had to be retried [2m0.006063728s] since the last attempt; please inspect Ethereum client connectivity

Now, let's say that the retry will fail. Again, the client will alarm with an error:

> 2021-01-07T18:31:21.532+0100	ERROR	keep-contract-BondedECDSAKeep	subscription to event KeepClosed failed: [dial tcp 127.0.0.1:8546: connect: connection refused]; resubscription attempt will be performed

Last but not least, we now have a backoff. The delay is increased twice until it reaches the maximum backoff time which is set to 2 minutes.
nkuba added a commit that referenced this pull request Jan 28, 2021
New Ethereum event subscription API, background event pull loop

# Overview 

There are two major changes to Ethereum subscription API proposed here:
- new subscription API with `OnEvent` and `Pipe` functions,
- background monitoring loop pulling past events from the chain.

The first change should allow implementing some handler logic easier and to avoid complex logic leading to bugs such as keep-network/keep-core#1333 or keep-network/keep-core#2052.

The second change should improve client responsiveness for operators running their nodes against Ethereum deployments that are not very reliable on the event delivery front.
  
This code has been integrated with ECDSA keep client in `keep-ecdsa` repository and can be tested there on the branch `pipe-it` (keep-network/keep-ecdsa#671).

# New API

Event subscription API has been refactored to resemble the proposition from keep-network/keep-core#491. The new event subscription mechanism allows installing event callback handler function with `OnEvent` function as well as piping events from a subscription to a channel with `Pipe` function.

Example usage of `OnEvent`:
```
handlerFn := func(
    submittingMember common.Address,
    conflictingPublicKey []byte,
    blockNumber uint64,
) {
   // (...)
}
subscription := keepContract.ConflictingPublicKeySubmitted(
	nil, // default SubscribeOpts
	nil, // no filtering on submitting member
).OnEvent(handlerFn)
```

The same subscription but with a `Pipe`:
```
sink := make(chan *abi.BondedECDSAKeepConflictingPublicKeySubmitted)

subscription := keepContract.ConflictingPublicKeySubmitted(
	nil, // default SubscribeOpts
	nil, // no filtering on submitting member
).Pipe(sink)
```

Currently, all our event subscriptions use function handlers. While it is convenient in some cases, in some other cases it is the opposite. For example, `OnBondedECDSAKeepCreated` handler in ECDSA client works perfectly fine as a function. It triggers the protocol and does not have to constantly monitor the state of the chain. On the other hand, `OnDKGResultSubmitted` handler from the beacon client needs to monitor the chain and exit the process of event publication in case another node has published the result. In this case, the code could be better structured with a channel-based subscription that would allow listening for block counter events and events from DKG result submitted subscription in one for-select loop.

# Background monitoring loop

Some nodes in the network are running against Ethereum setups that are not particularly reliable in delivering events. Events are not delivered, nodes are not starting key-generation, or are not participating in redemption signing. Another problem is the stability of the event subscription mechanism (see #62). If the web socket connection is dropped too often, the resubscription mechanism is not enough to receive events emitted when the connection was in a weird, stale state.

To address this problem, we introduce a background loop periodically pulling past events from the chain next to a regular `watchLogs` subscription. How often events are pulled and how many blocks are taken into account can be configured with `SubscribeOpts` parameters. 

This way, even if the event was lost by `watchLogs` subscription for whatever reason, it should be pulled by a background monitoring loop later. An extremely important implication of this change is that handlers should have a logic in place allowing them to de-duplicate received events even if a lot of time passed between receiving the original event and the duplicate.

I have been experimenting with various options here, including de-duplication events in the chain implementation layer, but none of them proved to be successful as the correct de-duplication algorithm requires domain knowledge about a certain type of an event and in what circumstances identical event emitted later should or should not be identified as a duplicate.

De-duplicator implementations should be added to `keep-core` and `keep-ecdsa` clients and are out of the scope of `keep-common` and this PR.
nkuba added a commit to keep-network/tbtc that referenced this pull request Jan 28, 2021
Update Go contract bindings: resubscription mechanism and subscription background monitoring loop

Refs keep-network/keep-ecdsa#680

Depends on keep-network/keep-common#62
Depends on keep-network/keep-common#63

The referenced PRs from `keep-common` repository refactor Go contract bindings updating Ethereum event resubscription mechanism and adding subscription background monitoring loop in case an event was lost for whatever reason. The full reasoning behind those changes is explained in the referenced PRs.

Here we update `keep-common` dependency and regenerate contract bindings to include the mentioned changes.

Leaving as a Draft until the two referenced PRs are not merged. 

Please do not merge this PR until `keep-common` dependency is updated to the merge commit from `keep-common`.
@pdyraga pdyraga added this to the v1.4.0 milestone Feb 16, 2021
@pdyraga pdyraga modified the milestones: v1.5.0, v1.4.0 Mar 11, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants