WIP feat: bounded consensus events channel #8251

fgimenez · 2024-05-14T11:16:15Z

Requires #8193

There is an unbounded channel for consensus events sent from the RPC.

The channel to_engine is used by the BeaconConsensusEngineHandle to handle messages from the RPC. The channel is unbounded and therefore susceptible to DoS attacks or excessive resource consumption in times of congestion.

The RPC is only callable from the consensus client which is considered a trusted actor and therefore the risk of DoS attack is low. However, during times of congestion a large number of events could build up. Processing events includes executing blocks and updating the blockchain tree. Thus, it may be possible that events are added to the channel faster than they may be processed and the channel will increase in volume.

This PR replaces the unbounded channel with a bounded one, for now setting a default channel size of 1000, this should be adapted after gathering usage metrics.

emhane

here I would like to see the MeteredSender and MeteredReceiver types already.

not convinced dropping engine messages in congestion is sufficient error handling? can the node recover if it drops any engine message? otherwise this introduces an even bigger vulnerability than it aims to fix.

if we cannot guarantee recovery from dropping all engine message types, then I lean towards using UnbounbdedMeteredSender and UnbounbdedMeteredReceiver here for now, or a re-fetch system if that's possible to implement.

@kirk-baird @AgeManning @mattsse @Rjected wdyt?

emhane · 2024-05-18T13:49:11Z

indeed, dropping engine messages in congestion conflicts with this issue about not dropping engine messages #8203

kirk-baird · 2024-05-20T01:15:10Z

Hmm this is a tough one which seems it isn't great either way. To add some further pain to the issue, the EngineAPI specs do not prevent replay attacks which would be a pain here. Note the timestamp needs to be within 1 minute of the current clock so replays are only valid that long.

The authentication scheme is not designed to

Prevent attackers with capability to read ('sniff') network traffic from reading the traffic,

Prevent attackers with capability to read ('sniff') network traffic from performing replay-attacks of earlier messages.

I don't imagine this is a common case but someone with read access to the EL - CL connection could replay messages. An example of a malicious case would be if there are two FCU updates which change from chain A to chain B. If the attacker spammed the connection with FCU A - FCU B - FCU A - FCU B - ... it would be quite a lot of processing. Now if we have a bounded channel they'd be able to use this replay to cause genuine messages to be dropped which is equally as bad as a resource DoS.

In terms of dropping excess FCUs, if the most recent FCU has a valid canonical head then all of the older FCUs can be safely dropped. However, if a FCU does not contain the canonical head then older FCU are relevant as they could update the canonical head. So there's potential here to minimize the number of FCUs we process by filtering some old ones.

In terms of dropping new payloads, it would not be good to drop any canonical blocks as these would need to be fetched again and processed later. So I believe we'd need to process all of the new payloads unless some logic is implemented to determine which are not canonical.

tl;dr So maybe a tidier solution here is to not bound the channel and drain all of the messages into a queue. In the queue some management can occur

detecting and combining duplicate FCUs or new payloads
some optimistic ordering of FCUs to process the most recent first (would need to make sure we've processed enough new payloads)
dropping any FCUs older that the most recent valid FCU that has been processed.

emhane · 2024-05-21T16:31:40Z

how about just upgrading the connection to CL to https @kirk-baird ?

kirk-baird · 2024-05-22T01:17:38Z

how about just upgrading the connection to CL to https @kirk-baird ?

Yep HTTPS has replay protection so that would prevent re-using old requests.

emhane · 2024-05-22T12:03:51Z

how about just upgrading the connection to CL to https @kirk-baird ?

Yep HTTPS has replay protection so that would prevent re-using old requests.

would it require a fix on CL side to support this as an extra security feature or is it already supported? @kirk-baird

Rjected · 2024-05-22T14:30:56Z

The engine api connection requires a valid jwt, how feasible is something like this? I assume an attacker on the same machine is out of scope, would it need to feed the right messages to the CL s.t. it sends these messages?

Rjected · 2024-05-22T14:31:09Z

sorry clicked the wrong button...

fgimenez · 2024-05-22T18:43:59Z

ups this was pointing to #8193 's branch and was automatically closed when it was merged and the branch removed, reopening

emhane · 2024-05-22T20:55:42Z

The engine api connection requires a valid jwt, how feasible is something like this? I assume an attacker on the same machine is out of scope would it need to feed the right messages to the CL s.t. it sends these messages?

right, risk is very low. however, with this fix, the risk that node can't recover in network congestion due to dropped engine messages comes into existence @Rjected

kirk-baird · 2024-05-23T00:55:30Z

The engine api connection requires a valid jwt, how feasible is something like this? I assume an attacker on the same machine is out of scope, would it need to feed the right messages to the CL s.t. it sends these messages?

Good question, so the issue here is that the valid jwt is replayable. That means if an attacker reads a message with a JWT they can replay that message as many times as they desire until it expires ~1 minute. An attacker isn't able to create new messages since they can't forge a jwt only re-use existing tokens. Note that each JWT is tied to a single message via HMAC so an attacker can't take a valid one and use it on a different message.

So the likelihood of this issue is determined based on how easy it is to read messages and send them again. Now by default these connections are not encrypted and use http or websockets. If the CL + EL are on the same machine then reading the connection means you've access to the machine and so you can do much worse attacks, we're not worried about this case.

The case to think about is when the EL+CL are on different machines and communicate over an untrusted network e.g. regular HTTP can be eavesdropped. Upgrading this connection to HTTPS when they're not on the same machine would resolve this issue as it prevents replay and also encrypts the messages.

gnattishness · 2024-05-23T05:44:42Z

Setting the channel to bounded, without doing much else, could cause similar trouble with the message sitting in a waiting tokio task instead of the channel.
Instead, you'd likely want to think about a solution involving try_send or send_timeout

In terms of correctness with the CL, the execution API spec already describes a notion of a timeout and retry.
When under heavy load with an unbounded channel, Reth's eventual response could be long past the timeout and ignored.
The channel could also contain duplicate (retried) requests.

emhane · 2024-05-25T13:43:14Z

thanks @gnattishness

engine_newPayloadV3 and engine_forkchoiceUpdatedV3 have 8 second timeouts.

how about keeping the unbounded channel in that case, but polling it not when EL is ready to process the engine message, instead at short intervals. we do this and queue the engine messages in a type that evicts entries on-op based on the timeout, when inserting new entries. much like this type https://github.com/sigp/discv5/blob/8eac0fee8da65c4021c6bba7c976d395151309ed/src/lru_time_cache.rs#L7-L18.

I think that type can replace the VecDeque mentioned in the conflicting issue #8203.

emhane · 2024-05-30T00:14:16Z

closing in favour of addressing #8203 wrt to spec'd timeouts for messages

fgimenez added 16 commits May 13, 2024 18:58

feat: use broadcast channel for event listeners

684a83c

move event_listeners from manager to network

17b2ada

updated transactions

4c596bb

update pruner

50a9c22

EventNotifier moved to tokio-util

e7ac4b1

configurable broadcast channel size and default

d3b7809

updated consensus

07acaac

updated builder

577b3d0

updated commands

ef76b05

updated tests

9d3b918

clippy

5b6a2c3

fix tests

7acc76c

removed uneeded methods

7bf7230

EventListeners::notify doesn't return result

83ede66

merge EventNotifier and EventListener

f3ecdfe

remove uneeded subscriber count

06ff479

fgimenez added A-consensus Related to the consensus engine C-security Issue or pull request related to security. labels May 14, 2024

fgimenez added 2 commits May 14, 2024 14:18

engine handle receives listeners

c7b029b

feat: bounded consensus events channel

3c3bdb2

fgimenez force-pushed the fgimenez/bounded-consensus-events-channel branch from 23581c8 to 3c3bdb2 Compare May 14, 2024 12:27

clippy

c270f48

emhane added the S-blocked This cannot more forward until something else changes label May 18, 2024

emhane requested changes May 18, 2024

View reviewed changes

fgimenez force-pushed the fgimenez/event-listeners-broadcast-channel branch from 5da034a to 77df31b Compare May 20, 2024 17:36

fgimenez force-pushed the fgimenez/event-listeners-broadcast-channel branch from 4f7d39d to c400887 Compare May 22, 2024 07:16

Rjected closed this May 22, 2024

Rjected reopened this May 22, 2024

fgimenez deleted the branch paradigmxyz:main May 22, 2024 17:50

fgimenez closed this May 22, 2024

fgimenez reopened this May 22, 2024

fgimenez changed the base branch from fgimenez/event-listeners-broadcast-channel to main May 22, 2024 18:44

emhane closed this May 30, 2024

emhane mentioned this pull request May 30, 2024

Buffer incoming engine messages if pruner is active #8203

Closed

emhane mentioned this pull request Jun 25, 2024

feat: new engine API handler #8559

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP feat: bounded consensus events channel #8251

WIP feat: bounded consensus events channel #8251

fgimenez commented May 14, 2024

emhane left a comment

emhane commented May 18, 2024

kirk-baird commented May 20, 2024

emhane commented May 21, 2024

kirk-baird commented May 22, 2024

emhane commented May 22, 2024

Rjected commented May 22, 2024 •

edited

Loading

Rjected commented May 22, 2024

fgimenez commented May 22, 2024

emhane commented May 22, 2024

kirk-baird commented May 23, 2024 •

edited

Loading

gnattishness commented May 23, 2024

emhane commented May 25, 2024

emhane commented May 30, 2024

WIP feat: bounded consensus events channel #8251

WIP feat: bounded consensus events channel #8251

Conversation

fgimenez commented May 14, 2024

emhane left a comment

Choose a reason for hiding this comment

emhane commented May 18, 2024

kirk-baird commented May 20, 2024

emhane commented May 21, 2024

kirk-baird commented May 22, 2024

emhane commented May 22, 2024

Rjected commented May 22, 2024 • edited Loading

Rjected commented May 22, 2024

fgimenez commented May 22, 2024

emhane commented May 22, 2024

kirk-baird commented May 23, 2024 • edited Loading

gnattishness commented May 23, 2024

emhane commented May 25, 2024

emhane commented May 30, 2024

Rjected commented May 22, 2024 •

edited

Loading

kirk-baird commented May 23, 2024 •

edited

Loading