-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WIP feat: bounded consensus events channel #8251
WIP feat: bounded consensus events channel #8251
Conversation
23581c8
to
3c3bdb2
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
here I would like to see the MeteredSender
and MeteredReceiver
types already.
not convinced dropping engine messages in congestion is sufficient error handling? can the node recover if it drops any engine message? otherwise this introduces an even bigger vulnerability than it aims to fix.
if we cannot guarantee recovery from dropping all engine message types, then I lean towards using UnbounbdedMeteredSender
and UnbounbdedMeteredReceiver
here for now, or a re-fetch system if that's possible to implement.
indeed, dropping engine messages in congestion conflicts with this issue about not dropping engine messages #8203 |
Hmm this is a tough one which seems it isn't great either way. To add some further pain to the issue, the EngineAPI specs do not prevent replay attacks which would be a pain here. Note the timestamp needs to be within 1 minute of the current clock so replays are only valid that long.
I don't imagine this is a common case but someone with read access to the EL - CL connection could replay messages. An example of a malicious case would be if there are two FCU updates which change from chain A to chain B. If the attacker spammed the connection with FCU A - FCU B - FCU A - FCU B - ... it would be quite a lot of processing. Now if we have a bounded channel they'd be able to use this replay to cause genuine messages to be dropped which is equally as bad as a resource DoS. In terms of dropping excess FCUs, if the most recent FCU has a valid canonical head then all of the older FCUs can be safely dropped. However, if a FCU does not contain the canonical head then older FCU are relevant as they could update the canonical head. So there's potential here to minimize the number of FCUs we process by filtering some old ones. In terms of dropping new payloads, it would not be good to drop any canonical blocks as these would need to be fetched again and processed later. So I believe we'd need to process all of the new payloads unless some logic is implemented to determine which are not canonical. tl;dr So maybe a tidier solution here is to not bound the channel and drain all of the messages into a queue. In the queue some management can occur
|
5da034a
to
77df31b
Compare
how about just upgrading the connection to CL to |
Yep HTTPS has replay protection so that would prevent re-using old requests. |
4f7d39d
to
c400887
Compare
would it require a fix on CL side to support this as an extra security feature or is it already supported? @kirk-baird |
The engine api connection requires a valid jwt, how feasible is something like this? I assume an attacker on the same machine is out of scope, would it need to feed the right messages to the CL s.t. it sends these messages? |
sorry clicked the wrong button... |
ups this was pointing to #8193 's branch and was automatically closed when it was merged and the branch removed, reopening |
right, risk is very low. however, with this fix, the risk that node can't recover in network congestion due to dropped engine messages comes into existence @Rjected |
Good question, so the issue here is that the valid jwt is replayable. That means if an attacker reads a message with a JWT they can replay that message as many times as they desire until it expires ~1 minute. An attacker isn't able to create new messages since they can't forge a jwt only re-use existing tokens. Note that each JWT is tied to a single message via HMAC so an attacker can't take a valid one and use it on a different message. So the likelihood of this issue is determined based on how easy it is to read messages and send them again. Now by default these connections are not encrypted and use http or websockets. If the CL + EL are on the same machine then reading the connection means you've access to the machine and so you can do much worse attacks, we're not worried about this case. The case to think about is when the EL+CL are on different machines and communicate over an untrusted network e.g. regular HTTP can be eavesdropped. Upgrading this connection to HTTPS when they're not on the same machine would resolve this issue as it prevents replay and also encrypts the messages. |
Setting the channel to bounded, without doing much else, could cause similar trouble with the message sitting in a waiting tokio task instead of the channel. In terms of correctness with the CL, the execution API spec already describes a notion of a timeout and retry. |
thanks @gnattishness
how about keeping the unbounded channel in that case, but polling it not when EL is ready to process the engine message, instead at short intervals. we do this and queue the engine messages in a type that evicts entries on-op based on the timeout, when inserting new entries. much like this type https://github.com/sigp/discv5/blob/8eac0fee8da65c4021c6bba7c976d395151309ed/src/lru_time_cache.rs#L7-L18. I think that type can replace the |
closing in favour of addressing #8203 wrt to spec'd timeouts for messages |
Requires #8193
There is an unbounded channel for consensus events sent from the RPC.
The channel
to_engine
is used by theBeaconConsensusEngineHandle
to handle messages from the RPC. The channel is unbounded and therefore susceptible to DoS attacks or excessive resource consumption in times of congestion.The RPC is only callable from the consensus client which is considered a trusted actor and therefore the risk of DoS attack is low. However, during times of congestion a large number of events could build up. Processing events includes executing blocks and updating the blockchain tree. Thus, it may be possible that events are added to the channel faster than they may be processed and the channel will increase in volume.
This PR replaces the unbounded channel with a bounded one, for now setting a default channel size of 1000, this should be adapted after gathering usage metrics.