-
Notifications
You must be signed in to change notification settings - Fork 1.6k
Collators should only send their collation to one validator at a time #3230
Comments
So, I checked the code a little bit and I'm not sure about the back pressure. While I understand how it works and that it makes sense, we only have one channel for all PoV requests. This means we can not differentiate per relay parent, but we should still try to serve one PoV per relay parent. I went a different route in my pr here: #3360 We cache the requests in the collator protocol and answer them one by one, but on a relay parent level. |
As already spoken in real life, this is not really supported currently and we should also not do this. The consideration is that the validator with the best connection is probably the first that will request the collation from a collator. |
I will have a look at your PR asap. What we should make sure is, that a single malicious validator cannot DOS a parachain by simply not requesting the collation - that would really not be great. |
Avoiding parachain stalls is basically the only reason backing goups have size larger than one validator. |
Yeah good point, it could stop us by artificially slowing down the transfer. This would be bad. We could stop if the transfer is too slow... However, with my latest change to the validator side (only requesting on collation at a time), we could have the same problems. (a a malicious collator slowing us down) |
Right now a collator will announce a collation to all validators of the backing group and all of them will start fetching the collation. Therefore the collator needs to be be able to push its payload 5 times within the request timeout (for a backing group size of 5), which easily fails (and has failed) for large PoVs.
Instead a collator should serve those requests one at a time, this way the load is better distributed. The validator that received the collation will second it and distribute it to other validators as well via POV distribution.
Implementation: We can have a request queue size of one, so all other requests will be cancelled immediately. For this to work we need to be able to back pressure on the queue for the duration it takes to serve the request. Thanks to @tomaka this is possible and we already have an implementation of this in statement distribution.
Another puzzle piece is being able to back pressure on one particular request channel. In network bridge we have a request multiplexer which makes back pressuring on a particular channel impossible. This should be fixed eventually. For now we special cased statement distribution here and send the request channel directly to statement distribution. A similar hack could be done for collation fetching as well or even better get rid of the multiplexer all together.
Considerations:
@bkchr @rphmeier
The text was updated successfully, but these errors were encountered: