Collators should only send their collation to one validator at a time #3230

eskimor · 2021-06-13T12:37:43Z

Right now a collator will announce a collation to all validators of the backing group and all of them will start fetching the collation. Therefore the collator needs to be be able to push its payload 5 times within the request timeout (for a backing group size of 5), which easily fails (and has failed) for large PoVs.

Instead a collator should serve those requests one at a time, this way the load is better distributed. The validator that received the collation will second it and distribute it to other validators as well via POV distribution.

Implementation: We can have a request queue size of one, so all other requests will be cancelled immediately. For this to work we need to be able to back pressure on the queue for the duration it takes to serve the request. Thanks to @tomaka this is possible and we already have an implementation of this in statement distribution.

Another puzzle piece is being able to back pressure on one particular request channel. In network bridge we have a request multiplexer which makes back pressuring on a particular channel impossible. This should be fixed eventually. For now we special cased statement distribution here and send the request channel directly to statement distribution. A similar hack could be done for collation fetching as well or even better get rid of the multiplexer all together.

Considerations:

A collator should keep note of validators that were not able to fetch the collation within the timeout, so it will try some other validator first the next time.
A validator should not reduce the collators reputation on a cancelled request as this is expected behavior now. Instead we could only change its reputation if we keep failing and also don't get notified of a seconded statement about the advertised candidate, in that case we can decrease reputation a lot/disconnect immediately.

@bkchr @rphmeier

bkchr · 2021-06-24T13:49:08Z

Another puzzle piece is being able to back pressure on one particular request channel. In network bridge we have a request multiplexer which makes back pressuring on a particular channel impossible. This should be fixed eventually. For now we special cased statement distribution here and send the request channel directly to statement distribution. A similar hack could be done for collation fetching as well or even better get rid of the multiplexer all together.

So, I checked the code a little bit and I'm not sure about the back pressure. While I understand how it works and that it makes sense, we only have one channel for all PoV requests. This means we can not differentiate per relay parent, but we should still try to serve one PoV per relay parent.

I went a different route in my pr here: #3360 We cache the requests in the collator protocol and answer them one by one, but on a relay parent level.

bkchr · 2021-06-24T13:50:42Z

* A collator should keep note of validators that were not able to fetch the collation within the timeout, so it will try some other validator first the next time.

As already spoken in real life, this is not really supported currently and we should also not do this. The consideration is that the validator with the best connection is probably the first that will request the collation from a collator.

eskimor · 2021-06-24T13:56:39Z

As already spoken in real life, this is not really supported currently and we should also not do this. The consideration is that the validator with the best connection is probably the first that will request the collation from a collator.

I will have a look at your PR asap. What we should make sure is, that a single malicious validator cannot DOS a parachain by simply not requesting the collation - that would really not be great.

burdges · 2021-06-24T13:59:30Z

Avoiding parachain stalls is basically the only reason backing goups have size larger than one validator.

bkchr · 2021-06-24T14:02:21Z

I will have a look at your PR asap. What we should make sure is, that a single malicious validator cannot DOS a parachain by simply not requesting the collation - that would really not be great.

Yeah good point, it could stop us by artificially slowing down the transfer. This would be bad. We could stop if the transfer is too slow...

However, with my latest change to the validator side (only requesting on collation at a time), we could have the same problems. (a a malicious collator slowing us down)

eskimor added I10-optimisation An enhancement to provide better overall performance in terms of time-to-completion for a task. U1-asap No need to stop dead in your tracks, however issue should be addressed as soon as possible. labels Jun 13, 2021

eskimor mentioned this issue Aug 31, 2021

Collation fetching times out too often #3748

Closed

eskimor closed this as completed Oct 7, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Collators should only send their collation to one validator at a time #3230

Collators should only send their collation to one validator at a time #3230

eskimor commented Jun 13, 2021

bkchr commented Jun 24, 2021

bkchr commented Jun 24, 2021

eskimor commented Jun 24, 2021

burdges commented Jun 24, 2021

bkchr commented Jun 24, 2021

Collators should only send their collation to one validator at a time #3230

Collators should only send their collation to one validator at a time #3230

Comments

eskimor commented Jun 13, 2021

bkchr commented Jun 24, 2021

bkchr commented Jun 24, 2021

eskimor commented Jun 24, 2021

burdges commented Jun 24, 2021

bkchr commented Jun 24, 2021