Skip to content
This repository has been archived by the owner on Nov 15, 2023. It is now read-only.

Collators should only send their collation to one validator at a time #3230

Closed
eskimor opened this issue Jun 13, 2021 · 5 comments
Closed

Collators should only send their collation to one validator at a time #3230

eskimor opened this issue Jun 13, 2021 · 5 comments
Labels
I10-optimisation An enhancement to provide better overall performance in terms of time-to-completion for a task. U1-asap No need to stop dead in your tracks, however issue should be addressed as soon as possible.

Comments

@eskimor
Copy link
Member

eskimor commented Jun 13, 2021

Right now a collator will announce a collation to all validators of the backing group and all of them will start fetching the collation. Therefore the collator needs to be be able to push its payload 5 times within the request timeout (for a backing group size of 5), which easily fails (and has failed) for large PoVs.

Instead a collator should serve those requests one at a time, this way the load is better distributed. The validator that received the collation will second it and distribute it to other validators as well via POV distribution.

Implementation: We can have a request queue size of one, so all other requests will be cancelled immediately. For this to work we need to be able to back pressure on the queue for the duration it takes to serve the request. Thanks to @tomaka this is possible and we already have an implementation of this in statement distribution.

Another puzzle piece is being able to back pressure on one particular request channel. In network bridge we have a request multiplexer which makes back pressuring on a particular channel impossible. This should be fixed eventually. For now we special cased statement distribution here and send the request channel directly to statement distribution. A similar hack could be done for collation fetching as well or even better get rid of the multiplexer all together.

Considerations:

  • A collator should keep note of validators that were not able to fetch the collation within the timeout, so it will try some other validator first the next time.
  • A validator should not reduce the collators reputation on a cancelled request as this is expected behavior now. Instead we could only change its reputation if we keep failing and also don't get notified of a seconded statement about the advertised candidate, in that case we can decrease reputation a lot/disconnect immediately.

@bkchr @rphmeier

@eskimor eskimor added I10-optimisation An enhancement to provide better overall performance in terms of time-to-completion for a task. U1-asap No need to stop dead in your tracks, however issue should be addressed as soon as possible. labels Jun 13, 2021
@bkchr
Copy link
Member

bkchr commented Jun 24, 2021

Another puzzle piece is being able to back pressure on one particular request channel. In network bridge we have a request multiplexer which makes back pressuring on a particular channel impossible. This should be fixed eventually. For now we special cased statement distribution here and send the request channel directly to statement distribution. A similar hack could be done for collation fetching as well or even better get rid of the multiplexer all together.

So, I checked the code a little bit and I'm not sure about the back pressure. While I understand how it works and that it makes sense, we only have one channel for all PoV requests. This means we can not differentiate per relay parent, but we should still try to serve one PoV per relay parent.

I went a different route in my pr here: #3360 We cache the requests in the collator protocol and answer them one by one, but on a relay parent level.

@bkchr
Copy link
Member

bkchr commented Jun 24, 2021

* A collator should keep note of validators that were not able to fetch the collation within the timeout, so it will try some other validator first the next time.

As already spoken in real life, this is not really supported currently and we should also not do this. The consideration is that the validator with the best connection is probably the first that will request the collation from a collator.

@eskimor
Copy link
Member Author

eskimor commented Jun 24, 2021

As already spoken in real life, this is not really supported currently and we should also not do this. The consideration is that the validator with the best connection is probably the first that will request the collation from a collator.

I will have a look at your PR asap. What we should make sure is, that a single malicious validator cannot DOS a parachain by simply not requesting the collation - that would really not be great.

@burdges
Copy link
Contributor

burdges commented Jun 24, 2021

Avoiding parachain stalls is basically the only reason backing goups have size larger than one validator.

@bkchr
Copy link
Member

bkchr commented Jun 24, 2021

I will have a look at your PR asap. What we should make sure is, that a single malicious validator cannot DOS a parachain by simply not requesting the collation - that would really not be great.

Yeah good point, it could stop us by artificially slowing down the transfer. This would be bad. We could stop if the transfer is too slow...

However, with my latest change to the validator side (only requesting on collation at a time), we could have the same problems. (a a malicious collator slowing us down)

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
I10-optimisation An enhancement to provide better overall performance in terms of time-to-completion for a task. U1-asap No need to stop dead in your tracks, however issue should be addressed as soon as possible.
Projects
None yet
Development

No branches or pull requests

3 participants