Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

async_backing: Candidate timeouts on group rotation boundary in versi #3165

Closed
alexggh opened this issue Feb 1, 2024 · 5 comments · Fixed by #3170
Closed

async_backing: Candidate timeouts on group rotation boundary in versi #3165

alexggh opened this issue Feb 1, 2024 · 5 comments · Fixed by #3170
Assignees

Comments

@alexggh
Copy link
Contributor

alexggh commented Feb 1, 2024

When running tick and glutton parachains on versi with async-backing ever 2-3 minutes candidate timeout on backing and that blocks the core for about 2 min.

Some logs regarding 1 candidate: https://grafana.teleport.parity.io/goto/Nke60BpSR?orgId=1.

Root cause

It seems that at group rotation boundary there is a problem in the way group assignment work and we end up in a situation where the collator and the backing subsystem end up using the group assignment before the rotation, but since the candidate is backed in a block after rotation the availability will use a different group for fetching the chunk which results in the candidate timing out.

The main culprit for this problem seems to be the backed assumption in runtime function validator_groups

pub fn validator_groups<T: initializer::Config>(
that when are backing a candidate is going to be included on chain next block, which is not the case with async_backing.

	let now = <frame_system::Pallet<T>>::block_number() + One::one();

	let groups = <scheduler::Pallet<T>>::validator_groups();
	let rotation_info = <scheduler::Pallet<T>>::group_rotation_info(now);

	(groups, rotation_info)

Hence the usage of validator_groups in backing subsystem and group_responsible

group_responsible: group_responsible_for(
from CoreState in availability-distribution will give us different groups, so the candidate never passes the availability part.

@rphmeier: Thoughts on how to fix this ?

@alexggh alexggh self-assigned this Feb 1, 2024
@alexggh alexggh moved this from Backlog to In Progress in parachains team board Feb 1, 2024
@alexggh alexggh changed the title async_backing: Investigate intermitent candidate timeouts on versi async_backing: Candidate timeouts on group rotation boundary in versi Feb 1, 2024
@eskimor
Copy link
Member

eskimor commented Feb 1, 2024

The backing time should be irrelevant for this. As I described, I think on the parathreads ticket for determining the backing group we have to use the relay parent of the candidate. If this is not the case, we need to fix that. I don't see how anything else can work.

@eskimor
Copy link
Member

eskimor commented Feb 1, 2024

Consequences in this comment.

@alexggh
Copy link
Contributor Author

alexggh commented Feb 1, 2024

If this is not the case, we need to fix that

I confirmed that is not the case, reproducibility steps:

  1. Start any zombienet with a config where async backing is enabled and a group rotation frequency lower than the session lenght which is 10 blocks on zombienet. E.g:
[relaychain.genesis.runtimeGenesis.patch.configuration.config]
  scheduling_lookahead = 2
  group_rotation_frequency = 4
  1. Run the test.
  2. Open polkadot.js observe candidates time out.
  3. Open the logs and you are going to see a lot of for the candidates that timed out:
Validator did not have our chunk
  1. The core will be stuck for the entire duration of this group rotation and nothing else get to be executed on it.

@rphmeier
Copy link
Contributor

rphmeier commented Feb 1, 2024

Nice find, and thanks!

The linked PR fixes the OBO correctly. The backing group is determined based on the relay-parent and not the number the block was backed in.

github-merge-queue bot pushed a commit that referenced this issue Feb 2, 2024
Fixes: #3165

---------

Signed-off-by: Alexandru Gheorghe <alexandru.gheorghe@parity.io>
@github-project-automation github-project-automation bot moved this from In Progress to Completed in parachains team board Feb 2, 2024
alexggh added a commit that referenced this issue Feb 2, 2024
Fixes: #3165

---------

Signed-off-by: Alexandru Gheorghe <alexandru.gheorghe@parity.io>
(cherry picked from commit 5ba8921)
Signed-off-by: Alexandru Gheorghe <alexandru.gheorghe@parity.io>
EgorPopelyaev pushed a commit that referenced this issue Feb 5, 2024
Fixes: #3165

---------


(cherry picked from commit 5ba8921)

Signed-off-by: Alexandru Gheorghe <alexandru.gheorghe@parity.io>
@Polkadot-Forum
Copy link

This issue has been mentioned on Polkadot Forum. There might be relevant details there:

https://forum.polkadot.network/t/async-backing-development-updates/6176/1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Completed
Development

Successfully merging a pull request may close this issue.

4 participants