Use higher priority for PVF preparation in dispute/approval context #4172

sandreim · 2024-04-17T11:14:49Z

Related to #4126 discussion

Currently all preparations have same priority and this is not ideal in all cases. This change should improve the finality time in the context of on-demand parachains and when ExecutorParams are updated on-chain and a rebuild of all artifacts is required. The desired effect is to speed up approval and dispute PVF executions which require preparation and delay backing executions which require preparation.

Signed-off-by: Andrei Sandu <andrei-mihail@parity.io>

alexggh · 2024-04-17T11:19:44Z

polkadot/node/core/candidate-validation/src/lib.rs

 	) -> Result<WasmValidationResult, ValidationError> {
-		let priority = polkadot_node_core_pvf::Priority::Normal;


So, we never used the priority logic.

alexggh

The changes look sane to me, however I don't think the impact is significant in the situation where you have to recompile all, because at any point PVF backing jobs are way less than PVF approval jobs, because a validator is usually allocated to a single parachain for backing while it is doing approval for at least an order of magnitude more PVFs.

Overkillus

Good find.

One thing we need to be conscious about is that this effectively introduces a form of backing back-off when nodes are overworked. Not an issue but if we ever notice some weird fluctuations in backing frequency this should be a prime candidate for investigation.

Overkillus · 2024-04-17T11:29:41Z

polkadot/node/core/pvf/src/priority.rs

-	/// For example, backing is on a critical path and requires execution as soon as possible.
+	/// Disputes and approval are on a critical path and requires execution as soon as
+	/// possible to not delay finality.


Backing was deemed more important than approvals and disputes? 😨

If I understand it correctly that is an awesome find and fix. I was not aware of it. It also aligns much better with out backpressure plans - first thing you should stop doing when overworked is backing, NOT approvals.

But despite the comments indicating that... I can't find any anything in the original code that sets anything to Priority::Critical. Huh.

I am looking at the execution path as well, but that seem to require a lot more changes.
However the backing execution backpressure to activate at the network level requires at least the backing group to be overloaded, but that also depends on the validator hardware, some might be faster or some might provision more CPUs.

I've checked the original paritytech/polkadot#2710 and it seems it was like that from the very beginning 🤷
Maybe @pepyakin could remember why it was implemented like that

I think, there is a good argument to have for backing being first, bare in mind that in the pre-async, backing a candidate is in a really tight deadline. On approvals and disputes you must not fall behind too much on work, but it does not need to happen under a tight deadline .

On a very high level our scheduling policy needs to answer to the following questions, do we have a backlog of work(approval & dispute) we can't handle or is it just a temporary spike, if it is a temporary spike then the right call is to actual prioritise backing first and get back to approvals and disputes after it.

So, you actually want to handle a backing request as soon as possible and fill in the gaps with the approvals and disputes and only if our backlog of approvals and disputes gets out of hand throttle backing, otherwise we might accidentally throttle backing too often.

Yeah, you might be right that this was far worse before async backing, but even now it is not so great either. Depending on the actual PVF, the preparation can take 10-20s so the relay parent could go out of scope. Given that we have 1 single worker doing this right now, finality can be delayed for quite a lot of time and this would affect finality on all parachains vs just one parachain that won't be able to get it's blocks included for a while

Actually, we could also bump prepare_workers_hard_max_num from 1 to 2. This would be used only for preparing PVFs tagged with Critical priority. Or maybe leave it for changing in #4126 . WDYT ?

Sounds like a really good idea, but again, it would be good to have a number of CPU cores reserved in the hardware requirements for the maximum number of workers that may run in parallel.

s0me0ne-unkn0wn

How the hell did that happen? 🤔
I'll try to dig into history to investigate if we lost in at some point or never had indeed

sandreim · 2024-04-17T12:10:40Z

One thing we need to be conscious about is that this effectively introduces a form of backing back-off when nodes are overworked. Not an issue but if we ever notice some weird fluctuations in backing frequency this should be a prime candidate for investigation.

Yes, we need that back pressure, but this change is only relevant for PVF preparation. Execution remains unaffected so I wouldn't say node overworked, more like preparation thread is overworked :)

Signed-off-by: Andrei Sandu <andrei-mihail@parity.io>

Overkillus · 2024-04-17T12:32:17Z

Yes, we need that back pressure, but this change is only relevant for PVF preparation. Execution remains unaffected so I wouldn't say node overworked, more like preparation thread is overworked :)

Very good point. Missed it that this is only prep related. Removes most of the worry then

ordian · 2024-04-18T13:57:45Z

polkadot/node/core/pvf/src/priority.rs

@@ -20,11 +20,12 @@ pub enum Priority {
 	/// Normal priority for things that do not require immediate response, but still need to be
 	/// done pretty quick.
 	///
-	/// Approvals and disputes fall into this category.
+	/// Backing falls into this category.


(commenting here because I can't comment higher)

A priority assigned to execution of a PVF.

Please correct me if I'm wrong, but from a quick reading of this code, it seems to me that the priority is not for PVF execution, but for PVF compilation (prepare).

If that's the case, first, the comments here are misleading. And second, these changes would only help if the nodes restart and have no artifacts for old PVFs that are relevant for unapproved blocks or maybe the PVF artifact pruning is too aggressive?
Normally, I would expect them to have the artifacts even for new PVFs due to pre-checking. What would probably help if preparation is slow is having more workers for PVF compilation (currently, it's 1).

Yes this is currently used for preparation only. You are right, with pre-checking the artifact should be already there. It helps mostly with missing artifacts due to prunning or executor parameter changes.

maybe the PVF artifact pruning is too aggressive?

I wouldn't say it is too aggressive, we prune an artifact if it hasn't been used for 24 hours, which seems like a fair measure to prevent artifact cache bloating.

I would expect them to have the artifacts even for new PVFs due to pre-checking

Generally, yes, but if a validator pre-checked a PVF and didn't use it for 24 hours, then again, it has to re-compile it.

missing artifacts due to prunning

AFAIU, pruning after 24 hours only applies if we haven't used the artifact. For example, if we were out of the active (parachain) validator set. Otherwise, they should not be pruned. Assuming minimal churn in validator set changes, this was not the case in the scenario we observed.

or executor parameter changes

For old artifacts, we should have them compiled unless we just started the node (in which case, yes, we should prioritize compiling PVFs of the unapproved blocks first). If we have them compiled already, then we should actually prioritize compiling new PVFs as soon as we see them (backing). So again, this change would not have helped in this case. Bumping preparation workers would though.

If we have them compiled already, then we should actually prioritize compiling new PVFs as soon as we see them (backing). So again, this change would not have helped in this case. Bumping preparation workers would though.

For the executor parameter changed case . We are more often assigned to validate candidates rather than back candidates for which the PVF needs to be recompiled. I don't understand why would backing priority helps in this case. Assuming we prioritize backing, we'd basically no-show on more of the assignments we have for many blocks, leading to other tranches triggering increasing the work of approval subsystems needlessly as finality lag increases. Bumping prepration workers hard cap might be a better choice for the time being along with this change.

AFAIU, pruning after 24 hours only applies if we haven't used the artifact.

Exactly. But it's still not the best behavior possible. I'd better limit the artifact cache size and remove the most stale artifact on its overflow.

Assuming we prioritize backing, we'd basically no-show on more of the assignments we have for many blocks, leading to other tranches triggering increasing the work of approval subsystems needlessly as finality lag increases. Bumping prepration workers hard cap might be a better choice for the time being along with this change.

I was describing a case when we have PVFs already compiled for unapproved (old) blocks. In that case, we would not no-show. What I meant is that we should put new PVF for preparation as soon as possible.

I see what you mean, but in this case, the preparation backlog should be empty and we would prepare new PVFs as soon as we see them anyway.

polkadot/node/core/candidate-validation/src/lib.rs

Signed-off-by: Andrei Sandu <andrei-mihail@parity.io>

…adot-sdk into sandreim/prioritize_prep

Signed-off-by: Andrei Sandu <andrei-mihail@parity.io>

ordian

Approving, but I think we should also bump the prep worker limit, at least the hard one, to have an effect on the situation when we need to recompile all PVFs at the same time in real time.

AndreiEres

Should help to keep finality. But I agree that it would be good to consider separately the case where we need to recompile many pvfs.

Signed-off-by: Andrei Sandu <andrei-mihail@parity.io>

sandreim · 2024-04-19T07:38:19Z

Approving, but I think we should also bump the prep worker limit, at least the hard one, to have an effect on the situation when we need to recompile all PVFs at the same time in real time.

bumped hard limit to 2

Prioritize disputes and approval pvf preparation against backing

09772da

Signed-off-by: Andrei Sandu <andrei-mihail@parity.io>

sandreim added R0-silent Changes should not be mentioned in any release notes I9-optimisation An enhancement to provide better overall performance in terms of time-to-completion for a task. labels Apr 17, 2024

sandreim requested a review from s0me0ne-unkn0wn April 17, 2024 11:16

alexggh reviewed Apr 17, 2024

View reviewed changes

alexggh approved these changes Apr 17, 2024

View reviewed changes

Overkillus reviewed Apr 17, 2024

View reviewed changes

s0me0ne-unkn0wn approved these changes Apr 17, 2024

View reviewed changes

clippy

ba36b2d

Signed-off-by: Andrei Sandu <andrei-mihail@parity.io>

Merge branch 'master' into sandreim/prioritize_prep

89bf59c

sandreim marked this pull request as ready for review April 18, 2024 07:39

ordian reviewed Apr 18, 2024

View reviewed changes

eskimor reviewed Apr 18, 2024

View reviewed changes

polkadot/node/core/candidate-validation/src/lib.rs Outdated Show resolved Hide resolved

sandreim and others added 4 commits April 18, 2024 18:07

review

750c33a

Signed-off-by: Andrei Sandu <andrei-mihail@parity.io>

Merge branch 'sandreim/prioritize_prep' of github.com:paritytech/polk…

e334d84

…adot-sdk into sandreim/prioritize_prep

fix Priority comments

d2fcc44

Signed-off-by: Andrei Sandu <andrei-mihail@parity.io>

Merge branch 'master' into sandreim/prioritize_prep

3d26ea7

ordian approved these changes Apr 18, 2024

View reviewed changes

AndreiEres approved these changes Apr 19, 2024

View reviewed changes

bump prepare_workers_hard_max_num to 2

28f979f

Signed-off-by: Andrei Sandu <andrei-mihail@parity.io>

Merge branch 'master' into sandreim/prioritize_prep

79163e4

sandreim added this pull request to the merge queue Apr 19, 2024

Merged via the queue into master with commit 04a9071 Apr 19, 2024
132 of 137 checks passed

sandreim deleted the sandreim/prioritize_prep branch April 19, 2024 08:43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use higher priority for PVF preparation in dispute/approval context #4172

Use higher priority for PVF preparation in dispute/approval context #4172

sandreim commented Apr 17, 2024 •

edited

Loading

alexggh Apr 17, 2024

sandreim Apr 17, 2024

alexggh left a comment •

edited

Loading

Overkillus left a comment

Overkillus Apr 17, 2024

sandreim Apr 17, 2024

s0me0ne-unkn0wn Apr 17, 2024

alexggh Apr 18, 2024

sandreim Apr 18, 2024 •

edited

Loading

sandreim Apr 18, 2024 •

edited

Loading

s0me0ne-unkn0wn Apr 18, 2024

s0me0ne-unkn0wn left a comment

sandreim commented Apr 17, 2024

Overkillus commented Apr 17, 2024

ordian Apr 18, 2024

sandreim Apr 18, 2024 •

edited

Loading

s0me0ne-unkn0wn Apr 18, 2024

ordian Apr 18, 2024

sandreim Apr 18, 2024

s0me0ne-unkn0wn Apr 18, 2024

ordian Apr 18, 2024

sandreim Apr 18, 2024

ordian left a comment

AndreiEres left a comment

sandreim commented Apr 19, 2024

		) -> Result<WasmValidationResult, ValidationError> {
		let priority = polkadot_node_core_pvf::Priority::Normal;

Use higher priority for PVF preparation in dispute/approval context #4172

Use higher priority for PVF preparation in dispute/approval context #4172

Conversation

sandreim commented Apr 17, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alexggh left a comment • edited Loading

Choose a reason for hiding this comment

Overkillus left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sandreim Apr 18, 2024 • edited Loading

Choose a reason for hiding this comment

sandreim Apr 18, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

s0me0ne-unkn0wn left a comment

Choose a reason for hiding this comment

sandreim commented Apr 17, 2024

Overkillus commented Apr 17, 2024

Choose a reason for hiding this comment

sandreim Apr 18, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ordian left a comment

Choose a reason for hiding this comment

AndreiEres left a comment

Choose a reason for hiding this comment

sandreim commented Apr 19, 2024

sandreim commented Apr 17, 2024 •

edited

Loading

alexggh left a comment •

edited

Loading

sandreim Apr 18, 2024 •

edited

Loading

sandreim Apr 18, 2024 •

edited

Loading

sandreim Apr 18, 2024 •

edited

Loading