Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use higher priority for PVF preparation in dispute/approval context #4172

Merged
merged 9 commits into from
Apr 19, 2024
40 changes: 30 additions & 10 deletions polkadot/node/core/candidate-validation/src/lib.rs
Original file line number Diff line number Diff line change
Expand Up @@ -657,7 +657,14 @@ async fn validate_candidate_exhaustive(
PrepareJobKind::Compilation,
);

validation_backend.validate_candidate(pvf, exec_timeout, params.encode()).await
validation_backend
.validate_candidate(
pvf,
exec_timeout,
params.encode(),
polkadot_node_core_pvf::Priority::Normal,
)
.await
},
PvfExecKind::Approval =>
validation_backend
Expand All @@ -667,6 +674,7 @@ async fn validate_candidate_exhaustive(
params,
executor_params,
PVF_APPROVAL_EXECUTION_RETRY_DELAY,
polkadot_node_core_pvf::Priority::Critical,
)
.await,
};
Expand Down Expand Up @@ -749,10 +757,15 @@ trait ValidationBackend {
pvf: PvfPrepData,
exec_timeout: Duration,
encoded_params: Vec<u8>,
// The priority for the preparation job.
prepare_priority: polkadot_node_core_pvf::Priority,
) -> Result<WasmValidationResult, ValidationError>;

/// Tries executing a PVF for the approval subsystem. Will retry once if an error is encountered
/// that may have been transient.
/// Tries executing a PVF. Will retry once if an error is encountered that may have
/// been transient.
///
/// The `prepare_priority` is relevant in the context of the caller. Currently we expect
/// that `approval` context has priority over `backing` context.
///
/// NOTE: Should retry only on errors that are a result of execution itself, and not of
/// preparation.
Expand All @@ -763,6 +776,8 @@ trait ValidationBackend {
params: ValidationParams,
executor_params: ExecutorParams,
retry_delay: Duration,
// The priority for the preparation job.
prepare_priority: polkadot_node_core_pvf::Priority,
) -> Result<WasmValidationResult, ValidationError> {
let prep_timeout = pvf_prep_timeout(&executor_params, PvfPrepKind::Prepare);
// Construct the PVF a single time, since it is an expensive operation. Cloning it is cheap.
Expand All @@ -776,8 +791,10 @@ trait ValidationBackend {
// long.
let total_time_start = Instant::now();

let mut validation_result =
self.validate_candidate(pvf.clone(), exec_timeout, params.encode()).await;
// Use `Priority::Critical` as finality trumps parachain liveliness.
let mut validation_result = self
.validate_candidate(pvf.clone(), exec_timeout, params.encode(), prepare_priority)
.await;
if validation_result.is_ok() {
return validation_result
}
Expand Down Expand Up @@ -851,8 +868,9 @@ trait ValidationBackend {

// Encode the params again when re-trying. We expect the retry case to be relatively
// rare, and we want to avoid unconditionally cloning data.
validation_result =
self.validate_candidate(pvf.clone(), new_timeout, params.encode()).await;
validation_result = self
.validate_candidate(pvf.clone(), new_timeout, params.encode(), prepare_priority)
.await;
}
}

Expand All @@ -870,11 +888,13 @@ impl ValidationBackend for ValidationHost {
pvf: PvfPrepData,
exec_timeout: Duration,
encoded_params: Vec<u8>,
// The priority for the preparation job.
prepare_priority: polkadot_node_core_pvf::Priority,
) -> Result<WasmValidationResult, ValidationError> {
let priority = polkadot_node_core_pvf::Priority::Normal;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, we never used the priority logic.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yup


let (tx, rx) = oneshot::channel();
if let Err(err) = self.execute_pvf(pvf, exec_timeout, encoded_params, priority, tx).await {
if let Err(err) =
self.execute_pvf(pvf, exec_timeout, encoded_params, prepare_priority, tx).await
{
return Err(InternalValidationError::HostCommunication(format!(
"cannot send pvf to the validation host, it might have shut down: {:?}",
err
Expand Down
2 changes: 2 additions & 0 deletions polkadot/node/core/candidate-validation/src/tests.rs
Original file line number Diff line number Diff line change
Expand Up @@ -368,6 +368,7 @@ impl ValidationBackend for MockValidateCandidateBackend {
_pvf: PvfPrepData,
_timeout: Duration,
_encoded_params: Vec<u8>,
_prepare_priority: polkadot_node_core_pvf::Priority,
) -> Result<WasmValidationResult, ValidationError> {
// This is expected to panic if called more times than expected, indicating an error in the
// test.
Expand Down Expand Up @@ -1044,6 +1045,7 @@ impl ValidationBackend for MockPreCheckBackend {
_pvf: PvfPrepData,
_timeout: Duration,
_encoded_params: Vec<u8>,
_prepare_priority: polkadot_node_core_pvf::Priority,
) -> Result<WasmValidationResult, ValidationError> {
unreachable!()
}
Expand Down
2 changes: 1 addition & 1 deletion polkadot/node/core/pvf/src/host.rs
Original file line number Diff line number Diff line change
Expand Up @@ -197,7 +197,7 @@ impl Config {
prepare_worker_program_path,
prepare_worker_spawn_timeout: Duration::from_secs(3),
prepare_workers_soft_max_num: 1,
prepare_workers_hard_max_num: 1,
prepare_workers_hard_max_num: 2,

execute_worker_program_path,
execute_worker_spawn_timeout: Duration::from_secs(3),
Expand Down
7 changes: 4 additions & 3 deletions polkadot/node/core/pvf/src/priority.rs
Original file line number Diff line number Diff line change
Expand Up @@ -14,17 +14,18 @@
// You should have received a copy of the GNU General Public License
// along with Polkadot. If not, see <http://www.gnu.org/licenses/>.

/// A priority assigned to execution of a PVF.
/// A priority assigned to preparation of a PVF.
#[derive(Clone, Copy, Debug, PartialEq, Eq, PartialOrd, Ord)]
pub enum Priority {
/// Normal priority for things that do not require immediate response, but still need to be
/// done pretty quick.
///
/// Approvals and disputes fall into this category.
/// Backing falls into this category.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(commenting here because I can't comment higher)

A priority assigned to execution of a PVF.

Please correct me if I'm wrong, but from a quick reading of this code, it seems to me that the priority is not for PVF execution, but for PVF compilation (prepare).

If that's the case, first, the comments here are misleading. And second, these changes would only help if the nodes restart and have no artifacts for old PVFs that are relevant for unapproved blocks or maybe the PVF artifact pruning is too aggressive?
Normally, I would expect them to have the artifacts even for new PVFs due to pre-checking. What would probably help if preparation is slow is having more workers for PVF compilation (currently, it's 1).

Copy link
Contributor Author

@sandreim sandreim Apr 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes this is currently used for preparation only. You are right, with pre-checking the artifact should be already there. It helps mostly with missing artifacts due to prunning or executor parameter changes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe the PVF artifact pruning is too aggressive?

I wouldn't say it is too aggressive, we prune an artifact if it hasn't been used for 24 hours, which seems like a fair measure to prevent artifact cache bloating.

I would expect them to have the artifacts even for new PVFs due to pre-checking

Generally, yes, but if a validator pre-checked a PVF and didn't use it for 24 hours, then again, it has to re-compile it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

missing artifacts due to prunning

AFAIU, pruning after 24 hours only applies if we haven't used the artifact. For example, if we were out of the active (parachain) validator set. Otherwise, they should not be pruned. Assuming minimal churn in validator set changes, this was not the case in the scenario we observed.

or executor parameter changes

For old artifacts, we should have them compiled unless we just started the node (in which case, yes, we should prioritize compiling PVFs of the unapproved blocks first). If we have them compiled already, then we should actually prioritize compiling new PVFs as soon as we see them (backing). So again, this change would not have helped in this case. Bumping preparation workers would though.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we have them compiled already, then we should actually prioritize compiling new PVFs as soon as we see them (backing). So again, this change would not have helped in this case. Bumping preparation workers would though.

For the executor parameter changed case . We are more often assigned to validate candidates rather than back candidates for which the PVF needs to be recompiled. I don't understand why would backing priority helps in this case. Assuming we prioritize backing, we'd basically no-show on more of the assignments we have for many blocks, leading to other tranches triggering increasing the work of approval subsystems needlessly as finality lag increases. Bumping prepration workers hard cap might be a better choice for the time being along with this change.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAIU, pruning after 24 hours only applies if we haven't used the artifact.

Exactly. But it's still not the best behavior possible. I'd better limit the artifact cache size and remove the most stale artifact on its overflow.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assuming we prioritize backing, we'd basically no-show on more of the assignments we have for many blocks, leading to other tranches triggering increasing the work of approval subsystems needlessly as finality lag increases. Bumping prepration workers hard cap might be a better choice for the time being along with this change.

I was describing a case when we have PVFs already compiled for unapproved (old) blocks. In that case, we would not no-show. What I meant is that we should put new PVF for preparation as soon as possible.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see what you mean, but in this case, the preparation backlog should be empty and we would prepare new PVFs as soon as we see them anyway.

Normal,
/// This priority is used for requests that are required to be processed as soon as possible.
///
/// For example, backing is on a critical path and requires execution as soon as possible.
/// Disputes and approvals are on a critical path and require execution as soon as
/// possible to not delay finality.
Critical,
}

Expand Down
Loading