Skip to content
This repository has been archived by the owner on Nov 15, 2023. It is now read-only.

Collator execution environment parametrization #6497

Closed
s0me0ne-unkn0wn opened this issue Jan 3, 2023 · 15 comments
Closed

Collator execution environment parametrization #6497

s0me0ne-unkn0wn opened this issue Jan 3, 2023 · 15 comments
Labels
I3-bug Fails to follow expected behavior. T5-parachains_protocol This PR/Issue is related to Parachains features and protocol changes.

Comments

@s0me0ne-unkn0wn
Copy link
Contributor

s0me0ne-unkn0wn commented Jan 3, 2023

See: paritytech/polkadot-sdk#917, PR #6161

After adopting #6161, we should make collators aware of execution environment parameters, too. Otherwise, they could submit a PVF that they checked to be valid, but that would be invalid for any validator because of different execution environment parameters in the relay chain runtime.

It's not an urgent concern as, right now, execution environment parameters are set to default values, and collators share the same defaults, so the situation described cannot happen. It will only become problematic if we change any execution environment parameters in the relay chain runtime.

@s0me0ne-unkn0wn s0me0ne-unkn0wn added the I3-bug Fails to follow expected behavior. label Jan 3, 2023
@eskimor eskimor added the T5-parachains_protocol This PR/Issue is related to Parachains features and protocol changes. label Jan 5, 2023
@eskimor eskimor moved this from To do to In progress in Parachains-core Jan 5, 2023
@s0me0ne-unkn0wn
Copy link
Contributor Author

After some research (thanks @ordian for the consultations) I didn't find anything that has to be done here.

A PVF is never prepared nor executed on the collator side. A collator submits PVF to the relay chain where it is included in a block and caught up by the PVF pre-checker which is already parametrized. Pre-checkers vote for the validation code hash, and if supermajority ensures it can produce a valid artifact, it is considered valid. That is, all the checking is done on the relay chain side.

So I'm closing it for now, feel free to reopen if I'm missing something.

@eskimor
Copy link
Member

eskimor commented Mar 3, 2023

They could submit a block/PoV that would be considered invalid. PVF is less of a problem: All that could happen is that a runtime upgrade gets rejected.

@s0me0ne-unkn0wn
Copy link
Contributor Author

They could, but how to avoid it? I mean, without bringing the whole executor infrastructure into the collator code?

@eskimor
Copy link
Member

eskimor commented Mar 3, 2023

As long as we are only using them for limits which honest code should never exceed any way, it does not matter too much, but once we also use it to upgrading/switching compilers it is starting to matter. I mean if the risk of incompatibilities is low enough, we might still be able to punt on it for quite a while. Worst thing that can happen is that a parachain is stalled until they upgraded their nodes - bad enough for a production parachain though.

Considering that we have Kusama this might still be acceptable for quite a while: If things break, they break on Kusama first (where chaos is expected) and collators can upgrade their nodes, before we hit Polkadot.

Also I just found this comment: This is actually an interestingidea: If collators demand a particular execution environment in the candidate receipt, this would work as well - at the cost of complicating validators.

Anyhow, resolving this can easily wait until we have versioning - we might make creation of these tickets a todo on that ticket, so we have less clutter until the time comes when we should start thinking about it.

@bkchr
Copy link
Member

bkchr commented Mar 3, 2023

They could, but how to avoid it? I mean, without bringing the whole executor infrastructure into the collator code?

The whole executor (substrate) infrastructure is already in the collator code. At some point we will need to fetch these parameters as well for configuring the block building environment.

@s0me0ne-unkn0wn
Copy link
Contributor Author

The whole executor (substrate) infrastructure is already in the collator code

That's true, but currently, the collator doesn't feature any interface to the executor. We could create one if need be, but do we really have to? The collator always runs a full node nearby, is it possible to use its already available executor interface (probably using RPC calls?) to check what we want to check and not to incorporate that logic into the collator itself?

My concern is that we'll have to support two different yet homogenous executor interfaces.

@bkchr
Copy link
Member

bkchr commented Mar 3, 2023

I have later seen that @eskimor already created issues for the things paritytech/polkadot-sdk#2339

My concern is that we'll have to support two different yet homogenous executor interfaces.

I don't know exactly what you mean by this.

@s0me0ne-unkn0wn
Copy link
Contributor Author

I don't know exactly what you mean by this.

Well, we have quite a bit of infrastructure on the node side to run PVFs and validate candidates. PVF host with preparation end execution pools/queues/workers, the executor interface itself, which implements validation backend that wraps the substrate executor and serves as an intermediary between execution requests and the execution itself, and so on. If we want similar functionality in collators, we should either abstract part of that infrastructure and reuse it in collators (which would create a kind of overkill if we don't put some effort into extensively tearing the abstractions apart), or implement a similar but more simple infrastructure in collators (which would result in some functionality duplication).

We can go either way, but I believe first we should evaluate if we can reuse something that already exists, that's why I'm asking if we can make use of the node running along with the collator. All those things are already there in the node software. If we implement an RPC call that accepts a candidate, validates it off-chain, and provides the collator with the answer like "valid/not valid", that would be enough. Do you think it makes any sense?

@bkchr
Copy link
Member

bkchr commented Mar 4, 2023

If we want similar functionality in collators, we should either abstract part of that infrastructure and reuse it in collators (which would create a kind of overkill if we don't put some effort into extensively tearing the abstractions apart), or implement a similar but more simple infrastructure in collators (which would result in some functionality duplication).

Ahh okay, so you really spoke about the entire executor infrastructure. So no, we don't need this. From the collator POV the runtime is trusted. However, we will need to run block building under certain constraints as it is done while validation on the relay chain. For sure we need to run with less memory than we have while validating a block, because validation stores more stuff in memory. We need to respect the maximum stack depth. We need to build a block while ensuring that it doesn't take more time than we give backers to validate it #6793. And whatever things come up over time. Stuff like preparation time of the wasm binary isn't really important to check. Wasm binaries that don't fit into the timeout will not be accepted as runtime upgrade and the collator doesn't need to check this.

@s0me0ne-unkn0wn
Copy link
Contributor Author

So no, we don't need this. From the collator POV the runtime is trusted. However, we will need to run block building under certain constraints as it is done while validation on the relay chain.

And to validate anything, we need at least the following:

  • A preparation worker running as a separate process that should prepare a runnable binary artifact out of Wasm PVF and which is time-limited not to cause a timeout during precheck on the relay chain;
  • An execution worker running as a separate process because it may die during the execution, and we want to be able to kill it on timeout;
  • An artifact storage because we don't want to prepare the artifact every time we want to validate;
  • Probably some other things I forgot.

So it seems to me like we still need pretty much the same infrastructure, just vastly scaled down, it doesn't need support parallel preparation and execution and other bells and whistles, but neither it looks like something trivial. Am I correct?

@bkchr
Copy link
Member

bkchr commented Mar 6, 2023

So it seems to me like we still need pretty much the same infrastructure, just vastly scaled down, it doesn't need support parallel preparation and execution and other bells and whistles, but neither it looks like something trivial. Am I correct?

We already share the same infrastructure by using the sc-executor as basis for the validation.

@s0me0ne-unkn0wn
Copy link
Contributor Author

We already share the same infrastructure by using the sc-executor as basis for the validation.

Sure, but I'm talking about the part between sc-executor and a subsystem that has a PVF as raw Wasm binary and a candidate and that wants to validate the candidate using the PVF. In validator software, it consists of the PVF host at the higher level and the executor interface at the lower level. We still need all that stuff in the collator code if we want the collator to be able to use sc-executor to validate anything, right?

@rphmeier
Copy link
Contributor

rphmeier commented Mar 8, 2023

In validator software, it consists of the PVF host at the higher level and the executor interface at the lower level. We still need all that stuff in the collator code if we want the collator to be able to use sc-executor to validate anything, right?

No, collators don't use the PVF mechanism. Their job is to make something that passes the PVF, and they do this with a Substrate runtime & client. It's also executed as WebAssembly, yes, so some things like stack depth and memory limits might need to be shared. But pre-checking or other execution worker logic is unnecessary. For a collator the runtime is trusted, for a validator the runtime is not.

@s0me0ne-unkn0wn
Copy link
Contributor Author

I feel like I'm losing the thread 😅 Let's try from another end. I'll make a numbered list to address points easily.

  1. What we're trying to achieve? IIUC, we want the collator to be able to validate a candidate it's just produced, and we want it to be done by the collator itself before it submits the candidate to the relay chain, right?
  2. To validate a candidate, we must run the PVF on that candidate, right?
  3. To run the PVF, we have to prepare binary artifact from Wasm code, in other words, to compile it, right?
  4. Then we have to run that binary artifact providing it with the candidate as input, and if validation goes awry (timeout, crash, etc.), we don't want the whole collator process to go down, so we need to execute the binary in a separate process, right?
  5. And that separate process, that is, the execution worker, should provide some glue between the collator code and the Substrate executor, parametrizing the Substrate executor in such a way it could execute the PVF properly, right?

Am I missing something? Because if I am not, it still sounds to me like "we need a somewhat scaled-down implementation of PVF host for collators" 🤷

@bkchr
Copy link
Member

bkchr commented Mar 11, 2023

  1. IIUC, we want the collator to be able to validate a candidate it's just produced, and we want it to be done by the collator itself before it submits the candidate to the relay chain, right?

No we don't want this. Normal mode is that you produce a PoV and send it to the relay chain. You are clearly trusting yourself to build a proper candidate ;)

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
I3-bug Fails to follow expected behavior. T5-parachains_protocol This PR/Issue is related to Parachains features and protocol changes.
Projects
No open projects
Status: Done
Development

No branches or pull requests

4 participants