Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consider patching memset/memcpy inside of WASM to call the native implementations instead #21

Open
koute opened this issue Feb 22, 2023 · 11 comments
Labels
I9-optimisation An enhancement to provide better overall performance in terms of time-to-completion for a task.

Comments

@koute
Copy link
Contributor

koute commented Feb 22, 2023

So I've been thinking about how to potentially speed up the contracts pallet, and I've remembered that just by enabling the bulk memory operations WASM feature we could potentially make the contracts pallet twice as fast (and also potentially speed up other things, as bulk memory operations are very widely used). Enabling this is unfortunately a little tricky, as outlined by @pepyakin here.

But then, I got a (maybe not so) crazy idea. What if we'd just.... monkeypatch this into the WASM bytecode before compiling it?

We already patch the WASM blobs to adjust e.g. the memory section or add stack depth metering. So why not do something like this?

  1. Find memset and memcpy inside of the WASM blob.
  2. Replace the body of those functions with a call to a dynamically injected host function which would call the native optimized memset/memcpy on the host.
  3. Things magically get faster.
  4. ???
  5. Profit.

So this has the following benefits:

  1. Relatively simple and easy to do.
  2. Works retroactively.
  3. Backwards and forwards compatible.
  4. Eliminates the wasmtime bug where depending on the phase of the moon whether the Cranelift-generated memset loop randomly ends up cache aligned in memory or not the contracts pallet's performance takes a sharp nosedive.
  5. It's just an optimization so doesn't affect the host <-> runtime interface in any way, and can be removed at any time.

Any downsides here that I'm not seeing as to why we wouldn't want to do this?

cc @paritytech/sdk-node @pepyakin @athei

@koute koute added the I9-optimisation An enhancement to provide better overall performance in terms of time-to-completion for a task. label Feb 22, 2023
@athei
Copy link
Member

athei commented Feb 22, 2023

This is a difficult decision: This would fix our short term issues with wasmi and is very tempting because it would probably be implementable right away. However, it directly opposes our goal of moving away from the deprecated parity-wasm as it adds another user (the patching). And moving away from parity-wasm (by transitioning to wasmparser) is a prerequisite to enable further wasm proposals. That said, I think it is not a big issue to add another user user of parity-wasm: When we transition to wasmparser (i.e porting the stack height metering) and have #917 we can just enable bulk memory and remove the monkey patching. So no harm done by adding it.

I generally like the idea and I am not against it. But let me play devils advocate and see how you feel about those drawbacks:

  • The obvious one: It adds complexity. Probably not too much. But still. We have a custom non standard pass we apply.
  • Some of the memset / memcpy will probably be inlined and cannot be trivially patched. But I guess those we don't want to patch anyways?
  • This might make the build a bit brittle since I am not sure how stable the name of those functions are. This is probably a minor point but still something that we need to keep in the back of our mind. I guess this is the same as the first point.
  • It might decrease performance in some cases (where small blocks are copied). This assumes that host functions are more expensive than internal functions (I am not sure if this is still the case with wasmtime). Since the arguments are runtime values we can't really know if it is a big or small copy when patching.

@koute
Copy link
Contributor Author

koute commented Feb 23, 2023

The obvious one: It adds complexity. Probably not too much. But still. We have a custom non standard pass we apply.

Indeed. But due to the way rustc currently emits those functions monkeypatching them should be really trivial.

Some of the memset / memcpy will probably be inlined and cannot be trivially patched. But I guess those we don't want to patch anyways?

If they're inlined it means they were copying something small enough that most likely this optimization would be a net loss.

This might make the build a bit brittle since I am not sure how stable the name of those functions are.

Yeah, this is a good point. From what I can see rustc always emits them like this, even in the production profile with LTO enabled:

 (func $memcpy (param $0 i32) (param $1 i32) (param $2 i32) (result i32)
  (call $_ZN17compiler_builtins3mem6memcpy17ha1fcde75cc2961d5E
   (local.get $0)
   (local.get $1)
   (local.get $2)
  )
 )

We could always add a simple test in our wasm-builder crate that would check for this and emit a warning, maybe something like "WARNING: memcpy not found in the runtime; this will inhibit optimizations and potentially make things slower; please report this as a bug on substrate issue tracker" and if this ever changes someone would let us know.

It might decrease performance in some cases (where small blocks are copied). This assumes that host functions are more expensive than internal functions (I am not sure if this is still the case with wasmtime). Since the arguments are runtime values we can't really know if it is a big or small copy when patching.

We could probably just write a benchmark which would go through each possible size up to N and see where the cutoff is and then instead of emitting an unconditional call add an if which would either call the host function or the built-in one.


One problem I could see is that if we're going to use wasm-opt then that might interfere with this optimization. But maybe if we do this we wouldn't need wasm-opt? Or maybe we could tell wasm-opt to leave this function alone and never inline it?

@athei
Copy link
Member

athei commented Feb 24, 2023

Some of the memset / memcpy will probably be inlined and cannot be trivially patched. But I guess those we don't want to patch anyways?

If they're inlined it means they were copying something small enough that most likely this optimization would be a net loss.

The size of the object shouldn't affect the inlining decision. The functions are always of the same size. Something similar happens: I think for objects smaller than a certain thresholds no call to this function is injected but an unrolled loop. In this case we don't want to patch as you said. We are good here.

One problem I could see is that if we're going to use wasm-opt then that might interfere with this optimization. But maybe if we do this we wouldn't need wasm-opt? Or maybe we could tell wasm-opt to leave this function alone and never inline it?

We still want wasm-opt. It does a lot of optimizations. I think we should just run our pass before wasm-opt. Then there is nothing to inline for wasm-opt.

@koute
Copy link
Contributor Author

koute commented Feb 27, 2023

We still want wasm-opt. It does a lot of optimizations. I think we should just run our pass before wasm-opt. Then there is nothing to inline for wasm-opt.

We can't really do that because wasm-opt is going to run offline when the runtime is compiled from Rust into .wasm, and this is going to run online immediately before the runtime is compiled from .wasm into the native code and instantiated.

So if we'd want this and wasm-opt we'd have to instruct it to leave these functions alone (assuming it'll touch them).

@athei
Copy link
Member

athei commented Feb 28, 2023

We can't really do that because wasm-opt is going to run offline when the runtime is compiled from Rust into .wasm, and this is going to run online immediately before the runtime is compiled from .wasm into the native code and instantiated.

Why can't we just put it into the wasm-builder? I don't think this "last minute patching" is a good idea. It makes it harder to inspect the code that is actually running.

@koute
Copy link
Contributor Author

koute commented Mar 1, 2023

Why can't we just put it into the wasm-builder?

The whole point of this exercise is to make this optimization entirely transparent, with full backwards and forwards compatibility, and also to make it apply retroactively so that every runtime which already exists on chain would also automatically benefit (but that's just a bonus).

We can't do this in wasm-builder because then that'd be something which affects the runtime <-> host interface, which would make this not an entirely transparent optimization. So you'd need new enough client nodes, which means months of waiting until enough people upgrade, and then there's also the problem of PVFs where if not everyone is upgraded this could potentially be trivially exploited since our PVF execution environment is not versioned.

On other hand with this approach we could most likely enable it right now. (And we can still remove this and switch to real bulk memory ops in the future once all of the pieces are there, precisely because it doesn't affect the runtime <-> host interface at all.)

I don't think this "last minute patching" is a good idea. It makes it harder to inspect the code that is actually running.

Well, any JIT optimizer does this a lot, including wasmtime. (:

@athei
Copy link
Member

athei commented Mar 1, 2023

We can't do this in wasm-builder because then that'd be something which affects the runtime <-> host interface

Ahh yeah I forgot about this. Yes it needs to be done when compiling then.

Well, any JIT optimizer does this a lot, including wasmtime. (:

Yes but this is expected behavior. I don't expect this from my blockchain node. But its okay. It seems to be the only viable way.

On other hand with this approach we could most likely enable it right now. (And we can still remove this and switch to real bulk memory ops in the future once all of the pieces are there, precisely because it doesn't affect the runtime <-> host interface at all.)

Only grumble is that I don't see a way to tell wasm-opt which functions to leave alone for inlining.

@koute
Copy link
Contributor Author

koute commented Mar 1, 2023

Only grumble is that I don't see a way to tell wasm-opt which functions to leave alone for inlining.

Yeah, if we want both then this will have to be rectified. But assuming the worst case even if I have to add this to wasm-opt myself it shouldn't be too hard.

@athei
Copy link
Member

athei commented Mar 1, 2023

Since there is no time frame to move away from stack height metering from parity-wasm I am in favor of implementing this optimization.

Are you up for doing it?

@koute
Copy link
Contributor Author

koute commented Mar 1, 2023

Since there is no time frame to move away from stack height metering from parity-wasm I am in favor of implementing this optimization.

Are you up for doing it?

If I were to do this in a production-ready fashion I'd probably like to also do #20 (that is, get rid of parity-wasm completely in favor of wasmparser + wasm-encoder) and while I'm at it rewrite the stack height metering too. (Which might not even require much work, since Filecoin people already did this, so it'd be mostly integrating this back into substrate and making sure there are no bugs/it still works as we expect it should.)

So, tentatively yeah.

Right now I'm putting up a proof of concept implementation so that I can measure the performance benefit and can compare vanilla vs this optimization vs wasm-opt vs this optimization + wasm-opt.

@athei
Copy link
Member

athei commented Mar 1, 2023

If I were to do this in a production-ready fashion I'd probably like to also do #20 (that is, get rid of parity-wasm completely in favor of wasmparser + wasm-encoder) and while I'm at it rewrite the stack height metering too. (Which might not even require much work, since Filecoin people already did this, so it'd be mostly integrating this back into substrate and making sure there are no bugs/it still works as we expect it should.)

I didn't know. This is great. After that the only blocker for wasm-proposals is the versioning of the executor.

Right now I'm putting up a proof of concept implementation so that I can measure the performance benefit and can compare vanilla vs this optimization vs wasm-opt vs this optimization + wasm-opt.

Nice.

@the-right-joyce the-right-joyce transferred this issue from paritytech/substrate Aug 24, 2023
github-merge-queue bot pushed a commit that referenced this issue Apr 8, 2024
… import message (#4021)

Sometimes you need to debug some issues just by the logs and reconstruct
what happened.
In these scenarios it would be nice to know if a block was imported as
best block, and what it parent was.
So here I propose to change the output of the informant to this:

```
2024-04-05 20:38:22.004  INFO ⋮substrate: [Parachain] ✨ Imported #18 (0xe7b3…4555 -> 0xbd6f…ced7)    
2024-04-05 20:38:24.005  INFO ⋮substrate: [Parachain] ✨ Imported #19 (0xbd6f…ced7 -> 0x4dd0…d81f)    
2024-04-05 20:38:24.011  INFO ⋮substrate: [jobless-children-5352] 🌟 Imported #42 (0xed2e…27fc -> 0x718f…f30e)    
2024-04-05 20:38:26.005  INFO ⋮substrate: [Parachain] ✨ Imported #20 (0x4dd0…d81f -> 0x6e85…e2b8)    
2024-04-05 20:38:28.004  INFO ⋮substrate: [Parachain] 🌟 Imported #21 (0x6e85…e2b8 -> 0xad53…2a97)    
2024-04-05 20:38:30.004  INFO ⋮substrate: [Parachain] 🌟 Imported #22 (0xad53…2a97 -> 0xa874…890f)    
```

---------

Co-authored-by: Bastian Köcher <git@kchr.de>
Ank4n pushed a commit that referenced this issue Apr 9, 2024
… import message (#4021)

Sometimes you need to debug some issues just by the logs and reconstruct
what happened.
In these scenarios it would be nice to know if a block was imported as
best block, and what it parent was.
So here I propose to change the output of the informant to this:

```
2024-04-05 20:38:22.004  INFO ⋮substrate: [Parachain] ✨ Imported #18 (0xe7b3…4555 -> 0xbd6f…ced7)    
2024-04-05 20:38:24.005  INFO ⋮substrate: [Parachain] ✨ Imported #19 (0xbd6f…ced7 -> 0x4dd0…d81f)    
2024-04-05 20:38:24.011  INFO ⋮substrate: [jobless-children-5352] 🌟 Imported #42 (0xed2e…27fc -> 0x718f…f30e)    
2024-04-05 20:38:26.005  INFO ⋮substrate: [Parachain] ✨ Imported #20 (0x4dd0…d81f -> 0x6e85…e2b8)    
2024-04-05 20:38:28.004  INFO ⋮substrate: [Parachain] 🌟 Imported #21 (0x6e85…e2b8 -> 0xad53…2a97)    
2024-04-05 20:38:30.004  INFO ⋮substrate: [Parachain] 🌟 Imported #22 (0xad53…2a97 -> 0xa874…890f)    
```

---------

Co-authored-by: Bastian Köcher <git@kchr.de>
dharjeezy pushed a commit to dharjeezy/polkadot-sdk that referenced this issue Apr 9, 2024
… import message (paritytech#4021)

Sometimes you need to debug some issues just by the logs and reconstruct
what happened.
In these scenarios it would be nice to know if a block was imported as
best block, and what it parent was.
So here I propose to change the output of the informant to this:

```
2024-04-05 20:38:22.004  INFO ⋮substrate: [Parachain] ✨ Imported paritytech#18 (0xe7b3…4555 -> 0xbd6f…ced7)    
2024-04-05 20:38:24.005  INFO ⋮substrate: [Parachain] ✨ Imported paritytech#19 (0xbd6f…ced7 -> 0x4dd0…d81f)    
2024-04-05 20:38:24.011  INFO ⋮substrate: [jobless-children-5352] 🌟 Imported paritytech#42 (0xed2e…27fc -> 0x718f…f30e)    
2024-04-05 20:38:26.005  INFO ⋮substrate: [Parachain] ✨ Imported paritytech#20 (0x4dd0…d81f -> 0x6e85…e2b8)    
2024-04-05 20:38:28.004  INFO ⋮substrate: [Parachain] 🌟 Imported paritytech#21 (0x6e85…e2b8 -> 0xad53…2a97)    
2024-04-05 20:38:30.004  INFO ⋮substrate: [Parachain] 🌟 Imported paritytech#22 (0xad53…2a97 -> 0xa874…890f)    
```

---------

Co-authored-by: Bastian Köcher <git@kchr.de>
jonathanudd pushed a commit to jonathanudd/polkadot-sdk that referenced this issue Apr 10, 2024
* Update dependencies

Upgrades Substrate based dependencies from v2.0.0 -> v2.0.0-alpha.1
and uses the `jsonrpsee`'s new feature flags. The actual code hasn't
been updated though, so this won't compile.

* Use `RawClient`s from `jsonrpsee`

* Update to use jsonrpsee's new API

* Hook up Ethereum Bridge Runtime, Relay, and Node Runtime

* Bump `parity-crypto` from v0.4 to v0.6

Fixes error when trying to compile tests. This was caused by
`parity-crypto` v0.4's use of `parity-secp256k1` over `secp256k1'.
Using the Parity fork meant multiple version of the same underlying
C library were being pulled in. `parity-crypto` v0.6 moved away from
this, only relying on `secp256k1` thus fixing the issue.
liuchengxu added a commit to subcoin-project/polkadot-sdk that referenced this issue Sep 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
I9-optimisation An enhancement to provide better overall performance in terms of time-to-completion for a task.
Projects
Status: backlog
Development

No branches or pull requests

2 participants