-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize Func::call
and its C API
#3319
Optimize Func::call
and its C API
#3319
Conversation
7414f0e
to
5f37933
Compare
Subscribe to Label Actioncc @fitzgen, @kubkon, @peterhuene
This issue or pull request has been labeled: "fuzzing", "wasi", "wasmtime:api", "wasmtime:c-api"
Thus the following users have been cc'd because of the following labels:
To subscribe or unsubscribe from this label, edit the |
To show an idea of the numbers I'm looking at, here's some information. I've got a C program that shows the following timings on various wasmtime versions. Timings are shown with using What's being benchmarked: nophost calls a wasm function that does nothing
i64host calls a wasm function. Wasm function calls a host function returning i64. Wasm then returns that i64.
manyhost calls a wasm function. Wasm function calls a host function with 5 i32 params and one i32 return. Wasm discards result and returns.
This is all a far cry away from what Rust can do though, so while this PR certainly improves the state of affairs Rust far outstrips it. Rust has the option of using "typed" and "untyped" functions, effectively using when imports are defined with
when imports are defined with
The basic summary here is that when we have "typed" versions of everything basically any combination of arguments/returns is roughly 20ns of call-time overhead. That's the minimum threshold for calling a wasm function. Once "untyped" things are used then there starts to be per-argument and per-result overhead. This creeps up in both calling wasm functions and calling host functions, and can be seen how "many" is slower than "i64" which is slower than "nop". Finally the C API imposes about a 40ns blanket overhead on top of the wasmtime "untyped" Rust versions, mostly because Basically at this point I think that the "untyped" variants, which C currently is forced to use, are as optimal as they're gonna get. There's still inherent overhead with translation between |
This commit adds a new API to the `wasmtime::Func` type for wrapping a C-ABI function with a well-defined signature dervied from a wasm type signature. The purpose of this API is to add the-most-optimize-we-can path for using the C API and having wasm call host functions. Previously when wasm called a host function it would perform these steps: 1. Using a trampoline, place all arguments into a `u128*` array on the stack. 2. Call `Func::invoke` which uses the type of the function (dynamically) to read values from this `u128*`. 3. Values are placed within a `Vec<Val>` after being read. 4. The C API receives `&[Val]` and translates this to `&[wasmtime_val_t]`, iterating over each value and copying its contents into a new vector. 5. Then the host function is actually called. 6. The above argument-shuffling steps are all performed in reverse for the results, shipping everything through `wasmtime_val_t` and `Val`. PRs such as bytecodealliance#3319 and related attempts have made this sequence faster, but the numbers on bytecodealliance#3319 show that even after we get all the allocation and such bits out of the way we're still spending quite a lot of time shuffling arguments back-and-forth relative to the `Func::wrap` API that Rust can use. This commit fixes the issue by eliminating all steps except 1/5 above. Although we still place all arguments on the stack and read them out again to call the C-defined function with it's much faster than pushing this all through the `Val` and `wasmtime_val_t` machinery. This overall gets the cost of a wasm->host call basically on-par with `Func::wrap`.
This commit adds a new API to the `wasmtime::Func` type for wrapping a C-ABI function with a well-defined signature derived from a wasm type signature. The purpose of this API is to add the-most-optimized-we-can path for using the C API and having wasm call host functions. Previously when wasm called a host function it would perform these steps: 1. Using a trampoline, place all arguments into a `u128*` array on the stack. 2. Call `Func::invoke` which uses the type of the function (dynamically) to read values from this `u128*`. 3. Values are placed within a `Vec<Val>` after being read. 4. The C API receives `&[Val]` and translates this to `&[wasmtime_val_t]`, iterating over each value and copying its contents into a new vector. 5. Then the host function is actually called. 6. The above argument-shuffling steps are all performed in reverse for the results, shipping everything through `wasmtime_val_t` and `Val`. PRs such as bytecodealliance#3319 and related attempts have made this sequence faster, but the numbers on bytecodealliance#3319 show that even after we get all the allocation and such bits out of the way we're still spending quite a lot of time shuffling arguments back-and-forth relative to the `Func::wrap` API that Rust can use. This commit fixes the issue by eliminating all steps except 1/5 above. Although we still place all arguments on the stack and read them out again to call the C-defined function with it's much faster than pushing this all through the `Val` and `wasmtime_val_t` machinery. This overall gets the cost of a wasm->host call basically on-par with `Func::wrap`, although it's still not as optimized. Benchmarking the overhead of wasm->host calls, where `i64` returns one i64 value and `many` takes 5 `i32` parameters and returns one `i64` value, the numbers I get are: | Import | Rust | C before | C after | |--------|------|----------|---------| | `i64` | 1ns | 40ns | 7ns | | `many` | 1ns | 91ns | 10ns | This commit is a clear win over the previous implementation, but it's even still somewhat far away from Rust. That being said I think I'm out of ideas of how to make this better. Without open-coding much larger portions of `wasmtime` I'm not sure how much better we can get here. The time in C after this commit is almost entirely spent in trampolines storing the arguments to the stack and loading them later, and at this point I'm not sure how much more optimized we can get than that since Rust needs to enter the picture here somehow to handle the Wasmtime fiddly-bits of calling back into C. I'm hopeful, though, that this is such a large improvement from before that the question of optimizing this further in C is best left for another day. The new `Func::wrap_cabi` method is unlikely to ever get used from Rust, but a `wasmtime_func_wrap` API was added to the C API to mirror `Func::wrap` where if the host function pointer has a specific ABI this function can be called instead of `wasmtime_func_new`.
This commit is an alternative to bytecodealliance#3298 which achieves effectively the same goal of optimizing the `Func::call` API as well as its C API sibling of `wasmtime_func_call`. The strategy taken here is different than bytecodealliance#3298 though where a new API isn't created, rather a small tweak to an existing API is done. Specifically this commit handles the major sources of slowness with `Func::call` with: * Looking up the type of a function, to typecheck the arguments with and use to guide how the results should be loaded, no longer hits the rwlock in the `Engine` but instead each `Func` contains its own `FuncType`. This can be an unnecessary allocation for funcs not used with `Func::call`, so this is a downside of this implementation relative to bytecodealliance#3298. A mitigating factor, though, is that instance exports are loaded lazily into the `Store` and in theory not too many funcs are active in the store as `Func` objects. * Temporary storage is amortized with a long-lived `Vec` in the `Store` rather than allocating a new vector on each call. This is basically the same strategy as bytecodealliance#3294 only applied to different types in different places. Specifically `wasmtime::Store` now retains a `Vec<u128>` for `Func::call`, and the C API retains a `Vec<Val>` for calling `Func::call`. * Finally, an API breaking change is made to `Func::call` and its type signature (as well as `Func::call_async`). Instead of returning `Box<[Val]>` as it did before this function now takes a `results: &mut [Val]` parameter. This allows the caller to manage the allocation and we can amortize-remove it in `wasmtime_func_call` by using space after the parameters in the `Vec<Val>` we're passing in. This change is naturally a breaking change and we'll want to consider it carefully, but mitigating factors are that most embeddings are likely using `TypedFunc::call` instead and this signature taking a mutable slice better aligns with `Func::new` which receives a mutable slice for the results. Overall this change, in the benchmark of "call a nop function from the C API" is not quite as good as bytecodealliance#3298. It's still a bit slower, on the order of 15ns, because there's lots of capacity checks around vectors and the type checks are slightly less optimized than before. Overall though this is still significantly better than today because allocations and the rwlock to acquire the type information are both avoided. I personally feel that this change is the best to do because it has less of an API impact than bytecodealliance#3298.
5f37933
to
3ed9019
Compare
I'll note that I think we still want this even in the face of #3345. I personally like the idea of taking |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great, just the one question!
This commit fixes a "merge conflict" with bytecodealliance#3319 being merged into `main`, causing CI failures on merge.
This commit is an alternative to #3298 which achieves effectively the
same goal of optimizing the
Func::call
API as well as its C APIsibling of
wasmtime_func_call
. The strategy taken here is differentthan #3298 though where a new API isn't created, rather a small tweak to
an existing API is done. Specifically this commit handles the major
sources of slowness with
Func::call
with:Looking up the type of a function, to typecheck the arguments with and
use to guide how the results should be loaded, no longer hits the
rwlock in the
Engine
but instead eachFunc
contains its ownFuncType
. This can be an unnecessary allocation for funcs not usedwith
Func::call
, so this is a downside of this implementationrelative to Add a new API to make
Func::call
faster #3298. A mitigating factor, though, is that instanceexports are loaded lazily into the
Store
and in theory not too manyfuncs are active in the store as
Func
objects.Temporary storage is amortized with a long-lived
Vec
in theStore
rather than allocating a new vector on each call. This is basically
the same strategy as Avoid vector allocations in wasm->host calls #3294 only applied to different types in
different places. Specifically
wasmtime::Store
now retains aVec<u128>
forFunc::call
, and the C API retains aVec<Val>
forcalling
Func::call
.Finally, an API breaking change is made to
Func::call
and its typesignature (as well as
Func::call_async
). Instead of returningBox<[Val]>
as it did before this function now takes aresults: &mut [Val]
parameter. This allows the caller to manage theallocation and we can amortize-remove it in
wasmtime_func_call
byusing space after the parameters in the
Vec<Val>
we're passing in.This change is naturally a breaking change and we'll want to consider
it carefully, but mitigating factors are that most embeddings are
likely using
TypedFunc::call
instead and this signature taking amutable slice better aligns with
Func::new
which receives a mutableslice for the results.
Overall this change, in the benchmark of "call a nop function from the C
API" is not quite as good as #3298. It's still a bit slower, on the
order of 15ns, because there's lots of capacity checks around vectors
and the type checks are slightly less optimized than before. Overall
though this is still significantly better than today because allocations
and the rwlock to acquire the type information are both avoided. I
personally feel that this change is the best to do because it has less
of an API impact than #3298.