Use relative `call` instructions between wasm functions #3254

alexcrichton · 2021-08-26T22:12:25Z

This commit is a relatively major change to the way that Wasmtime
generates code for Wasm modules and how functions call each other.
Prior to this commit all function calls between functions, even if they
were defined in the same module, were done indirectly through a
register. To implement this the backend would emit an absolute 8-byte
relocation near all function calls, load that address into a register,
and then call it. While this technique is simple to implement and easy
to get right, it has two primary downsides associated with it:

Function calls are always indirect which means they are more difficult
to predict, resulting in worse performance.
Generating a relocation-per-function call requires expensive
relocation resolution at module-load time, which can be a large
contributing factor to how long it takes to load a precompiled module.

To fix these issues, while also compromising on the previously simple
implementation technique, this commit switches wasm calls within a
module to using the colocated flag enabled in Cranelift-speak, which
basically means that a relative call instruction is used with a
relocation that's resolved relative to the pc of the call instruction
itself.

When switching the colocated flag to true this commit is also then
able to move much of the relocation resolution from wasmtime_jit::link
into wasmtime_cranelift::obj during object-construction time. This
frontloads all relocation work which means that there's actually no
relocations related to function calls in the final image, solving both
of our points above.

The main gotcha in implementing this technique is that there are
hardware limitations to relative function calls which mean we can't
simply blindly use them. AArch64, for example, can only go +/- 64 MB
from the bl instruction to the target, which means that if the
function we're calling is a greater distance away then we would fail to
resolve that relocation. On x86_64 the limits are +/- 2GB which are much
larger, but theoretically still feasible to hit. Consequently the main
increase in implementation complexity is fixing this issue.

This issue is actually already present in Cranelift itself, and is
internally one of the invariants handled by the MachBuffer type. When
generating a function relative jumps between basic blocks have similar
restrictions. At this time, however, I decided to not try to reuse
MachBuffer for inter-function relocations. The main reason for this is
that MachBuffer doesn't actually handle any of the cases that
inter-function relocations need to handle, namely the 26-bit relocation
and 32-bit relocations of AArch64 and x86_64. If a 26-bit relocation
isn't resolvable because a function gets too large then MachBuffer
will simply panic. I also opted to not use MachBuffer for now, though,
because it doesn't quite have the infrastructure already set up where
when inserting an island for a smaller jump sequence the result should
be a relocation to fill in later. Today it simply promotes a 19-bit
branch on AArch64 to a 26-bit branch, assuming the 26 bits is enough.
All-in-all I felt that at this time MachBuffer wasn't reusable enough
and would need enough work that it was probably more worthwhile to do
this in wasmtime-cranelift first and looking to unify the strategies
in the future.

For these reasons the wasmtime_cranelift::obj module has grown in
complexity quite a bit. This now entirely handles relative relocations
between functions, automatically inserting "jump veneers" to resolve
calls between functions that are too far away. The general strategy is
similar to MachBuffer, but different in the final tracking of
relocations. The current assumption is that each TargetIsa has the
ability to generate a fixed "jump veneer" which internally has an 8-byte
immediate value that is added to its own address to reach the
destination of an indirect call. This relative jump, even in the veneer,
means that when veneers are used we still don't need absolute
relocations since the code is still all implemented as relative jumps.
The veneer jumps, however, are larger in code size because they don't
fit into the native versions.

I've added some simple testing of this for now. A synthetic compiler
option was create to simply add padded 0s between functions and test
cases implement various forms of calls that at least need veneers. A
test is also included for x86_64, but it is unfortunately pretty slow
because it requires generating 2GB of output. I'm hoping for now it's
not too bad, but we can disable the test if it's prohibitive and
otherwise just comment the necessary portions to be sure to run the
ignored test if these parts of the code have changed.

The final end-result of this commit is that for a large module I'm
working with the number of relocations dropped to zero, meaning that
nothing actually needs to be done to the text section when it's loaded
into memory (yay!). I haven't run final benchmarks yet but this is the
last remaining source of significant slowdown when loading modules,
after I land a number of other PRs both active and ones that I only have
locally for now.

cc #3230

alexcrichton · 2021-08-26T22:13:56Z

I'm not overly happy with the number of tests I have for this, but I don't know how to otherwise exercise it more. One test I'd like to add is a sort of binary search that tries to stress the logic around precise sizes and when veneers are inserted, but I couldn't figure out a good way to test the binary search, e.g. somehow read out whether a veneer was inserted or not. @cfallin if you've got ideas of how to more thoroughly test this I'd love to implement them.

cfallin · 2021-08-26T22:24:36Z

@alexcrichton Thanks -- this looks like a bit of a monster and I will dive in a bit later, but on reading the PR description only, one thing strikes me: it should be possible to test the veneer insertion without generating tons of padding by forcing an "always use veneers" mode, no? I'm not sure if the plumbing is there to take Cranelift options in wasmtime_cranelift::obj but if not, perhaps this would be an appropriate reason to carry some options into that code (I can imagine other linker options might appear in the future).

akirilov-arm

I just had a quick look at the AArch64 bits. I have two general comments, which don't need to be addressed in this PR - first of all, we could extend the calls to an 8 GB range (+/-4GB) by using an ADRP + ADD + BLR sequence. My second comment is more of an open question - have we ever thought about using code memory pages that are executable, but not readable, as additional security hardening? Of course, that would preclude the usage of any kind of inline literals and/or pools.

cranelift/codegen/src/isa/aarch64/inst/emit.rs

github-actions · 2021-08-26T23:17:56Z

Subscribe to Label Action

cc @fitzgen, @kubkon, @peterhuene

This issue or pull request has been labeled: "cranelift", "cranelift:area:aarch64", "cranelift:area:machinst", "cranelift:area:x64", "fuzzing", "wasi", "wasmtime:api"

Thus the following users have been cc'd because of the following labels:

fitzgen: fuzzing
kubkon: wasi
peterhuene: wasmtime:api

To subscribe or unsubscribe from this label, edit the .github/subscribe-to-label.json configuration file.

Learn more.

alexcrichton · 2021-08-27T00:03:41Z

we could extend the calls to an 8 GB range

Do you know if this is common in normal AArch64 compilers? (e.g. gcc or llvm). Do you know if this is better than our current strategy of load-a-literal-and-call-it? (which we could presumably change to being a relative literal if we wanted)

have we ever thought about using code memory pages that are executable, but not readable, as additional security hardening?

ooh interesting! I always thought the execute permission implied the read permission, but if we could prevent reading that'd be pretty nifty! I'm not sure how feasible it would be at the Cranelift-level but we could try to shove information like that into the VMContext, but that's pretty Wasmtime-specific.

(I don't know if others have pondered this myself)

cfallin · 2021-08-27T00:29:51Z

have we ever thought about using code memory pages that are executable, but not readable, as additional security hardening?

ooh interesting! I always thought the execute permission implied the read permission, but if we could prevent reading that'd be pretty nifty! I'm not sure how feasible it would be at the Cranelift-level but we could try to shove information like that into the VMContext, but that's pretty Wasmtime-specific.

It's definitely possible, but it would take some additional work with regard to linking: specifically we would need relocations that refer to .rodata and would need to fix those up to point to all constants in a pool at the end of the object.

There's the additional complication that on RISC-ish architectures like aarch64 there are limits on distance to literal-constant loads (LDR supports +/- 1MiB if I'm not mistaken?) which would mean we'd need constant islands between functions, and to switch between execute-only and read-only pages.

That said other platforms do W^X pretty successfully (OpenBSD by default at least) and it'd be a cool mitigation to say that we can do too :-)

akirilov-arm · 2021-08-27T01:10:53Z

Do you know if this is common in normal AArch64 compilers? (e.g. gcc or llvm).

I honestly don't know - I just wanted to mention the possibility.

Do you know if this is better than our current strategy of load-a-literal-and-call-it?

I believe so - while the veneers avoid the worst aspect of that approach (the need to have a short jump over the literal, as we do for SIMD & FP literals right now), ADRP and ADD are very fast arithmetic instructions, while a literal load is still a load. Also, unless several call sites manage to share the literal, the code size is worse - 12 bytes (literal load, i.e. 4 bytes + 8-byte literal) vs. 2 instructions, i.e. 8 bytes.

I always thought the execute permission implied the read permission...

If I am not mistaken, not in the 64-bit Arm architecture.

There's the additional complication that on RISC-ish architectures like aarch64 there are limits on distance to literal-constant loads (LDR supports +/- 1MiB if I'm not mistaken?)...

Yes, but in the AArch64 case we could apply a similar trick to get some extra breathing room, i.e. an ADRP + plain LDR combination - it is not an accident that ADRP forms 12-bit-aligned addresses, while LDR supports unsigned 12-bit offsets 😉. In fact, that combination is treated in a special way by Cortex-A55, for example, effectively fusing it.

cranelift/codegen/src/isa/aarch64/inst/emit.rs

cranelift/codegen/src/isa/x64/inst/emit.rs

crates/cranelift/src/builder.rs

crates/cranelift/src/obj.rs

bjorn3 · 2021-08-27T11:27:12Z

crates/cranelift/src/obj.rs

+    /// Helper function exclusively for tests to increase padding between
+    /// functions to test the veneer insertion logic in this file.
+    pub fn append_synthetic_padding(&mut self, amt: usize) {
+        self.push_code(vec![0; amt], true);


Maybe make this ud2 on x86?

This commit is a relatively major change to the way that Wasmtime generates code for Wasm modules and how functions call each other. Prior to this commit all function calls between functions, even if they were defined in the same module, were done indirectly through a register. To implement this the backend would emit an absolute 8-byte relocation near all function calls, load that address into a register, and then call it. While this technique is simple to implement and easy to get right, it has two primary downsides associated with it: * Function calls are always indirect which means they are more difficult to predict, resulting in worse performance. * Generating a relocation-per-function call requires expensive relocation resolution at module-load time, which can be a large contributing factor to how long it takes to load a precompiled module. To fix these issues, while also compromising on the previously simple implementation technique, this commit switches wasm calls within a module to using the `colocated` flag enabled in Cranelift-speak, which basically means that a relative call instruction is used with a relocation that's resolved relative to the pc of the call instruction itself. When switching the `colocated` flag to `true` this commit is also then able to move much of the relocation resolution from `wasmtime_jit::link` into `wasmtime_cranelift::obj` during object-construction time. This frontloads all relocation work which means that there's actually no relocations related to function calls in the final image, solving both of our points above. The main gotcha in implementing this technique is that there are hardware limitations to relative function calls which mean we can't simply blindly use them. AArch64, for example, can only go +/- 64 MB from the `bl` instruction to the target, which means that if the function we're calling is a greater distance away then we would fail to resolve that relocation. On x86_64 the limits are +/- 2GB which are much larger, but theoretically still feasible to hit. Consequently the main increase in implementation complexity is fixing this issue. This issue is actually already present in Cranelift itself, and is internally one of the invariants handled by the `MachBuffer` type. When generating a function relative jumps between basic blocks have similar restrictions. At this time, however, I decided to not try to reuse `MachBuffer` for inter-function relocations. The main reason for this is that `MachBuffer` doesn't actually handle any of the cases that inter-function relocations need to handle, namely the 26-bit relocation and 32-bit relocations of AArch64 and x86_64. If a 26-bit relocation isn't resolvable because a function gets too large then `MachBuffer` will simply panic. I also opted to not use `MachBuffer` for now, though, because it doesn't quite have the infrastructure already set up where when inserting an island for a smaller jump sequence the result should be a relocation to fill in later. Today it simply promotes a 19-bit branch on AArch64 to a 26-bit branch, assuming the 26 bits is enough. All-in-all I felt that at this time `MachBuffer` wasn't reusable enough and would need enough work that it was probably more worthwhile to do this in `wasmtime-cranelift` first and looking to unify the strategies in the future. For these reasons the `wasmtime_cranelift::obj` module has grown in complexity quite a bit. This now entirely handles relative relocations between functions, automatically inserting "jump veneers" to resolve calls between functions that are too far away. The general strategy is similar to `MachBuffer`, but different in the final tracking of relocations. The current assumption is that each `TargetIsa` has the ability to generate a fixed "jump veneer" which internally has an 8-byte immediate value that is added to its own address to reach the destination of an indirect call. This relative jump, even in the veneer, means that when veneers are used we still don't need absolute relocations since the code is still all implemented as relative jumps. The veneer jumps, however, are larger in code size because they don't fit into the native versions. I've added some simple testing of this for now. A synthetic compiler option was create to simply add padded 0s between functions and test cases implement various forms of calls that at least need veneers. A test is also included for x86_64, but it is unfortunately pretty slow because it requires generating 2GB of output. I'm hoping for now it's not too bad, but we can disable the test if it's prohibitive and otherwise just comment the necessary portions to be sure to run the ignored test if these parts of the code have changed. The final end-result of this commit is that for a large module I'm working with the number of relocations dropped to zero, meaning that nothing actually needs to be done to the text section when it's loaded into memory (yay!). I haven't run final benchmarks yet but this is the last remaining source of significant slowdown when loading modules, after I land a number of other PRs both active and ones that I only have locally for now.

Only intended to be used during testing/debugging. This also hooks up the option to fuzzers.

alexcrichton · 2021-08-27T17:24:27Z

Today the X86PCRelRodata4 relocation type coming out of cranelift is simply ignored, but I don't know why it is ignored. I just added a commit to ignore it which fixes tests on the old backend, but if someone else knows of a better comment to add as to why it's ignored I'd be happy to write it in.

bjorn3 · 2021-08-27T17:36:05Z

X86PCRelRodata4 is a relocation pointing to read-only data directly after a function. It only exists so that it is possible to move this read-only data away from the function. If you don't do this, the emitted code is pre-relocated for the expected location of the trailing read-only data.

wasmtime/cranelift/jit/src/backend.rs

Lines 916 to 919 in e6f3994

    
           Reloc::X86PCRelRodata4 => { 
        
               // Not necessary to record this unless we are going to split apart code and its 
        
               // jumptbl/rodata. 
        
           }

It by the way isn't emitted by any new style backends.

sunfishcode · 2021-08-27T18:14:45Z

I don't know the current status of this, but at one point, the way Cranelift was embedded in SpiderMonkey, SpiderMonkey would prepend and append instructions to Cranelift's output, and we needed the ability to move the constant pool rodata around to make room.

Don't push raw data onto the text section, we have to account for this separately. Add a debug assertion to catch bugs like this in the future.

Hopefully can remove all this with gimli-rs/object#367

This reverts commit cb119a6.

cfallin

This is great work!

I've left a bunch of comments below, but the main thing that is still sticking out to me is the partial duplication between this and the intra-function code handling all the same problems in MachBuffer. It doesn't quite sit right with me that we duplicate a lot of the logic for island insertion, deadline computation, creating veneers and editing code, just because there is some more implementation (e.g. 26->32 bit case) needed in the existing code. I'm worried about the technical debt this creates: "we'll consolidate someday" means approximately never, given all the other work on our plates, and now we have one more place where we reason about branches and relocations.

I hate to say all that after you've already done the work down this path, but I think that if we do try to consolidate now instead, a lot of what you've done is reusable -- specifically the shims here are longer-distance kinds of LabelUses, in MachBuffer terminology.

I'm not sure I fully understand the other issue you describe with MachBuffer re: island insertion logic -- it should be possible to either promote incrementally or add some info as we emit the branch that we should assume worst-case right away.

There's possibly more context here that I've already evicted from L1 since our talk last week -- sorry, very scattered week for me! -- and I'm very happy to talk through all the above to understand more what you've found! At the least, IMHO, we should document better in a code-comment somewhere how this implementation differs from the intra-function case and why it needs its own same-but-slightly-different implementation.

cfallin · 2021-08-30T17:37:17Z

cranelift/codegen/src/machinst/mod.rs

@@ -425,6 +425,34 @@ pub trait MachBackend {
    fn map_reg_to_dwarf(&self, _: Reg) -> Result<u16, RegisterMappingError> {
        Err(RegisterMappingError::UnsupportedArchitecture)
    }
+
+    /// Generates as "veneer" which is used when a relative call instruction


Two thoughts on this new addition to the backend trait:

I think there might be an opportunity to merge this per-backend behavior into the LabelUse trait already defined by each backend. The generated trampoline code is playing almost the same role -- extending the range of a branch and allowing a shorter-range reloc to be converted to a longer-range one -- except that, unlike intra-function veneers, it also assumes the use of various registers.

That last bit is a key difference. In the aarch64 case x16 and x17 are used internally to instruction-lowering sequences but never across basic blocks, so this is fine; but r10 and r11 on x86-64 will potentially be used by regalloc, so we wouldn't want to blindly insert this as another type of veneer in the existing trait. So we'd want to add some parameters to the supports_veneer, veneer_size and generate_veneer functions to indicate what kind of veneer ("next step up in range" or "absolute" maybe) and whether the use of regs as per ABI for inter-function transfers is allowed.

Whatever we do, it strikes me that the duplication here ("veneers" in two places, with similar APIs) is likely to confuse others so we should somehow merge them or distinguish them better. Furthermore if we're going to have low-level understanding of branches (e.g. embedded machine-code bits in a sequence to emit) we should have that in as few places as possible.

I am wondering if there is a better name than "veneer", if we don't merge; maybe "trampoline" or "linkage stub"?

This function could be useful for cranelift-jit too. It also needs a version that loads the address from memory at a specific location like ELF PLT's. Currently it only has a x86_64 version for the latter function.

cfallin · 2021-08-30T17:38:20Z

crates/cranelift/src/builder.rs

+    /// A debug-only setting used to synthetically insert 0-byte padding between
+    /// compiled functions to simulate huge compiled artifacts and exercise
+    /// logic related to jump veneers.
+    pub padding_between_functions: usize,


Do we still need this if we have the option below (force veneers) as well?

I'm going to leave this in for now because I think it's useful to exercise the veneers in real-world situations where they're actually needed. Otherwise I'd be worried that the logic for actually inserting veneers was only correct when the forcing was turned on. This way there's some exercising of the actual "ok yes we need that veneer" logic as well. (it's also relatively easy to support).

Makes sense!

Would it make sense to use it only in tests with the aarch64 backend (could be instantiated explicitly, doesn't need to run on aarch64 host), where the threshold for inserting an island is much lower, so we don't have the overhead of 2GiB object files in tests on x86?

cfallin · 2021-08-30T17:41:44Z

crates/cranelift/src/obj.rs

 pub struct ObjectBuilder<'a> {
+    /// The target that we're compiling for, used to query target-specific
+    /// information as necessary.
+    isa: &'a dyn TargetIsa,


I see here why you had to add methods to the TargetIsa to get at the veneer-generation implementation; above comments re: making use of LabelUse still apply and then maybe we could bounce through the TargetIsa to use those.

cfallin · 2021-08-30T17:43:47Z

crates/cranelift/src/obj.rs

+    ///
+    /// The second element of the pair here is the desired alignment of the
+    /// function body.
+    text_contents: Vec<(Vec<u8>, u64)>,


Readability request: a little struct in place of this tuple, so we can write e.g. TextSegmentFunction { body_bytes, alignment } or something like that?

cfallin · 2021-08-30T17:45:41Z

crates/cranelift/src/obj.rs

+
+    /// Map from text section offset, `u64`, to the index in `text_contents`
+    /// where those contents live.
+    text_locations: HashMap<u64, usize>,


For maps from one index space to another I like the X_to_Y naming idiom to make it clear what the index and value are; maybe text_offset_to_text_contents_index? Verbose I know but getting index spaces crossed is a potent footgun...

cfallin · 2021-08-30T17:48:34Z

crates/cranelift/src/obj.rs

+                    // every pending relocation pessimistically needsd a
+                    // veneer.
+                    let offset_to_edit = r.offset + u64::from(r.reloc.offset);
+                    r.max_jump_distance = offset_to_edit + r.limits().1;


Maybe at least a checked_add here, if not some clamping? A hypothetical machine with a 64-bit branch that can do a relative jump anywhere would overflow otherwise.

cfallin · 2021-08-30T17:51:02Z

crates/cranelift/src/obj.rs

+                // for when the constant rodata is separated from the code
+                // itself. We don't do that, though, so we ignore these
+                // relocations since the offsets already listed here are already
+                // correct.


can you add a TODO here referencing one of the remove-old-backends issues/PRs (bytecodealliance/rfcs#12 maybe) so we can grep this when the time comes?

cfallin · 2021-08-30T17:53:32Z

crates/cranelift/src/obj.rs

+    ///
+    /// * Finally a relocation against an unknown symbol may be so far away
+    ///   that if the next symbol is inserted it couldn't reach its
+    ///   destination. In this situation a veneer is generated.


Can you add a reference to the MachBuffer code that does this too? The cases are the same (known/in-bounds, known, out-of-bounds, unknown) and it'd be helpful to cross-reference where else this logic is used :-)

cfallin · 2021-08-30T17:55:11Z

crates/cranelift/src/obj.rs

+    /// This method is the implementation detail of actually modifying code
+    /// emitted by Cranelift by patching in values to relocations. By doing
+    /// this at object-assembly time here we can avoid doing this at load-time
+    /// later, frontloading as much work as possible to make cache loads more


"cache loads" is ambiguous here (my computer-architect mind thinks this means a normal memory load); maybe "to make loading compiled modules from disk more efficient"?

cfallin · 2021-08-30T17:55:55Z

crates/cranelift/src/obj.rs

+        let code = self.text_locations[&r.offset];
+        let code = &mut self.text_contents[code].0;
+
+        match r.reloc.reloc {


There is some code in the runtime that applies relocations as well; can we consolidate the two implementations?

alexcrichton · 2021-08-31T20:56:32Z

Ok talked with @cfallin offline about merging all this into MachBuffer and after some fiddling I think it's going to work (got the test suite passing locally on arm64 at least). I'm gonna close this and I'll reopen with a cleaned up history, a rebase, and a new commit message.

alexcrichton requested a review from cfallin August 26, 2021 22:12

alexcrichton mentioned this pull request Aug 26, 2021

Loading precompiled modules is slower than necessary #3230

Closed

akirilov-arm suggested changes Aug 26, 2021

View reviewed changes

cranelift/codegen/src/isa/aarch64/inst/emit.rs Outdated Show resolved Hide resolved

cranelift/codegen/src/isa/aarch64/inst/emit.rs Outdated Show resolved Hide resolved

cranelift/codegen/src/isa/aarch64/inst/emit.rs Outdated Show resolved Hide resolved

akirilov-arm suggested changes Aug 27, 2021

View reviewed changes

cranelift/codegen/src/isa/aarch64/inst/emit.rs Show resolved Hide resolved

cranelift/codegen/src/isa/aarch64/inst/emit.rs Show resolved Hide resolved

cranelift/codegen/src/isa/aarch64/inst/emit.rs Outdated Show resolved Hide resolved