-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use relative call
instructions between wasm functions
#3254
Conversation
I'm not overly happy with the number of tests I have for this, but I don't know how to otherwise exercise it more. One test I'd like to add is a sort of binary search that tries to stress the logic around precise sizes and when veneers are inserted, but I couldn't figure out a good way to test the binary search, e.g. somehow read out whether a veneer was inserted or not. @cfallin if you've got ideas of how to more thoroughly test this I'd love to implement them. |
@alexcrichton Thanks -- this looks like a bit of a monster and I will dive in a bit later, but on reading the PR description only, one thing strikes me: it should be possible to test the veneer insertion without generating tons of padding by forcing an "always use veneers" mode, no? I'm not sure if the plumbing is there to take Cranelift options in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just had a quick look at the AArch64 bits. I have two general comments, which don't need to be addressed in this PR - first of all, we could extend the calls to an 8 GB range (+/-4GB) by using an ADRP
+ ADD
+ BLR
sequence. My second comment is more of an open question - have we ever thought about using code memory pages that are executable, but not readable, as additional security hardening? Of course, that would preclude the usage of any kind of inline literals and/or pools.
Subscribe to Label Actioncc @fitzgen, @kubkon, @peterhuene
This issue or pull request has been labeled: "cranelift", "cranelift:area:aarch64", "cranelift:area:machinst", "cranelift:area:x64", "fuzzing", "wasi", "wasmtime:api"
Thus the following users have been cc'd because of the following labels:
To subscribe or unsubscribe from this label, edit the |
Do you know if this is common in normal AArch64 compilers? (e.g. gcc or llvm). Do you know if this is better than our current strategy of load-a-literal-and-call-it? (which we could presumably change to being a relative literal if we wanted)
ooh interesting! I always thought the execute permission implied the read permission, but if we could prevent reading that'd be pretty nifty! I'm not sure how feasible it would be at the Cranelift-level but we could try to shove information like that into the VMContext, but that's pretty Wasmtime-specific. (I don't know if others have pondered this myself) |
It's definitely possible, but it would take some additional work with regard to linking: specifically we would need relocations that refer to There's the additional complication that on RISC-ish architectures like aarch64 there are limits on distance to literal-constant loads ( That said other platforms do W^X pretty successfully (OpenBSD by default at least) and it'd be a cool mitigation to say that we can do too :-) |
I honestly don't know - I just wanted to mention the possibility.
I believe so - while the veneers avoid the worst aspect of that approach (the need to have a short jump over the literal, as we do for SIMD & FP literals right now),
If I am not mistaken, not in the 64-bit Arm architecture.
Yes, but in the AArch64 case we could apply a similar trick to get some extra breathing room, i.e. an |
crates/cranelift/src/obj.rs
Outdated
/// Helper function exclusively for tests to increase padding between | ||
/// functions to test the veneer insertion logic in this file. | ||
pub fn append_synthetic_padding(&mut self, amt: usize) { | ||
self.push_code(vec![0; amt], true); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe make this ud2
on x86?
This commit is a relatively major change to the way that Wasmtime generates code for Wasm modules and how functions call each other. Prior to this commit all function calls between functions, even if they were defined in the same module, were done indirectly through a register. To implement this the backend would emit an absolute 8-byte relocation near all function calls, load that address into a register, and then call it. While this technique is simple to implement and easy to get right, it has two primary downsides associated with it: * Function calls are always indirect which means they are more difficult to predict, resulting in worse performance. * Generating a relocation-per-function call requires expensive relocation resolution at module-load time, which can be a large contributing factor to how long it takes to load a precompiled module. To fix these issues, while also compromising on the previously simple implementation technique, this commit switches wasm calls within a module to using the `colocated` flag enabled in Cranelift-speak, which basically means that a relative call instruction is used with a relocation that's resolved relative to the pc of the call instruction itself. When switching the `colocated` flag to `true` this commit is also then able to move much of the relocation resolution from `wasmtime_jit::link` into `wasmtime_cranelift::obj` during object-construction time. This frontloads all relocation work which means that there's actually no relocations related to function calls in the final image, solving both of our points above. The main gotcha in implementing this technique is that there are hardware limitations to relative function calls which mean we can't simply blindly use them. AArch64, for example, can only go +/- 64 MB from the `bl` instruction to the target, which means that if the function we're calling is a greater distance away then we would fail to resolve that relocation. On x86_64 the limits are +/- 2GB which are much larger, but theoretically still feasible to hit. Consequently the main increase in implementation complexity is fixing this issue. This issue is actually already present in Cranelift itself, and is internally one of the invariants handled by the `MachBuffer` type. When generating a function relative jumps between basic blocks have similar restrictions. At this time, however, I decided to not try to reuse `MachBuffer` for inter-function relocations. The main reason for this is that `MachBuffer` doesn't actually handle any of the cases that inter-function relocations need to handle, namely the 26-bit relocation and 32-bit relocations of AArch64 and x86_64. If a 26-bit relocation isn't resolvable because a function gets too large then `MachBuffer` will simply panic. I also opted to not use `MachBuffer` for now, though, because it doesn't quite have the infrastructure already set up where when inserting an island for a smaller jump sequence the result should be a relocation to fill in later. Today it simply promotes a 19-bit branch on AArch64 to a 26-bit branch, assuming the 26 bits is enough. All-in-all I felt that at this time `MachBuffer` wasn't reusable enough and would need enough work that it was probably more worthwhile to do this in `wasmtime-cranelift` first and looking to unify the strategies in the future. For these reasons the `wasmtime_cranelift::obj` module has grown in complexity quite a bit. This now entirely handles relative relocations between functions, automatically inserting "jump veneers" to resolve calls between functions that are too far away. The general strategy is similar to `MachBuffer`, but different in the final tracking of relocations. The current assumption is that each `TargetIsa` has the ability to generate a fixed "jump veneer" which internally has an 8-byte immediate value that is added to its own address to reach the destination of an indirect call. This relative jump, even in the veneer, means that when veneers are used we still don't need absolute relocations since the code is still all implemented as relative jumps. The veneer jumps, however, are larger in code size because they don't fit into the native versions. I've added some simple testing of this for now. A synthetic compiler option was create to simply add padded 0s between functions and test cases implement various forms of calls that at least need veneers. A test is also included for x86_64, but it is unfortunately pretty slow because it requires generating 2GB of output. I'm hoping for now it's not too bad, but we can disable the test if it's prohibitive and otherwise just comment the necessary portions to be sure to run the ignored test if these parts of the code have changed. The final end-result of this commit is that for a large module I'm working with the number of relocations dropped to zero, meaning that nothing actually needs to be done to the text section when it's loaded into memory (yay!). I haven't run final benchmarks yet but this is the last remaining source of significant slowdown when loading modules, after I land a number of other PRs both active and ones that I only have locally for now.
Only intended to be used during testing/debugging. This also hooks up the option to fuzzers.
97d9699
to
63c0e77
Compare
Today the |
wasmtime/cranelift/jit/src/backend.rs Lines 916 to 919 in e6f3994
It by the way isn't emitted by any new style backends. |
I don't know the current status of this, but at one point, the way Cranelift was embedded in SpiderMonkey, SpiderMonkey would prepend and append instructions to Cranelift's output, and we needed the ability to move the constant pool rodata around to make room. |
Don't push raw data onto the text section, we have to account for this separately. Add a debug assertion to catch bugs like this in the future.
Hopefully can remove all this with gimli-rs/object#367
This reverts commit cb119a6.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is great work!
I've left a bunch of comments below, but the main thing that is still sticking out to me is the partial duplication between this and the intra-function code handling all the same problems in MachBuffer
. It doesn't quite sit right with me that we duplicate a lot of the logic for island insertion, deadline computation, creating veneers and editing code, just because there is some more implementation (e.g. 26->32 bit case) needed in the existing code. I'm worried about the technical debt this creates: "we'll consolidate someday" means approximately never, given all the other work on our plates, and now we have one more place where we reason about branches and relocations.
I hate to say all that after you've already done the work down this path, but I think that if we do try to consolidate now instead, a lot of what you've done is reusable -- specifically the shims here are longer-distance kinds of LabelUse
s, in MachBuffer
terminology.
I'm not sure I fully understand the other issue you describe with MachBuffer
re: island insertion logic -- it should be possible to either promote incrementally or add some info as we emit the branch that we should assume worst-case right away.
There's possibly more context here that I've already evicted from L1 since our talk last week -- sorry, very scattered week for me! -- and I'm very happy to talk through all the above to understand more what you've found! At the least, IMHO, we should document better in a code-comment somewhere how this implementation differs from the intra-function case and why it needs its own same-but-slightly-different implementation.
@@ -425,6 +425,34 @@ pub trait MachBackend { | |||
fn map_reg_to_dwarf(&self, _: Reg) -> Result<u16, RegisterMappingError> { | |||
Err(RegisterMappingError::UnsupportedArchitecture) | |||
} | |||
|
|||
/// Generates as "veneer" which is used when a relative call instruction |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Two thoughts on this new addition to the backend trait:
-
I think there might be an opportunity to merge this per-backend behavior into the
LabelUse
trait already defined by each backend. The generated trampoline code is playing almost the same role -- extending the range of a branch and allowing a shorter-range reloc to be converted to a longer-range one -- except that, unlike intra-function veneers, it also assumes the use of various registers.That last bit is a key difference. In the aarch64 case
x16
andx17
are used internally to instruction-lowering sequences but never across basic blocks, so this is fine; butr10
andr11
on x86-64 will potentially be used by regalloc, so we wouldn't want to blindly insert this as another type of veneer in the existing trait. So we'd want to add some parameters to thesupports_veneer
,veneer_size
andgenerate_veneer
functions to indicate what kind of veneer ("next step up in range" or "absolute" maybe) and whether the use of regs as per ABI for inter-function transfers is allowed.Whatever we do, it strikes me that the duplication here ("veneers" in two places, with similar APIs) is likely to confuse others so we should somehow merge them or distinguish them better. Furthermore if we're going to have low-level understanding of branches (e.g. embedded machine-code bits in a sequence to emit) we should have that in as few places as possible.
-
I am wondering if there is a better name than "veneer", if we don't merge; maybe "trampoline" or "linkage stub"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function could be useful for cranelift-jit too. It also needs a version that loads the address from memory at a specific location like ELF PLT's. Currently it only has a x86_64 version for the latter function.
/// A debug-only setting used to synthetically insert 0-byte padding between | ||
/// compiled functions to simulate huge compiled artifacts and exercise | ||
/// logic related to jump veneers. | ||
pub padding_between_functions: usize, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we still need this if we have the option below (force veneers) as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm going to leave this in for now because I think it's useful to exercise the veneers in real-world situations where they're actually needed. Otherwise I'd be worried that the logic for actually inserting veneers was only correct when the forcing was turned on. This way there's some exercising of the actual "ok yes we need that veneer" logic as well. (it's also relatively easy to support).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense!
Would it make sense to use it only in tests with the aarch64 backend (could be instantiated explicitly, doesn't need to run on aarch64 host), where the threshold for inserting an island is much lower, so we don't have the overhead of 2GiB object files in tests on x86?
pub struct ObjectBuilder<'a> { | ||
/// The target that we're compiling for, used to query target-specific | ||
/// information as necessary. | ||
isa: &'a dyn TargetIsa, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see here why you had to add methods to the TargetIsa
to get at the veneer-generation implementation; above comments re: making use of LabelUse
still apply and then maybe we could bounce through the TargetIsa
to use those.
/// | ||
/// The second element of the pair here is the desired alignment of the | ||
/// function body. | ||
text_contents: Vec<(Vec<u8>, u64)>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Readability request: a little struct in place of this tuple, so we can write e.g. TextSegmentFunction { body_bytes, alignment }
or something like that?
|
||
/// Map from text section offset, `u64`, to the index in `text_contents` | ||
/// where those contents live. | ||
text_locations: HashMap<u64, usize>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For maps from one index space to another I like the X_to_Y
naming idiom to make it clear what the index and value are; maybe text_offset_to_text_contents_index
? Verbose I know but getting index spaces crossed is a potent footgun...
// every pending relocation pessimistically needsd a | ||
// veneer. | ||
let offset_to_edit = r.offset + u64::from(r.reloc.offset); | ||
r.max_jump_distance = offset_to_edit + r.limits().1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe at least a checked_add
here, if not some clamping? A hypothetical machine with a 64-bit branch that can do a relative jump anywhere would overflow otherwise.
// for when the constant rodata is separated from the code | ||
// itself. We don't do that, though, so we ignore these | ||
// relocations since the offsets already listed here are already | ||
// correct. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add a TODO here referencing one of the remove-old-backends issues/PRs (bytecodealliance/rfcs#12 maybe) so we can grep this when the time comes?
/// | ||
/// * Finally a relocation against an unknown symbol may be so far away | ||
/// that if the next symbol is inserted it couldn't reach its | ||
/// destination. In this situation a veneer is generated. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a reference to the MachBuffer
code that does this too? The cases are the same (known/in-bounds, known, out-of-bounds, unknown) and it'd be helpful to cross-reference where else this logic is used :-)
/// This method is the implementation detail of actually modifying code | ||
/// emitted by Cranelift by patching in values to relocations. By doing | ||
/// this at object-assembly time here we can avoid doing this at load-time | ||
/// later, frontloading as much work as possible to make cache loads more |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"cache loads" is ambiguous here (my computer-architect mind thinks this means a normal memory load); maybe "to make loading compiled modules from disk more efficient"?
let code = self.text_locations[&r.offset]; | ||
let code = &mut self.text_contents[code].0; | ||
|
||
match r.reloc.reloc { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is some code in the runtime that applies relocations as well; can we consolidate the two implementations?
Ok talked with @cfallin offline about merging all this into |
This commit is a relatively major change to the way that Wasmtime
generates code for Wasm modules and how functions call each other.
Prior to this commit all function calls between functions, even if they
were defined in the same module, were done indirectly through a
register. To implement this the backend would emit an absolute 8-byte
relocation near all function calls, load that address into a register,
and then call it. While this technique is simple to implement and easy
to get right, it has two primary downsides associated with it:
Function calls are always indirect which means they are more difficult
to predict, resulting in worse performance.
Generating a relocation-per-function call requires expensive
relocation resolution at module-load time, which can be a large
contributing factor to how long it takes to load a precompiled module.
To fix these issues, while also compromising on the previously simple
implementation technique, this commit switches wasm calls within a
module to using the
colocated
flag enabled in Cranelift-speak, whichbasically means that a relative call instruction is used with a
relocation that's resolved relative to the pc of the call instruction
itself.
When switching the
colocated
flag totrue
this commit is also thenable to move much of the relocation resolution from
wasmtime_jit::link
into
wasmtime_cranelift::obj
during object-construction time. Thisfrontloads all relocation work which means that there's actually no
relocations related to function calls in the final image, solving both
of our points above.
The main gotcha in implementing this technique is that there are
hardware limitations to relative function calls which mean we can't
simply blindly use them. AArch64, for example, can only go +/- 64 MB
from the
bl
instruction to the target, which means that if thefunction we're calling is a greater distance away then we would fail to
resolve that relocation. On x86_64 the limits are +/- 2GB which are much
larger, but theoretically still feasible to hit. Consequently the main
increase in implementation complexity is fixing this issue.
This issue is actually already present in Cranelift itself, and is
internally one of the invariants handled by the
MachBuffer
type. Whengenerating a function relative jumps between basic blocks have similar
restrictions. At this time, however, I decided to not try to reuse
MachBuffer
for inter-function relocations. The main reason for this isthat
MachBuffer
doesn't actually handle any of the cases thatinter-function relocations need to handle, namely the 26-bit relocation
and 32-bit relocations of AArch64 and x86_64. If a 26-bit relocation
isn't resolvable because a function gets too large then
MachBuffer
will simply panic. I also opted to not use
MachBuffer
for now, though,because it doesn't quite have the infrastructure already set up where
when inserting an island for a smaller jump sequence the result should
be a relocation to fill in later. Today it simply promotes a 19-bit
branch on AArch64 to a 26-bit branch, assuming the 26 bits is enough.
All-in-all I felt that at this time
MachBuffer
wasn't reusable enoughand would need enough work that it was probably more worthwhile to do
this in
wasmtime-cranelift
first and looking to unify the strategiesin the future.
For these reasons the
wasmtime_cranelift::obj
module has grown incomplexity quite a bit. This now entirely handles relative relocations
between functions, automatically inserting "jump veneers" to resolve
calls between functions that are too far away. The general strategy is
similar to
MachBuffer
, but different in the final tracking ofrelocations. The current assumption is that each
TargetIsa
has theability to generate a fixed "jump veneer" which internally has an 8-byte
immediate value that is added to its own address to reach the
destination of an indirect call. This relative jump, even in the veneer,
means that when veneers are used we still don't need absolute
relocations since the code is still all implemented as relative jumps.
The veneer jumps, however, are larger in code size because they don't
fit into the native versions.
I've added some simple testing of this for now. A synthetic compiler
option was create to simply add padded 0s between functions and test
cases implement various forms of calls that at least need veneers. A
test is also included for x86_64, but it is unfortunately pretty slow
because it requires generating 2GB of output. I'm hoping for now it's
not too bad, but we can disable the test if it's prohibitive and
otherwise just comment the necessary portions to be sure to run the
ignored test if these parts of the code have changed.
The final end-result of this commit is that for a large module I'm
working with the number of relocations dropped to zero, meaning that
nothing actually needs to be done to the text section when it's loaded
into memory (yay!). I haven't run final benchmarks yet but this is the
last remaining source of significant slowdown when loading modules,
after I land a number of other PRs both active and ones that I only have
locally for now.
cc #3230