-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARM64 Indirect call produces redundant address load for R2R #35108
Comments
I couldn't figure out the best area label to add to this issue. Please help me learn by adding exactly one area label. |
Well, we certainly need Given that
|
@jkotas @fadimounir When looking at R2R codegen on arm64, we see the inefficient pattern above for calls through an indirection cell. On x64, the code would be simply:
I'm not sure what is actually going on here, but I presume this indirection cell initially points to some kind of stub, and after the actual target is determined, the indirection cell is fixed up to the actual target address. In the x64 case, then, the steady state overhead is just an indirect call instead of a direct call. On arm64, however, the steady state also includes setting up the x11 argument, which will not be used. (In my optimized example, it would still be used to load the indirection cell value, and we would always need something like this.) Is this the optimal way to generate this code? We could instead generate the
but that only has a +/-1MB PC-relative range, and wouldn't allow setting Possibly we could change the stub to include the stub address? (Or perhaps the stub is shared?) |
The methods have to be called via indirection cells in R2R images. (We may be able to direct calls in some cases as an optimization. We do not have that optimization today for R2R, on any platform - that would be a separate discussion.) The indirection cell points to a delay load stub initially. The delay load stub worker method needs the address of the indirection cell to figure out the target method and to update the indirection cell with the address of the target method. On x64, we get the address of the indirection cell by disassembling the callsite. It works fine because of the x64 encoding details (note that there are details to consider like how break points injected by debugger are encoded, etc.). On arm/arm64, we require the address of the indirection cell to be passed in a fixed register. You are right that we could move the loading of the address from the code stream into the non-shared portion of the stub. I am not sure whether it would actually save anything interesting. It would likely just move the code from one place in the R2R image to a different place in the R2R image and/or regress the steady state performance (depends on the details). Your suggestion in #35108 (comment) sounds good as the first step. |
I collected no. of such |
The current code sequence, using adrp/add pair has 32-bit PC-relative range, which presumably means that all the indirection cells can be packed together in the PE file. Is that important? If we instead chose to use the We would presumably need an escape hatch if the call was within a function that itself was >1MB, so it couldn't even reach outside of its own code section (such as hiding it somewhere in the code, or bailing the compilation and retrying but disallowing the "short" form of the instruction pattern). |
While we should start by removing the duplication of setting up the indirection cell address, this pattern is so common it seems like we should consider how to optimize it even further. |
For example, you could imagine a case where there are a lot of calls to different functions with similar indirection cell addresses, where the indirection cells are packed together. In that case, a single adrp/add could establish a "base pointer", and a single "add" to this base pointer could compute the indirection cell address (instead of a new adrp/add pair). This would require "relocs" for these adds, a decision about whether maintaining a "base pointer" was profitable, and an "escape hatch" if all the targets couldn't be reached from the one (or more) base pointer(s). So yes, it could be complex. |
Definitely agree.
This packing is important for some types of cells (e.g. string literals - to avoid duplicates). It is not as important for other types of cells. Also, this packing makes the PE file algorithm simple.
I expect that the number of sections in the PE image is capped to something relatively small. We may start hitting against this limit if we start generating 2 sections per 1MB of binary. This is really only a problem on Windows where we use the OS loader. It is not a problem on Linux where we have our own loader today.
This idea sounds most scalable idea to me. It does not have the corner cases where it breaks down. I would look into these ideas on top of crossgen2 only. I do not think we would want to spend time to teach crossgen about this. cc @dotnet/crossgen-contrib |
Another idea: are there different classes of indirection cells that could be handled differently? E.g., it looks like the write barrier helpers also use the same code pattern (without the
We'd have to look how frequent this is, but perhaps JIT helpers could get special treatment, like using the |
R2RDump already parses the indirection cells; it should be trivial to adapt for producing summaries of cell counts per helper or whatever to estimate expected impact of the proposed changes on crossgened framework assemblies or any other workloads. |
@jkotas writes:
I'd like to understand this better. Consider: the current best case for arm64 indirect call using an indirection cell, at a call site is:
If we optimize this using
But that has the significant downside of a short +/-1MB PC-relative range to the indirection cell. And if we still need to set up the If we used direct call, we would have a single instruction:
This has a PC-relative range of +/-128MB. This would cover almost all (if not all) PE files, and of course could use jump stubs to reach farther. The target address could be a patch-able stub address that starts as a branch to a fixup code and is fixed up to be a direct branch to the target address, either:
with a +/-128MB range, with the possibility of jump stubs to reach farther, or
Using this scheme, call sites go from 4 instructions to 1, and the "indirection cell" goes from 8 bytes to 4 (assuming +/- 128MB range is the common case; larger if using the second form or jump stubs). The "steady state" path changes from 4 instructions and a single indirect branch to 2 instructions: branch to branch, with presumably both branches perfectly predicted. Presumably the address of the "indirection cell" could be deduced by disassembling the call site
And then the indirection cell address stored directly in the Comments? |
We have stopped doing patchable code inside binaries and do not want to go back.
Yes, we can do a direct call to indirect jump to trade throughput for code size. We should be doing that for the cold JIT helpers (e.g. IL_THROW) already. We can do it for regular calls too. I think we would need the JIT tell us whether it wants the target for small callsite or fast callsite to do this well. |
cc @AndyAyersMS |
What percent of the final binary size is that? How many different 4K pages are we using? I am wondering whether a simple heuristic, such as sorting indirection cells by popularity, might help to CSE some of the |
~25%. Fixing what Bruce mentioned in #35108 (comment) will give us back 10%. |
The way arm64 R2R indirect calls are encoded are by far (according to @kunalspathak 's latest data) the largest contribution to arm64 generated code being bigger than x64. A back-of-the-envelope calculation showed that if we could convert them all to direct calls (along with a number of much smaller mostly peephole-style optimizations), arm64 code expansion compared to x64 would drop from 1.74x to 1.25x for R2R code. |
From my recollection the Windows PE loader should be able to tolerate high numbers of sections, so I don't think we should throw out the idea of using the
instruction pair as unworkable for R2R fixups on Arm64. From a search, I found two interesting links. https://docs.microsoft.com/en-us/windows/win32/debug/pe-format The first documents that the OS loader will not support more than 96 sections, and the second is a stackoverflow discussion which describes that limit as having disappeared as of Windows Vista, and the limit became 65536. We may be able to rely on more sections being loadable. |
To be clear, when I used the word "section" above, I wasn't referring to PE file sections, although perhaps we would choose (or be required) to use PE file sections to implement the solution, if necessary. I can see that being necessary if we needed to interleave writable data and non-writable executable code sections (e.g., code, indirection cell data, code, indirection cell data, etc.). |
Yes, if we need to interleave code with fixups, we're speaking of PE sections which would need interleaving, not anything internal to our compiler. Its unfortunate, but certainly possible. |
While generating indirect call for R2R, I have noticed we generate redundant indirect address loads in
x11
as well asx4
and end up using the one present inx4
. In dotnet/coreclr#5020 we fixedx11
as being a register to store indirect cell address, but in dotnet/coreclr#16171 we moved to morph phase. However with that, we end up creating 2nd redundantaddrp/add
pair inside lower that usesx4
register.We need to investigate to see why
x4
is needed and if yes, then simplify it tomov x4, x11
.category:cq
theme:ready-to-run
skill-level:intermediate
cost:large
The text was updated successfully, but these errors were encountered: