Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARM64 Indirect call produces redundant address load for R2R #35108

Open
kunalspathak opened this issue Apr 17, 2020 · 21 comments
Open

ARM64 Indirect call produces redundant address load for R2R #35108

kunalspathak opened this issue Apr 17, 2020 · 21 comments
Labels
arch-arm64 area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI
Milestone

Comments

@kunalspathak
Copy link
Member

kunalspathak commented Apr 17, 2020

While generating indirect call for R2R, I have noticed we generate redundant indirect address loads in x11 as well as x4 and end up using the one present in x4. In dotnet/coreclr#5020 we fixed x11 as being a register to store indirect cell address, but in dotnet/coreclr#16171 we moved to morph phase. However with that, we end up creating 2nd redundant addrp/add pair inside lower that uses x4 register.

        9000000B          adrp    x11, [RELOC #0x20a0e212c50]
        9100016B          add     x11, x11, #0
        90000004          adrp    x4, [RELOC #0x20a0e212c50]
        91000084          add     x4, x4, #0
        F9400084          ldr     x4, [x4]
        D63F0080          blr     x4

We need to investigate to see why x4 is needed and if yes, then simplify it to mov x4, x11.

category:cq
theme:ready-to-run
skill-level:intermediate
cost:large

@Dotnet-GitSync-Bot
Copy link
Collaborator

I couldn't figure out the best area label to add to this issue. Please help me learn by adding exactly one area label.

@Dotnet-GitSync-Bot Dotnet-GitSync-Bot added the untriaged New issue has not been triaged by the area owner label Apr 17, 2020
@kunalspathak
Copy link
Member Author

@kunalspathak kunalspathak added arch-arm64 area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI labels Apr 17, 2020
@BruceForstall
Copy link
Member

Well, we certainly need x4 in this example because it is an available register that is used to compute the branch target address. It appears that the non-standard argument register x11 is used by R2R to identify the indirection cell address to the call stub.

Given that x11 and x4 here are loaded with the same address, we should be able to generate:

adrp    x11, [RELOC #0x20a0e212c50]
add     x11, x11, #0
ldr     x4, [x11] // re-use x11 here as stub address
blr     x4

@BruceForstall
Copy link
Member

@jkotas @fadimounir When looking at R2R codegen on arm64, we see the inefficient pattern above for calls through an indirection cell. On x64, the code would be simply:

call [RELOC #0x20a0e212c50]

I'm not sure what is actually going on here, but I presume this indirection cell initially points to some kind of stub, and after the actual target is determined, the indirection cell is fixed up to the actual target address. In the x64 case, then, the steady state overhead is just an indirect call instead of a direct call. On arm64, however, the steady state also includes setting up the x11 argument, which will not be used. (In my optimized example, it would still be used to load the indirection cell value, and we would always need something like this.)

Is this the optimal way to generate this code?

We could instead generate the LDR (literal) instruction, e.g.:

ldr     x4, RELOC #0x20a0e212c50
blr     x4

but that only has a +/-1MB PC-relative range, and wouldn't allow setting x11.

Possibly we could change the stub to include the stub address? (Or perhaps the stub is shared?)

@jkotas
Copy link
Member

jkotas commented Apr 17, 2020

The methods have to be called via indirection cells in R2R images. (We may be able to direct calls in some cases as an optimization. We do not have that optimization today for R2R, on any platform - that would be a separate discussion.)

The indirection cell points to a delay load stub initially. The delay load stub worker method needs the address of the indirection cell to figure out the target method and to update the indirection cell with the address of the target method.

On x64, we get the address of the indirection cell by disassembling the callsite. It works fine because of the x64 encoding details (note that there are details to consider like how break points injected by debugger are encoded, etc.).

On arm/arm64, we require the address of the indirection cell to be passed in a fixed register. You are right that we could move the loading of the address from the code stream into the non-shared portion of the stub. I am not sure whether it would actually save anything interesting. It would likely just move the code from one place in the R2R image to a different place in the R2R image and/or regress the steady state performance (depends on the details).

Your suggestion in #35108 (comment) sounds good as the first step.

@BruceForstall BruceForstall added this to the Future milestone Apr 20, 2020
@BruceForstall BruceForstall removed the untriaged New issue has not been triaged by the area owner label Apr 20, 2020
@kunalspathak
Copy link
Member Author

I collected no. of such adrp/add groups in framework libraries and found approx. 615732 such groups in 126823 methods.

@BruceForstall
Copy link
Member

The current code sequence, using adrp/add pair has 32-bit PC-relative range, which presumably means that all the indirection cells can be packed together in the PE file. Is that important?

If we instead chose to use the LDR (literal) instruction, with only +/-1MB range, presumably we would have to move them into various sections "close to" their calling functions, or even per-calling-function (which means they wouldn't be shared; presumably that's a loss?). However, then presumably the VM could use the same idea as x64: disassembly the LDR instruction using the return address in the lr register.

We would presumably need an escape hatch if the call was within a function that itself was >1MB, so it couldn't even reach outside of its own code section (such as hiding it somewhere in the code, or bailing the compilation and retrying but disallowing the "short" form of the instruction pattern).

@BruceForstall
Copy link
Member

While we should start by removing the duplication of setting up the indirection cell address, this pattern is so common it seems like we should consider how to optimize it even further.

@BruceForstall
Copy link
Member

For example, you could imagine a case where there are a lot of calls to different functions with similar indirection cell addresses, where the indirection cells are packed together. In that case, a single adrp/add could establish a "base pointer", and a single "add" to this base pointer could compute the indirection cell address (instead of a new adrp/add pair). This would require "relocs" for these adds, a decision about whether maintaining a "base pointer" was profitable, and an "escape hatch" if all the targets couldn't be reached from the one (or more) base pointer(s). So yes, it could be complex.

@jkotas
Copy link
Member

jkotas commented Apr 22, 2020

this pattern is so common it seems like we should consider how to optimize it even further.

Definitely agree.

all the indirection cells can be packed together in the PE file. Is that important?

This packing is important for some types of cells (e.g. string literals - to avoid duplicates). It is not as important for other types of cells. Also, this packing makes the PE file algorithm simple.

move them into various sections

I expect that the number of sections in the PE image is capped to something relatively small. We may start hitting against this limit if we start generating 2 sections per 1MB of binary. This is really only a problem on Windows where we use the OS loader. It is not a problem on Linux where we have our own loader today.

a single adrp/add could establish a "base pointer", and a single "add" to this base pointer could compute the indirection cell address (instead of a new adrp/add pair)

This idea sounds most scalable idea to me. It does not have the corner cases where it breaks down.

I would look into these ideas on top of crossgen2 only. I do not think we would want to spend time to teach crossgen about this.

cc @dotnet/crossgen-contrib

@BruceForstall
Copy link
Member

Another idea: are there different classes of indirection cells that could be handled differently? E.g., it looks like the write barrier helpers also use the same code pattern (without the x11 setup):

add     x14, x0, #48 // arg 0 to CORINFO_HELP_ASSIGN_REF
mov     x15, x1 // arg 2 to CORINFO_HELP_ASSIGN_REF
adrp    x12, [RELOC #0x1f40fb96500] // indirection cell address high for CORINFO_HELP_ASSIGN_REF
add     x12, x12, #0 // indirection cell address low for CORINFO_HELP_ASSIGN_REF
ldr     x12, [x12] // load actual address
blr     x12 // branch

We'd have to look how frequent this is, but perhaps JIT helpers could get special treatment, like using the LDR (literal) encoding, if possible.

@trylek
Copy link
Member

trylek commented Apr 22, 2020

R2RDump already parses the indirection cells; it should be trivial to adapt for producing summaries of cell counts per helper or whatever to estimate expected impact of the proposed changes on crossgened framework assemblies or any other workloads.

@BruceForstall
Copy link
Member

@jkotas writes:

The methods have to be called via indirection cells in R2R images. (We may be able to direct calls in some cases as an optimization. We do not have that optimization today for R2R, on any platform - that would be a separate discussion.)

I'd like to understand this better. Consider: the current best case for arm64 indirect call using an indirection cell, at a call site is:

adrp    x12, [HIGH RELOC #0x1f40fb96500] // indirection cell address high
add     x12, x12,  [LOW RELOC #0x1f40fb96500] // indirection cell address low
ldr     x12, [x12] // load actual address
blr     x12 // branch

If we optimize this using LDR (literal), we get:

ldr     x4, RELOC #0x20a0e212c50
blr     x4

But that has the significant downside of a short +/-1MB PC-relative range to the indirection cell. And if we still need to set up the x11 register it wouldn't work.

If we used direct call, we would have a single instruction:

bl RELOC #0x1f40fb96500

This has a PC-relative range of +/-128MB. This would cover almost all (if not all) PE files, and of course could use jump stubs to reach farther.

The target address could be a patch-able stub address that starts as a branch to a fixup code and is fixed up to be a direct branch to the target address, either:

b RELOC #0x1f40fb96500

with a +/-128MB range, with the possibility of jump stubs to reach farther, or

ldr xip0, label
br xip0
label: // 8 byte address of actual target.

Using this scheme, call sites go from 4 instructions to 1, and the "indirection cell" goes from 8 bytes to 4 (assuming +/- 128MB range is the common case; larger if using the second form or jump stubs). The "steady state" path changes from 4 instructions and a single indirect branch to 2 instructions: branch to branch, with presumably both branches perfectly predicted.

Presumably the address of the "indirection cell" could be deduced by disassembling the call site bl instruction using the lr return value, as is done on x64. Or, the indirection cell could be expanded to 8 bytes as:

label:
adr xip0, label
b RELOC #0x1f40fb96500

And then the indirection cell address stored directly in the xip0 register before branching to the stub.

Comments?

@jkotas
Copy link
Member

jkotas commented Apr 22, 2020

The target address could be a patch-able stub

We have stopped doing patchable code inside binaries and do not want to go back.

  • Patchable code inside binaries is SDL rules violation. We would need to get an exception for it that we are not going to get.
  • It is problematic on locked down environments (e.g. the latest macOS version started disallowing it - you cannot patch memory mapped code, you have to copy it out somewhere else first). This is only going to get worse over time.

If we used direct call, we would have a single instruction:

Yes, we can do a direct call to indirect jump to trade throughput for code size. We should be doing that for the cold JIT helpers (e.g. IL_THROW) already. We can do it for regular calls too. I think we would need the JIT tell us whether it wants the target for small callsite or fast callsite to do this well.

@BruceForstall
Copy link
Member

cc @AndyAyersMS

@AntonLapounov
Copy link
Member

I collected no. of such adrp/add groups in framework libraries and found approx. 615732 such groups in 126823 methods.

What percent of the final binary size is that? How many different 4K pages are we using? I am wondering whether a simple heuristic, such as sorting indirection cells by popularity, might help to CSE some of the ADRP instructions.

@kunalspathak
Copy link
Member Author

What percent of the final binary size is that?

~25%. Fixing what Bruce mentioned in #35108 (comment) will give us back 10%.

@BruceForstall
Copy link
Member

The way arm64 R2R indirect calls are encoded are by far (according to @kunalspathak 's latest data) the largest contribution to arm64 generated code being bigger than x64. A back-of-the-envelope calculation showed that if we could convert them all to direct calls (along with a number of much smaller mostly peephole-style optimizations), arm64 code expansion compared to x64 would drop from 1.74x to 1.25x for R2R code.

@davidwrighton
Copy link
Member

@jkotas

From my recollection the Windows PE loader should be able to tolerate high numbers of sections, so I don't think we should throw out the idea of using the

ldr     x4, RELOC #0x20a0e212c50
blr     x4

instruction pair as unworkable for R2R fixups on Arm64. From a search, I found two interesting links.

https://docs.microsoft.com/en-us/windows/win32/debug/pe-format
and
https://stackoverflow.com/questions/17466916/whats-the-maximum-number-of-sections-a-pe-can-have

The first documents that the OS loader will not support more than 96 sections, and the second is a stackoverflow discussion which describes that limit as having disappeared as of Windows Vista, and the limit became 65536. We may be able to rely on more sections being loadable.

@BruceForstall
Copy link
Member

To be clear, when I used the word "section" above, I wasn't referring to PE file sections, although perhaps we would choose (or be required) to use PE file sections to implement the solution, if necessary. I can see that being necessary if we needed to interleave writable data and non-writable executable code sections (e.g., code, indirection cell data, code, indirection cell data, etc.).

@davidwrighton
Copy link
Member

Yes, if we need to interleave code with fixups, we're speaking of PE sections which would need interleaving, not anything internal to our compiler. Its unfortunate, but certainly possible.

@BruceForstall BruceForstall added the JitUntriaged CLR JIT issues needing additional triage label Oct 28, 2020
@BruceForstall BruceForstall removed the JitUntriaged CLR JIT issues needing additional triage label Nov 25, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arch-arm64 area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI
Projects
None yet
Development

No branches or pull requests

7 participants