Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JIT: Excessive copies when inlining #4308

Closed
omariom opened this issue Jun 12, 2015 · 12 comments
Closed

JIT: Excessive copies when inlining #4308

omariom opened this issue Jun 12, 2015 · 12 comments
Assignees
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI enhancement Product code improvement that does NOT require public API changes/additions optimization tenet-performance Performance related issue
Milestone

Comments

@omariom
Copy link
Contributor

omariom commented Jun 12, 2015

For the following code:

    static Guid s_dt;

    [MethodImpl(MethodImplOptions.NoInlining)]
    private static void TestValueTypesInInlinedMethods()
    {
        var dt = new Guid();

        Method1(dt);
    }

    private static void Method1(Guid dt)
    {
        Method2(dt);
    }

    private static void Method2(Guid dt)
    {
        Method3(dt);
    }

    private static void Method3(Guid dt)
    {
        s_dt = dt;
    }

JIT generates these machine instructions:

 sub         rsp,48h  
 xor         eax,eax  
 mov         qword ptr [rsp+38h],rax  
 mov         qword ptr [rsp+40h],rax  
 xor         eax,eax  

 lea         rdx,[rsp+38h]  
 xorpd       xmm0,xmm0  
 movdqu      xmmword ptr [rdx],xmm0  

 movdqu      xmm0,xmmword ptr [rsp+38h]  
 movdqu      xmmword ptr [rsp+28h],xmm0  
 movdqu      xmm0,xmmword ptr [rsp+28h]  
 movdqu      xmmword ptr [rsp+18h],xmm0  
 movdqu      xmm0,xmmword ptr [rsp+18h]  
 movdqu      xmmword ptr [rsp+8],xmm0  

 mov         rax,405B5B0040h  
 mov         rax,qword ptr [rax]  
 add         rax,8  
 movdqu      xmm0,xmmword ptr [rsp+8]  
 movdqu      xmmword ptr [rax],xmm0  
 add         rsp,48h  
 ret  

Though all the methods are inlined and don't change their params, JIT excessively allocates stack memory for each param and copies their values.
Double zeroing is a minor issue.

category:cq
theme:structs
skill-level:expert
cost:large
impact:large

@omariom omariom changed the title JIT: Excessive copies when inlinning JIT: Excessive copies when inlining Jun 12, 2015
@cmckinsey cmckinsey self-assigned this Jun 13, 2015
@cmckinsey
Copy link
Contributor

@omariom This is a known issue and on our list of future code-quality improvements.

The Guid type is a 16-byte struct. The JIT has limitations today with handling structs in some optimizations so this issue is really the JIT optimizers inability to copy-propagate away structs in general. In this case the inlining happens to make the copies but you can write cases that just involve copying structs to demonstrate the same problem. You can also rewrite the test case with a scalar (e.g. long) and see that all the copies are eliminated.

Thanks for reporting.

@ghost
Copy link

ghost commented Jun 13, 2015

@cmckinsey, thanks for the insights! Is it due to this line: src/jit/optimizer.cpp#L2855?

@mikedn
Copy link
Contributor

mikedn commented Jun 13, 2015

Is it due to this line: src/jit/optimizer.cpp#L2855?

Neah, that just affects certain loop optimizations. And in general, it's never because of a line like that, it's because of lines that do not exists after such a line 😄.

@ghost
Copy link

ghost commented Jun 13, 2015

:+1:

I think you can easily beat me to the pull-request @mikedn. Go for it! 😄

@mikedn
Copy link
Contributor

mikedn commented Jun 14, 2015

I think you can easily beat me to the pull-request @mikedn. Go for it!

@jasonwilliams200OK I'd love to but I doubt that I have enough spare time for it. Besides, it's a bit unlikely that it can be done in a single pull-request. Not to mention that the exact approach will probably have to be discussed with the rest of the JIT team. In short, it won't happen overnight.

@omariom
Copy link
Contributor Author

omariom commented Jun 14, 2015

Nice to see pull-requests are already being discussed 👍

@msftgits msftgits transferred this issue from dotnet/coreclr Jan 30, 2020
@msftgits msftgits added this to the Future milestone Jan 30, 2020
@AndyAyersMS
Copy link
Member

Here's the current codegen:

; Assembly listing for method X:TestValueTypesInInlinedMethods()
; Emitting BLENDED_CODE for X64 CPU with AVX - Windows
; optimized code
; rsp based frame
; partially interruptible
; Final local variable assignments
;
;* V00 loc0         [V00    ] (  0,  0   )  struct (16) zero-ref    do-not-enreg[SB] ld-addr-op
;  V01 OutArgs      [V01    ] (  1,  1   )  lclBlk (32) [rsp+0x00]   "OutgoingArgSpace"
;* V02 tmp1         [V02    ] (  0,  0   )  struct (16) zero-ref    do-not-enreg[SB] "Inlining Arg"
;* V03 tmp2         [V03    ] (  0,  0   )  struct (16) zero-ref    do-not-enreg[SB] "Inlining Arg"
;* V04 tmp3         [V04    ] (  0,  0   )  struct (16) zero-ref    do-not-enreg[SB] "Inlining Arg"
;
; Lcl frame size = 40

G_M32745_IG01:
       sub      rsp, 40
       vzeroupper 
						;; bbWeight=1    PerfScore 1.25
G_M32745_IG02:
       mov      rcx, 0xD1FFAB1E
       mov      edx, 1
       call     CORINFO_HELP_GETSHARED_NONGCSTATIC_BASE
       mov      rax, 0xD1FFAB1E
       mov      rax, gword ptr [rax]
       vxorps   xmm0, xmm0
       vmovdqu  xmmword ptr [rax+8], xmm0
						;; bbWeight=1    PerfScore 5.08
G_M32745_IG03:
       add      rsp, 40
       ret      
						;; bbWeight=1    PerfScore 1.25

Methods 1 and 2 still have an extra copy though:

; Assembly listing for method X:Method1(System.Guid)
; Emitting BLENDED_CODE for X64 CPU with AVX - Windows
; optimized code
; rsp based frame
; partially interruptible
; Final local variable assignments
;
;  V00 arg0         [V00,T00] (  3,  6   )   byref  ->  rcx        
;  V01 OutArgs      [V01    ] (  1,  1   )  lclBlk (32) [rsp+0x00]   "OutgoingArgSpace"
;  V02 tmp1         [V02,T01] (  2,  4   )  struct (16) [rsp+0x28]   do-not-enreg[SB] "Inlining Arg"
;* V03 tmp2         [V03    ] (  0,  0   )  struct (16) zero-ref    do-not-enreg[SB] "Inlining Arg"
;
; Lcl frame size = 56

G_M20035_IG01:
       sub      rsp, 56
       vzeroupper 
						;; bbWeight=1    PerfScore 1.25
G_M20035_IG02:
       vmovdqu  xmm0, xmmword ptr [rcx]
       vmovdqu  xmmword ptr [rsp+28H], xmm0
       mov      rcx, 0xD1FFAB1E
       mov      edx, 1
       call     CORINFO_HELP_GETSHARED_NONGCSTATIC_BASE
       mov      rax, 0xD1FFAB1E
       mov      rax, gword ptr [rax]
       vmovdqu  xmm0, xmmword ptr [rsp+28H]
       vmovdqu  xmmword ptr [rax+8], xmm0
						;; bbWeight=1    PerfScore 8.75
G_M20035_IG03:
       add      rsp, 56
       ret      
						;; bbWeight=1    PerfScore 1.25

@CarolEidt
Copy link
Contributor

This was originally issue 1133 in dotnet/coreclr, and there is a test, JIT\Regression\JitBlue\GitHub_1133, that captures this (to enable improvements or regressions to show up when we run jit diffs).
The propagation of the zero is handled in the TestValueTypesInInlinedMethods method via the constant assertion prop that happens during global morph. In Method1 and Method2, the value is not known to be zero, and value numbering is not assigning the same value number to the indirection of the byref incoming arg to the local that it's assigned to. I believe that this is an issue with value numbering of struct types.

@CarolEidt CarolEidt modified the milestones: Future, 6.0.0 Oct 14, 2020
CarolEidt added a commit to CarolEidt/runtime that referenced this issue Dec 4, 2020
… to correctly reflect the code in the issue.

Add tests for other issues to enable easy diff-ing
@JulieLeeMSFT JulieLeeMSFT added the needs-further-triage Issue has been initially triaged, but needs deeper consideration or reconsideration label Mar 23, 2021
@AndyAyersMS
Copy link
Member

This is a struct copy-prop issue.

Assigning it over to @sandreenko

@AndyAyersMS AndyAyersMS removed the needs-further-triage Issue has been initially triaged, but needs deeper consideration or reconsideration label Apr 9, 2021
@AndyAyersMS AndyAyersMS assigned sandreenko and unassigned AndyAyersMS Apr 9, 2021
@sandreenko sandreenko modified the milestones: 6.0.0, 7.0.0 Jul 9, 2021
@sandreenko
Copy link
Contributor

There are 2 items left to close this issue:

  1. support any kind of sources for STORE_LCL_VAR, right now we support only LCL_VAR, CNST_INT, in this case it is IND;
  2. support STORE_BLK with a LCL_VAR src in a register.

both are doable after the recent changes, just don't fit into 6.0 window.

@kunalspathak
Copy link
Member

The current codegen is much better.
Note: I have removed static and replaced them with instance fields/methods to remove other noise related to static handling.

G_M1099_IG01:
       vzeroupper
                                                ;; bbWeight=1    PerfScore 1.00
G_M1099_IG02:
       vxorps   xmm0, xmm0
       vmovdqu  xmmword ptr [rcx+8], xmm0
                                                ;; bbWeight=1    PerfScore 2.33
G_M1099_IG03:
       ret
                                                ;; bbWeight=1    PerfScore 1.00

@kunalspathak
Copy link
Member

Arm64 code:

G_M1099_IG01:
            stp     fp, lr, [sp,#-16]!
            mov     fp, sp
                                                ;; bbWeight=1    PerfScore 1.50
G_M1099_IG02:
            stp     xzr, xzr, [x0,#8]
                                                ;; bbWeight=1    PerfScore 1.00
G_M1099_IG03:
            ldp     fp, lr, [sp],#16
            ret     lr
                                                ;; bbWeight=1    PerfScore 2.00

@ghost ghost locked as resolved and limited conversation to collaborators Nov 3, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI enhancement Product code improvement that does NOT require public API changes/additions optimization tenet-performance Performance related issue
Projects
None yet
Development

No branches or pull requests

9 participants