-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RyuJIT generates poor code for a helper method which does return Method(value, value)
#9916
Comments
FYI. @CarolEidt, @AndyAyersMS, @dotnet/jit-contrib |
Also FYI. @VSadov |
Does this seem like something that is feasible for the JIT to do, or does it fall into the realm of changing observable side-effects? |
If we inline and expose a full web of inter-related structs then we can safely eliminate copies. For small structs we do that today via promotion. Current heuristic (see You might experimentally change this value to 4 to see if it fixes the issues in your sample above. Being more aggressive about promotion is probably not the right long-term fix for our struct issues -- though we can certainly reconsider the heuristic above and adjust if it makes sense. We need to add the ability to reason about structs as a whole. This is part of the first-class structs work. |
@AndyAyersMS, thanks for the tips.
Just initially, I would think that 4 is a better baseline (w.r.t element count) for "large vs small" (<= 4 is small, > 4 is large). This would, at the very least, cover the common case of SIMD like structs (both 128-bit for float/int and 256-bit for double/long).
I agree. The "first-class structs" issue looks to cover a good set of scenarios. |
@AndyAyersMS, changing However,
|
Hmm, I'd have to look deeper, but here's what I saw based on quick scan: There is a second promotion heuristic for implicit by-ref structs that might possibly be holding up In So if you're up for another experiment, you could try lowering the profitability threshold in |
@AndyAyersMS, it is definitely the I modified the method locally to print some additional information about the decision it makes here:
I'm not quite sure what the proper heuristic here is. However, I would speculate that bool undoPromotion = false;
if (lvaGetPromotionType(newVarDsc) == PROMOTION_TYPE_DEPENDENT)
{
undoPromotion = true;
JITDUMP("Undoing promotion for struct parameter V%02d, because promotion is dependent.\n", lclNum);
}
else if ((varDsc->lvRefCnt > 2) && (varDsc->lvRefCnt <= varDsc->lvFieldCnt))
{
undoPromotion = true;
JITDUMP("Undoing promotion for for struct parameter V%02d: #refs = %d, #fields = %d.\n", lclNum, varDsc->lvRefCnt, varDsc->lvFieldCnt);
}
else
{
JITDUMP("Not undoing promotion for for struct parameter V%02d.\n", lclNum);
} I would guess a "more accurate" heuristic would also take struct size and whether it is a HVA/HFA struct into account, but that is probably "future" work. Thoughts? (I am happy to try something out and get diffs/benches done) |
The general promotion logic was last touched in dotnet/coreclr#9455 where we bumped the limit for fieldless promotions from 2 to 3. There are some notes on the byref promotion heuristic in #8653. Basically we are trying to weigh the always-paid cost of the initial copy from the byref to a local vs the potential benefits at the use sites. |
Looks like increasing the fieldCount to 4 is definitely not profitable overall. Both OptimizationsCorelib
Tests
Just increasing fieldCount to 4Corelib
Tests
Just undoPromotion logic changeCorelib
Tests
|
Codegen today. We did eventually turn on promotion for up to 4 fields, so it looks much better than back then. ; Assembly listing for method Program:Test1(Program+Vector4)
; Emitting BLENDED_CODE for X64 CPU with AVX - Windows
; optimized code
; rsp based frame
; partially interruptible
; No PGO data
; 0 inlinees with PGO data; 2 single block inlinees; 0 inlinees without PGO data
; Final local variable assignments
;
; V00 arg0 [V00,T01] ( 3, 6 ) byref -> rcx single-def
; V01 OutArgs [V01 ] ( 1, 1 ) lclBlk (32) [rsp+00H] "OutgoingArgSpace"
; V02 tmp1 [V02,T00] ( 5, 10 ) struct (16) [rsp+28H] do-not-enreg[SF] "Inlining Arg"
;* V03 tmp2 [V03 ] ( 0, 0 ) struct (16) zero-ref "Inlining Arg"
;* V04 tmp3 [V04 ] ( 0, 0 ) struct (16) zero-ref "Inlining Arg"
;* V05 tmp4 [V05,T10] ( 0, 0 ) float -> zero-ref V03.X(offs=0x00) P-INDEP "field V03.X (fldOffset=0x0)"
;* V06 tmp5 [V06,T11] ( 0, 0 ) float -> zero-ref V03.Y(offs=0x04) P-INDEP "field V03.Y (fldOffset=0x4)"
;* V07 tmp6 [V07,T12] ( 0, 0 ) float -> zero-ref V03.Z(offs=0x08) P-INDEP "field V03.Z (fldOffset=0x8)"
;* V08 tmp7 [V08,T13] ( 0, 0 ) float -> zero-ref V03.W(offs=0x0c) P-INDEP "field V03.W (fldOffset=0xc)"
; V09 tmp8 [V09,T02] ( 3, 3 ) float -> mm0 V04.X(offs=0x00) P-INDEP "field V04.X (fldOffset=0x0)"
; V10 tmp9 [V10,T03] ( 3, 3 ) float -> mm1 V04.Y(offs=0x04) P-INDEP "field V04.Y (fldOffset=0x4)"
; V11 tmp10 [V11,T04] ( 3, 3 ) float -> mm2 V04.Z(offs=0x08) P-INDEP "field V04.Z (fldOffset=0x8)"
; V12 tmp11 [V12,T05] ( 3, 3 ) float -> mm3 V04.W(offs=0x0c) P-INDEP "field V04.W (fldOffset=0xc)"
; V13 cse0 [V13,T06] ( 2, 2 ) float -> mm0 "CSE - aggressive"
; V14 cse1 [V14,T07] ( 2, 2 ) float -> mm1 "CSE - aggressive"
; V15 cse2 [V15,T08] ( 2, 2 ) float -> mm2 "CSE - aggressive"
; V16 cse3 [V16,T09] ( 2, 2 ) float -> mm3 "CSE - aggressive"
;
; Lcl frame size = 56
G_M5536_IG01:
sub rsp, 56
vzeroupper
;; size=7 bbWeight=1 PerfScore 1.25
G_M5536_IG02:
vmovupd xmm0, xmmword ptr [rcx]
vmovupd xmmword ptr [rsp+28H], xmm0
vmovss xmm0, dword ptr [rsp+28H]
vmovss xmm1, dword ptr [rsp+2CH]
vmovss xmm2, dword ptr [rsp+30H]
vmovss xmm3, dword ptr [rsp+34H]
vmulss xmm0, xmm0, xmm0
vmulss xmm1, xmm1, xmm1
vaddss xmm0, xmm0, xmm1
vmulss xmm1, xmm2, xmm2
vaddss xmm0, xmm0, xmm1
vmulss xmm1, xmm3, xmm3
vaddss xmm0, xmm0, xmm1
call [System.Console:WriteLine(float)]
nop
;; size=69 bbWeight=1 PerfScore 41.25
G_M5536_IG03:
add rsp, 56
ret
;; size=5 bbWeight=1 PerfScore 1.25
; Total bytes of code 81, prolog size 7, PerfScore 53.15, instruction count 19, allocated bytes for code 94 (MethodHash=3951ea5f) for method Program:Test1(Program+Vector4)
; ============================================================ ; Assembly listing for method Program:Test2(Program+Vector4)
; Emitting BLENDED_CODE for X64 CPU with AVX - Windows
; optimized code
; rsp based frame
; partially interruptible
; No PGO data
; 0 inlinees with PGO data; 1 single block inlinees; 0 inlinees without PGO data
; Final local variable assignments
;
; V00 arg0 [V00,T00] ( 6, 12 ) byref -> rcx single-def
; V01 OutArgs [V01 ] ( 1, 1 ) lclBlk (32) [rsp+00H] "OutgoingArgSpace"
;* V02 tmp1 [V02 ] ( 0, 0 ) struct (16) zero-ref "Inlining Arg"
;* V03 tmp2 [V03 ] ( 0, 0 ) struct (16) zero-ref "Inlining Arg"
; V04 tmp3 [V04,T05] ( 2, 2 ) float -> mm1 V02.X(offs=0x00) P-INDEP "field V02.X (fldOffset=0x0)"
; V05 tmp4 [V05,T06] ( 2, 2 ) float -> mm3 V02.Y(offs=0x04) P-INDEP "field V02.Y (fldOffset=0x4)"
; V06 tmp5 [V06,T07] ( 2, 2 ) float -> mm5 V02.Z(offs=0x08) P-INDEP "field V02.Z (fldOffset=0x8)"
; V07 tmp6 [V07,T08] ( 2, 2 ) float -> mm7 V02.W(offs=0x0c) P-INDEP "field V02.W (fldOffset=0xc)"
; V08 tmp7 [V08,T09] ( 2, 2 ) float -> mm0 V03.X(offs=0x00) P-INDEP "field V03.X (fldOffset=0x0)"
; V09 tmp8 [V09,T10] ( 2, 2 ) float -> mm2 V03.Y(offs=0x04) P-INDEP "field V03.Y (fldOffset=0x4)"
; V10 tmp9 [V10,T11] ( 2, 2 ) float -> mm4 V03.Z(offs=0x08) P-INDEP "field V03.Z (fldOffset=0x8)"
; V11 tmp10 [V11,T12] ( 2, 2 ) float -> mm6 V03.W(offs=0x0c) P-INDEP "field V03.W (fldOffset=0xc)"
; V12 cse0 [V12,T01] ( 3, 3 ) float -> mm2 "CSE - aggressive"
; V13 cse1 [V13,T02] ( 3, 3 ) float -> mm4 "CSE - aggressive"
; V14 cse2 [V14,T03] ( 3, 3 ) float -> mm6 "CSE - aggressive"
; V15 cse3 [V15,T04] ( 3, 3 ) float -> mm0 "CSE - aggressive"
;
; Lcl frame size = 72
G_M19203_IG01:
sub rsp, 72
vzeroupper
vmovaps xmmword ptr [rsp+30H], xmm6
vmovaps xmmword ptr [rsp+20H], xmm7
;; size=19 bbWeight=1 PerfScore 5.25
G_M19203_IG02:
vmovss xmm0, dword ptr [rcx]
vmovaps xmm1, xmm0
vmovss xmm2, dword ptr [rcx+04H]
vmovaps xmm3, xmm2
vmovss xmm4, dword ptr [rcx+08H]
vmovaps xmm5, xmm4
vmovss xmm6, dword ptr [rcx+0CH]
vmovaps xmm7, xmm6
vmulss xmm0, xmm1, xmm0
vmulss xmm1, xmm3, xmm2
vaddss xmm0, xmm0, xmm1
vmulss xmm1, xmm5, xmm4
vaddss xmm0, xmm0, xmm1
vmulss xmm1, xmm7, xmm6
vaddss xmm0, xmm0, xmm1
call [System.Console:WriteLine(float)]
nop
;; size=70 bbWeight=1 PerfScore 41.25
G_M19203_IG03:
vmovaps xmm6, xmmword ptr [rsp+30H]
vmovaps xmm7, xmmword ptr [rsp+20H]
add rsp, 72
ret
;; size=17 bbWeight=1 PerfScore 9.25
; Total bytes of code 106, prolog size 19, PerfScore 68.25, instruction count 25, allocated bytes for code 125 (MethodHash=09bab4fc) for method Program:Test2(Program+Vector4)
; ============================================================ ; Assembly listing for method Program:Test3(Program+Vector4)
; Emitting BLENDED_CODE for X64 CPU with AVX - Windows
; optimized code
; rsp based frame
; partially interruptible
; No PGO data
; Final local variable assignments
;
; V00 arg0 [V00,T00] ( 6, 12 ) byref -> rcx single-def
; V01 OutArgs [V01 ] ( 1, 1 ) lclBlk (32) [rsp+00H] "OutgoingArgSpace"
; V02 tmp1 [V02,T01] ( 3, 3 ) float -> mm0 V06.X(offs=0x00) P-INDEP "field V00.X (fldOffset=0x0)"
; V03 tmp2 [V03,T02] ( 3, 3 ) float -> mm1 V06.Y(offs=0x04) P-INDEP "field V00.Y (fldOffset=0x4)"
; V04 tmp3 [V04,T03] ( 3, 3 ) float -> mm2 V06.Z(offs=0x08) P-INDEP "field V00.Z (fldOffset=0x8)"
; V05 tmp4 [V05,T04] ( 3, 3 ) float -> mm3 V06.W(offs=0x0c) P-INDEP "field V00.W (fldOffset=0xc)"
;* V06 tmp5 [V06 ] ( 0, 0 ) struct (16) zero-ref "Promoted implicit byref"
;
; Lcl frame size = 40
G_M17890_IG01:
sub rsp, 40
vzeroupper
;; size=7 bbWeight=1 PerfScore 1.25
G_M17890_IG02:
vmovss xmm0, dword ptr [rcx]
vmovss xmm1, dword ptr [rcx+04H]
vmovss xmm2, dword ptr [rcx+08H]
vmovss xmm3, dword ptr [rcx+0CH]
vmulss xmm0, xmm0, xmm0
vmulss xmm1, xmm1, xmm1
vaddss xmm0, xmm0, xmm1
vmulss xmm1, xmm2, xmm2
vaddss xmm0, xmm0, xmm1
vmulss xmm1, xmm3, xmm3
vaddss xmm0, xmm0, xmm1
call [System.Console:WriteLine(float)]
nop
;; size=54 bbWeight=1 PerfScore 40.25
G_M17890_IG03:
add rsp, 40
ret
;; size=5 bbWeight=1 PerfScore 1.25
; Total bytes of code 66, prolog size 7, PerfScore 50.45, instruction count 17, allocated bytes for code 77 (MethodHash=edd5ba1d) for method Program:Test3(Program+Vector4)
; ============================================================ In |
Codegen for all versions is the same today. This looks like it was fixed by #81636. |
Issue
For the code (where
ReturnType
andOperandType
are value types):RyuJIT currently has poor codegen when inlining a call to
MyMethod1
and ends up readingvalue
from memory, twice.Example
A simple example program: Test.cs.txt
The method
Test1
currently produces:The method
Test2
currently produces:The method
Test3
currently produces:Additional Notes
The stack frames appear to be much larger than necessary (it looks like the stack frames of each inlined method are kept, even when the values are no longer used).
Directly calling
MyMethod2
(Test2
) produces better code, but it fails to recognize thatleft
andright
are the same value.Manually inlining
MyMethod2
(Test3
) produces the best code. I would think that it is feasible for the JIT to produce this code when the code is inlined by the JIT (Test1
).category:cq
theme:structs
skill-level:expert
cost:medium
impact:small
The text was updated successfully, but these errors were encountered: