RyuJIT generates poor code for a helper method which does `return Method(value, value)` #9916

tannergooding · 2018-03-11T01:05:03Z

Issue

For the code (where ReturnType and OperandType are value types):

[MethodImpl(MethodImplOptions.AggressiveInlining)]
public static ReturnType MyMethod1(OperandType value)
{
    return MyMethod2(value, value);
}

[MethodImpl(MethodImplOptions.AggressiveInlining)]
public static ReturnType MyMethod2(OperandType left, OperandType right)
{
    // Logic
}

RyuJIT currently has poor codegen when inlining a call to MyMethod1 and ends up reading value from memory, twice.

Example

A simple example program: Test.cs.txt

The method Test1 currently produces:

The method Test2 currently produces:

The method Test3 currently produces:

Additional Notes

The stack frames appear to be much larger than necessary (it looks like the stack frames of each inlined method are kept, even when the values are no longer used).

Directly calling MyMethod2 (Test2) produces better code, but it fails to recognize that left and right are the same value.

Manually inlining MyMethod2 (Test3) produces the best code. I would think that it is feasible for the JIT to produce this code when the code is inlined by the JIT (Test1).

category:cq
theme:structs
skill-level:expert
cost:medium
impact:small

The text was updated successfully, but these errors were encountered:

tannergooding · 2018-03-11T01:05:17Z

FYI. @CarolEidt, @AndyAyersMS, @dotnet/jit-contrib

tannergooding · 2018-03-11T01:11:26Z

Also FYI. @VSadov

tannergooding · 2018-03-12T17:46:19Z

@CarolEidt, @AndyAyersMS.

Does this seem like something that is feasible for the JIT to do, or does it fall into the realm of changing observable side-effects?

AndyAyersMS · 2018-03-12T18:27:10Z

If we inline and expose a full web of inter-related structs then we can safely eliminate copies.

For small structs we do that today via promotion. Current heuristic (see lvShouldPromoteStructVar) is to aggressively promote structs with 3 or fewer fields. We'll promote larger structs only if we see particular fields being used. This rules out promotion of larger structs in cases where structs are simply being copied.

You might experimentally change this value to 4 to see if it fixes the issues in your sample above.

Being more aggressive about promotion is probably not the right long-term fix for our struct issues -- though we can certainly reconsider the heuristic above and adjust if it makes sense. We need to add the ability to reason about structs as a whole. This is part of the first-class structs work.

tannergooding · 2018-03-12T19:28:28Z

@AndyAyersMS, thanks for the tips.

Current heuristic (see lvShouldPromoteStructVar) is to aggressively promote structs with 3 or fewer fields.

Just initially, I would think that 4 is a better baseline (w.r.t element count) for "large vs small" (<= 4 is small, > 4 is large). This would, at the very least, cover the common case of SIMD like structs (both 128-bit for float/int and 256-bit for double/long).

Being more aggressive about promotion is probably not the right long-term fix for our struct issues

I agree. The "first-class structs" issue looks to cover a good set of scenarios.

tannergooding · 2018-03-12T21:36:18Z

@AndyAyersMS, changing lvaShouldPromoteStructVar to have a threshold of 4 does indeed fix the issue for Test1 (and it ultimately generates identical code to Test3).

However, Test2 actually produces the same codegen as before (the JIT Dump does indicate promotion is happening). This was somewhat surprising to me, since Test1 is a bit more complicated and ultimately does the same thing as Test2.

Test1 calls LengthSquared(value), which calls DotProduct(value, value)
Test2 directly calls DotProduct(value, value)

AndyAyersMS · 2018-03-12T22:31:43Z

Hmm, I'd have to look deeper, but here's what I saw based on quick scan:

There is a second promotion heuristic for implicit by-ref structs that might possibly be holding up Test2. In fgRetypeImplicitByRefArgs we can "undo" an implicit by-ref arg promotion for structs whose ref count is less than their field count. This seems to be blocking or undoing promotion of V00. So we end up with two copies of the arg.

In Test1 we copy the arg to a local and then make two copies of the local instead of making two copies of the arg. So more aggressive promotion probably turns this into one copy of the arg.

So if you're up for another experiment, you could try lowering the profitability threshold in fgRetypeImplicitByRefArgs. If that's indeed what is happening it would be nice for that method to be more vocal about its decision making.

tannergooding · 2018-03-13T00:26:00Z

@AndyAyersMS, it is definitely the undoPromotion check.

I modified the method locally to print some additional information about the decision it makes here:

*************** In fgRetypeImplicitByRefArgs()
  
  lvaGrabTemp returning 15 (V15 tmp14) (a long lifetime temp) called for Promoted implicit byref.
  Undoing promotion for for struct parameter V00: #refs = 2, #fields = 4.
  Changing the lvType for struct parameter V00 to TYP_BYREF.

I'm not quite sure what the proper heuristic here is. However, I would speculate that refs = 2 is the most common case, and that it is used to support binary operations, so perhaps changing the heurisitic to the following would work:

bool undoPromotion = false;

if (lvaGetPromotionType(newVarDsc) == PROMOTION_TYPE_DEPENDENT)
{
    undoPromotion = true;
    JITDUMP("Undoing promotion for struct parameter V%02d, because promotion is dependent.\n", lclNum);
}
else if ((varDsc->lvRefCnt > 2) && (varDsc->lvRefCnt <= varDsc->lvFieldCnt))
{
    undoPromotion = true;
    JITDUMP("Undoing promotion for for struct parameter V%02d: #refs = %d, #fields = %d.\n", lclNum, varDsc->lvRefCnt, varDsc->lvFieldCnt);
}
else
{
    JITDUMP("Not undoing promotion for for struct parameter V%02d.\n", lclNum);
}

I would guess a "more accurate" heuristic would also take struct size and whether it is a HVA/HFA struct into account, but that is probably "future" work.

Thoughts? (I am happy to try something out and get diffs/benches done)

AndyAyersMS · 2018-03-13T00:35:55Z

The general promotion logic was last touched in dotnet/coreclr#9455 where we bumped the limit for fieldless promotions from 2 to 3.

There are some notes on the byref promotion heuristic in #8653. Basically we are trying to weigh the always-paid cost of the initial copy from the byref to a local vs the potential benefits at the use sites.

tannergooding · 2018-03-13T03:13:05Z

Looks like increasing the fieldCount to 4 is definitely not profitable overall.
Changing the undoPromotion logic to exclude refCnt <= 2 might be profitable

Both Optimizations

Corelib

Summary:
(Note: Lower is better)
Total bytes of diff: 21575 (0.58% of base)
    diff is a regression.
Total byte diff includes 0 bytes from reconciling methods
        Base had    0 unique methods,        0 unique bytes
        Diff had    0 unique methods,        0 unique bytes
Top file regressions by size (bytes):
       21575 : System.Private.CoreLib.dasm (0.58% of base)
1 total files with size differences (0 improved, 1 regressed), 0 unchanged.
Top method regessions by size (bytes):
         383 : System.Private.CoreLib.dasm - System.Variant:MarshalHelperCastVariant(ref,int,byref)
         215 : System.Private.CoreLib.dasm - System.Array:IndexOf(ref,struct,int,int):int (12 methods)
         212 : System.Private.CoreLib.dasm - System.Variant:MarshalHelperConvertObjectToVariant(ref,byref)
         181 : System.Private.CoreLib.dasm - System.Array:BinarySearch(ref,int,int,struct,ref):int (9 methods)
         177 : System.Private.CoreLib.dasm - System.Collections.Generic.ArraySortHelper`1[Decimal][System.Decimal]:DownHeap(ref,int,int,int,ref)
Top method improvements by size (bytes):
         -72 : System.Private.CoreLib.dasm - System.Math:Ceiling(struct):struct
         -54 : System.Private.CoreLib.dasm - System.Decimal:Ceiling(struct):struct
         -24 : System.Private.CoreLib.dasm - System.Collections.Generic.NullableEqualityComparer`1[__Canon][System.__Canon]:IndexOf(ref,struct,int,int):int:this
         -24 : System.Private.CoreLib.dasm - System.Collections.Generic.NullableEqualityComparer`1[__Canon][System.__Canon]:LastIndexOf(ref,struct,int,int):int:this
         -23 : System.Private.CoreLib.dasm - System.Guid:TryWriteBytes(struct):bool:this
833 total methods with size differences (78 improved, 755 regressed), 25196 unchanged.

Tests

Summary:
(Note: Lower is better)
Total bytes of diff: 585179 (1.11% of base)
    diff is a regression.
Total byte diff includes 0 bytes from reconciling methods
        Base had    0 unique methods,        0 unique bytes
        Diff had    0 unique methods,        0 unique bytes
Top file regressions by size (bytes):
      238694 : JIT\Methodical\fp\exgen\10w5d_cs_do\10w5d_cs_do.dasm (26.10% of base)
      225169 : JIT\Methodical\fp\exgen\10w5d_cs_ro\10w5d_cs_ro.dasm (25.66% of base)
       11326 : JIT\Directed\perffix\primitivevt\mixed1_cs_do\mixed1_cs_do.dasm (21.66% of base)
       11075 : JIT\Directed\perffix\primitivevt\mixed1_cs_ro\mixed1_cs_ro.dasm (21.74% of base)
        5609 : JIT\HardwareIntrinsics\X86\Sse2\UnpackHigh_ro\UnpackHigh_ro.dasm (1.61% of base)
Top file improvements by size (bytes):
         -13 : JIT\opt\Tailcall\TailcallVerifyWithPrefix\TailcallVerifyWithPrefix.dasm (-0.03% of base)
         -10 : Interop\StructMarshalling\PInvoke\MarshalStructAsLayoutSeq\MarshalStructAsLayoutSeq.dasm (-0.03% of base)
          -5 : JIT\Performance\CodeQuality\Span\Indexer\Indexer.dasm (-0.05% of base)
          -1 : JIT\Directed\coverage\oldtests\cse2_cs_do\cse2_cs_do.dasm (-0.01% of base)
          -1 : JIT\Directed\coverage\oldtests\cse2_cs_ro\cse2_cs_ro.dasm (-0.01% of base)
203 total files with size differences (5 improved, 198 regressed), 2190 unchanged.
Top method regessions by size (bytes):
        4157 : JIT\Methodical\fp\exgen\10w5d_cs_do\10w5d_cs_do.dasm - testout1:Func_0_4_5_3():struct
        4157 : JIT\Methodical\fp\exgen\10w5d_cs_ro\10w5d_cs_ro.dasm - testout1:Func_0_4_5_3():struct
        3741 : JIT\Methodical\fp\exgen\10w5d_cs_do\10w5d_cs_do.dasm - testout1:Func_0_4_5_1():struct
        3741 : JIT\Methodical\fp\exgen\10w5d_cs_ro\10w5d_cs_ro.dasm - testout1:Func_0_4_5_1():struct
        3629 : JIT\Methodical\fp\exgen\10w5d_cs_do\10w5d_cs_do.dasm - testout1:Func_0_5_5_1():struct
Top method improvements by size (bytes):
         -13 : JIT\opt\Tailcall\TailcallVerifyWithPrefix\TailcallVerifyWithPrefix.dasm - TailcallVerify.Program:PrintOutRunTestsFile()
          -3 : Interop\StructMarshalling\PInvoke\MarshalStructAsLayoutSeq\MarshalStructAsLayoutSeq.dasm - Managed:testMethod(struct)
          -3 : Interop\StructMarshalling\PInvoke\MarshalStructAsLayoutSeq\MarshalStructAsLayoutSeq.dasm - Helper:PrintCharSetAnsiSequential(struct,ref)
          -3 : Interop\StructMarshalling\PInvoke\MarshalStructAsLayoutSeq\MarshalStructAsLayoutSeq.dasm - Helper:PrintCharSetUnicodeSequential(struct,ref)
          -3 : JIT\Methodical\explicit\coverage\seq_byte_1_d\seq_byte_1_d.dasm - TestApp:test_0_1(ubyte,struct,struct):ubyte
2910 total methods with size differences (19 improved, 2891 regressed), 163078 unchanged.

Just increasing fieldCount to 4

Corelib

Summary:
(Note: Lower is better)
Total bytes of diff: 4229 (0.11% of base)
    diff is a regression.
Total byte diff includes 0 bytes from reconciling methods
        Base had    0 unique methods,        0 unique bytes
        Diff had    0 unique methods,        0 unique bytes
Top file regressions by size (bytes):
        4229 : System.Private.CoreLib.dasm (0.11% of base)
1 total files with size differences (0 improved, 1 regressed), 0 unchanged.
Top method regessions by size (bytes):
         383 : System.Private.CoreLib.dasm - System.Variant:MarshalHelperCastVariant(ref,int,byref)
         212 : System.Private.CoreLib.dasm - System.Variant:MarshalHelperConvertObjectToVariant(ref,byref)
         177 : System.Private.CoreLib.dasm - System.Collections.Generic.ArraySortHelper`1[Decimal][System.Decimal]:DownHeap(ref,int,int,int,ref)
         162 : System.Private.CoreLib.dasm - System.Collections.Generic.ArraySortHelper`1[Decimal][System.Decimal]:PickPivotAndPartition(ref,int,int,ref):int
         145 : System.Private.CoreLib.dasm - System.Decimal:Remainder(struct,struct):struct
Top method improvements by size (bytes):
         -72 : System.Private.CoreLib.dasm - System.Math:Ceiling(struct):struct
         -54 : System.Private.CoreLib.dasm - System.Decimal:Ceiling(struct):struct
         -14 : System.Private.CoreLib.dasm - System.Threading.Tasks.ValueTask`1[Int32][System.Int32]:ConfigureAwait(bool):struct:this
          -6 : System.Private.CoreLib.dasm - MemberInfoCache`1[__Canon][System.__Canon]:PopulateFields(struct):ref:this
          -2 : System.Private.CoreLib.dasm - System.Decimal:System.IConvertible.ToSingle(ref):float:this
144 total methods with size differences (7 improved, 137 regressed), 25885 unchanged.

Tests

Summary:
(Note: Lower is better)
Total bytes of diff: 572051 (1.09% of base)
    diff is a regression.
Total byte diff includes 0 bytes from reconciling methods
        Base had    0 unique methods,        0 unique bytes
        Diff had    0 unique methods,        0 unique bytes
Top file regressions by size (bytes):
      238694 : JIT\Methodical\fp\exgen\10w5d_cs_do\10w5d_cs_do.dasm (26.10% of base)
      225169 : JIT\Methodical\fp\exgen\10w5d_cs_ro\10w5d_cs_ro.dasm (25.66% of base)
       11326 : JIT\Directed\perffix\primitivevt\mixed1_cs_do\mixed1_cs_do.dasm (21.66% of base)
       11075 : JIT\Directed\perffix\primitivevt\mixed1_cs_ro\mixed1_cs_ro.dasm (21.74% of base)
        5605 : JIT\HardwareIntrinsics\X86\Sse2\UnpackHigh_ro\UnpackHigh_ro.dasm (1.61% of base)
Top file improvements by size (bytes):
         -13 : JIT\opt\Tailcall\TailcallVerifyWithPrefix\TailcallVerifyWithPrefix.dasm (-0.03% of base)
167 total files with size differences (1 improved, 166 regressed), 2226 unchanged.
Top method regessions by size (bytes):
        4157 : JIT\Methodical\fp\exgen\10w5d_cs_do\10w5d_cs_do.dasm - testout1:Func_0_4_5_3():struct
        4157 : JIT\Methodical\fp\exgen\10w5d_cs_ro\10w5d_cs_ro.dasm - testout1:Func_0_4_5_3():struct
        3741 : JIT\Methodical\fp\exgen\10w5d_cs_do\10w5d_cs_do.dasm - testout1:Func_0_4_5_1():struct
        3741 : JIT\Methodical\fp\exgen\10w5d_cs_ro\10w5d_cs_ro.dasm - testout1:Func_0_4_5_1():struct
        3629 : JIT\Methodical\fp\exgen\10w5d_cs_do\10w5d_cs_do.dasm - testout1:Func_0_5_5_1():struct
Top method improvements by size (bytes):
         -13 : JIT\opt\Tailcall\TailcallVerifyWithPrefix\TailcallVerifyWithPrefix.dasm - TailcallVerify.Program:PrintOutRunTestsFile()
         -10 : JIT\Directed\nullabletypes\castclassvaluetype_do\castclassvaluetype_do.dasm - NullableTest13:BoxUnboxToQGen(struct):bool (2 methods)
         -10 : JIT\Directed\nullabletypes\castclassvaluetype_ro\castclassvaluetype_ro.dasm - NullableTest13:BoxUnboxToQGen(struct):bool (2 methods)
          -1 : JIT\Directed\nullabletypes\castclassvaluetype_do\castclassvaluetype_do.dasm - NullableTest13:BoxUnboxToQGenC(struct):bool (2 methods)
          -1 : JIT\Directed\nullabletypes\castclassvaluetype_ro\castclassvaluetype_ro.dasm - NullableTest13:BoxUnboxToQGenC(struct):bool (2 methods)
2555 total methods with size differences (5 improved, 2550 regressed), 163433 unchanged.

Just undoPromotion logic change

Corelib

Summary:
(Note: Lower is better)
Total bytes of diff: 14050 (0.38% of base)
    diff is a regression.
Total byte diff includes 0 bytes from reconciling methods
        Base had    0 unique methods,        0 unique bytes
        Diff had    0 unique methods,        0 unique bytes
Top file regressions by size (bytes):
       14050 : System.Private.CoreLib.dasm (0.38% of base)
1 total files with size differences (0 improved, 1 regressed), 0 unchanged.
Top method regessions by size (bytes):
         170 : System.Private.CoreLib.dasm - System.Array:IndexOf(ref,struct,int,int):int (12 methods)
         132 : System.Private.CoreLib.dasm - System.Array:BinarySearch(ref,int,int,struct,ref):int (9 methods)
         118 : System.Private.CoreLib.dasm - System.MemoryExtensions:BinarySearch(struct,ref):int (4 methods)
         110 : System.Private.CoreLib.dasm - System.TupleExtensions:CreateLong(ref,ref,ref,ref,ref,ref,ref,struct):struct (14 methods)
          87 : System.Private.CoreLib.dasm - System.Collections.Generic.ComparisonComparer`1[ValueTuple`3][System.ValueTuple`3[System.__Canon,System.__Canon,System.__Canon]]:Compare(struct,struct):int:this
Top method improvements by size (bytes):
         -24 : System.Private.CoreLib.dasm - System.Collections.Generic.NullableEqualityComparer`1[__Canon][System.__Canon]:IndexOf(ref,struct,int,int):int:this
         -24 : System.Private.CoreLib.dasm - System.Collections.Generic.NullableEqualityComparer`1[__Canon][System.__Canon]:LastIndexOf(ref,struct,int,int):int:this
         -23 : System.Private.CoreLib.dasm - System.Guid:TryWriteBytes(struct):bool:this
         -19 : System.Private.CoreLib.dasm - StringParser:TryParse(struct,byref):bool:this
         -16 : System.Private.CoreLib.dasm - System.Globalization.CompareInfo:FindStringOrdinal(int,struct,struct,bool):int
627 total methods with size differences (71 improved, 556 regressed), 25402 unchanged.

Tests

Summary:
(Note: Lower is better)
Total bytes of diff: 2601 (0.00% of base)
    diff is a regression.
Total byte diff includes 0 bytes from reconciling methods
        Base had    0 unique methods,        0 unique bytes
        Diff had    0 unique methods,        0 unique bytes
Top file regressions by size (bytes):
         644 : JIT\Directed\nullabletypes\Desktop\boxunboxvaluetype_do\boxunboxvaluetype_do.dasm (0.42% of base)
         644 : JIT\Directed\nullabletypes\Desktop\boxunboxvaluetype_ro\boxunboxvaluetype_ro.dasm (0.42% of base)
         131 : JIT\opt\FastTailCall\FastTailCallCandidates\FastTailCallCandidates.dasm (2.47% of base)
         109 : JIT\Regression\JitBlue\GitHub_8220\GitHub_8220\GitHub_8220.dasm (3.01% of base)
          67 : JIT\SIMD\VectorReturn_ro\VectorReturn_ro.dasm (0.71% of base)
Top file improvements by size (bytes):
         -10 : Interop\StructMarshalling\PInvoke\MarshalStructAsLayoutSeq\MarshalStructAsLayoutSeq.dasm (-0.03% of base)
          -6 : JIT\Performance\CodeQuality\Span\Indexer\Indexer.dasm (-0.06% of base)
          -2 : JIT\HardwareIntrinsics\X86\Sse2\MaxScalar_ro\MaxScalar_ro.dasm (0.00% of base)
          -2 : JIT\HardwareIntrinsics\X86\Sse2\MinScalar_ro\MinScalar_ro.dasm (0.00% of base)
          -1 : JIT\Directed\coverage\oldtests\cse2_cs_do\cse2_cs_do.dasm (-0.01% of base)
78 total files with size differences (6 improved, 72 regressed), 2315 unchanged.

jakobbotsch · 2022-12-14T14:32:11Z

Codegen today. We did eventually turn on promotion for up to 4 fields, so it looks much better than back then.

; Assembly listing for method Program:Test1(Program+Vector4)
; Emitting BLENDED_CODE for X64 CPU with AVX - Windows
; optimized code
; rsp based frame
; partially interruptible
; No PGO data
; 0 inlinees with PGO data; 2 single block inlinees; 0 inlinees without PGO data
; Final local variable assignments
;
;  V00 arg0         [V00,T01] (  3,  6   )   byref  ->  rcx         single-def
;  V01 OutArgs      [V01    ] (  1,  1   )  lclBlk (32) [rsp+00H]   "OutgoingArgSpace"
;  V02 tmp1         [V02,T00] (  5, 10   )  struct (16) [rsp+28H]   do-not-enreg[SF] "Inlining Arg"
;* V03 tmp2         [V03    ] (  0,  0   )  struct (16) zero-ref    "Inlining Arg"
;* V04 tmp3         [V04    ] (  0,  0   )  struct (16) zero-ref    "Inlining Arg"
;* V05 tmp4         [V05,T10] (  0,  0   )   float  ->  zero-ref    V03.X(offs=0x00) P-INDEP "field V03.X (fldOffset=0x0)"
;* V06 tmp5         [V06,T11] (  0,  0   )   float  ->  zero-ref    V03.Y(offs=0x04) P-INDEP "field V03.Y (fldOffset=0x4)"
;* V07 tmp6         [V07,T12] (  0,  0   )   float  ->  zero-ref    V03.Z(offs=0x08) P-INDEP "field V03.Z (fldOffset=0x8)"
;* V08 tmp7         [V08,T13] (  0,  0   )   float  ->  zero-ref    V03.W(offs=0x0c) P-INDEP "field V03.W (fldOffset=0xc)"
;  V09 tmp8         [V09,T02] (  3,  3   )   float  ->  mm0         V04.X(offs=0x00) P-INDEP "field V04.X (fldOffset=0x0)"
;  V10 tmp9         [V10,T03] (  3,  3   )   float  ->  mm1         V04.Y(offs=0x04) P-INDEP "field V04.Y (fldOffset=0x4)"
;  V11 tmp10        [V11,T04] (  3,  3   )   float  ->  mm2         V04.Z(offs=0x08) P-INDEP "field V04.Z (fldOffset=0x8)"
;  V12 tmp11        [V12,T05] (  3,  3   )   float  ->  mm3         V04.W(offs=0x0c) P-INDEP "field V04.W (fldOffset=0xc)"
;  V13 cse0         [V13,T06] (  2,  2   )   float  ->  mm0         "CSE - aggressive"
;  V14 cse1         [V14,T07] (  2,  2   )   float  ->  mm1         "CSE - aggressive"
;  V15 cse2         [V15,T08] (  2,  2   )   float  ->  mm2         "CSE - aggressive"
;  V16 cse3         [V16,T09] (  2,  2   )   float  ->  mm3         "CSE - aggressive"
;
; Lcl frame size = 56

G_M5536_IG01:
       sub      rsp, 56
       vzeroupper 
						;; size=7 bbWeight=1    PerfScore 1.25
G_M5536_IG02:
       vmovupd  xmm0, xmmword ptr [rcx]
       vmovupd  xmmword ptr [rsp+28H], xmm0
       vmovss   xmm0, dword ptr [rsp+28H]
       vmovss   xmm1, dword ptr [rsp+2CH]
       vmovss   xmm2, dword ptr [rsp+30H]
       vmovss   xmm3, dword ptr [rsp+34H]
       vmulss   xmm0, xmm0, xmm0
       vmulss   xmm1, xmm1, xmm1
       vaddss   xmm0, xmm0, xmm1
       vmulss   xmm1, xmm2, xmm2
       vaddss   xmm0, xmm0, xmm1
       vmulss   xmm1, xmm3, xmm3
       vaddss   xmm0, xmm0, xmm1
       call     [System.Console:WriteLine(float)]
       nop      
						;; size=69 bbWeight=1    PerfScore 41.25
G_M5536_IG03:
       add      rsp, 56
       ret      
						;; size=5 bbWeight=1    PerfScore 1.25

; Total bytes of code 81, prolog size 7, PerfScore 53.15, instruction count 19, allocated bytes for code 94 (MethodHash=3951ea5f) for method Program:Test1(Program+Vector4)
; ============================================================

; Assembly listing for method Program:Test2(Program+Vector4)
; Emitting BLENDED_CODE for X64 CPU with AVX - Windows
; optimized code
; rsp based frame
; partially interruptible
; No PGO data
; 0 inlinees with PGO data; 1 single block inlinees; 0 inlinees without PGO data
; Final local variable assignments
;
;  V00 arg0         [V00,T00] (  6, 12   )   byref  ->  rcx         single-def
;  V01 OutArgs      [V01    ] (  1,  1   )  lclBlk (32) [rsp+00H]   "OutgoingArgSpace"
;* V02 tmp1         [V02    ] (  0,  0   )  struct (16) zero-ref    "Inlining Arg"
;* V03 tmp2         [V03    ] (  0,  0   )  struct (16) zero-ref    "Inlining Arg"
;  V04 tmp3         [V04,T05] (  2,  2   )   float  ->  mm1         V02.X(offs=0x00) P-INDEP "field V02.X (fldOffset=0x0)"
;  V05 tmp4         [V05,T06] (  2,  2   )   float  ->  mm3         V02.Y(offs=0x04) P-INDEP "field V02.Y (fldOffset=0x4)"
;  V06 tmp5         [V06,T07] (  2,  2   )   float  ->  mm5         V02.Z(offs=0x08) P-INDEP "field V02.Z (fldOffset=0x8)"
;  V07 tmp6         [V07,T08] (  2,  2   )   float  ->  mm7         V02.W(offs=0x0c) P-INDEP "field V02.W (fldOffset=0xc)"
;  V08 tmp7         [V08,T09] (  2,  2   )   float  ->  mm0         V03.X(offs=0x00) P-INDEP "field V03.X (fldOffset=0x0)"
;  V09 tmp8         [V09,T10] (  2,  2   )   float  ->  mm2         V03.Y(offs=0x04) P-INDEP "field V03.Y (fldOffset=0x4)"
;  V10 tmp9         [V10,T11] (  2,  2   )   float  ->  mm4         V03.Z(offs=0x08) P-INDEP "field V03.Z (fldOffset=0x8)"
;  V11 tmp10        [V11,T12] (  2,  2   )   float  ->  mm6         V03.W(offs=0x0c) P-INDEP "field V03.W (fldOffset=0xc)"
;  V12 cse0         [V12,T01] (  3,  3   )   float  ->  mm2         "CSE - aggressive"
;  V13 cse1         [V13,T02] (  3,  3   )   float  ->  mm4         "CSE - aggressive"
;  V14 cse2         [V14,T03] (  3,  3   )   float  ->  mm6         "CSE - aggressive"
;  V15 cse3         [V15,T04] (  3,  3   )   float  ->  mm0         "CSE - aggressive"
;
; Lcl frame size = 72

G_M19203_IG01:
       sub      rsp, 72
       vzeroupper 
       vmovaps  xmmword ptr [rsp+30H], xmm6
       vmovaps  xmmword ptr [rsp+20H], xmm7
						;; size=19 bbWeight=1    PerfScore 5.25
G_M19203_IG02:
       vmovss   xmm0, dword ptr [rcx]
       vmovaps  xmm1, xmm0
       vmovss   xmm2, dword ptr [rcx+04H]
       vmovaps  xmm3, xmm2
       vmovss   xmm4, dword ptr [rcx+08H]
       vmovaps  xmm5, xmm4
       vmovss   xmm6, dword ptr [rcx+0CH]
       vmovaps  xmm7, xmm6
       vmulss   xmm0, xmm1, xmm0
       vmulss   xmm1, xmm3, xmm2
       vaddss   xmm0, xmm0, xmm1
       vmulss   xmm1, xmm5, xmm4
       vaddss   xmm0, xmm0, xmm1
       vmulss   xmm1, xmm7, xmm6
       vaddss   xmm0, xmm0, xmm1
       call     [System.Console:WriteLine(float)]
       nop      
						;; size=70 bbWeight=1    PerfScore 41.25
G_M19203_IG03:
       vmovaps  xmm6, xmmword ptr [rsp+30H]
       vmovaps  xmm7, xmmword ptr [rsp+20H]
       add      rsp, 72
       ret      
						;; size=17 bbWeight=1    PerfScore 9.25

; Total bytes of code 106, prolog size 19, PerfScore 68.25, instruction count 25, allocated bytes for code 125 (MethodHash=09bab4fc) for method Program:Test2(Program+Vector4)
; ============================================================

; Assembly listing for method Program:Test3(Program+Vector4)
; Emitting BLENDED_CODE for X64 CPU with AVX - Windows
; optimized code
; rsp based frame
; partially interruptible
; No PGO data
; Final local variable assignments
;
;  V00 arg0         [V00,T00] (  6, 12   )   byref  ->  rcx         single-def
;  V01 OutArgs      [V01    ] (  1,  1   )  lclBlk (32) [rsp+00H]   "OutgoingArgSpace"
;  V02 tmp1         [V02,T01] (  3,  3   )   float  ->  mm0         V06.X(offs=0x00) P-INDEP "field V00.X (fldOffset=0x0)"
;  V03 tmp2         [V03,T02] (  3,  3   )   float  ->  mm1         V06.Y(offs=0x04) P-INDEP "field V00.Y (fldOffset=0x4)"
;  V04 tmp3         [V04,T03] (  3,  3   )   float  ->  mm2         V06.Z(offs=0x08) P-INDEP "field V00.Z (fldOffset=0x8)"
;  V05 tmp4         [V05,T04] (  3,  3   )   float  ->  mm3         V06.W(offs=0x0c) P-INDEP "field V00.W (fldOffset=0xc)"
;* V06 tmp5         [V06    ] (  0,  0   )  struct (16) zero-ref    "Promoted implicit byref"
;
; Lcl frame size = 40

G_M17890_IG01:
       sub      rsp, 40
       vzeroupper 
						;; size=7 bbWeight=1    PerfScore 1.25
G_M17890_IG02:
       vmovss   xmm0, dword ptr [rcx]
       vmovss   xmm1, dword ptr [rcx+04H]
       vmovss   xmm2, dword ptr [rcx+08H]
       vmovss   xmm3, dword ptr [rcx+0CH]
       vmulss   xmm0, xmm0, xmm0
       vmulss   xmm1, xmm1, xmm1
       vaddss   xmm0, xmm0, xmm1
       vmulss   xmm1, xmm2, xmm2
       vaddss   xmm0, xmm0, xmm1
       vmulss   xmm1, xmm3, xmm3
       vaddss   xmm0, xmm0, xmm1
       call     [System.Console:WriteLine(float)]
       nop      
						;; size=54 bbWeight=1    PerfScore 40.25
G_M17890_IG03:
       add      rsp, 40
       ret      
						;; size=5 bbWeight=1    PerfScore 1.25

; Total bytes of code 66, prolog size 7, PerfScore 50.45, instruction count 17, allocated bytes for code 77 (MethodHash=edd5ba1d) for method Program:Test3(Program+Vector4)
; ============================================================

In Test2 we still end up with separate locals that refer to the same field (that is read from an implicit byref) but we don't CSE locals so we do not get rid of it.

jakobbotsch · 2023-07-24T15:59:09Z

Codegen for all versions is the same today. This looks like it was fixed by #81636.

msftgits transferred this issue from dotnet/coreclr Jan 31, 2020

msftgits added this to the Future milestone Jan 31, 2020

jakobbotsch closed this as completed Jul 24, 2023

ghost locked as resolved and limited conversation to collaborators Aug 23, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RyuJIT generates poor code for a helper method which does `return Method(value, value)` #9916

RyuJIT generates poor code for a helper method which does `return Method(value, value)` #9916

tannergooding commented Mar 11, 2018 •

edited by BruceForstall

Loading

tannergooding commented Mar 11, 2018

tannergooding commented Mar 11, 2018

tannergooding commented Mar 12, 2018

AndyAyersMS commented Mar 12, 2018

tannergooding commented Mar 12, 2018

tannergooding commented Mar 12, 2018

AndyAyersMS commented Mar 12, 2018

tannergooding commented Mar 13, 2018

AndyAyersMS commented Mar 13, 2018

tannergooding commented Mar 13, 2018

jakobbotsch commented Dec 14, 2022

jakobbotsch commented Jul 24, 2023 •

edited

Loading

RyuJIT generates poor code for a helper method which does return Method(value, value) #9916

RyuJIT generates poor code for a helper method which does return Method(value, value) #9916

Comments

tannergooding commented Mar 11, 2018 • edited by BruceForstall Loading

Issue

Example

Additional Notes

tannergooding commented Mar 11, 2018

tannergooding commented Mar 11, 2018

tannergooding commented Mar 12, 2018

AndyAyersMS commented Mar 12, 2018

tannergooding commented Mar 12, 2018

tannergooding commented Mar 12, 2018

AndyAyersMS commented Mar 12, 2018

tannergooding commented Mar 13, 2018

AndyAyersMS commented Mar 13, 2018

tannergooding commented Mar 13, 2018

Both Optimizations

Just increasing fieldCount to 4

Just undoPromotion logic change

jakobbotsch commented Dec 14, 2022

jakobbotsch commented Jul 24, 2023 • edited Loading

RyuJIT generates poor code for a helper method which does `return Method(value, value)` #9916

RyuJIT generates poor code for a helper method which does `return Method(value, value)` #9916

tannergooding commented Mar 11, 2018 •

edited by BruceForstall

Loading

jakobbotsch commented Jul 24, 2023 •

edited

Loading