Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RyuJIT generates poor code for a helper method which does return Method(value, value) #9916

Closed
tannergooding opened this issue Mar 11, 2018 · 12 comments
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI enhancement Product code improvement that does NOT require public API changes/additions optimization tenet-performance Performance related issue
Milestone

Comments

@tannergooding
Copy link
Member

tannergooding commented Mar 11, 2018

Issue

For the code (where ReturnType and OperandType are value types):

[MethodImpl(MethodImplOptions.AggressiveInlining)]
public static ReturnType MyMethod1(OperandType value)
{
    return MyMethod2(value, value);
}

[MethodImpl(MethodImplOptions.AggressiveInlining)]
public static ReturnType MyMethod2(OperandType left, OperandType right)
{
    // Logic
}

RyuJIT currently has poor codegen when inlining a call to MyMethod1 and ends up reading value from memory, twice.

Example

A simple example program: Test.cs.txt

The method Test1 currently produces:

The method Test2 currently produces:

The method Test3 currently produces:

Additional Notes

The stack frames appear to be much larger than necessary (it looks like the stack frames of each inlined method are kept, even when the values are no longer used).

Directly calling MyMethod2 (Test2) produces better code, but it fails to recognize that left and right are the same value.

Manually inlining MyMethod2 (Test3) produces the best code. I would think that it is feasible for the JIT to produce this code when the code is inlined by the JIT (Test1).

category:cq
theme:structs
skill-level:expert
cost:medium
impact:small

@tannergooding
Copy link
Member Author

FYI. @CarolEidt, @AndyAyersMS, @dotnet/jit-contrib

@tannergooding
Copy link
Member Author

Also FYI. @VSadov

@tannergooding
Copy link
Member Author

@CarolEidt, @AndyAyersMS.

Does this seem like something that is feasible for the JIT to do, or does it fall into the realm of changing observable side-effects?

@AndyAyersMS
Copy link
Member

If we inline and expose a full web of inter-related structs then we can safely eliminate copies.

For small structs we do that today via promotion. Current heuristic (see lvShouldPromoteStructVar) is to aggressively promote structs with 3 or fewer fields. We'll promote larger structs only if we see particular fields being used. This rules out promotion of larger structs in cases where structs are simply being copied.

You might experimentally change this value to 4 to see if it fixes the issues in your sample above.

Being more aggressive about promotion is probably not the right long-term fix for our struct issues -- though we can certainly reconsider the heuristic above and adjust if it makes sense. We need to add the ability to reason about structs as a whole. This is part of the first-class structs work.

@tannergooding
Copy link
Member Author

@AndyAyersMS, thanks for the tips.

Current heuristic (see lvShouldPromoteStructVar) is to aggressively promote structs with 3 or fewer fields.

Just initially, I would think that 4 is a better baseline (w.r.t element count) for "large vs small" (<= 4 is small, > 4 is large). This would, at the very least, cover the common case of SIMD like structs (both 128-bit for float/int and 256-bit for double/long).

Being more aggressive about promotion is probably not the right long-term fix for our struct issues

I agree. The "first-class structs" issue looks to cover a good set of scenarios.

@tannergooding
Copy link
Member Author

@AndyAyersMS, changing lvaShouldPromoteStructVar to have a threshold of 4 does indeed fix the issue for Test1 (and it ultimately generates identical code to Test3).

However, Test2 actually produces the same codegen as before (the JIT Dump does indicate promotion is happening). This was somewhat surprising to me, since Test1 is a bit more complicated and ultimately does the same thing as Test2.

  • Test1 calls LengthSquared(value), which calls DotProduct(value, value)
  • Test2 directly calls DotProduct(value, value)

@AndyAyersMS
Copy link
Member

Hmm, I'd have to look deeper, but here's what I saw based on quick scan:

There is a second promotion heuristic for implicit by-ref structs that might possibly be holding up Test2. In fgRetypeImplicitByRefArgs we can "undo" an implicit by-ref arg promotion for structs whose ref count is less than their field count. This seems to be blocking or undoing promotion of V00. So we end up with two copies of the arg.

In Test1 we copy the arg to a local and then make two copies of the local instead of making two copies of the arg. So more aggressive promotion probably turns this into one copy of the arg.

So if you're up for another experiment, you could try lowering the profitability threshold in fgRetypeImplicitByRefArgs. If that's indeed what is happening it would be nice for that method to be more vocal about its decision making.

@tannergooding
Copy link
Member Author

@AndyAyersMS, it is definitely the undoPromotion check.

I modified the method locally to print some additional information about the decision it makes here:

*************** In fgRetypeImplicitByRefArgs()
  
  lvaGrabTemp returning 15 (V15 tmp14) (a long lifetime temp) called for Promoted implicit byref.
  Undoing promotion for for struct parameter V00: #refs = 2, #fields = 4.
  Changing the lvType for struct parameter V00 to TYP_BYREF.

I'm not quite sure what the proper heuristic here is. However, I would speculate that refs = 2 is the most common case, and that it is used to support binary operations, so perhaps changing the heurisitic to the following would work:

bool undoPromotion = false;

if (lvaGetPromotionType(newVarDsc) == PROMOTION_TYPE_DEPENDENT)
{
    undoPromotion = true;
    JITDUMP("Undoing promotion for struct parameter V%02d, because promotion is dependent.\n", lclNum);
}
else if ((varDsc->lvRefCnt > 2) && (varDsc->lvRefCnt <= varDsc->lvFieldCnt))
{
    undoPromotion = true;
    JITDUMP("Undoing promotion for for struct parameter V%02d: #refs = %d, #fields = %d.\n", lclNum, varDsc->lvRefCnt, varDsc->lvFieldCnt);
}
else
{
    JITDUMP("Not undoing promotion for for struct parameter V%02d.\n", lclNum);
}

I would guess a "more accurate" heuristic would also take struct size and whether it is a HVA/HFA struct into account, but that is probably "future" work.

Thoughts? (I am happy to try something out and get diffs/benches done)

@AndyAyersMS
Copy link
Member

The general promotion logic was last touched in dotnet/coreclr#9455 where we bumped the limit for fieldless promotions from 2 to 3.

There are some notes on the byref promotion heuristic in #8653. Basically we are trying to weigh the always-paid cost of the initial copy from the byref to a local vs the potential benefits at the use sites.

@tannergooding
Copy link
Member Author

Looks like increasing the fieldCount to 4 is definitely not profitable overall.
Changing the undoPromotion logic to exclude refCnt <= 2 might be profitable

Both Optimizations

Corelib

Summary:
(Note: Lower is better)
Total bytes of diff: 21575 (0.58% of base)
    diff is a regression.
Total byte diff includes 0 bytes from reconciling methods
        Base had    0 unique methods,        0 unique bytes
        Diff had    0 unique methods,        0 unique bytes
Top file regressions by size (bytes):
       21575 : System.Private.CoreLib.dasm (0.58% of base)
1 total files with size differences (0 improved, 1 regressed), 0 unchanged.
Top method regessions by size (bytes):
         383 : System.Private.CoreLib.dasm - System.Variant:MarshalHelperCastVariant(ref,int,byref)
         215 : System.Private.CoreLib.dasm - System.Array:IndexOf(ref,struct,int,int):int (12 methods)
         212 : System.Private.CoreLib.dasm - System.Variant:MarshalHelperConvertObjectToVariant(ref,byref)
         181 : System.Private.CoreLib.dasm - System.Array:BinarySearch(ref,int,int,struct,ref):int (9 methods)
         177 : System.Private.CoreLib.dasm - System.Collections.Generic.ArraySortHelper`1[Decimal][System.Decimal]:DownHeap(ref,int,int,int,ref)
Top method improvements by size (bytes):
         -72 : System.Private.CoreLib.dasm - System.Math:Ceiling(struct):struct
         -54 : System.Private.CoreLib.dasm - System.Decimal:Ceiling(struct):struct
         -24 : System.Private.CoreLib.dasm - System.Collections.Generic.NullableEqualityComparer`1[__Canon][System.__Canon]:IndexOf(ref,struct,int,int):int:this
         -24 : System.Private.CoreLib.dasm - System.Collections.Generic.NullableEqualityComparer`1[__Canon][System.__Canon]:LastIndexOf(ref,struct,int,int):int:this
         -23 : System.Private.CoreLib.dasm - System.Guid:TryWriteBytes(struct):bool:this
833 total methods with size differences (78 improved, 755 regressed), 25196 unchanged.

Tests

Summary:
(Note: Lower is better)
Total bytes of diff: 585179 (1.11% of base)
    diff is a regression.
Total byte diff includes 0 bytes from reconciling methods
        Base had    0 unique methods,        0 unique bytes
        Diff had    0 unique methods,        0 unique bytes
Top file regressions by size (bytes):
      238694 : JIT\Methodical\fp\exgen\10w5d_cs_do\10w5d_cs_do.dasm (26.10% of base)
      225169 : JIT\Methodical\fp\exgen\10w5d_cs_ro\10w5d_cs_ro.dasm (25.66% of base)
       11326 : JIT\Directed\perffix\primitivevt\mixed1_cs_do\mixed1_cs_do.dasm (21.66% of base)
       11075 : JIT\Directed\perffix\primitivevt\mixed1_cs_ro\mixed1_cs_ro.dasm (21.74% of base)
        5609 : JIT\HardwareIntrinsics\X86\Sse2\UnpackHigh_ro\UnpackHigh_ro.dasm (1.61% of base)
Top file improvements by size (bytes):
         -13 : JIT\opt\Tailcall\TailcallVerifyWithPrefix\TailcallVerifyWithPrefix.dasm (-0.03% of base)
         -10 : Interop\StructMarshalling\PInvoke\MarshalStructAsLayoutSeq\MarshalStructAsLayoutSeq.dasm (-0.03% of base)
          -5 : JIT\Performance\CodeQuality\Span\Indexer\Indexer.dasm (-0.05% of base)
          -1 : JIT\Directed\coverage\oldtests\cse2_cs_do\cse2_cs_do.dasm (-0.01% of base)
          -1 : JIT\Directed\coverage\oldtests\cse2_cs_ro\cse2_cs_ro.dasm (-0.01% of base)
203 total files with size differences (5 improved, 198 regressed), 2190 unchanged.
Top method regessions by size (bytes):
        4157 : JIT\Methodical\fp\exgen\10w5d_cs_do\10w5d_cs_do.dasm - testout1:Func_0_4_5_3():struct
        4157 : JIT\Methodical\fp\exgen\10w5d_cs_ro\10w5d_cs_ro.dasm - testout1:Func_0_4_5_3():struct
        3741 : JIT\Methodical\fp\exgen\10w5d_cs_do\10w5d_cs_do.dasm - testout1:Func_0_4_5_1():struct
        3741 : JIT\Methodical\fp\exgen\10w5d_cs_ro\10w5d_cs_ro.dasm - testout1:Func_0_4_5_1():struct
        3629 : JIT\Methodical\fp\exgen\10w5d_cs_do\10w5d_cs_do.dasm - testout1:Func_0_5_5_1():struct
Top method improvements by size (bytes):
         -13 : JIT\opt\Tailcall\TailcallVerifyWithPrefix\TailcallVerifyWithPrefix.dasm - TailcallVerify.Program:PrintOutRunTestsFile()
          -3 : Interop\StructMarshalling\PInvoke\MarshalStructAsLayoutSeq\MarshalStructAsLayoutSeq.dasm - Managed:testMethod(struct)
          -3 : Interop\StructMarshalling\PInvoke\MarshalStructAsLayoutSeq\MarshalStructAsLayoutSeq.dasm - Helper:PrintCharSetAnsiSequential(struct,ref)
          -3 : Interop\StructMarshalling\PInvoke\MarshalStructAsLayoutSeq\MarshalStructAsLayoutSeq.dasm - Helper:PrintCharSetUnicodeSequential(struct,ref)
          -3 : JIT\Methodical\explicit\coverage\seq_byte_1_d\seq_byte_1_d.dasm - TestApp:test_0_1(ubyte,struct,struct):ubyte
2910 total methods with size differences (19 improved, 2891 regressed), 163078 unchanged.

Just increasing fieldCount to 4

Corelib

Summary:
(Note: Lower is better)
Total bytes of diff: 4229 (0.11% of base)
    diff is a regression.
Total byte diff includes 0 bytes from reconciling methods
        Base had    0 unique methods,        0 unique bytes
        Diff had    0 unique methods,        0 unique bytes
Top file regressions by size (bytes):
        4229 : System.Private.CoreLib.dasm (0.11% of base)
1 total files with size differences (0 improved, 1 regressed), 0 unchanged.
Top method regessions by size (bytes):
         383 : System.Private.CoreLib.dasm - System.Variant:MarshalHelperCastVariant(ref,int,byref)
         212 : System.Private.CoreLib.dasm - System.Variant:MarshalHelperConvertObjectToVariant(ref,byref)
         177 : System.Private.CoreLib.dasm - System.Collections.Generic.ArraySortHelper`1[Decimal][System.Decimal]:DownHeap(ref,int,int,int,ref)
         162 : System.Private.CoreLib.dasm - System.Collections.Generic.ArraySortHelper`1[Decimal][System.Decimal]:PickPivotAndPartition(ref,int,int,ref):int
         145 : System.Private.CoreLib.dasm - System.Decimal:Remainder(struct,struct):struct
Top method improvements by size (bytes):
         -72 : System.Private.CoreLib.dasm - System.Math:Ceiling(struct):struct
         -54 : System.Private.CoreLib.dasm - System.Decimal:Ceiling(struct):struct
         -14 : System.Private.CoreLib.dasm - System.Threading.Tasks.ValueTask`1[Int32][System.Int32]:ConfigureAwait(bool):struct:this
          -6 : System.Private.CoreLib.dasm - MemberInfoCache`1[__Canon][System.__Canon]:PopulateFields(struct):ref:this
          -2 : System.Private.CoreLib.dasm - System.Decimal:System.IConvertible.ToSingle(ref):float:this
144 total methods with size differences (7 improved, 137 regressed), 25885 unchanged.

Tests

Summary:
(Note: Lower is better)
Total bytes of diff: 572051 (1.09% of base)
    diff is a regression.
Total byte diff includes 0 bytes from reconciling methods
        Base had    0 unique methods,        0 unique bytes
        Diff had    0 unique methods,        0 unique bytes
Top file regressions by size (bytes):
      238694 : JIT\Methodical\fp\exgen\10w5d_cs_do\10w5d_cs_do.dasm (26.10% of base)
      225169 : JIT\Methodical\fp\exgen\10w5d_cs_ro\10w5d_cs_ro.dasm (25.66% of base)
       11326 : JIT\Directed\perffix\primitivevt\mixed1_cs_do\mixed1_cs_do.dasm (21.66% of base)
       11075 : JIT\Directed\perffix\primitivevt\mixed1_cs_ro\mixed1_cs_ro.dasm (21.74% of base)
        5605 : JIT\HardwareIntrinsics\X86\Sse2\UnpackHigh_ro\UnpackHigh_ro.dasm (1.61% of base)
Top file improvements by size (bytes):
         -13 : JIT\opt\Tailcall\TailcallVerifyWithPrefix\TailcallVerifyWithPrefix.dasm (-0.03% of base)
167 total files with size differences (1 improved, 166 regressed), 2226 unchanged.
Top method regessions by size (bytes):
        4157 : JIT\Methodical\fp\exgen\10w5d_cs_do\10w5d_cs_do.dasm - testout1:Func_0_4_5_3():struct
        4157 : JIT\Methodical\fp\exgen\10w5d_cs_ro\10w5d_cs_ro.dasm - testout1:Func_0_4_5_3():struct
        3741 : JIT\Methodical\fp\exgen\10w5d_cs_do\10w5d_cs_do.dasm - testout1:Func_0_4_5_1():struct
        3741 : JIT\Methodical\fp\exgen\10w5d_cs_ro\10w5d_cs_ro.dasm - testout1:Func_0_4_5_1():struct
        3629 : JIT\Methodical\fp\exgen\10w5d_cs_do\10w5d_cs_do.dasm - testout1:Func_0_5_5_1():struct
Top method improvements by size (bytes):
         -13 : JIT\opt\Tailcall\TailcallVerifyWithPrefix\TailcallVerifyWithPrefix.dasm - TailcallVerify.Program:PrintOutRunTestsFile()
         -10 : JIT\Directed\nullabletypes\castclassvaluetype_do\castclassvaluetype_do.dasm - NullableTest13:BoxUnboxToQGen(struct):bool (2 methods)
         -10 : JIT\Directed\nullabletypes\castclassvaluetype_ro\castclassvaluetype_ro.dasm - NullableTest13:BoxUnboxToQGen(struct):bool (2 methods)
          -1 : JIT\Directed\nullabletypes\castclassvaluetype_do\castclassvaluetype_do.dasm - NullableTest13:BoxUnboxToQGenC(struct):bool (2 methods)
          -1 : JIT\Directed\nullabletypes\castclassvaluetype_ro\castclassvaluetype_ro.dasm - NullableTest13:BoxUnboxToQGenC(struct):bool (2 methods)
2555 total methods with size differences (5 improved, 2550 regressed), 163433 unchanged.

Just undoPromotion logic change

Corelib

Summary:
(Note: Lower is better)
Total bytes of diff: 14050 (0.38% of base)
    diff is a regression.
Total byte diff includes 0 bytes from reconciling methods
        Base had    0 unique methods,        0 unique bytes
        Diff had    0 unique methods,        0 unique bytes
Top file regressions by size (bytes):
       14050 : System.Private.CoreLib.dasm (0.38% of base)
1 total files with size differences (0 improved, 1 regressed), 0 unchanged.
Top method regessions by size (bytes):
         170 : System.Private.CoreLib.dasm - System.Array:IndexOf(ref,struct,int,int):int (12 methods)
         132 : System.Private.CoreLib.dasm - System.Array:BinarySearch(ref,int,int,struct,ref):int (9 methods)
         118 : System.Private.CoreLib.dasm - System.MemoryExtensions:BinarySearch(struct,ref):int (4 methods)
         110 : System.Private.CoreLib.dasm - System.TupleExtensions:CreateLong(ref,ref,ref,ref,ref,ref,ref,struct):struct (14 methods)
          87 : System.Private.CoreLib.dasm - System.Collections.Generic.ComparisonComparer`1[ValueTuple`3][System.ValueTuple`3[System.__Canon,System.__Canon,System.__Canon]]:Compare(struct,struct):int:this
Top method improvements by size (bytes):
         -24 : System.Private.CoreLib.dasm - System.Collections.Generic.NullableEqualityComparer`1[__Canon][System.__Canon]:IndexOf(ref,struct,int,int):int:this
         -24 : System.Private.CoreLib.dasm - System.Collections.Generic.NullableEqualityComparer`1[__Canon][System.__Canon]:LastIndexOf(ref,struct,int,int):int:this
         -23 : System.Private.CoreLib.dasm - System.Guid:TryWriteBytes(struct):bool:this
         -19 : System.Private.CoreLib.dasm - StringParser:TryParse(struct,byref):bool:this
         -16 : System.Private.CoreLib.dasm - System.Globalization.CompareInfo:FindStringOrdinal(int,struct,struct,bool):int
627 total methods with size differences (71 improved, 556 regressed), 25402 unchanged.

Tests

Summary:
(Note: Lower is better)
Total bytes of diff: 2601 (0.00% of base)
    diff is a regression.
Total byte diff includes 0 bytes from reconciling methods
        Base had    0 unique methods,        0 unique bytes
        Diff had    0 unique methods,        0 unique bytes
Top file regressions by size (bytes):
         644 : JIT\Directed\nullabletypes\Desktop\boxunboxvaluetype_do\boxunboxvaluetype_do.dasm (0.42% of base)
         644 : JIT\Directed\nullabletypes\Desktop\boxunboxvaluetype_ro\boxunboxvaluetype_ro.dasm (0.42% of base)
         131 : JIT\opt\FastTailCall\FastTailCallCandidates\FastTailCallCandidates.dasm (2.47% of base)
         109 : JIT\Regression\JitBlue\GitHub_8220\GitHub_8220\GitHub_8220.dasm (3.01% of base)
          67 : JIT\SIMD\VectorReturn_ro\VectorReturn_ro.dasm (0.71% of base)
Top file improvements by size (bytes):
         -10 : Interop\StructMarshalling\PInvoke\MarshalStructAsLayoutSeq\MarshalStructAsLayoutSeq.dasm (-0.03% of base)
          -6 : JIT\Performance\CodeQuality\Span\Indexer\Indexer.dasm (-0.06% of base)
          -2 : JIT\HardwareIntrinsics\X86\Sse2\MaxScalar_ro\MaxScalar_ro.dasm (0.00% of base)
          -2 : JIT\HardwareIntrinsics\X86\Sse2\MinScalar_ro\MinScalar_ro.dasm (0.00% of base)
          -1 : JIT\Directed\coverage\oldtests\cse2_cs_do\cse2_cs_do.dasm (-0.01% of base)
78 total files with size differences (6 improved, 72 regressed), 2315 unchanged.

@msftgits msftgits transferred this issue from dotnet/coreclr Jan 31, 2020
@msftgits msftgits added this to the Future milestone Jan 31, 2020
@jakobbotsch
Copy link
Member

Codegen today. We did eventually turn on promotion for up to 4 fields, so it looks much better than back then.

; Assembly listing for method Program:Test1(Program+Vector4)
; Emitting BLENDED_CODE for X64 CPU with AVX - Windows
; optimized code
; rsp based frame
; partially interruptible
; No PGO data
; 0 inlinees with PGO data; 2 single block inlinees; 0 inlinees without PGO data
; Final local variable assignments
;
;  V00 arg0         [V00,T01] (  3,  6   )   byref  ->  rcx         single-def
;  V01 OutArgs      [V01    ] (  1,  1   )  lclBlk (32) [rsp+00H]   "OutgoingArgSpace"
;  V02 tmp1         [V02,T00] (  5, 10   )  struct (16) [rsp+28H]   do-not-enreg[SF] "Inlining Arg"
;* V03 tmp2         [V03    ] (  0,  0   )  struct (16) zero-ref    "Inlining Arg"
;* V04 tmp3         [V04    ] (  0,  0   )  struct (16) zero-ref    "Inlining Arg"
;* V05 tmp4         [V05,T10] (  0,  0   )   float  ->  zero-ref    V03.X(offs=0x00) P-INDEP "field V03.X (fldOffset=0x0)"
;* V06 tmp5         [V06,T11] (  0,  0   )   float  ->  zero-ref    V03.Y(offs=0x04) P-INDEP "field V03.Y (fldOffset=0x4)"
;* V07 tmp6         [V07,T12] (  0,  0   )   float  ->  zero-ref    V03.Z(offs=0x08) P-INDEP "field V03.Z (fldOffset=0x8)"
;* V08 tmp7         [V08,T13] (  0,  0   )   float  ->  zero-ref    V03.W(offs=0x0c) P-INDEP "field V03.W (fldOffset=0xc)"
;  V09 tmp8         [V09,T02] (  3,  3   )   float  ->  mm0         V04.X(offs=0x00) P-INDEP "field V04.X (fldOffset=0x0)"
;  V10 tmp9         [V10,T03] (  3,  3   )   float  ->  mm1         V04.Y(offs=0x04) P-INDEP "field V04.Y (fldOffset=0x4)"
;  V11 tmp10        [V11,T04] (  3,  3   )   float  ->  mm2         V04.Z(offs=0x08) P-INDEP "field V04.Z (fldOffset=0x8)"
;  V12 tmp11        [V12,T05] (  3,  3   )   float  ->  mm3         V04.W(offs=0x0c) P-INDEP "field V04.W (fldOffset=0xc)"
;  V13 cse0         [V13,T06] (  2,  2   )   float  ->  mm0         "CSE - aggressive"
;  V14 cse1         [V14,T07] (  2,  2   )   float  ->  mm1         "CSE - aggressive"
;  V15 cse2         [V15,T08] (  2,  2   )   float  ->  mm2         "CSE - aggressive"
;  V16 cse3         [V16,T09] (  2,  2   )   float  ->  mm3         "CSE - aggressive"
;
; Lcl frame size = 56

G_M5536_IG01:
       sub      rsp, 56
       vzeroupper 
						;; size=7 bbWeight=1    PerfScore 1.25
G_M5536_IG02:
       vmovupd  xmm0, xmmword ptr [rcx]
       vmovupd  xmmword ptr [rsp+28H], xmm0
       vmovss   xmm0, dword ptr [rsp+28H]
       vmovss   xmm1, dword ptr [rsp+2CH]
       vmovss   xmm2, dword ptr [rsp+30H]
       vmovss   xmm3, dword ptr [rsp+34H]
       vmulss   xmm0, xmm0, xmm0
       vmulss   xmm1, xmm1, xmm1
       vaddss   xmm0, xmm0, xmm1
       vmulss   xmm1, xmm2, xmm2
       vaddss   xmm0, xmm0, xmm1
       vmulss   xmm1, xmm3, xmm3
       vaddss   xmm0, xmm0, xmm1
       call     [System.Console:WriteLine(float)]
       nop      
						;; size=69 bbWeight=1    PerfScore 41.25
G_M5536_IG03:
       add      rsp, 56
       ret      
						;; size=5 bbWeight=1    PerfScore 1.25

; Total bytes of code 81, prolog size 7, PerfScore 53.15, instruction count 19, allocated bytes for code 94 (MethodHash=3951ea5f) for method Program:Test1(Program+Vector4)
; ============================================================
; Assembly listing for method Program:Test2(Program+Vector4)
; Emitting BLENDED_CODE for X64 CPU with AVX - Windows
; optimized code
; rsp based frame
; partially interruptible
; No PGO data
; 0 inlinees with PGO data; 1 single block inlinees; 0 inlinees without PGO data
; Final local variable assignments
;
;  V00 arg0         [V00,T00] (  6, 12   )   byref  ->  rcx         single-def
;  V01 OutArgs      [V01    ] (  1,  1   )  lclBlk (32) [rsp+00H]   "OutgoingArgSpace"
;* V02 tmp1         [V02    ] (  0,  0   )  struct (16) zero-ref    "Inlining Arg"
;* V03 tmp2         [V03    ] (  0,  0   )  struct (16) zero-ref    "Inlining Arg"
;  V04 tmp3         [V04,T05] (  2,  2   )   float  ->  mm1         V02.X(offs=0x00) P-INDEP "field V02.X (fldOffset=0x0)"
;  V05 tmp4         [V05,T06] (  2,  2   )   float  ->  mm3         V02.Y(offs=0x04) P-INDEP "field V02.Y (fldOffset=0x4)"
;  V06 tmp5         [V06,T07] (  2,  2   )   float  ->  mm5         V02.Z(offs=0x08) P-INDEP "field V02.Z (fldOffset=0x8)"
;  V07 tmp6         [V07,T08] (  2,  2   )   float  ->  mm7         V02.W(offs=0x0c) P-INDEP "field V02.W (fldOffset=0xc)"
;  V08 tmp7         [V08,T09] (  2,  2   )   float  ->  mm0         V03.X(offs=0x00) P-INDEP "field V03.X (fldOffset=0x0)"
;  V09 tmp8         [V09,T10] (  2,  2   )   float  ->  mm2         V03.Y(offs=0x04) P-INDEP "field V03.Y (fldOffset=0x4)"
;  V10 tmp9         [V10,T11] (  2,  2   )   float  ->  mm4         V03.Z(offs=0x08) P-INDEP "field V03.Z (fldOffset=0x8)"
;  V11 tmp10        [V11,T12] (  2,  2   )   float  ->  mm6         V03.W(offs=0x0c) P-INDEP "field V03.W (fldOffset=0xc)"
;  V12 cse0         [V12,T01] (  3,  3   )   float  ->  mm2         "CSE - aggressive"
;  V13 cse1         [V13,T02] (  3,  3   )   float  ->  mm4         "CSE - aggressive"
;  V14 cse2         [V14,T03] (  3,  3   )   float  ->  mm6         "CSE - aggressive"
;  V15 cse3         [V15,T04] (  3,  3   )   float  ->  mm0         "CSE - aggressive"
;
; Lcl frame size = 72

G_M19203_IG01:
       sub      rsp, 72
       vzeroupper 
       vmovaps  xmmword ptr [rsp+30H], xmm6
       vmovaps  xmmword ptr [rsp+20H], xmm7
						;; size=19 bbWeight=1    PerfScore 5.25
G_M19203_IG02:
       vmovss   xmm0, dword ptr [rcx]
       vmovaps  xmm1, xmm0
       vmovss   xmm2, dword ptr [rcx+04H]
       vmovaps  xmm3, xmm2
       vmovss   xmm4, dword ptr [rcx+08H]
       vmovaps  xmm5, xmm4
       vmovss   xmm6, dword ptr [rcx+0CH]
       vmovaps  xmm7, xmm6
       vmulss   xmm0, xmm1, xmm0
       vmulss   xmm1, xmm3, xmm2
       vaddss   xmm0, xmm0, xmm1
       vmulss   xmm1, xmm5, xmm4
       vaddss   xmm0, xmm0, xmm1
       vmulss   xmm1, xmm7, xmm6
       vaddss   xmm0, xmm0, xmm1
       call     [System.Console:WriteLine(float)]
       nop      
						;; size=70 bbWeight=1    PerfScore 41.25
G_M19203_IG03:
       vmovaps  xmm6, xmmword ptr [rsp+30H]
       vmovaps  xmm7, xmmword ptr [rsp+20H]
       add      rsp, 72
       ret      
						;; size=17 bbWeight=1    PerfScore 9.25

; Total bytes of code 106, prolog size 19, PerfScore 68.25, instruction count 25, allocated bytes for code 125 (MethodHash=09bab4fc) for method Program:Test2(Program+Vector4)
; ============================================================
; Assembly listing for method Program:Test3(Program+Vector4)
; Emitting BLENDED_CODE for X64 CPU with AVX - Windows
; optimized code
; rsp based frame
; partially interruptible
; No PGO data
; Final local variable assignments
;
;  V00 arg0         [V00,T00] (  6, 12   )   byref  ->  rcx         single-def
;  V01 OutArgs      [V01    ] (  1,  1   )  lclBlk (32) [rsp+00H]   "OutgoingArgSpace"
;  V02 tmp1         [V02,T01] (  3,  3   )   float  ->  mm0         V06.X(offs=0x00) P-INDEP "field V00.X (fldOffset=0x0)"
;  V03 tmp2         [V03,T02] (  3,  3   )   float  ->  mm1         V06.Y(offs=0x04) P-INDEP "field V00.Y (fldOffset=0x4)"
;  V04 tmp3         [V04,T03] (  3,  3   )   float  ->  mm2         V06.Z(offs=0x08) P-INDEP "field V00.Z (fldOffset=0x8)"
;  V05 tmp4         [V05,T04] (  3,  3   )   float  ->  mm3         V06.W(offs=0x0c) P-INDEP "field V00.W (fldOffset=0xc)"
;* V06 tmp5         [V06    ] (  0,  0   )  struct (16) zero-ref    "Promoted implicit byref"
;
; Lcl frame size = 40

G_M17890_IG01:
       sub      rsp, 40
       vzeroupper 
						;; size=7 bbWeight=1    PerfScore 1.25
G_M17890_IG02:
       vmovss   xmm0, dword ptr [rcx]
       vmovss   xmm1, dword ptr [rcx+04H]
       vmovss   xmm2, dword ptr [rcx+08H]
       vmovss   xmm3, dword ptr [rcx+0CH]
       vmulss   xmm0, xmm0, xmm0
       vmulss   xmm1, xmm1, xmm1
       vaddss   xmm0, xmm0, xmm1
       vmulss   xmm1, xmm2, xmm2
       vaddss   xmm0, xmm0, xmm1
       vmulss   xmm1, xmm3, xmm3
       vaddss   xmm0, xmm0, xmm1
       call     [System.Console:WriteLine(float)]
       nop      
						;; size=54 bbWeight=1    PerfScore 40.25
G_M17890_IG03:
       add      rsp, 40
       ret      
						;; size=5 bbWeight=1    PerfScore 1.25

; Total bytes of code 66, prolog size 7, PerfScore 50.45, instruction count 17, allocated bytes for code 77 (MethodHash=edd5ba1d) for method Program:Test3(Program+Vector4)
; ============================================================

In Test2 we still end up with separate locals that refer to the same field (that is read from an implicit byref) but we don't CSE locals so we do not get rid of it.

@jakobbotsch
Copy link
Member

jakobbotsch commented Jul 24, 2023

Codegen for all versions is the same today. This looks like it was fixed by #81636.

@ghost ghost locked as resolved and limited conversation to collaborators Aug 23, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI enhancement Product code improvement that does NOT require public API changes/additions optimization tenet-performance Performance related issue
Projects
None yet
Development

No branches or pull requests

4 participants