Accessing a field of a Vector4 causes later codegen to be inefficient if inlined #10045

tannergooding · 2018-03-27T16:42:21Z

As a simple example, look at:

static void Main(string[] args)
{
    var value = new Vector4(0, 2, 3, 4);
    var result = Test(value, 1);
    Console.WriteLine(result);
}

public static Vector4 Test(Vector4 left, float right)
{
    left.X += right;
    return left;
}

This is currently producing:

; xmm0 = { 1, 2, 3, 4 }
C4E17929442420       vmovapd  xmmword ptr [rsp+20H], xmm0
488D4C2420           lea      rcx, bword ptr [rsp+20H]
C4E17A1001           vmovss   xmm0, dword ptr [rcx]
C4E17A58056B000000   vaddss   xmm0, dword ptr [reloc @RWD12]
C4E17A1101           vmovss   dword ptr [rcx], xmm0
48B9A06E7D00F87F0000 mov      rcx, 0x7FF8007D6EA0
E8DB1B805F           call     CORINFO_HELP_NEWSFAST
488BC8               mov      rcx, rax
488D4108             lea      rax, bword ptr [rcx+8]
C4E17A10442420       vmovss   xmm0, dword ptr [rsp+20H]
C4E17A1100           vmovss   dword ptr [rax], xmm0
C4E17A10442424       vmovss   xmm0, dword ptr [rsp+24H]
C4E17A114004         vmovss   dword ptr [rax+4], xmm0
C4E17A10442428       vmovss   xmm0, dword ptr [rsp+28H]
C4E17A114008         vmovss   dword ptr [rax+8], xmm0
C4E17A1044242C       vmovss   xmm0, dword ptr [rsp+2CH]
C4E17A11400C         vmovss   dword ptr [rax+12], xmm0
E844FBFFFF           call     System.Console:WriteLine(ref)

Manually inlining this code to be:

static void Main(string[] args)
{
    var value = new Vector4(0, 2, 3, 4);
    value.X += 1;
    Console.WriteLine(value);
}

Produces the much more efficient code:

; xmm0 = { 1, 2, 3, 4 }
C4E17929442420       vmovapd  xmmword ptr [rsp+20H], xmm0
488D4C2420           lea      rcx, bword ptr [rsp+20H]
C4E17A1001           vmovss   xmm0, dword ptr [rcx]
C4E17A580547000000   vaddss   xmm0, dword ptr [reloc @RWD12]
C4E17A1101           vmovss   dword ptr [rcx], xmm0
48B9A06E7C00F87F0000 mov      rcx, 0x7FF8007C6EA0
E8CF1B815F           call     CORINFO_HELP_NEWSFAST
488BC8               mov      rcx, rax
C4E17928442420       vmovapd  xmm0, xmmword ptr [rsp+20H]
C4E179114108         vmovupd  xmmword ptr [rcx+8], xmm0
E862FBFFFF           call     System.Console:WriteLine(ref)
90                   nop

We should recognize when the accessed value is 16-bytes and just do a block-copy (as was done with the manually inlined code).

It may also be beneficial to recognize that [rsp+20H] contains the "last use" of value and that we can just reference the memory directly (without needing to copy).

category:cq
theme:vector-codegen
skill-level:intermediate
cost:small
impact:small

The text was updated successfully, but these errors were encountered:

tannergooding · 2018-03-27T16:42:37Z

FYI. @CarolEidt, @eerhardt

CarolEidt · 2018-03-27T17:07:10Z

This is related to work item 7 in https://github.com/dotnet/coreclr/blob/master/Documentation/design-docs/first-class-structs.md, and also to #8016, though this is a bit different because it deals with a SIMD type, for which fgMorphCombineSIMDFieldAssignments() is supposed to merge these kinds of copies - though it may behave differently for a copy to an argument.

tannergooding · 2023-03-15T23:32:50Z

I fixed this a while back as part of the general refactorings to use HWIntrinsics behind the scenes for all of System.Numerics.

Current codegen is:

       48B9D0067B6FF97F0000 mov      rcx, 0x7FF96F7B06D0
       E8DA38B25F           call     CORINFO_HELP_NEWSFAST
       C5F8100522000000     vmovups  xmm0, xmmword ptr [reloc @RWD00]
       C4E37921052800000000 vinsertps xmm0, xmm0, dword ptr [reloc @RWD16], 0
       C5F8114008           vmovups  xmmword ptr [rax+08H], xmm0
       488BC8               mov      rcx, rax
       FF15DA9D2E00         call     [System.Console:WriteLine(System.Object)]
       90                   nop

msftgits transferred this issue from dotnet/coreclr Jan 31, 2020

msftgits added this to the Future milestone Jan 31, 2020

BruceForstall added the JitUntriaged CLR JIT issues needing additional triage label Oct 28, 2020

BruceForstall removed the JitUntriaged CLR JIT issues needing additional triage label Oct 30, 2022

tannergooding closed this as completed Mar 15, 2023

ghost locked as resolved and limited conversation to collaborators Apr 15, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Accessing a field of a Vector4 causes later codegen to be inefficient if inlined #10045

Accessing a field of a Vector4 causes later codegen to be inefficient if inlined #10045

tannergooding commented Mar 27, 2018 •

edited by BruceForstall

Loading

tannergooding commented Mar 27, 2018

CarolEidt commented Mar 27, 2018

tannergooding commented Mar 15, 2023

Accessing a field of a Vector4 causes later codegen to be inefficient if inlined #10045

Accessing a field of a Vector4 causes later codegen to be inefficient if inlined #10045

Comments

tannergooding commented Mar 27, 2018 • edited by BruceForstall Loading

tannergooding commented Mar 27, 2018

CarolEidt commented Mar 27, 2018

tannergooding commented Mar 15, 2023

tannergooding commented Mar 27, 2018 •

edited by BruceForstall

Loading