Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sub-optimal codegen for incrementing (some) references #12880

Closed
kindermannhubert opened this issue Jun 13, 2019 · 7 comments
Closed

Sub-optimal codegen for incrementing (some) references #12880

kindermannhubert opened this issue Jun 13, 2019 · 7 comments
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI enhancement Product code improvement that does NOT require public API changes/additions optimization
Milestone

Comments

@kindermannhubert
Copy link

kindermannhubert commented Jun 13, 2019

Am I correct if I think the codegen for Get2 method is not ideal? Mainly I don't understand the presence of cmp instruction. And the add instruction could be reduced to movzx offset, no?

using System;
using System.Runtime.CompilerServices;
using System.Runtime.InteropServices;
using SharpLab.Runtime;

public static class Test
{
    public static byte Get1<T>(this T[] array)
    {
        return Unsafe.As<Exposer>(array).Byte0;
    }
    
    public static byte Get2<T>(this T[] array)
    {
        return Unsafe.AddByteOffset(ref Unsafe.As<Exposer>(array).Byte0, (IntPtr)4); //constant is arbitrary
    }
}

[StructLayout(LayoutKind.Explicit)]
internal class Exposer
{
    [FieldOffset(0)]
    public byte Byte0;
}

For x64 Core / Release:

Test.Get1[[System.Int32, System.Private.CoreLib]](Int32[])
    L0000: movzx eax, byte [rcx+0x8]
    L0004: ret

Test.Get2[[System.Int32, System.Private.CoreLib]](Int32[])
    L0000: cmp [rcx], ecx
    L0002: add rcx, 0x8
    L0006: movzx eax, byte [rcx+0x4]
    L000a: ret

https://sharplab.io/#v2:EYLgxg9gTgpgtADwGwBYA0AXEBDAzgWwB8ABAJgEYBYAKGIAYACY8gOgCUBXAOwwEt8YLAMIR8AB14AbGFADKMgG68wMXAG4a9Jq049+ggJI8ZEMfKhKV6zY1kALbFDEAZbMHbc+AjdU0BmbSQmUgYAFVUMGgBvGgY4hgBtACleDABxGC4ZZQAKDABPMRgIADMc3h4ASkqAXVj44gDmIOB8jBgGDIxyAB5QgD48u15cMISahkcobHzK+riY6njlpgB2BgBVLlxsEsEAQVwegFEEMQhcGUGpmcqWACE2mDofZYBfeYZP5NSMrKhcgUiqVylVap9GoEGK12p0YBhSH1BhhhqNQuNJlBprNPosVg11lsdnsWPsACZkx7tADyJRKlwwOVgJU2212ByOp3Olyg1yxtweTzoaAYOSMGAAChgoJUUJVXvEPtQlTQErJpRwwBhXPkIBxGTq9RgANIVMksLmSZSpcHUCrtKBcbCSYIMLkXGTRb4AMV4MEkZNp9PhOTotuWkJhHSpzx8byAA==

What I'm trying to achieve is all checks free array access. I have extension method for Span and T[]. I'm satisfied with codegen for Span variant but for array I think it could be improved.

public static class Ext
{
    [MethodImpl(MethodImplOptions.AggressiveInlining)]
    public static T GetDangerous<T>(this T[] array, int i)
    {
        return Unsafe.Add(
            ref Unsafe.As<byte, T>(
                ref Unsafe.AddByteOffset(
                    ref Unsafe.As<Exposer>(array).Byte0,
                    TypeData<T>.ArrayFirstItemByteOffset)),
            i);
    }

    [MethodImpl(MethodImplOptions.AggressiveInlining)]
    public static T GetDangerous<T>(this Span<T> span, int i)
    {
        return Unsafe.Add(ref MemoryMarshal.GetReference(span), i);
    }
}

internal class TypeData<T>
{
    public static readonly IntPtr ArrayFirstItemByteOffset = GetArrayFirstItemByteOffset();

    private static unsafe IntPtr GetArrayFirstItemByteOffset()
    {
        var array = new T[1];
        return Unsafe.ByteOffset(ref Unsafe.As<Exposer>(array).Byte0, ref Unsafe.As<T, byte>(ref array[0]));
    }
}

[StructLayout(LayoutKind.Explicit)]
internal class Exposer
{
    [FieldOffset(0)]
    public byte Byte0;
}
Ext.GetDangerous[[System.Int32, System.Private.CoreLib]](Int32[], Int32)
    L0000: cmp [rcx], ecx
    L0002: add rcx, 0x8
    L0006: movsxd rax, edx
    L0009: mov eax, [rcx+rax*4+0x8]
    L000d: ret

Ext.GetDangerous[[System.Int32, System.Private.CoreLib]](System.Span`1<Int32>, Int32)
    L0000: mov rax, [rcx]
    L0003: movsxd rdx, edx
    L0006: mov eax, [rax+rdx*4]
    L0009: ret

https://sharplab.io/#v2:EYLgxg9gTgpgtADwGwBYA0AXEBDAzgWwB8ABAJgEYBYAKGIAYACY8gOgCUBXAOwwEt8YLAMIR8AB14AbGFADKMgG68wMXAG4a9Jq049+ggJI8ZEMfKhKV6zY1kALbFDEAZbMHbc+AjdU0BmbSQmUgYAUQQMGgBvGgY4hgBtACleDABxGC4ZZQAKDABPMRgIADMc3h4ASkqAXVj4hIBZGAw7CAATA3FJHObWjq6xSQB5MT4ILlwWAEEAc1nYXFxeBRgjSQqK2dr6uOIA5iCAFQYMjAARbC5Zkw5cAB4jgD48u15cBiOEmoZHKGx8mgGBUMMDKrsGDFqPEYUwAOwMACqk2wJUE03a7RyENh8VgJSRKLRMwewHyGBgQOe2OhuLpcXxhNwqPRmIAQuSYMMSiVcC0afTBXiYATkczidMHuExBA+VAXn8AZUWByKXQ0DihbCckYMAAFDBQSpHQowS4YbAzKD/fIAMV4UFwGAMFPwqq5PL5GGqGtpQt4lR8MIAvhCIclUhkslBcgUiqVylUdn7En02p1ur0WunBiMxrwJlM5gtVMtVutNtdkzD9oFPqcWpdrrcHtTWu8GLIxFdHk8GLhu1wgSCwRCofTiAixSyZpicozmvhoPlGo5cA5JCwzmwRTJMiocgOrpVh4GIaHqBeaCCZFxsJJgp9TebsNEIQB6d/2RwuNwMdoQKoDBcBAoK4BwYgylAoKHMCXDKLw968AAXtg4xcBCtaQJMoIjtM1oAvajrOq67rcryLQMAAvAwAAcPhXtQCSyIaHBgBgrj5BAHAYDknHcRgADSFTtCw0obGAqTJjeUB3g+ZBhAgMpym+KYJPaMCSO05FejkdDVvEtZkhSDDunQDFAA=

Or is there any simpler way to achieve this?

Thanks!

category:cq
theme:basic-cq
skill-level:expert
cost:medium
impact:small

@RussKeldorph
Copy link
Contributor

@dotnet/jit-contrib

@AndyAyersMS
Copy link
Member

Yes, codegen for Get2 could be better... at first blush it looks like the conversion of the constant offset to IntPtr is blocking the jit's ability to optimize. Let me look deeper.

@AndyAyersMS
Copy link
Member

Actually the jit copes with the constant IntPtr ok. The issue is that in Get2 we break the expression tree up to enable inlining and don't ever patch it back together. So we end up (after a bit of optimization) with two trees, one that forms &array[0] and null checks, and the other that adds 4 and indirects:

***** BB01, stmt 1
     (  5,  5) [000015] ------------              *  STMT      void  (IL   ???...  ???)
N008 (  5,  5) [000014] -A-XG---R---              \--*  ASG       byref 
N007 (  1,  1) [000013] D------N----                 +--*  LCL_VAR   byref  V03 tmp1         d:1
N006 (  5,  5) [000054] ---XG--N----                 \--*  COMMA     byref 
N002 (  2,  2) [000050] ---X---N----                    +--*  NULLCHECK byte  
N001 (  1,  1) [000049] ------------                    |  \--*  LCL_VAR   ref    V01 arg0         u:1
N005 (  3,  3) [000053] ------------                    \--*  ADD       byref 
N003 (  1,  1) [000051] ------------                       +--*  LCL_VAR   ref    V01 arg0         u:1 (last use)
N004 (  1,  1) [000052] ------------                       \--*  CNS_INT   long   8 field offset Fseq[Byte0]

***** BB01, stmt 2
     (  6,  6) [000025] ------------              *  STMT      void  (IL   ???...  ???)
N005 (  6,  6) [000024] ---XG-------              \--*  RETURN    int   
N004 (  5,  5) [000023] *--XG-------                 \--*  IND       ubyte 
N003 (  2,  2) [000047] -------N----                    \--*  ADD       byref 
N001 (  1,  1) [000016] ------------                       +--*  LCL_VAR   byref  V03 tmp1         u:1 (last use)
N002 (  1,  1) [000037] ------------                       \--*  CNS_INT   long   4

and that's what we end up generating code for.

Ideally the jit would notice that tmp1 has exactly one use and between def and use there is nothing that blocks recombining the trees. Doing that would at least fold the arithmetic onto the indir. Realizing that the nullcheck can then be done implicitly by the indir is perhaps a bit trickier...

We don't see this for the span case because there's no byref created (so no need for an early null check) and no extra offset to add in to get to the elements.

Since fixing this seems somewhat involved, the "future" milestone seems appropriate.

@erozenfeld
Copy link
Member

Doing that would at least fold the arithmetic onto the indir. Realizing that the nullcheck can then be done implicitly by the indir is perhaps a bit trickier...

This is something optFoldNullCheck should be able to handle. #12472 tracks possible improvements to optFoldNullCheck.

@mikedn
Copy link
Contributor

mikedn commented Jun 25, 2019

And FWIW the trivial forward substition I have in mikedn/coreclr@a19eaa7 does handle this particular case of broken trees:

G_M19078_IG01:
       0F1F440000           nop
G_M19078_IG02:
       3909                 cmp      dword ptr [rcx], ecx
       0FB6410C             movzx    rax, byte  ptr [rcx+12]
G_M19078_IG03:
       C3                   ret

But that needs more work and I'm far from convinced that such trivial cases need to be handled so late during compilation. We should try not to break trees to begin with...

@msftgits msftgits transferred this issue from dotnet/coreclr Jan 31, 2020
@msftgits msftgits added this to the Future milestone Jan 31, 2020
@Sergio0694
Copy link
Contributor

Possibly related to this issue, I've noticed another weird detail in the codegen for a very similar extension method to the one in the first message in this issue. Consider this code (I'm assuming to be running on .NET Core 2.1/3.1 or .NET Native, this could fail on eg. Mono due to different memory layout):

public static class ArrayExtensions
{
    [MethodImpl(MethodImplOptions.AggressiveInlining)]
    public static ref float DangerousGetReferenceAt(this float[] array, int i)
    {
        var arrayData = Unsafe.As<RawArrayData>(array);
        ref float r0 = ref Unsafe.As<byte, float>(ref arrayData.Data);
        ref float ri = ref Unsafe.Add(ref r0, i);

        return ref ri;
    }

    [StructLayout(LayoutKind.Sequential)]
    private sealed class RawArrayData
    {
        public IntPtr Length;
        public byte Data;
    }
}

According to sharplab.io (here), this yields:

ArrayExtensions.DangerousGetReferenceAt(Single[], Int32)
    L0000: cmp [rcx], ecx
    L0002: add rcx, 0x10
    L0006: movsxd rax, edx
    L0009: lea rax, [rcx+rax*4]
    L000d: ret

That first cmp has already been mentioned here and it's a known issue, but why is that add rcx, 0x10 instruction there, instead of just using lea rax, [rcx+rax*4+0x10] afterwards? As in:

ArrayExtensions.DangerousGetReferenceAt(Single[], Int32)
    L0000: movsxd rax, edx
    L0003: lea rax, [rcx+rax*4+0x10]
    L0009: ret

Same thing happens when the extension is generic and it's being used to actually load a value:

image

I would've expected that vmovss to just be a single vmovss xmm0, dword [rcx+rax*4+0x10].

Hope this helps, let me know if this is a known issue as well or if I should open a separate issue.

Thanks! 😄

cc. @tannergooding in case you're interested in following up on this.

@BruceForstall BruceForstall added the JitUntriaged CLR JIT issues needing additional triage label Oct 28, 2020
@TIHan TIHan removed the JitUntriaged CLR JIT issues needing additional triage label Oct 31, 2022
@jakobbotsch
Copy link
Member

All the cases here generate ideal codegen today:

; Assembly listing for method Test2:Get1[int](int[]):ubyte (FullOpts)
; Emitting BLENDED_CODE for X64 with AVX - Windows
; FullOpts code
; optimized code
; rsp based frame
; partially interruptible
; No PGO data
; Final local variable assignments
;
;  V00 arg0         [V00,T00] (  3,  3   )     ref  ->  rcx         class-hnd single-def <int[]>
;# V01 OutArgs      [V01    ] (  1,  1   )  struct ( 0) [rsp+0x00]  do-not-enreg[XS] addr-exposed "OutgoingArgSpace"
;
; Lcl frame size = 0

G_M46567_IG01:  ;; offset=0x0000
						;; size=0 bbWeight=1 PerfScore 0.00
G_M46567_IG02:  ;; offset=0x0000
       movzx    rax, byte  ptr [rcx+0x08]
						;; size=4 bbWeight=1 PerfScore 2.00
G_M46567_IG03:  ;; offset=0x0004
       ret      
						;; size=1 bbWeight=1 PerfScore 1.00

; Total bytes of code 5, prolog size 0, PerfScore 3.00, instruction count 2, allocated bytes for code 5 (MethodHash=3dc24a18) for method Test2:Get1[int](int[]):ubyte (FullOpts)
; ============================================================

; Assembly listing for method Test2:Get2[int](int[]):ubyte (FullOpts)
; Emitting BLENDED_CODE for X64 with AVX - Windows
; FullOpts code
; optimized code
; rsp based frame
; partially interruptible
; No PGO data
; Final local variable assignments
;
;  V00 arg0         [V00,T00] (  3,  3   )     ref  ->  rcx         class-hnd single-def <int[]>
;# V01 OutArgs      [V01    ] (  1,  1   )  struct ( 0) [rsp+0x00]  do-not-enreg[XS] addr-exposed "OutgoingArgSpace"
;
; Lcl frame size = 0

G_M50340_IG01:  ;; offset=0x0000
						;; size=0 bbWeight=1 PerfScore 0.00
G_M50340_IG02:  ;; offset=0x0000
       movzx    rax, byte  ptr [rcx+0x0C]
						;; size=4 bbWeight=1 PerfScore 2.00
G_M50340_IG03:  ;; offset=0x0004
       ret      
						;; size=1 bbWeight=1 PerfScore 1.00

; Total bytes of code 5, prolog size 0, PerfScore 3.00, instruction count 2, allocated bytes for code 5 (MethodHash=02df3b5b) for method Test2:Get2[int](int[]):ubyte (FullOpts)
; ============================================================

; Assembly listing for method Test:GetDangerous[int](int[],int):int (FullOpts)
; Emitting BLENDED_CODE for X64 with AVX - Windows
; FullOpts code
; optimized code
; rsp based frame
; partially interruptible
; No PGO data
; Final local variable assignments
;
;  V00 arg0         [V00,T00] (  4,  4   )     ref  ->  rcx         class-hnd single-def <int[]>
;  V01 arg1         [V01,T01] (  3,  3   )     int  ->  rdx         single-def
;# V02 OutArgs      [V02    ] (  1,  1   )  struct ( 0) [rsp+0x00]  do-not-enreg[XS] addr-exposed "OutgoingArgSpace"
;
; Lcl frame size = 0

G_M29253_IG01:  ;; offset=0x0000
						;; size=0 bbWeight=1 PerfScore 0.00
G_M29253_IG02:  ;; offset=0x0000
       cmp      byte  ptr [rcx], cl
       movsxd   rax, edx
       mov      eax, dword ptr [rcx+4*rax+0x10]
						;; size=9 bbWeight=1 PerfScore 5.25
G_M29253_IG03:  ;; offset=0x0009
       ret      
						;; size=1 bbWeight=1 PerfScore 1.00

; Total bytes of code 10, prolog size 0, PerfScore 6.25, instruction count 4, allocated bytes for code 10 (MethodHash=6f148dba) for method Test:GetDangerous[int](int[],int):int (FullOpts)
; ============================================================

; Assembly listing for method Test:GetDangerous[int](System.Span`1[int],int):int (FullOpts)
; Emitting BLENDED_CODE for X64 with AVX - Windows
; FullOpts code
; optimized code
; rsp based frame
; partially interruptible
; No PGO data
; 0 inlinees with PGO data; 1 single block inlinees; 0 inlinees without PGO data
; Final local variable assignments
;
;  V00 arg0         [V00,T00] (  3,  6   )   byref  ->  rcx         single-def
;  V01 arg1         [V01,T01] (  3,  3   )     int  ->  rdx         single-def
;# V02 OutArgs      [V02    ] (  1,  1   )  struct ( 0) [rsp+0x00]  do-not-enreg[XS] addr-exposed "OutgoingArgSpace"
;* V03 tmp1         [V03    ] (  0,  0   )  struct (16) zero-ref    ld-addr-op "Inlining Arg" <System.Span`1[int]>
;* V04 tmp2         [V04    ] (  0,  0   )   byref  ->  zero-ref    "field V00._reference (fldOffset=0x0)" P-INDEP
;* V05 tmp3         [V05    ] (  0,  0   )     int  ->  zero-ref    "field V00._length (fldOffset=0x8)" P-INDEP
;  V06 tmp4         [V06,T02] (  2,  2   )   byref  ->  rax         single-def "field V03._reference (fldOffset=0x0)" P-INDEP
;* V07 tmp5         [V07    ] (  0,  0   )     int  ->  zero-ref    "field V03._length (fldOffset=0x8)" P-INDEP
;* V08 tmp6         [V08    ] (  0,  0   )  struct (16) zero-ref    "Promoted implicit byref" <System.Span`1[int]>
;
; Lcl frame size = 0

G_M39219_IG01:  ;; offset=0x0000
						;; size=0 bbWeight=1 PerfScore 0.00
G_M39219_IG02:  ;; offset=0x0000
       mov      rax, bword ptr [rcx]
       movsxd   rcx, edx
       mov      eax, dword ptr [rax+4*rcx]
						;; size=9 bbWeight=1 PerfScore 4.25
G_M39219_IG03:  ;; offset=0x0009
       ret      
						;; size=1 bbWeight=1 PerfScore 1.00

; Total bytes of code 10, prolog size 0, PerfScore 5.25, instruction count 4, allocated bytes for code 10 (MethodHash=92f066cc) for method Test:GetDangerous[int](System.Span`1[int],int):int (FullOpts)
; ============================================================

; Assembly listing for method ArrayExtensions:DangerousGetReferenceAt(float[],int):byref (FullOpts)
; Emitting BLENDED_CODE for X64 with AVX - Windows
; FullOpts code
; optimized code
; rsp based frame
; partially interruptible
; No PGO data
; Final local variable assignments
;
;  V00 arg0         [V00,T00] (  4,  4   )     ref  ->  rcx         class-hnd single-def <float[]>
;  V01 arg1         [V01,T01] (  3,  3   )     int  ->  rdx         single-def
;# V02 OutArgs      [V02    ] (  1,  1   )  struct ( 0) [rsp+0x00]  do-not-enreg[XS] addr-exposed "OutgoingArgSpace"
;
; Lcl frame size = 0

G_M22385_IG01:  ;; offset=0x0000
						;; size=0 bbWeight=1 PerfScore 0.00
G_M22385_IG02:  ;; offset=0x0000
       cmp      byte  ptr [rcx], cl
       movsxd   rax, edx
       lea      rax, bword ptr [rcx+4*rax+0x10]
						;; size=10 bbWeight=1 PerfScore 4.25
G_M22385_IG03:  ;; offset=0x000A
       ret      
						;; size=1 bbWeight=1 PerfScore 1.00

; Total bytes of code 11, prolog size 0, PerfScore 5.25, instruction count 4, allocated bytes for code 11 (MethodHash=cbc7a88e) for method ArrayExtensions:DangerousGetReferenceAt(float[],int):byref (FullOpts)
; ============================================================

@github-actions github-actions bot locked and limited conversation to collaborators May 8, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI enhancement Product code improvement that does NOT require public API changes/additions optimization
Projects
None yet
Development

No branches or pull requests

10 participants