Add support for optimized AVX2/SSE2 multiply for VectorXXX<ulong/long> #87113

xoofx · 2023-06-04T15:11:25Z

Fixes #87107 by adding support for optimized AVX2/SSE2 versions of:

Vector256<ulong> * Vector256<ulong> and long with AVX2
Vector128<ulong> * Vector128<ulong> and long with SSE2

Before

; Assembly listing for method System.IO.Hashing.XxHashShared:Multiply(System.Runtime.Intrinsics.Vector256`1[ulong],System.Runtime.Intrinsics.Vector256`1[ulong]):System.Runtime.Intrinsics.Vector256`1[ulong]
; Emitting BLENDED_CODE for X64 CPU with AVX - Windows
; optimized code
; rsp based frame
; partially interruptible
; No PGO data
; 0 inlinees with PGO data; 3 single block inlinees; 1 inlinees without PGO data

G_M000_IG01:                ;; offset=0000H
       56                   push     rsi
       4881EC90000000       sub      rsp, 144
       C5F877               vzeroupper 
       488BF1               mov      rsi, rcx
 
G_M000_IG02:                ;; offset=000EH
       C5FD1002             vmovupd  ymm0, ymmword ptr[rdx]
       C5FD11442440         vmovupd  ymmword ptr[rsp+40H], ymm0
       C4C17D1000           vmovupd  ymm0, ymmword ptr[r8]
       C5FD11442420         vmovupd  ymmword ptr[rsp+20H], ymm0
       488B4C2440           mov      rcx, qword ptr [rsp+40H]
       488B542420           mov      rdx, qword ptr [rsp+20H]
       FF15FDFE2C00         call     [System.Runtime.Intrinsics.Scalar`1[ulong]:Multiply(ulong,ulong):ulong]
       4889442460           mov      qword ptr [rsp+60H], rax
       488B4C2448           mov      rcx, qword ptr [rsp+48H]
       488B542428           mov      rdx, qword ptr [rsp+28H]
       FF15E8FE2C00         call     [System.Runtime.Intrinsics.Scalar`1[ulong]:Multiply(ulong,ulong):ulong]
       4889442468           mov      qword ptr [rsp+68H], rax
       488B4C2450           mov      rcx, qword ptr [rsp+50H]
       488B542430           mov      rdx, qword ptr [rsp+30H]
       FF15D3FE2C00         call     [System.Runtime.Intrinsics.Scalar`1[ulong]:Multiply(ulong,ulong):ulong]
       4889442470           mov      qword ptr [rsp+70H], rax
       488B4C2458           mov      rcx, qword ptr [rsp+58H]
       488B542438           mov      rdx, qword ptr [rsp+38H]
       FF15BEFE2C00         call     [System.Runtime.Intrinsics.Scalar`1[ulong]:Multiply(ulong,ulong):ulong]
       4889442478           mov      qword ptr [rsp+78H], rax
       C5FD10442460         vmovupd  ymm0, ymmword ptr[rsp+60H]
       C5FD1106             vmovupd  ymmword ptr[rsi], ymm0
       488BC6               mov      rax, rsi
 
G_M000_IG03:                ;; offset=0084H
       C5F877               vzeroupper 
       4881C490000000       add      rsp, 144
       5E                   pop      rsi
       C3                   ret      
 
; Total bytes of code 144

After

; Assembly listing for method System.IO.Hashing.XxHashShared:Multiply(System.Runtime.Intrinsics.Vector256`1[ulong],System.Runtime.Intrinsics.Vector256`1[ulong]):System.Runtime.Intrinsics.Vector256`1[ulong]
; Emitting BLENDED_CODE for X64 CPU with AVX - Windows
; optimized code
; rsp based frame
; partially interruptible
; No PGO data

G_M000_IG01:                ;; offset=0000H
       C5F877               vzeroupper 
 
G_M000_IG02:                ;; offset=0003H
       C5FD1002             vmovupd  ymm0, ymmword ptr[rdx]
       C5F573D020           vpsrlq   ymm1, ymm0, 32
       C4C17D1010           vmovupd  ymm2, ymmword ptr[r8]
       C4E27540CA           vpmulld  ymm1, ymm1, ymm2
       C5E573D220           vpsrlq   ymm3, ymm2, 32
       C4E27D40DB           vpmulld  ymm3, ymm0, ymm3
       C5F5D4CB             vpaddq   ymm1, ymm1, ymm3
       C5F573F120           vpsllq   ymm1, ymm1, 32
       C4E27D40C2           vpmulld  ymm0, ymm0, ymm2
       C5F5D4C0             vpaddq   ymm0, ymm1, ymm0
       C5FD1101             vmovupd  ymmword ptr[rcx], ymm0
       488BC1               mov      rax, rcx
 
G_M000_IG03:                ;; offset=0039H
       C5F877               vzeroupper 
       C3                   ret      
 
; Total bytes of code 61

Not sure where to add tests. Guidance welcome.

…g>` SSE2 optimized * operator

ghost · 2023-06-04T15:11:42Z

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

Issue Details

Fixes #87107 by adding support for optimized AVX2/SSE2 versions of:

Vector256<ulong> * Vector256<ulong> and long with AVX2
Vector128<ulong> * Vector128<ulong> and long with SSE2

Before

; Assembly listing for method System.IO.Hashing.XxHashShared:Multiply(System.Runtime.Intrinsics.Vector256`1[ulong],System.Runtime.Intrinsics.Vector256`1[ulong]):System.Runtime.Intrinsics.Vector256`1[ulong]
; Emitting BLENDED_CODE for X64 CPU with AVX - Windows
; optimized code
; rsp based frame
; partially interruptible
; No PGO data
; 0 inlinees with PGO data; 3 single block inlinees; 1 inlinees without PGO data

G_M000_IG01:                ;; offset=0000H
       56                   push     rsi
       4881EC90000000       sub      rsp, 144
       C5F877               vzeroupper 
       488BF1               mov      rsi, rcx
 
G_M000_IG02:                ;; offset=000EH
       C5FD1002             vmovupd  ymm0, ymmword ptr[rdx]
       C5FD11442440         vmovupd  ymmword ptr[rsp+40H], ymm0
       C4C17D1000           vmovupd  ymm0, ymmword ptr[r8]
       C5FD11442420         vmovupd  ymmword ptr[rsp+20H], ymm0
       488B4C2440           mov      rcx, qword ptr [rsp+40H]
       488B542420           mov      rdx, qword ptr [rsp+20H]
       FF15FDFE2C00         call     [System.Runtime.Intrinsics.Scalar`1[ulong]:Multiply(ulong,ulong):ulong]
       4889442460           mov      qword ptr [rsp+60H], rax
       488B4C2448           mov      rcx, qword ptr [rsp+48H]
       488B542428           mov      rdx, qword ptr [rsp+28H]
       FF15E8FE2C00         call     [System.Runtime.Intrinsics.Scalar`1[ulong]:Multiply(ulong,ulong):ulong]
       4889442468           mov      qword ptr [rsp+68H], rax
       488B4C2450           mov      rcx, qword ptr [rsp+50H]
       488B542430           mov      rdx, qword ptr [rsp+30H]
       FF15D3FE2C00         call     [System.Runtime.Intrinsics.Scalar`1[ulong]:Multiply(ulong,ulong):ulong]
       4889442470           mov      qword ptr [rsp+70H], rax
       488B4C2458           mov      rcx, qword ptr [rsp+58H]
       488B542438           mov      rdx, qword ptr [rsp+38H]
       FF15BEFE2C00         call     [System.Runtime.Intrinsics.Scalar`1[ulong]:Multiply(ulong,ulong):ulong]
       4889442478           mov      qword ptr [rsp+78H], rax
       C5FD10442460         vmovupd  ymm0, ymmword ptr[rsp+60H]
       C5FD1106             vmovupd  ymmword ptr[rsi], ymm0
       488BC6               mov      rax, rsi
 
G_M000_IG03:                ;; offset=0084H
       C5F877               vzeroupper 
       4881C490000000       add      rsp, 144
       5E                   pop      rsi
       C3                   ret      
 
; Total bytes of code 144

After

; Assembly listing for method System.IO.Hashing.XxHashShared:Multiply(System.Runtime.Intrinsics.Vector256`1[ulong],System.Runtime.Intrinsics.Vector256`1[ulong]):System.Runtime.Intrinsics.Vector256`1[ulong]
; Emitting BLENDED_CODE for X64 CPU with AVX - Windows
; optimized code
; rsp based frame
; partially interruptible
; No PGO data

G_M000_IG01:                ;; offset=0000H
       C5F877               vzeroupper 
 
G_M000_IG02:                ;; offset=0003H
       C5FD1002             vmovupd  ymm0, ymmword ptr[rdx]
       C5F573D020           vpsrlq   ymm1, ymm0, 32
       C4C17D1010           vmovupd  ymm2, ymmword ptr[r8]
       C4E27540CA           vpmulld  ymm1, ymm1, ymm2
       C5E573D220           vpsrlq   ymm3, ymm2, 32
       C4E27D40DB           vpmulld  ymm3, ymm0, ymm3
       C5F5D4CB             vpaddq   ymm1, ymm1, ymm3
       C5F573F120           vpsllq   ymm1, ymm1, 32
       C4E27D40C2           vpmulld  ymm0, ymm0, ymm2
       C5F5D4C0             vpaddq   ymm0, ymm1, ymm0
       C5FD1101             vmovupd  ymmword ptr[rcx], ymm0
       488BC1               mov      rax, rcx
 
G_M000_IG03:                ;; offset=0039H
       C5F877               vzeroupper 
       C3                   ret      
 
; Total bytes of code 61

Not sure where to add tests. Guidance welcome.

Author:	xoofx
Assignees:	-
Labels:	`area-CodeGen-coreclr`
Milestone:	-

EgorBo · 2023-06-04T15:41:46Z

src/libraries/System.Private.CoreLib/src/System/Runtime/Intrinsics/Vector256_1.cs

-                left._lower * right._lower,
-                left._upper * right._upper
-            );
+            if ((typeof(T) == typeof(ulong) || typeof(T) == typeof(long)) && Avx2.IsSupported)


Nice!
Perhaps, worth leaving a comment that this path is not taken when avx-512 is presented?

tannergooding · 2023-06-04T15:51:24Z

src/libraries/System.Private.CoreLib/src/System/Runtime/Intrinsics/Vector128_1.cs

+                // xh * yl
+                Vector128<ulong> t1 = Sse2.Multiply(Sse2.ShiftRightLogical(left.AsUInt64(), 32).AsUInt32(), right.AsUInt32());
+                // xl * yh
+                Vector128<ulong> t2 = Sse2.Multiply(left.AsUInt32(), Sse2.ShiftRightLogical(right.AsUInt64(), 32).AsUInt32());
+                // (xh * yl + xl * yh) * 2^32
+                Vector128<ulong> t3 = Sse2.ShiftLeftLogical((t1.AsUInt64() + t2.AsUInt64()), 32);
+                // xl * yl
+                Vector128<ulong> t4 = Sse2.Multiply(left.AsUInt32(), right.AsUInt32());
+                // (xh * yl + xl * yh) * 2^32 + xl * yl
+                return Unsafe.BitCast<Vector128<ulong>, Vector128<T>>(t3 + t4);


This is likely going to have negative impact on some compilers (such as Mono, particularly for the interpreter).

Given we already have a functionally correct fallback implementation for int/uint that does the same thing, it's very unclear why we aren't doing the same thing and just sharing the logic for the long/ulong path: https://github.com/dotnet/runtime/blob/main/src/coreclr/jit/gentree.cpp#L19843-L19878

-- If we were going to have the managed implementation and could show that it doesn't hurt the Mono scenario; then it'd notably also be better to use (t3 + t4).As<ulong, T>() rather than BitCast, to use left.AsUInt64() >>> 32 over ShiftRightLogical, and to use (t1 + t2) << 32 over ShiftLeftLogical (the .AsUInt64() are unnecessary here).

(such as Mono, particularly for the interpreter).

Why would Mono interpreter execute Vector__.op_Multiply in the first place - it doesn't suport simd, does it?
But even if this is a problem for mono - it can be solved with just #if !MONO

Mono in general has been explicitly adding some SIMD support to ensure that code is not pessimized. This includes in the interpreter: #86738

But even if this is a problem for mono - it can be solved with just #if !MONO

That complicates things, puts more burden on the JIT/inliner, and can pessimize codegen in some cases.

While it'd be great to someday be able to move a lot of this into managed, I don't think we're at that point yet. We still have quite a bit of places where the inliner can just give up, particularly in large/complex methods. Having a known hot path function have the potential to not inline isn't great.

We also already have the relevant logic in the JIT to handle this, it just needs to be shared between the int/uint and long/ulong paths.

That complicates things, puts more burden on the JIT/inliner, and can pessimize codegen in some cases.

I don't think it makes any sense here - you either take this fast path, or waste the whole inliner's budget on whatever the scalar logic we have now as a fallback (which is way bigger). Moreover, as you can see from the snippet - Scalar'1[ulong]:Multiply are already not inlined due to the budget. So my point that this PR has no cons compared to the current impl. So in fact, this PR eats less of inliner's budget than the Main's code.

We also already have the relevant logic in the JIT to handle this, it just needs to be shared between the int/uint and long/ulong paths.

I didn't notice that existing logic for GT_MUL that looks similiar, if it can be easily shared, then sure. If not - the current change as is is better than what we have today.

I don't think it makes any sense here - you either take this fast path, or waste the whole inliner's budget on whatever the scalar logic we have now as a fallback (which is way bigger). Moreover, as you can see from the snippet

I meant, in particular, the inliner vs having this use the intrinsic path in the JIT. I agree that its better than what we have now, just that its less ideal than what it could/should be for something we know we're using in a hot path.

xoofx · 2023-06-04T16:15:18Z

We also already have the relevant logic in the JIT to handle this, it just needs to be shared between the int/uint and long/ulong paths.

@tannergooding Fair enough. Closing this PR then, don't have enough Sunday time to figure out how to properly implement this with AVX2 and SSE2 in the cpp version. 😅 You will be more efficient than me for modifying the cpp codepath.

tannergooding · 2023-06-04T16:30:32Z

You will be more efficient than me for modifying the cpp codepath.

Will have a PR up shortly. Going to eat breakfast and finish up my current PR first

EgorBo · 2023-08-07T10:48:44Z

@tannergooding any news about the PR? Otherwise it'd be better to just land this simple change for .NET 8.0. Someone has just complained about XXHASH perf again in #90090 and I believe it's the same issue

Vector256<ulong> Test(Vector256<ulong> a, Vector256<ulong> b) 
    => a * b;

current codegen without avx512:

; Method ConsoleApp1.Program:Test(System.Runtime.Intrinsics.Vector256`1[ulong],System.Runtime.Intrinsics.Vector256`1[ulong]):System.Runtime.Intrinsics.Vector256`1[ulong]:this (FullOpts)
G_M14587_IG01:  ;; offset=0000H
       push     rdi
       push     rsi
       push     rbp
       push     rbx
       sub      rsp, 248
       vzeroupper 
       vmovaps  xmmword ptr [rsp+E0H], xmm6
       mov      rbx, rdx
       mov      rsi, r8
       mov      rdi, r9
						;; size=32 bbWeight=4 PerfScore 32.00

G_M14587_IG02:  ;; offset=0020H
       vmovups  xmm0, xmmword ptr [rsi]
       vmovaps  xmmword ptr [rsp+D0H], xmm0
       vmovups  xmm0, xmmword ptr [rdi]
       vmovaps  xmmword ptr [rsp+C0H], xmm0
       mov      rdx, qword ptr [rsp+D0H]
       mov      qword ptr [rsp+B0H], rdx
       mov      rdx, qword ptr [rsp+C0H]
       mov      qword ptr [rsp+A8H], rdx
       mov      rdx, qword ptr [rsp+B0H]
       imul     rdx, qword ptr [rsp+A8H]
       mov      qword ptr [rsp+B8H], rdx
       mov      rbp, qword ptr [rsp+B8H]
       mov      rdx, qword ptr [rsp+D8H]
       mov      qword ptr [rsp+98H], rdx
       mov      rdx, qword ptr [rsp+C8H]
       mov      qword ptr [rsp+90H], rdx
       mov      rcx, qword ptr [rsp+98H]
       mov      rdx, qword ptr [rsp+90H]
       call     [System.Runtime.Intrinsics.Scalar`1[ulong]:Multiply(ulong,ulong):ulong]
       mov      qword ptr [rsp+A0H], rax
       mov      rdx, qword ptr [rsp+A0H]
       mov      qword ptr [rsp+80H], rbp
       mov      qword ptr [rsp+88H], rdx
       vmovaps  xmm6, xmmword ptr [rsp+80H]
       vmovups  xmm0, xmmword ptr [rsi+10H]
       vmovaps  xmmword ptr [rsp+70H], xmm0
       vmovups  xmm0, xmmword ptr [rdi+10H]
       vmovaps  xmmword ptr [rsp+60H], xmm0
       mov      rdx, qword ptr [rsp+70H]
       mov      qword ptr [rsp+50H], rdx
       mov      rdx, qword ptr [rsp+60H]
       mov      qword ptr [rsp+48H], rdx
       mov      rcx, qword ptr [rsp+50H]
       mov      rdx, qword ptr [rsp+48H]
       call     [System.Runtime.Intrinsics.Scalar`1[ulong]:Multiply(ulong,ulong):ulong]
       mov      qword ptr [rsp+58H], rax
       mov      rsi, qword ptr [rsp+58H]
       mov      rdx, qword ptr [rsp+78H]
       mov      qword ptr [rsp+38H], rdx
       mov      rdx, qword ptr [rsp+68H]
       mov      qword ptr [rsp+30H], rdx
       mov      rcx, qword ptr [rsp+38H]
       mov      rdx, qword ptr [rsp+30H]
       call     [System.Runtime.Intrinsics.Scalar`1[ulong]:Multiply(ulong,ulong):ulong]
       mov      qword ptr [rsp+40H], rax
       mov      rax, qword ptr [rsp+40H]
       mov      qword ptr [rsp+20H], rsi
       mov      qword ptr [rsp+28H], rax
       vinserti128 ymm0, ymm6, xmmword ptr [rsp+20H], 1
       vmovups  ymmword ptr [rbx], ymm0
       mov      rax, rbx
						;; size=325 bbWeight=4 PerfScore 309.00

G_M14587_IG03:  ;; offset=0165H
       vmovaps  xmm6, xmmword ptr [rsp+E0H]
       vzeroupper 
       add      rsp, 248
       pop      rbx
       pop      rbp
       pop      rsi
       pop      rdi
       ret      
						;; size=24 bbWeight=4 PerfScore 33.00
; Total bytes of code: 381

; Method ConsoleApp1.Program:Test(int) (FullOpts)
G_M64089_IG01:  ;; offset=0000H
       sub      rsp, 40
						;; size=4 bbWeight=1 PerfScore 0.25

G_M64089_IG02:  ;; offset=0004H
       sub      ecx, 2
       cmp      ecx, 6
       ja       SHORT G_M64089_IG04
       mov      eax, 87
       bt       eax, ecx
       jae      SHORT G_M64089_IG04
						;; size=18 bbWeight=1 PerfScore 3.25

G_M64089_IG03:  ;; offset=0016H
       mov      rcx, 0x1B300309A50      ; 'a1'
       call     [System.Console:WriteLine(System.String)]
						;; size=16 bbWeight=0.50 PerfScore 1.62

G_M64089_IG04:  ;; offset=0026H
       mov      rcx, 0x1B300309A70      ; 'a2'
       call     [System.Console:WriteLine(System.String)]
       nop      
						;; size=17 bbWeight=1 PerfScore 3.50

G_M64089_IG05:  ;; offset=0037H
       add      rsp, 40
       ret      
						;; size=5 bbWeight=1 PerfScore 1.25
; Total bytes of code: 60

tannergooding · 2023-08-07T16:39:46Z

@EgorBo, its marked draft still and hasn't been looked at since it wasn't viewed as "critical" compared to other fixes that had to go in for correctness, etc

I believe the code in this PR isn't quite correct either, there were failures seen at the time and when I took the code "as is" in the new PR. Which means there is additional risk that may be undesirable this late...

The standard algorithm we use elsewhere is:

uint al = (uint)a;
uint ah = (uint)(a >>> 32);
uint bl = (uint)b;
uint bh = (uint)(b >>> 32);

ulong mull = ((ulong)al) * bl;
ulong t = ((ulong)ah) * bl + (mull >>> 32);
ulong tl = ((ulong)al) * bh + (uint)t;

return (tl << 32) | (uint)mull;

We have Vector128<ulong> Sse2.Multiply(Vector128<uint> a, Vector128<uint> b) which operates as:

result[0] = (ulong)a[0] * b[0];
result[1] = (ulong)a[2] * b[2];

and so a vectorized translation of this would end up being something like:

Vector128<uint> al = a.AsUInt32();
Vector128<uint> ah = (a >>> 32).AsUInt32();
Vector128<uint> bl = b.AsUInt32();
Vector128<uint> bh = (b >>> 32).AsUInt32();

Vector128<ulong> mull = Sse2.Multiply(al, bl);
Vector128<ulong> t = Sse2.Multiply(ah, bl) + (mull >>> 32);
Vector128<ulong> tl = Sse2.Multiply(al, bh) + (t & Vector128.Create(0xFFFF_FFFFul));

return (tl << 32) | (mull & Vector128.Create(0xFFFF_FFFFul);

Add support for Vector256<ulong/long> AVX2 and `Vector128<ulong/lon…

3d856df

…g>` SSE2 optimized * operator

dotnet-issue-labeler bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Jun 4, 2023

ghost added the community-contribution Indicates that the PR has been added by a community member label Jun 4, 2023

xoofx mentioned this pull request Jun 4, 2023

Performance regression of XXHash128 for large buffers with .NET 8 #87107

Open

EgorBo requested a review from tannergooding June 4, 2023 15:39

EgorBo reviewed Jun 4, 2023

View reviewed changes

tannergooding reviewed Jun 4, 2023

View reviewed changes

xoofx closed this Jun 4, 2023

build-analysis bot mentioned this pull request Jun 4, 2023

Tracking issue for CI build timeouts #76454

Closed

ghost locked as resolved and limited conversation to collaborators Jul 4, 2023

dotnet unlocked this conversation Aug 7, 2023

EgorBo mentioned this pull request Aug 7, 2023

When calculating xxhash, net7.0 is slower than net6.0 #90090

Closed

ghost locked as resolved and limited conversation to collaborators Sep 6, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for optimized AVX2/SSE2 multiply for VectorXXX<ulong/long> #87113

Add support for optimized AVX2/SSE2 multiply for VectorXXX<ulong/long> #87113

xoofx commented Jun 4, 2023

ghost commented Jun 4, 2023

EgorBo Jun 4, 2023

tannergooding Jun 4, 2023

EgorBo Jun 4, 2023 •

edited

Loading

tannergooding Jun 4, 2023

EgorBo Jun 4, 2023 •

edited

Loading

tannergooding Jun 4, 2023

xoofx commented Jun 4, 2023

tannergooding commented Jun 4, 2023

EgorBo commented Aug 7, 2023 •

edited

Loading

tannergooding commented Aug 7, 2023

Add support for optimized AVX2/SSE2 multiply for VectorXXX<ulong/long> #87113

Add support for optimized AVX2/SSE2 multiply for VectorXXX<ulong/long> #87113

Conversation

xoofx commented Jun 4, 2023

ghost commented Jun 4, 2023

EgorBo Jun 4, 2023

Choose a reason for hiding this comment

tannergooding Jun 4, 2023

Choose a reason for hiding this comment

EgorBo Jun 4, 2023 • edited Loading

Choose a reason for hiding this comment

tannergooding Jun 4, 2023

Choose a reason for hiding this comment

EgorBo Jun 4, 2023 • edited Loading

Choose a reason for hiding this comment

tannergooding Jun 4, 2023

Choose a reason for hiding this comment

xoofx commented Jun 4, 2023

tannergooding commented Jun 4, 2023

EgorBo commented Aug 7, 2023 • edited Loading

tannergooding commented Aug 7, 2023

EgorBo Jun 4, 2023 •

edited

Loading

EgorBo Jun 4, 2023 •

edited

Loading

EgorBo commented Aug 7, 2023 •

edited

Loading