-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for optimized AVX2/SSE2 multiply for VectorXXX<ulong/long> #87113
Conversation
…g>` SSE2 optimized * operator
Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch Issue DetailsFixes #87107 by adding support for optimized AVX2/SSE2 versions of:
Before ; Assembly listing for method System.IO.Hashing.XxHashShared:Multiply(System.Runtime.Intrinsics.Vector256`1[ulong],System.Runtime.Intrinsics.Vector256`1[ulong]):System.Runtime.Intrinsics.Vector256`1[ulong]
; Emitting BLENDED_CODE for X64 CPU with AVX - Windows
; optimized code
; rsp based frame
; partially interruptible
; No PGO data
; 0 inlinees with PGO data; 3 single block inlinees; 1 inlinees without PGO data
G_M000_IG01: ;; offset=0000H
56 push rsi
4881EC90000000 sub rsp, 144
C5F877 vzeroupper
488BF1 mov rsi, rcx
G_M000_IG02: ;; offset=000EH
C5FD1002 vmovupd ymm0, ymmword ptr[rdx]
C5FD11442440 vmovupd ymmword ptr[rsp+40H], ymm0
C4C17D1000 vmovupd ymm0, ymmword ptr[r8]
C5FD11442420 vmovupd ymmword ptr[rsp+20H], ymm0
488B4C2440 mov rcx, qword ptr [rsp+40H]
488B542420 mov rdx, qword ptr [rsp+20H]
FF15FDFE2C00 call [System.Runtime.Intrinsics.Scalar`1[ulong]:Multiply(ulong,ulong):ulong]
4889442460 mov qword ptr [rsp+60H], rax
488B4C2448 mov rcx, qword ptr [rsp+48H]
488B542428 mov rdx, qword ptr [rsp+28H]
FF15E8FE2C00 call [System.Runtime.Intrinsics.Scalar`1[ulong]:Multiply(ulong,ulong):ulong]
4889442468 mov qword ptr [rsp+68H], rax
488B4C2450 mov rcx, qword ptr [rsp+50H]
488B542430 mov rdx, qword ptr [rsp+30H]
FF15D3FE2C00 call [System.Runtime.Intrinsics.Scalar`1[ulong]:Multiply(ulong,ulong):ulong]
4889442470 mov qword ptr [rsp+70H], rax
488B4C2458 mov rcx, qword ptr [rsp+58H]
488B542438 mov rdx, qword ptr [rsp+38H]
FF15BEFE2C00 call [System.Runtime.Intrinsics.Scalar`1[ulong]:Multiply(ulong,ulong):ulong]
4889442478 mov qword ptr [rsp+78H], rax
C5FD10442460 vmovupd ymm0, ymmword ptr[rsp+60H]
C5FD1106 vmovupd ymmword ptr[rsi], ymm0
488BC6 mov rax, rsi
G_M000_IG03: ;; offset=0084H
C5F877 vzeroupper
4881C490000000 add rsp, 144
5E pop rsi
C3 ret
; Total bytes of code 144 After ; Assembly listing for method System.IO.Hashing.XxHashShared:Multiply(System.Runtime.Intrinsics.Vector256`1[ulong],System.Runtime.Intrinsics.Vector256`1[ulong]):System.Runtime.Intrinsics.Vector256`1[ulong]
; Emitting BLENDED_CODE for X64 CPU with AVX - Windows
; optimized code
; rsp based frame
; partially interruptible
; No PGO data
G_M000_IG01: ;; offset=0000H
C5F877 vzeroupper
G_M000_IG02: ;; offset=0003H
C5FD1002 vmovupd ymm0, ymmword ptr[rdx]
C5F573D020 vpsrlq ymm1, ymm0, 32
C4C17D1010 vmovupd ymm2, ymmword ptr[r8]
C4E27540CA vpmulld ymm1, ymm1, ymm2
C5E573D220 vpsrlq ymm3, ymm2, 32
C4E27D40DB vpmulld ymm3, ymm0, ymm3
C5F5D4CB vpaddq ymm1, ymm1, ymm3
C5F573F120 vpsllq ymm1, ymm1, 32
C4E27D40C2 vpmulld ymm0, ymm0, ymm2
C5F5D4C0 vpaddq ymm0, ymm1, ymm0
C5FD1101 vmovupd ymmword ptr[rcx], ymm0
488BC1 mov rax, rcx
G_M000_IG03: ;; offset=0039H
C5F877 vzeroupper
C3 ret
; Total bytes of code 61 Not sure where to add tests. Guidance welcome.
|
left._lower * right._lower, | ||
left._upper * right._upper | ||
); | ||
if ((typeof(T) == typeof(ulong) || typeof(T) == typeof(long)) && Avx2.IsSupported) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice!
Perhaps, worth leaving a comment that this path is not taken when avx-512 is presented?
// xh * yl | ||
Vector128<ulong> t1 = Sse2.Multiply(Sse2.ShiftRightLogical(left.AsUInt64(), 32).AsUInt32(), right.AsUInt32()); | ||
// xl * yh | ||
Vector128<ulong> t2 = Sse2.Multiply(left.AsUInt32(), Sse2.ShiftRightLogical(right.AsUInt64(), 32).AsUInt32()); | ||
// (xh * yl + xl * yh) * 2^32 | ||
Vector128<ulong> t3 = Sse2.ShiftLeftLogical((t1.AsUInt64() + t2.AsUInt64()), 32); | ||
// xl * yl | ||
Vector128<ulong> t4 = Sse2.Multiply(left.AsUInt32(), right.AsUInt32()); | ||
// (xh * yl + xl * yh) * 2^32 + xl * yl | ||
return Unsafe.BitCast<Vector128<ulong>, Vector128<T>>(t3 + t4); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is likely going to have negative impact on some compilers (such as Mono, particularly for the interpreter).
Given we already have a functionally correct fallback implementation for int
/uint
that does the same thing, it's very unclear why we aren't doing the same thing and just sharing the logic for the long
/ulong
path: https://github.com/dotnet/runtime/blob/main/src/coreclr/jit/gentree.cpp#L19843-L19878
-- If we were going to have the managed implementation and could show that it doesn't hurt the Mono scenario; then it'd notably also be better to use (t3 + t4).As<ulong, T>()
rather than BitCast
, to use left.AsUInt64() >>> 32
over ShiftRightLogical
, and to use (t1 + t2) << 32
over ShiftLeftLogical
(the .AsUInt64()
are unnecessary here).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(such as Mono, particularly for the interpreter).
Why would Mono interpreter execute Vector__.op_Multiply
in the first place - it doesn't suport simd, does it?
But even if this is a problem for mono - it can be solved with just #if !MONO
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mono in general has been explicitly adding some SIMD support to ensure that code is not pessimized. This includes in the interpreter: #86738
But even if this is a problem for mono - it can be solved with just #if !MONO
That complicates things, puts more burden on the JIT/inliner, and can pessimize codegen in some cases.
While it'd be great to someday be able to move a lot of this into managed, I don't think we're at that point yet. We still have quite a bit of places where the inliner can just give up, particularly in large/complex methods. Having a known hot path function have the potential to not inline isn't great.
We also already have the relevant logic in the JIT to handle this, it just needs to be shared between the int/uint
and long/ulong
paths.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That complicates things, puts more burden on the JIT/inliner, and can pessimize codegen in some cases.
I don't think it makes any sense here - you either take this fast path, or waste the whole inliner's budget on whatever the scalar logic we have now as a fallback (which is way bigger). Moreover, as you can see from the snippet - Scalar'1[ulong]:Multiply
are already not inlined due to the budget. So my point that this PR has no cons compared to the current impl. So in fact, this PR eats less of inliner's budget than the Main's code.
We also already have the relevant logic in the JIT to handle this, it just needs to be shared between the int/uint and long/ulong paths.
I didn't notice that existing logic for GT_MUL
that looks similiar, if it can be easily shared, then sure. If not - the current change as is is better than what we have today.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think it makes any sense here - you either take this fast path, or waste the whole inliner's budget on whatever the scalar logic we have now as a fallback (which is way bigger). Moreover, as you can see from the snippet
I meant, in particular, the inliner vs having this use the intrinsic path in the JIT. I agree that its better than what we have now, just that its less ideal than what it could/should be for something we know we're using in a hot path.
@tannergooding Fair enough. Closing this PR then, don't have enough Sunday time to figure out how to properly implement this with AVX2 and SSE2 in the cpp version. 😅 You will be more efficient than me for modifying the cpp codepath. |
Will have a PR up shortly. Going to eat breakfast and finish up my current PR first |
@tannergooding any news about the PR? Otherwise it'd be better to just land this simple change for .NET 8.0. Someone has just complained about XXHASH perf again in #90090 and I believe it's the same issue Vector256<ulong> Test(Vector256<ulong> a, Vector256<ulong> b)
=> a * b; current codegen without avx512: ; Method ConsoleApp1.Program:Test(System.Runtime.Intrinsics.Vector256`1[ulong],System.Runtime.Intrinsics.Vector256`1[ulong]):System.Runtime.Intrinsics.Vector256`1[ulong]:this (FullOpts)
G_M14587_IG01: ;; offset=0000H
push rdi
push rsi
push rbp
push rbx
sub rsp, 248
vzeroupper
vmovaps xmmword ptr [rsp+E0H], xmm6
mov rbx, rdx
mov rsi, r8
mov rdi, r9
;; size=32 bbWeight=4 PerfScore 32.00
G_M14587_IG02: ;; offset=0020H
vmovups xmm0, xmmword ptr [rsi]
vmovaps xmmword ptr [rsp+D0H], xmm0
vmovups xmm0, xmmword ptr [rdi]
vmovaps xmmword ptr [rsp+C0H], xmm0
mov rdx, qword ptr [rsp+D0H]
mov qword ptr [rsp+B0H], rdx
mov rdx, qword ptr [rsp+C0H]
mov qword ptr [rsp+A8H], rdx
mov rdx, qword ptr [rsp+B0H]
imul rdx, qword ptr [rsp+A8H]
mov qword ptr [rsp+B8H], rdx
mov rbp, qword ptr [rsp+B8H]
mov rdx, qword ptr [rsp+D8H]
mov qword ptr [rsp+98H], rdx
mov rdx, qword ptr [rsp+C8H]
mov qword ptr [rsp+90H], rdx
mov rcx, qword ptr [rsp+98H]
mov rdx, qword ptr [rsp+90H]
call [System.Runtime.Intrinsics.Scalar`1[ulong]:Multiply(ulong,ulong):ulong]
mov qword ptr [rsp+A0H], rax
mov rdx, qword ptr [rsp+A0H]
mov qword ptr [rsp+80H], rbp
mov qword ptr [rsp+88H], rdx
vmovaps xmm6, xmmword ptr [rsp+80H]
vmovups xmm0, xmmword ptr [rsi+10H]
vmovaps xmmword ptr [rsp+70H], xmm0
vmovups xmm0, xmmword ptr [rdi+10H]
vmovaps xmmword ptr [rsp+60H], xmm0
mov rdx, qword ptr [rsp+70H]
mov qword ptr [rsp+50H], rdx
mov rdx, qword ptr [rsp+60H]
mov qword ptr [rsp+48H], rdx
mov rcx, qword ptr [rsp+50H]
mov rdx, qword ptr [rsp+48H]
call [System.Runtime.Intrinsics.Scalar`1[ulong]:Multiply(ulong,ulong):ulong]
mov qword ptr [rsp+58H], rax
mov rsi, qword ptr [rsp+58H]
mov rdx, qword ptr [rsp+78H]
mov qword ptr [rsp+38H], rdx
mov rdx, qword ptr [rsp+68H]
mov qword ptr [rsp+30H], rdx
mov rcx, qword ptr [rsp+38H]
mov rdx, qword ptr [rsp+30H]
call [System.Runtime.Intrinsics.Scalar`1[ulong]:Multiply(ulong,ulong):ulong]
mov qword ptr [rsp+40H], rax
mov rax, qword ptr [rsp+40H]
mov qword ptr [rsp+20H], rsi
mov qword ptr [rsp+28H], rax
vinserti128 ymm0, ymm6, xmmword ptr [rsp+20H], 1
vmovups ymmword ptr [rbx], ymm0
mov rax, rbx
;; size=325 bbWeight=4 PerfScore 309.00
G_M14587_IG03: ;; offset=0165H
vmovaps xmm6, xmmword ptr [rsp+E0H]
vzeroupper
add rsp, 248
pop rbx
pop rbp
pop rsi
pop rdi
ret
;; size=24 bbWeight=4 PerfScore 33.00
; Total bytes of code: 381
; Method ConsoleApp1.Program:Test(int) (FullOpts)
G_M64089_IG01: ;; offset=0000H
sub rsp, 40
;; size=4 bbWeight=1 PerfScore 0.25
G_M64089_IG02: ;; offset=0004H
sub ecx, 2
cmp ecx, 6
ja SHORT G_M64089_IG04
mov eax, 87
bt eax, ecx
jae SHORT G_M64089_IG04
;; size=18 bbWeight=1 PerfScore 3.25
G_M64089_IG03: ;; offset=0016H
mov rcx, 0x1B300309A50 ; 'a1'
call [System.Console:WriteLine(System.String)]
;; size=16 bbWeight=0.50 PerfScore 1.62
G_M64089_IG04: ;; offset=0026H
mov rcx, 0x1B300309A70 ; 'a2'
call [System.Console:WriteLine(System.String)]
nop
;; size=17 bbWeight=1 PerfScore 3.50
G_M64089_IG05: ;; offset=0037H
add rsp, 40
ret
;; size=5 bbWeight=1 PerfScore 1.25
; Total bytes of code: 60
|
@EgorBo, its marked draft still and hasn't been looked at since it wasn't viewed as "critical" compared to other fixes that had to go in for correctness, etc I believe the code in this PR isn't quite correct either, there were failures seen at the time and when I took the code "as is" in the new PR. Which means there is additional risk that may be undesirable this late... The standard algorithm we use elsewhere is: uint al = (uint)a;
uint ah = (uint)(a >>> 32);
uint bl = (uint)b;
uint bh = (uint)(b >>> 32);
ulong mull = ((ulong)al) * bl;
ulong t = ((ulong)ah) * bl + (mull >>> 32);
ulong tl = ((ulong)al) * bh + (uint)t;
return (tl << 32) | (uint)mull; We have result[0] = (ulong)a[0] * b[0];
result[1] = (ulong)a[2] * b[2]; and so a vectorized translation of this would end up being something like: Vector128<uint> al = a.AsUInt32();
Vector128<uint> ah = (a >>> 32).AsUInt32();
Vector128<uint> bl = b.AsUInt32();
Vector128<uint> bh = (b >>> 32).AsUInt32();
Vector128<ulong> mull = Sse2.Multiply(al, bl);
Vector128<ulong> t = Sse2.Multiply(ah, bl) + (mull >>> 32);
Vector128<ulong> tl = Sse2.Multiply(al, bh) + (t & Vector128.Create(0xFFFF_FFFFul));
return (tl << 32) | (mull & Vector128.Create(0xFFFF_FFFFul); |
Fixes #87107 by adding support for optimized AVX2/SSE2 versions of:
Vector256<ulong> * Vector256<ulong>
andlong
with AVX2Vector128<ulong> * Vector128<ulong>
andlong
with SSE2Before
After
Not sure where to add tests. Guidance welcome.