-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for optimized AVX2/SSE2 multiply for VectorXXX<ulong/long> #87113
Closed
+64
−16
Closed
Changes from all commits
Commits
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -6,6 +6,7 @@ | |
using System.Globalization; | ||
using System.Runtime.CompilerServices; | ||
using System.Runtime.InteropServices; | ||
using System.Runtime.Intrinsics.X86; | ||
using System.Text; | ||
|
||
namespace System.Runtime.Intrinsics | ||
|
@@ -267,10 +268,26 @@ public T this[int index] | |
[MethodImpl(MethodImplOptions.AggressiveInlining)] | ||
public static Vector256<T> operator *(Vector256<T> left, Vector256<T> right) | ||
{ | ||
return Vector256.Create( | ||
left._lower * right._lower, | ||
left._upper * right._upper | ||
); | ||
if ((typeof(T) == typeof(ulong) || typeof(T) == typeof(long)) && Avx2.IsSupported) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Nice! |
||
{ | ||
// xh * yl | ||
Vector256<uint> t1 = Avx2.MultiplyLow(Avx2.ShiftRightLogical(left.AsUInt64(), 32).AsUInt32(), right.AsUInt32()); | ||
// xl * yh | ||
Vector256<uint> t2 = Avx2.MultiplyLow(left.AsUInt32(), Avx2.ShiftRightLogical(right.AsUInt64(), 32).AsUInt32()); | ||
// (xh * yl + xl * yh) * 2^32 | ||
Vector256<ulong> t3 = Avx2.ShiftLeftLogical((t1.AsUInt64() + t2.AsUInt64()), 32); | ||
// xl * yl | ||
Vector256<uint> t4 = Avx2.MultiplyLow(left.AsUInt32(), right.AsUInt32()); | ||
// (xh * yl + xl * yh) * 2^32 + xl * yl | ||
return Unsafe.BitCast<Vector256<ulong>, Vector256<T>>(t3 + t4.AsUInt64()); | ||
} | ||
else | ||
{ | ||
return Vector256.Create( | ||
left._lower * right._lower, | ||
left._upper * right._upper | ||
); | ||
} | ||
} | ||
|
||
/// <summary>Multiplies a vector by a scalar to compute their product.</summary> | ||
|
@@ -282,10 +299,17 @@ public T this[int index] | |
[MethodImpl(MethodImplOptions.AggressiveInlining)] | ||
public static Vector256<T> operator *(Vector256<T> left, T right) | ||
{ | ||
return Vector256.Create( | ||
left._lower * right, | ||
left._upper * right | ||
); | ||
if ((typeof(T) == typeof(ulong) || typeof(T) == typeof(long)) && Avx2.IsSupported) | ||
{ | ||
return left * Vector256.Create(right); | ||
} | ||
else | ||
{ | ||
return Vector256.Create( | ||
left._lower * right, | ||
left._upper * right | ||
); | ||
} | ||
} | ||
|
||
/// <summary>Multiplies a vector by a scalar to compute their product.</summary> | ||
|
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is likely going to have negative impact on some compilers (such as Mono, particularly for the interpreter).
Given we already have a functionally correct fallback implementation for
int
/uint
that does the same thing, it's very unclear why we aren't doing the same thing and just sharing the logic for thelong
/ulong
path: https://github.com/dotnet/runtime/blob/main/src/coreclr/jit/gentree.cpp#L19843-L19878-- If we were going to have the managed implementation and could show that it doesn't hurt the Mono scenario; then it'd notably also be better to use
(t3 + t4).As<ulong, T>()
rather thanBitCast
, to useleft.AsUInt64() >>> 32
overShiftRightLogical
, and to use(t1 + t2) << 32
overShiftLeftLogical
(the.AsUInt64()
are unnecessary here).There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why would Mono interpreter execute
Vector__.op_Multiply
in the first place - it doesn't suport simd, does it?But even if this is a problem for mono - it can be solved with just
#if !MONO
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mono in general has been explicitly adding some SIMD support to ensure that code is not pessimized. This includes in the interpreter: #86738
That complicates things, puts more burden on the JIT/inliner, and can pessimize codegen in some cases.
While it'd be great to someday be able to move a lot of this into managed, I don't think we're at that point yet. We still have quite a bit of places where the inliner can just give up, particularly in large/complex methods. Having a known hot path function have the potential to not inline isn't great.
We also already have the relevant logic in the JIT to handle this, it just needs to be shared between the
int/uint
andlong/ulong
paths.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think it makes any sense here - you either take this fast path, or waste the whole inliner's budget on whatever the scalar logic we have now as a fallback (which is way bigger). Moreover, as you can see from the snippet -
Scalar'1[ulong]:Multiply
are already not inlined due to the budget. So my point that this PR has no cons compared to the current impl. So in fact, this PR eats less of inliner's budget than the Main's code.I didn't notice that existing logic for
GT_MUL
that looks similiar, if it can be easily shared, then sure. If not - the current change as is is better than what we have today.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I meant, in particular, the inliner vs having this use the intrinsic path in the JIT. I agree that its better than what we have now, just that its less ideal than what it could/should be for something we know we're using in a hot path.