Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for optimized AVX2/SSE2 multiply for VectorXXX<ulong/long> #87113

Closed
wants to merge 1 commit into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,7 @@
using System.Numerics;
using System.Runtime.CompilerServices;
using System.Runtime.InteropServices;
using System.Runtime.Intrinsics.X86;
using System.Text;

namespace System.Runtime.Intrinsics
Expand Down Expand Up @@ -269,10 +270,26 @@ public T this[int index]
[MethodImpl(MethodImplOptions.AggressiveInlining)]
public static Vector128<T> operator *(Vector128<T> left, Vector128<T> right)
{
return Vector128.Create(
left._lower * right._lower,
left._upper * right._upper
);
if ((typeof(T) == typeof(ulong) || typeof(T) == typeof(long)) && Sse2.IsSupported)
{
// xh * yl
Vector128<ulong> t1 = Sse2.Multiply(Sse2.ShiftRightLogical(left.AsUInt64(), 32).AsUInt32(), right.AsUInt32());
// xl * yh
Vector128<ulong> t2 = Sse2.Multiply(left.AsUInt32(), Sse2.ShiftRightLogical(right.AsUInt64(), 32).AsUInt32());
// (xh * yl + xl * yh) * 2^32
Vector128<ulong> t3 = Sse2.ShiftLeftLogical((t1.AsUInt64() + t2.AsUInt64()), 32);
// xl * yl
Vector128<ulong> t4 = Sse2.Multiply(left.AsUInt32(), right.AsUInt32());
// (xh * yl + xl * yh) * 2^32 + xl * yl
return Unsafe.BitCast<Vector128<ulong>, Vector128<T>>(t3 + t4);
Comment on lines +275 to +284
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is likely going to have negative impact on some compilers (such as Mono, particularly for the interpreter).

Given we already have a functionally correct fallback implementation for int/uint that does the same thing, it's very unclear why we aren't doing the same thing and just sharing the logic for the long/ulong path: https://github.com/dotnet/runtime/blob/main/src/coreclr/jit/gentree.cpp#L19843-L19878

-- If we were going to have the managed implementation and could show that it doesn't hurt the Mono scenario; then it'd notably also be better to use (t3 + t4).As<ulong, T>() rather than BitCast, to use left.AsUInt64() >>> 32 over ShiftRightLogical, and to use (t1 + t2) << 32 over ShiftLeftLogical (the .AsUInt64() are unnecessary here).

Copy link
Member

@EgorBo EgorBo Jun 4, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(such as Mono, particularly for the interpreter).

Why would Mono interpreter execute Vector__.op_Multiply in the first place - it doesn't suport simd, does it?
But even if this is a problem for mono - it can be solved with just #if !MONO

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mono in general has been explicitly adding some SIMD support to ensure that code is not pessimized. This includes in the interpreter: #86738

But even if this is a problem for mono - it can be solved with just #if !MONO

That complicates things, puts more burden on the JIT/inliner, and can pessimize codegen in some cases.

While it'd be great to someday be able to move a lot of this into managed, I don't think we're at that point yet. We still have quite a bit of places where the inliner can just give up, particularly in large/complex methods. Having a known hot path function have the potential to not inline isn't great.

We also already have the relevant logic in the JIT to handle this, it just needs to be shared between the int/uint and long/ulong paths.

Copy link
Member

@EgorBo EgorBo Jun 4, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That complicates things, puts more burden on the JIT/inliner, and can pessimize codegen in some cases.

I don't think it makes any sense here - you either take this fast path, or waste the whole inliner's budget on whatever the scalar logic we have now as a fallback (which is way bigger). Moreover, as you can see from the snippet - Scalar'1[ulong]:Multiply are already not inlined due to the budget. So my point that this PR has no cons compared to the current impl. So in fact, this PR eats less of inliner's budget than the Main's code.

We also already have the relevant logic in the JIT to handle this, it just needs to be shared between the int/uint and long/ulong paths.

I didn't notice that existing logic for GT_MUL that looks similiar, if it can be easily shared, then sure. If not - the current change as is is better than what we have today.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it makes any sense here - you either take this fast path, or waste the whole inliner's budget on whatever the scalar logic we have now as a fallback (which is way bigger). Moreover, as you can see from the snippet

I meant, in particular, the inliner vs having this use the intrinsic path in the JIT. I agree that its better than what we have now, just that its less ideal than what it could/should be for something we know we're using in a hot path.

}
else
{
return Vector128.Create(
left._lower * right._lower,
left._upper * right._upper
);
}
}

/// <summary>Multiplies a vector by a scalar to compute their product.</summary>
Expand All @@ -284,10 +301,17 @@ public T this[int index]
[MethodImpl(MethodImplOptions.AggressiveInlining)]
public static Vector128<T> operator *(Vector128<T> left, T right)
{
return Vector128.Create(
left._lower * right,
left._upper * right
);
if ((typeof(T) == typeof(ulong) || typeof(T) == typeof(long)) && Sse2.IsSupported)
{
return left * Vector128.Create(right);
}
else
{
return Vector128.Create(
left._lower * right,
left._upper * right
);
}
}

/// <summary>Multiplies a vector by a scalar to compute their product.</summary>
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,7 @@
using System.Globalization;
using System.Runtime.CompilerServices;
using System.Runtime.InteropServices;
using System.Runtime.Intrinsics.X86;
using System.Text;

namespace System.Runtime.Intrinsics
Expand Down Expand Up @@ -267,10 +268,26 @@ public T this[int index]
[MethodImpl(MethodImplOptions.AggressiveInlining)]
public static Vector256<T> operator *(Vector256<T> left, Vector256<T> right)
{
return Vector256.Create(
left._lower * right._lower,
left._upper * right._upper
);
if ((typeof(T) == typeof(ulong) || typeof(T) == typeof(long)) && Avx2.IsSupported)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!
Perhaps, worth leaving a comment that this path is not taken when avx-512 is presented?

{
// xh * yl
Vector256<uint> t1 = Avx2.MultiplyLow(Avx2.ShiftRightLogical(left.AsUInt64(), 32).AsUInt32(), right.AsUInt32());
// xl * yh
Vector256<uint> t2 = Avx2.MultiplyLow(left.AsUInt32(), Avx2.ShiftRightLogical(right.AsUInt64(), 32).AsUInt32());
// (xh * yl + xl * yh) * 2^32
Vector256<ulong> t3 = Avx2.ShiftLeftLogical((t1.AsUInt64() + t2.AsUInt64()), 32);
// xl * yl
Vector256<uint> t4 = Avx2.MultiplyLow(left.AsUInt32(), right.AsUInt32());
// (xh * yl + xl * yh) * 2^32 + xl * yl
return Unsafe.BitCast<Vector256<ulong>, Vector256<T>>(t3 + t4.AsUInt64());
}
else
{
return Vector256.Create(
left._lower * right._lower,
left._upper * right._upper
);
}
}

/// <summary>Multiplies a vector by a scalar to compute their product.</summary>
Expand All @@ -282,10 +299,17 @@ public T this[int index]
[MethodImpl(MethodImplOptions.AggressiveInlining)]
public static Vector256<T> operator *(Vector256<T> left, T right)
{
return Vector256.Create(
left._lower * right,
left._upper * right
);
if ((typeof(T) == typeof(ulong) || typeof(T) == typeof(long)) && Avx2.IsSupported)
{
return left * Vector256.Create(right);
}
else
{
return Vector256.Create(
left._lower * right,
left._upper * right
);
}
}

/// <summary>Multiplies a vector by a scalar to compute their product.</summary>
Expand Down