Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[API Proposal]: Add AVX10v2 API to add Avx10.2 support #109083

Closed
DeepakRajendrakumaran opened this issue Oct 21, 2024 · 30 comments
Closed

[API Proposal]: Add AVX10v2 API to add Avx10.2 support #109083

DeepakRajendrakumaran opened this issue Oct 21, 2024 · 30 comments
Labels
api-approved API was approved in API review, it can be implemented area-System.Runtime.Intrinsics avx10 Related to the AVX10 architecture
Milestone

Comments

@DeepakRajendrakumaran
Copy link
Contributor

DeepakRajendrakumaran commented Oct 21, 2024

Background and motivation

Intel has announced the features available in the next version of Avx10 (10.2). In order to support this, .NET needs to expand the Avx10 library to include the new APIs.

Avx10.2 spec. Section 7 - 14 in this spec goes over the newly added instructions. A couple of interesting features here are MinMax and saturating conversions

As part of the original API Proposal, the proposed design was for future Avx10 versions to have their own classes which inherits from Avx10v1

API Proposal

namespace System.Runtime.Intrinsics.X86
{
    /// <summary>Provides access to X86 AVX10.1 hardware instructions via intrinsics</summary>
    [Intrinsic]
    [CLSCompliant(false)]
    public abstract class Avx10v2 : Avx10v1
    {
        internal Avx10v2() { }

        public static new bool IsSupported { get => IsSupported; }

        // VMINMAXPD xmm1{k1}{z}, xmm2, xmm3/m128/m64bcst, imm8
        public static Vector128<double> MinMax(Vector128<double> left, Vector128<double> right, [ConstantExpected] byte control) => MinMax(left, right, mode);
        
        // VMINMAXPD ymm1{k1}{z}, ymm2, ymm3/m256/m64bcst {sae}, imm8
        public static Vector256<double> MinMax(Vector256<double> left, Vector256<double> right, [ConstantExpected] byte control) => MinMax(left, right, mode);
        
        // VMINMAXPS xmm1{k1}{z}, xmm2, xmm3/m128/m32bcst, imm8
        public static Vector128<float> MinMax(Vector128<float> left, Vector128<float> right, [ConstantExpected] byte control) => MinMax(left, right, mode);
        
        // VMINMAXPS ymm1{k1}{z}, ymm2, ymm3/m256/m32bcst {sae}, imm8
        public static Vector256<float> MinMax(Vector256<float> left, Vector256<float> right, [ConstantExpected] byte control) => MinMax(left, right, mode);
        
        // VMINMAXSD xmm1{k1}{z}, xmm2, xmm3/m64 {sae}, imm8
        public static double MinMaxScalar(Vector128<double> left, Vector128<double> right, [ConstantExpected] byte control) => MinMaxScalar(left, right, mode);

        // VMINMAXSS xmm1{k1}{z}, xmm2, xmm3/m32 {sae}, imm8
        public static float MinMaxScalar(Vector128<float> left, Vector128<float> right, [ConstantExpected] byte control) => MinMaxScalar(left, right, mode);

        // VADDPD ymm1{k1}{z}, ymm2, ymm3/m256/m64bcst {er}
        public static Vector256<double> Add(Vector256<double> left, Vector256<double> right, [ConstantExpected(Max = FloatRoundingMode.ToZero)] FloatRoundingMode mode) => Add(left, right, mode);

        // VADDPS ymm1{k1}{z}, ymm2, ymm3/m256/m32bcst {er}
        public static Vector256<float> Add(Vector256<float> left, Vector256<float> right, [ConstantExpected(Max = FloatRoundingMode.ToZero)] FloatRoundingMode mode) => Add(left, right, mode);
                
        // VDIVPD ymm1{k1}{z}, ymm2, ymm3/m256/m64bcst {er}
        public static Vector256<double> Divide(Vector256<double> left, Vector256<double> right, [ConstantExpected(Max = FloatRoundingMode.ToZero)] FloatRoundingMode mode) => Divide(left, right, mode);
        
        // VDIVPS ymm1{k1}{z}, ymm2, ymm3/m256/m32bcst {er}
        public static Vector256<float> Divide(Vector256<float> left, Vector256<float> right, [ConstantExpected(Max = FloatRoundingMode.ToZero)] FloatRoundingMode mode) => Divide(left, right, mode);
        
        // VCVTPS2IBS xmm1{k1}{z}, xmm2/m128/m32bcst
        public static Vector128<int> ConvertToByteWithSaturationAndWidenToInt32(Vector128<float> value) => ConvertToByteWithSaturationAndWidenToInt32(value);
                
        // VCVTPS2IBS ymm1{k1}{z}, ymm2/m256/m32bcst {er}
        public static Vector256<int> ConvertToByteWithSaturationAndWidenToInt32(Vector256<float> value) => ConvertToByteWithSaturationAndWidenToInt32(value);
        
        // VCVTPS2IBS ymm1{k1}{z}, ymm2/m256/m32bcst {er}
        public static Vector256<int> ConvertToByteWithSaturationAndWidenToInt32(Vector256<float> value, [ConstantExpected(Max = FloatRoundingMode.ToZero)] FloatRoundingMode mode) => ConvertToByteWithSaturationAndWidenToInt32(value, mode);
        
        // VCVTPS2IUBS xmm1{k1}{z}, xmm2/m128/m32bcst
        public static Vector128<uint> ConvertToByteWithSaturationAndWidenToUInt32(Vector128<float> value) => ConvertToByteWithSaturationAndWidenToUInt32(value);
                
        // VCVTPS2IUBS ymm1{k1}{z}, ymm2/m256/m32bcst {er}
        public static Vector256<uint> ConvertToByteWithSaturationAndWidenToUInt32(Vector256<float> value) => ConvertToByteWithSaturationAndWidenToUInt32(value);
        
        // VCVTPS2IUBS ymm1{k1}{z}, ymm2/m256/m32bcst {er}
        public static Vector256<uint> ConvertToByteWithSaturationAndWidenToUInt32(Vector256<float> value, [ConstantExpected(Max = FloatRoundingMode.ToZero)] FloatRoundingMode mode) => ConvertToByteWithSaturationAndWidenToUInt32(value, mode);

        // VCVTTPS2IBS xmm1{k1}{z}, xmm2/m128/m32bcst
        public static Vector128<int> ConvertToByteWithTruncationSaturationAndWidenToInt32(Vector128<float> value) => ConvertToByteWithTruncationSaturationAndWidenToInt32(value);
                
        // VCVTTPS2IBS ymm1{k1}{z}, ymm2/m256/m32bcst {sae}
        public static Vector256<int> ConvertToByteWithTruncationSaturationAndWidenToInt32(Vector256<float> value) => ConvertToVector256SByteWithTruncationSaturation(value);
                
        // VCVTTPS2IUBS xmm1{k1}{z}, xmm2/m128/m32bcst
        public static Vector128<uint> ConvertToByteWithTruncationSaturationAndWidenToUInt32(Vector128<float> value) => ConvertToByteWithTruncationSaturationAndWidenToUInt32(value);
                
        // VCVTTPS2IUBS ymm1{k1}{z}, ymm2/m256/m32bcst {sae}
        public static Vector256<uint> ConvertToByteWithTruncationSaturationAndWidenToUInt32(Vector256<float> value) => ConvertToByteWithTruncationSaturationAndWidenToUInt32(value);
        
        // VMOVD xmm1, xmm2/m32
        public static Vector128<uint> ConvertToVector128UInt32(Vector128<uint> value) => ConvertToVector128UInt32(value);
        
        // VMOVW xmm1, xmm2/m16
        public static Vector128<ushort> ConvertToVector128UInt16(Vector128<ushort> value) => ConvertToVector128UInt16(value);
        
        //The below instructions are those where 
        //embedded rouding support have been added 
        //to the existing API

        // VCVTDQ2PS ymm1{k1}{z}, ymm2/m256/m32bcst {er}
        public static Vector256<float> ConvertToVector256Single(Vector256<int> value, [ConstantExpected(Max = FloatRoundingMode.ToZero)] FloatRoundingMode mode) => ConvertToVector256Single(value, mode);
        
        // VCVTPD2DQ xmm1{k1}{z}, ymm2/m256/m64bcst {er}
        public static Vector128<int> ConvertToVector128Int32(Vector256<double> value, [ConstantExpected(Max = FloatRoundingMode.ToZero)] FloatRoundingMode mode) => ConvertToVector128Int32(value, mode);

        // VCVTPD2PS xmm1{k1}{z}, ymm2/m256/m64bcst {er}
        public static Vector128<float> ConvertToVector128Single(Vector256<double> value, [ConstantExpected(Max = FloatRoundingMode.ToZero)] FloatRoundingMode mode) => ConvertToVector128Single(value, mode);

        // VCVTPD2QQ ymm1{k1}{z}, ymm2/m256/m64bcst {er}
        public static Vector256<long> ConvertToVector256Int64(Vector256<double> value, [ConstantExpected(Max = FloatRoundingMode.ToZero)] FloatRoundingMode mode) => ConvertToVector256Int64(value, mode);

        // VCVTPD2UDQ xmm1{k1}{z}, ymm2/m256/m64bcst {er}
        public static Vector128<uint> ConvertToVector128UInt32(Vector256<double> value, [ConstantExpected(Max = FloatRoundingMode.ToZero)] FloatRoundingMode mode) => ConvertToVector128UInt32(value, mode);

        // VCVTPD2UQQ ymm1{k1}{z}, ymm2/m256/m64bcst {er}
        public static Vector256<ulong> ConvertToVector256UInt64(Vector256<double> value, [ConstantExpected(Max = FloatRoundingMode.ToZero)] FloatRoundingMode mode) => ConvertToVector256UInt64(value, mode);

        // VCVTPS2DQ ymm1{k1}{z}, ymm2/m256/m32bcst {er}
        public static Vector256<int> ConvertToVector256Int32(Vector256<float> value, [ConstantExpected(Max = FloatRoundingMode.ToZero)] FloatRoundingMode mode) => ConvertToVector256Int32(value, mode);

        // VCVTPS2QQ ymm1{k1}{z}, xmm2/m128/m32bcst {er}
        public static Vector256<long> ConvertToVector256Int64(Vector128<float> value, [ConstantExpected(Max = FloatRoundingMode.ToZero)] FloatRoundingMode mode) => ConvertToVector256Int64(value, mode);

        // VCVTPS2UDQ ymm1{k1}{z}, ymm2/m256/m32bcst {er}
        public static Vector256<uint> ConvertToVector256UInt32(Vector256<float> value, [ConstantExpected(Max = FloatRoundingMode.ToZero)] FloatRoundingMode mode) => ConvertToVector256UInt32(value, mode);

        // VCVTPS2UQQ ymm1{k1}{z}, xmm2/m128/m32bcst {er}
        public static Vector256<ulong> ConvertToVector256UInt64(Vector128<float> value, [ConstantExpected(Max = FloatRoundingMode.ToZero)] FloatRoundingMode mode) => ConvertToVector256UInt64(value, mode);

        // VCVTQQ2PS xmm1{k1}{z}, ymm2/m256/m64bcst {er}
        public static Vector128<float> ConvertToVector128Single(Vector256<ulong> value, [ConstantExpected(Max = FloatRoundingMode.ToZero)] FloatRoundingMode mode) => ConvertToVector128Single(value, mode);
        
        // VCVTQQ2PD ymm1{k1}{z}, ymm2/m256/m64bcst {er}
        public static Vector256<double> ConvertToVector256Double(Vector256<ulong> value, [ConstantExpected(Max = FloatRoundingMode.ToZero)] FloatRoundingMode mode) => ConvertToVector256Double(value, mode);
        
        // VCVTUDQ2PS ymm1{k1}{z}, ymm2/m256/m32bcst {er}
        public static Vector256<float> ConvertToVector256Single(Vector256<uint> value, [ConstantExpected(Max = FloatRoundingMode.ToZero)] FloatRoundingMode mode) => ConvertToVector256Single(value, mode);
        
        // VCVTUQQ2PS xmm1{k1}{z}, ymm2/m256/m64bcst {er}
        public static Vector128<float> ConvertToVector128Single(Vector256<long> value, [ConstantExpected(Max = FloatRoundingMode.ToZero)] FloatRoundingMode mode) => ConvertToVector128Single(value, mode);
        
        // VCVTUQQ2PD ymm1{k1}{z}, ymm2/m256/m64bcst {er}
        public static Vector256<double> ConvertToVector256Double(Vector256<long> value, [ConstantExpected(Max = FloatRoundingMode.ToZero)] FloatRoundingMode mode) => ConvertToVector256Double(value, mode);
        
        // VMULPD ymm1{k1}{z}, ymm2, ymm3/m256/m64bcst {er}
        public static Vector256<double> Multiply(Vector256<double> left, Vector256<double> right, [ConstantExpected(Max = FloatRoundingMode.ToZero)] FloatRoundingMode mode) => Multiply(left, right, mode);
        
        // VMULPS ymm1{k1}{z}, ymm2, ymm3/m256/m32bcst {er}
        public static Vector256<float> Multiply(Vector256<float> left, Vector256<float> right, [ConstantExpected(Max = FloatRoundingMode.ToZero)] FloatRoundingMode mode) => Multiply(left, right, mode);
        
        // VSCALEFPD ymm1{k1}{z}, ymm2, ymm3/m256/m64bcst {er}
        public static Vector256<double> Scale(Vector256<double> left, Vector256<double> right, [ConstantExpected(Max = FloatRoundingMode.ToZero)] FloatRoundingMode mode) => Scale(left, right, mode);
        
        // VSCALEFPS ymm1{k1}{z}, ymm2, ymm3/m256/m32bcst {er}
        public static Vector256<float> Scale(Vector256<float> left, Vector256<float> right, [ConstantExpected(Max = FloatRoundingMode.ToZero)] FloatRoundingMode mode) => Scale(left, right, mode);
        
        // VSQRTPD ymm1{k1}{z}, ymm2/m256/m64bcst {er}
        public static Vector256<double> Sqrt(Vector256<double> value, [ConstantExpected(Max = FloatRoundingMode.ToZero)] FloatRoundingMode mode) => Sqrt(value, mode);
        
        // VSQRTPS ymm1{k1}{z}, ymm2/m256/m32bcst {er}
        public static Vector256<float> Sqrt(Vector256<float> value, [ConstantExpected(Max = FloatRoundingMode.ToZero)] FloatRoundingMode mode) => Sqrt(value, mode);
        
        // VSUBPD ymm1{k1}{z}, ymm2, ymm3/m256/m64bcst {er}
        public static Vector256<double> Subtract(Vector256<double> left, Vector256<double> right, [ConstantExpected(Max = FloatRoundingMode.ToZero)] FloatRoundingMode mode) => Subtract(left, right, mode);
        
        // VSUBPS ymm1{k1}{z}, ymm2, ymm3/m256/m32bcst {er}
        public static Vector256<float> Subtract(Vector256<float> left, Vector256<float> right, [ConstantExpected(Max = FloatRoundingMode.ToZero)] FloatRoundingMode mode) => Subtract(left, right, mode);
        
        [Intrinsic]
        public new abstract class X64 : Avx10v1.X64
        {
            internal X64() { }

            public static new bool IsSupported { get => IsSupported; }
        }

        [Intrinsic]
        public abstract class V512 : Avx10v1.V512
        {
            internal V512() { }

            public static new bool IsSupported { get => IsSupported; }
    
            // VMINMAXPD zmm1{k1}{z}, zmm2, zmm3/m512/m64bcst {sae}, imm8
            public static Vector512<double> MinMax(Vector512<double> left, Vector512<double> right, [ConstantExpected] byte control) => MinMax(left, right, mode);
            
            // VMINMAXPS zmm1{k1}{z}, zmm2, zmm3/m512/m32bcst {sae}, imm8
            public static Vector512<float> MinMax(Vector512<float> left, Vector512<float> right, [ConstantExpected] byte control) => MinMax(left, right, mode);
            
            // VCVTPS2IBS zmm1{k1}{z}, zmm2/m512/m32bcst {er}
            public static Vector512<int> ConvertToByteWithSaturationAndWidenToInt32(Vector512<float> value) => ConvertToByteWithSaturationAndWidenToInt32(value);
            
            // VCVTPS2IBS zmm1{k1}{z}, zmm2/m512/m32bcst {er}
            public static Vector512<int> ConvertToByteWithSaturationAndWidenToInt32(Vector512<float> value, [ConstantExpected(Max = FloatRoundingMode.ToZero)] FloatRoundingMode mode) => ConvertToByteWithSaturationAndWidenToInt32(value, mode);
            
            // VCVTPS2IUBS zmm1{k1}{z}, zmm2/m512/m32bcst {er}
            public static Vector512<uint> ConvertToByteWithSaturationAndWidenToUInt32(Vector512<float> value) => ConvertToByteWithSaturationAndWidenToUInt32(value);
            
            // VCVTPS2IUBS zmm1{k1}{z}, zmm2/m512/m32bcst {er}
            public static Vector512<uint> ConvertToByteWithSaturationAndWidenToUInt32(Vector512<float> value, [ConstantExpected(Max = FloatRoundingMode.ToZero)] FloatRoundingMode mode) => ConvertToByteWithSaturationAndWidenToUInt32(value, mode);

            // VCVTTPS2IUBS zmm1{k1}{z}, zmm2/m512/m32bcst {sae}
            public static Vector512<int> ConvertToByteWithTruncationSaturationAndWidenToInt32(Vector512<float> value) => ConvertToByteWithTruncationSaturationAndWidenToInt32(value);
                        
            // VCVTTPS2IUBS zmm1{k1}{z}, zmm2/m512/m32bcst {sae}
            public static Vector512<uint> ConvertToByteWithTruncationSaturationAndWidenToUInt32(Vector512<float> value) => ConvertToByteWithTruncationSaturationAndWidenToUInt32(value);
            
            // This is a 512 extension of previously existing 128/26 inrinsic
            // VMPSADBW zmm1{k1}{z}, zmm2, zmm3/m512, imm8
            public static Vector512<ushort> MultipleSumAbsoluteDifferences(Vector512<byte> left, Vector512<byte> right, [ConstantExpected] byte mask) => MultipleSumAbsoluteDifferences(left, right, mask);

            [Intrinsic]
            public new abstract class X64 : Avx10v1.V512.X64
            {
                internal X64() { }

                public static new bool IsSupported { get => IsSupported; }
            }
        }
    }
}

API Usage

Vector128<float> v1 = Vector512.Create((float)someParam1);
Vector128<float> v2 = Vector512.Create((float)someParam2);
if (Avx10v2.IsSupported()) {
  Vector128<float> v3 = Avx10v2.MinMaxVector(v1, v2, 0b00000000);
  // etc
}

Alternative Designs

No response

Risks

No response

@DeepakRajendrakumaran DeepakRajendrakumaran added the api-suggestion Early API idea and discussion, it is NOT ready for implementation label Oct 21, 2024
@dotnet-policy-service dotnet-policy-service bot added the untriaged New issue has not been triaged by the area owner label Oct 21, 2024
Copy link
Contributor

Tagging subscribers to this area: @dotnet/area-system-runtime-intrinsics
See info in area-owners.md if you want to be subscribed.

@DeepakRajendrakumaran
Copy link
Contributor Author

The following instructions which are part of Avx10.2 are not mentioned above. These fall under mostly 2 groups - 16 bit floating point and FMA instructions

`


Instructions Skipped - 
Entire section 7 in AVX10.2 manual

Parts of Section 8 in AVX10.2 manual
- VCOMXSH
- VUCOMXSH

Entire Section 9 in AVX10.2 manual

Parts of Section 10 in AVX10.2 manual
- VDPPHPS

Parts of Section 11 in AVX10.2 manual
- VMINMAXNEPBF16

Parts of Section 12 in AVX10.2 manual
- VADDPH
- VCMPPH
- VCVTDQ2PH
- VCVTPD2PH
- VCVTPH2DQ
- VCVTPH2PD
- VCVTPH2PS
- VCVTPH2PSX
- VCVTPH2QQ
- VCVTPH2UDQ
- VCVTPH2UQQ
- VCVTPH2UW
- VCVTPH2W
- VCVTPS2PH
- VCVTPS2PHX
- VCVTQQ2PH
- VCVTTPH2DQ
- VCVTTPH2QQ
- VCVTTPH2UDQ
- VCVTTPH2UQQ
- VCVTTPH2UW
- VCVTTPH2W
- VCVTUDQ2PH
- VCVTUQQ2PH
- VCVTUW2PH
- VCVTW2PH
- VDIVPH
- VFCMADDCPH
- VFCMULCPH
- VFMADD132PD - Prior instructions dont exist
- VFMADD132PH
- VFMADD132PS - Prior instructions dont exist
- VFMADD213PD - Prior instructions dont exist
- VFMADD213PH
- VFMADD213PS - Prior instructions dont exist
- VFMADD231PD - Prior instructions dont exist
- VFMADD231PH
- VFMADD231PS - Prior instructions dont exist
- VFMADDCPH
- VFMADDSUB132PD - Prior instructions dont exist
- VFMADDSUB132PH
- VFMADDSUB132PS - Prior instructions dont exist
- VFMADDSUB213PD - Prior instructions dont exist
- VFMADDSUB213PH
- VFMADDSUB213PS - Prior instructions dont exist
- VFMADDSUB231PD - Prior instructions dont exist
- VFMADDSUB231PH
- VFMADDSUB231PS - Prior instructions dont exist
- VFMSUB132PD - Prior instructions dont exist
- VFMSUB132PH
- VFMSUB132PS - Prior instructions dont exist
- VFMSUB213PD - Prior instructions dont exist
- VFMSUB213PH
- VFMSUB213PS - Prior instructions dont exist
- VFMSUB231PD - Prior instructions dont exist
- VFMSUB231PH
- VFMSUB231PS - Prior instructions dont exist
- VFMSUBADD132PD - Prior instructions dont exist
- VFMSUBADD132PH
- VFMSUBADD132PS - Prior instructions dont exist
- VFMSUBADD213PD - Prior instructions dont exist
- VFMSUBADD213PH
- VFMSUBADD213PS - Prior instructions dont exist
- VFMSUBADD231PD - Prior instructions dont exist
- VFMSUBADD231PH
- VFMSUBADD231PS - Prior instructions dont exist
- VFMULCPH
- VFNMADD132PD - Prior instructions dont exist
- VFNMADD132PH
- VFNMADD132PS - Prior instructions dont exist
- VFNMADD213PD - Prior instructions dont exist
- VFNMADD213PH
- VFNMADD213PS - Prior instructions dont exist
- VFNMADD231PD - Prior instructions dont exist
- VFNMADD231PH
- VFNMADD231PS - Prior instructions dont exist
- VFNMSUB132PD - Prior instructions dont exist
- VFNMSUB132PH
- VFNMSUB132PS - Prior instructions dont exist
- VFNMSUB213PD - Prior instructions dont exist
- VFNMSUB213PH
- VFNMSUB213PS - Prior instructions dont exist
- VFNMSUB231PD - Prior instructions dont exist
- VFNMSUB231PH
- VFNMSUB231PS - Prior instructions dont exist
- VGETEXPPH
- VGETMANTPH
- VMAXPH
- VMINPH
- VMULPH
- VREDUCEPH
- VRNDSCALEPH
- VSQRTPH
- VSUBPH

Parts of Section 13 in AVX10.2 manual
- VCVT[,T]NEBF162I[,U]BS
- VCVT[,T]PH2I[,U]BS

@BruceForstall BruceForstall added the avx10 Related to the AVX10 architecture label Oct 23, 2024
@tannergooding
Copy link
Member

Haven't finished going through the list, but as initial feedback:

  • MinMaxVector should be named just MinMax
  • MinMax should instead be named MinMaxScalar
  • The various Compare*Enhanced APIs are unnecessary, we can implicitly use these instructions for the existing Compare* APIs, since its simply setting different flags allowing more optimal codegen for subsequent branches or conditional moves
  • It'd be helpful to separate out (such as via a separate code block or proposal) the "new instruction forms" where they aren't new concepts, but rather just new overloads of existing APIs (typically taking V256<T> and FloatRoundingMode)
  • For APIs like ConvertWithSaturationPackedFloatToSignedByteInteger, we want to use the .NET type names, so Single is preferred over Float, SByte over SignedByte, etc
    • for signed integers we have SByte, Int16, Int32, Int64
    • for unsigned integers we have Byte, UInt16, UInt32, UInt64
    • for floating-point we have Half, Single, Double
  • For APIs like ConvertWithSaturationPackedFloatToSignedByteInteger, we probably want to more closely parity the existing names like ConvertToVector128Int32WithTruncation and so would call it ConvertToVector128ByteWithSaturation

@DeepakRajendrakumaran
Copy link
Contributor Author

Haven't finished going through the list, but as initial feedback:

  • MinMaxVector should be named just MinMax

  • MinMax should instead be named MinMaxScalar

  • The various Compare*Enhanced APIs are unnecessary, we can implicitly use these instructions for the existing Compare* APIs, since its simply setting different flags allowing more optimal codegen for subsequent branches or conditional moves

  • It'd be helpful to separate out (such as via a separate code block or proposal) the "new instruction forms" where they aren't new concepts, but rather just new overloads of existing APIs (typically taking V256<T> and FloatRoundingMode)

  • For APIs like ConvertWithSaturationPackedFloatToSignedByteInteger, we want to use the .NET type names, so Single is preferred over Float, SByte over SignedByte, etc

    • for signed integers we have SByte, Int16, Int32, Int64
    • for unsigned integers we have Byte, UInt16, UInt32, UInt64
    • for floating-point we have Half, Single, Double
  • For APIs like ConvertWithSaturationPackedFloatToSignedByteInteger, we probably want to more closely parity the existing names like ConvertToVector128Int32WithTruncation and so would call it ConvertToVector128ByteWithSaturation

Thank you. I will leave you a comment when I have made all required changes.

@khushal1996
Copy link
Contributor

@tannergooding Thanks for the review. About the nomenclature for Convert APIs, for something like // VCVTPS2IBS ymm1{k1}{z}, ymm2/m256/m32bcst {er}, should we use ConvertToVector128ByteWithSaturation or ConvertToVector128SByteInt16WithSaturation? Because the instruction description is something like -->

These instructions convert four, eight or sixteen packed single-precision floating-point values in the
source operand to four, eight or sixteen signed or unsigned byte integers in the destination operand.
The downconverted 8-bit result is written inplace at the lower 8-bit of the corresponding 32-bit element.
The upper 3 bytes are zeroed. VCVTPS2IBS converts single-precision floating point elements into signed byte integer elements.

Let me know what you think.

@tannergooding
Copy link
Member

I'll need to think about it more.

It is important we document the behavior which is conversion to byte so that users understand what the API is doing.
It is then important we document the return type of Vector128<int> so that it doesn't cause issues with overload resolution, since you cannot overload by return type.

It's functionally doing a ConvertToVector128ByteWithSaturationAndWidenToVector128Int32, which is a very verbose name.

@khushal1996
Copy link
Contributor

True. I was thinking on similar lines ConvertToVector128ByteWithSaturationAndWidenToVector128Int32 but wanted to keep it a little shorter and also describe that it widens to int32. How about ConvertToVector128SByteWithSaturationWidenToInt32? alteast we can remove the vector128 after widen.

@DeepakRajendrakumaran
Copy link
Contributor Author

DeepakRajendrakumaran commented Oct 24, 2024

I have updated the names and made the other changes. For the Widen ones, let me know how you want us to update those. The ones this might apply to are
Accumulated*DotProduct*

and
convert intrinsics where widening is happening

@DeepakRajendrakumaran
Copy link
Contributor Author

I'll need to think about it more.

It is important we document the behavior which is conversion to byte so that users understand what the API is doing. It is then important we document the return type of Vector128<int> so that it doesn't cause issues with overload resolution, since you cannot overload by return type.

It's functionally doing a ConvertToVector128ByteWithSaturationAndWidenToVector128Int32, which is a very verbose name.

Hi Tanner - have you decided on how you want the 'widen' API's to be named?

@tannergooding
Copy link
Member

I think we should default to the verbose name, which is the most consistent with our other APIs and the least problematic.

We'll likely discuss some of the alternatives in API review and it wouldn't hurt to have them listed.

Notably we have Vector128<int> ConvertToInt32(Vector128<float> value) in SSE2 (and similar for Int64/UInt32/UInt654 in other ISAs), so something like ConvertToByteWithSaturationAndWidenToInt32 might be a feasible shorter name that won't conflict, but I don't think we could get much shorter otherwise.

@DeepakRajendrakumaran
Copy link
Contributor Author

I think we should default to the verbose name, which is the most consistent with our other APIs and the least problematic.

We'll likely discuss some of the alternatives in API review and it wouldn't hurt to have them listed.

Notably we have Vector128<int> ConvertToInt32(Vector128<float> value) in SSE2 (and similar for Int64/UInt32/UInt654 in other ISAs), so something like ConvertToByteWithSaturationAndWidenToInt32 might be a feasible shorter name that won't conflict, but I don't think we could get much shorter otherwise.

I've updated the names for these, Do you think it makes sense to have 'Widen' in the name for the accumulated dot product ones as well?

@jeffhandley jeffhandley added this to the 10.0.0 milestone Nov 3, 2024
@jeffhandley jeffhandley removed the untriaged New issue has not been triaged by the area owner label Nov 3, 2024
@DeepakRajendrakumaran
Copy link
Contributor Author

I think we should default to the verbose name, which is the most consistent with our other APIs and the least problematic.
We'll likely discuss some of the alternatives in API review and it wouldn't hurt to have them listed.
Notably we have Vector128<int> ConvertToInt32(Vector128<float> value) in SSE2 (and similar for Int64/UInt32/UInt654 in other ISAs), so something like ConvertToByteWithSaturationAndWidenToInt32 might be a feasible shorter name that won't conflict, but I don't think we could get much shorter otherwise.

I've updated the names for these, Do you think it makes sense to have 'Widen' in the name for the accumulated dot product ones as well?

@tannergooding What are the next steps for this?

@tannergooding tannergooding added api-ready-for-review API is ready for review, it is NOT ready for implementation and removed api-suggestion Early API idea and discussion, it is NOT ready for implementation labels Nov 18, 2024
@tannergooding
Copy link
Member

tannergooding commented Nov 19, 2024

I've filtered out the VPDPB[SU,UIU,SS]D[,S] instructions from the initial review as the names aren't correct

namespace System.Runtime.Intrinsics.X86
{
    /// <summary>Provides access to X86 AVX10.1 hardware instructions via intrinsics</summary>
    [Intrinsic]
    [CLSCompliant(false)]
    public abstract class Avx10v2 : Avx10v1
    {
        // VPDPBSSD xmm1{k1}{z}, xmm2, xmm3/m128/m32bcst
        public static Vector128<int> AccumulatedByteDotProduct(vector128<sbyte> left, Vector128<sbyte> right) => AccumulatedByteDotProduct(left, right, acc);

        // VPDPBSUD xmm1{k1}{z}, xmm2, xmm3/m128/m32bcst
        public static Vector128<int> AccumulatedByteDotProduct(vector128<sbyte> left, Vector128<byte> right) => AccumulatedByteDotProduct(left, right, acc);

        // VPDPBUUD xmm1{k1}{z}, xmm2, xmm3/m128/m32bcst
        public static Vector128<int> AccumulatedByteDotProduct(vector128<byte> left, Vector128<byte> right) => AccumulatedSignedByteDotProduct(left, right, acc);

        // VPDPBSSD ymm1{k1}{z}, ymm2, ymm3/m256/m32bcst
        public static Vector256<int> AccumulatedByteDotProduct(Vector256<sbyte> left, Vector256<sbyte> right) => AccumulatedByteDotProduct(left, right, acc);

        // VPDPBSUD ymm1{k1}{z}, ymm2, ymm3/m256/m32bcst
        public static Vector256<int> AccumulatedByteDotProduct(Vector256<sbyte> left, Vector256<byte> right) => AccumulatedSignedByteDotProduct(left, right, acc);

        // VPDPBUUD ymm1{k1}{z}, ymm2, ymm3/m256/m32bcst
        public static Vector256<int> AccumulatedByteDotProduct(Vector256<byte> left, Vector256<byte> right) => AccumulatedByteDotProduct(left, right, acc);

        // VPDPBSSDS xmm1{k1}{z}, xmm2, xmm3/m128/m32bcst
        public static Vector128<int> AccumulatedByteDotProductWithSaturation(vector128<sbyte> left, Vector128<sbyte> right) => AccumulatedByteDotProductWithSaturation(left, right, acc);

        // VPDPBSUDS xmm1{k1}{z}, xmm2, xmm3/m128/m32bcst
        public static Vector128<int> AccumulatedByteDotProductWithSaturation(vector128<sbyte> left, Vector128<byte> right) => AccumulatedByteDotProductWithSaturation(left, right, acc);

        // VPDPBUUDS xmm1{k1}{z}, xmm2, xmm3/m128/m32bcst
        public static Vector128<int> AccumulatedByteDotProductWithSaturation(vector128<byte> left, Vector128<byte> right) => AccumulatedByteDotProductWithSaturation(left, right, acc);

        // VPDPBSSDS ymm1{k1}{z}, ymm2, ymm3/m256/m32bcst
        public static Vector256<int> AccumulatedByteDotProductWithSaturation(Vector256<sbyte> left, Vector256<sbyte> right) => AccumulatedByteDotProductWithSaturation(left, right, acc);

        // VPDPBSUDS ymm1{k1}{z}, ymm2, ymm3/m256/m32bcst
        public static Vector256<int> AccumulatedByteDotProductWithSaturation(Vector256<sbyte> left, Vector256<byte> right) => AccumulatedByteDotProductWithSaturation(left, right, acc);

        // VPDPBUUDS ymm1{k1}{z}, ymm2, ymm3/m256/m32bcst
        public static Vector256<int> AccumulatedByteDotProductWithSaturation(Vector256<byte> left, Vector256<byte> right) => AccumulatedByteDotProductWithSaturation(left, right, acc);

        // VPDPWSUD xmm1{k1}{z}, xmm2, xmm3/m128/m32bcst
        public static Vector128<int> AccumulatedInt16DotProduct(vector128<short> left, Vector128<ushort> right) => AccumulatedInt16DotProduct(left, right, acc);

        // VPDPWUSD xmm1{k1}{z}, xmm2, xmm3/m128/m32bcst
        public static Vector128<int> AccumulatedInt16DotProduct(vector128<ushort> left, Vector128<short> right) => AccumulatedInt16DotProduct(left, right, acc);

        // VPDPWUUD xmm1{k1}{z}, xmm2, xmm3/m128/m32bcst
        public static Vector128<int> AccumulatedInt16DotProduct(vector128<ushort> left, Vector128<ushort> right) => AccumulatedInt16DotProduct(left, right, acc);

        // VPDPWSUD ymm1{k1}{z}, ymm2, ymm3/m256/m32bcst
        public static Vector256<int> AccumulatedInt16DotProduct(Vector256<short> left, Vector256<ushort> right) => AccumulatedInt16DotProduct(left, right, acc);

        // VPDPWUSD ymm1{k1}{z}, ymm2, ymm3/m256/m32bcst
        public static Vector256<int> AccumulatedInt16DotProduct(Vector256<ushort> left, Vector256<short> right) => AccumulatedInt16DotProduct(left, right, acc);

        // VPDPWUUD ymm1{k1}{z}, ymm2, ymm3/m256/m32bcst
        public static Vector256<int> AccumulatedInt16DotProduct(Vector256<ushort> left, Vector256<ushort> right) => AccumulatedInt16DotProduct(left, right, acc);

        // VPDPWSUDS xmm1{k1}{z}, xmm2, xmm3/m128/m32bcst
        public static Vector128<int> AccumulatedInt16DotProductWithSaturation(vector128<short> left, Vector128<ushort> right) => AccumulatedInt16DotProductWithSaturation(left, right, acc);

        // VPDPWUSDS xmm1{k1}{z}, xmm2, xmm3/m128/m32bcst
        public static Vector128<int> AccumulatedInt16DotProductWithSaturation(vector128<ushort> left, Vector128<short> right) => AccumulatedInt16DotProductWithSaturation(left, right, acc);

        // VPDPWUUDS xmm1{k1}{z}, xmm2, xmm3/m128/m32bcst
        public static Vector128<int> AccumulatedInt16DotProductWithSaturation(vector128<ushort> left, Vector128<ushort> right) => AccumulatedInt16DotProductWithSaturation(left, right, acc);

        // VPDPWSUDS ymm1{k1}{z}, ymm2, ymm3/m256/m32bcst
        public static Vector256<int> AccumulatedInt16DotProductWithSaturation(Vector256<short> left, Vector256<ushort> right) => AccumulatedSaturatedSignedShortDotProduct(left, right, acc);

        // VPDPWUSDS ymm1{k1}{z}, ymm2, ymm3/m256/m32bcst
        public static Vector256<int> AccumulatedInt16DotProductWithSaturation(Vector256<ushort> left, Vector256<short> right) => AccumulatedSaturatedSignedShortDotProduct(left, right, acc);

        // VPDPWUUDS ymm1{k1}{z}, ymm2, ymm3/m256/m32bcst
        public static Vector256<int> AccumulatedInt16DotProductWithSaturation(Vector256<ushort> left, Vector256<ushort> right) => AccumulatedInt16DotProductWithSaturation(left, right, acc);

        [Intrinsic]
        public abstract class V512 : Avx10v1.V512
        {   
            // VPDPWSUD xmm1{k1}{z}, xmm2, xmm3/m128/m32bcst
            public static Vector512<int> AccumulatedInt16DotProduct(Vector512<short> left, Vector512<ushort> right) => AccumulatedInt16DotProduct(left, right, acc);

            // VPDPWUSD xmm1{k1}{z}, xmm2, xmm3/m128/m32bcst
            public static Vector512<int> AccumulatedInt16DotProduct(Vector512<ushort> left, Vector512<short> right) => AccumulatedInt16DotProduct(left, right, acc);

            // VPDPWUUD xmm1{k1}{z}, xmm2, xmm3/m128/m32bcst
            public static Vector512<int> AccumulatedInt16DotProduct(Vector512<ushort> left, Vector512<ushort> right) => AccumulatedInt16DotProduct(left, right, acc);

            // VPDPWSUDS xmm1{k1}{z}, xmm2, xmm3/m128/m32bcst
            public static Vector512<int> AccumulatedInt16DotProductWithSaturation(Vector512<short> left, Vector512<short> right) => AccumulatedInt16DotProductWithSaturation(left, right, acc);

            // VPDPWUSDS xmm1{k1}{z}, xmm2, xmm3/m128/m32bcst
            public static Vector512<int> AccumulatedInt16DotProductWithSaturation(Vector512<short> left, Vector512<ushort> right) => AccumulatedInt16DotProductWithSaturation(left, right, acc);

            // VPDPWUUDS xmm1{k1}{z}, xmm2, xmm3/m128/m32bcst
            public static Vector512<int> AccumulatedInt16DotProductWithSaturation(Vector512<ushort> left, Vector512<ushort> right) => AccumulatedInt16DotProductWithSaturation(left, right, acc);

            // VPDPBSSD zmm1{k1}{z}, zmm2, zmm3/m512/m32bcst
            public static Vector512<int> AccumulatedByteDotProduct(Vector512<sbyte> left, Vector512<sbyte> right) => AccumulatedSByteDotProduct(left, right, acc);

            // VPDPBSUD zmm1{k1}{z}, zmm2, zmm3/m512/m32bcst
            public static Vector512<int> AccumulatedByteDotProduct(Vector512<sbyte> left, Vector512<byte> right) => AccumulatedSByteDotProduct(left, right, acc);

            // VPDPBUUD zmm1{k1}{z}, zmm2, zmm3/m512/m32bcst
            public static Vector512<int> AccumulatedByteDotProduct(Vector512<byte> left, Vector512<byte> right) => AccumulatedSByteDotProduct(left, right, acc);

            // VPDPBSSDS zmm1{k1}{z}, zmm2, zmm3/m512/m32bcst
            public static Vector512<int> AccumulatedByteDotProductWithSaturation(Vector512<sbyte> left, Vector512<sbyte> right) => AccumulatedByteDotProductWithSaturation(left, right, acc);

            // VPDPBSUDS zmm1{k1}{z}, zmm2, zmm3/m512/m32bcst
            public static Vector512<int> AccumulatedByteDotProductWithSaturation(Vector512<sbyte> left, Vector512<byte> right) => AccumulatedByteDotProductWithSaturation(left, right, acc);

            // VPDPBUUDS zmm1{k1}{z}, zmm2, zmm3/m512/m32bcst
            public static Vector512<int> AccumulatedByteDotProductWithSaturation(Vector512<byte> left, Vector512<byte> right) => AccumulatedByteDotProductWithSaturation(left, right, acc);
        }
    }
}

@bartonjs
Copy link
Member

bartonjs commented Nov 19, 2024

Video

  • ConvertToByteWithTruncationSaturationAndWidenToUInt32 => ConvertToByteWithTruncatedSaturationAndWidenToUInt32
    • Truncation => Truncated
  • Vector128<uint> ConvertToVector128UInt32(Vector128<uint> value) and similar, change to "ConvertScalarTo..."
namespace System.Runtime.Intrinsics.X86
{
    /// <summary>Provides access to X86 AVX10.1 hardware instructions via intrinsics</summary>
    [Intrinsic]
    [CLSCompliant(false)]
    public abstract class Avx10v2 : Avx10v1
    {
        internal Avx10v2() { }

        public static new bool IsSupported { get => IsSupported; }

        // VMINMAXPD xmm1{k1}{z}, xmm2, xmm3/m128/m64bcst, imm8
        public static Vector128<double> MinMax(Vector128<double> left, Vector128<double> right, [ConstantExpected] byte control) => MinMax(left, right, mode);
        
        // VMINMAXPD ymm1{k1}{z}, ymm2, ymm3/m256/m64bcst {sae}, imm8
        public static Vector256<double> MinMax(Vector256<double> left, Vector256<double> right, [ConstantExpected] byte control) => MinMax(left, right, mode);
        
        // VMINMAXPS xmm1{k1}{z}, xmm2, xmm3/m128/m32bcst, imm8
        public static Vector128<float> MinMax(Vector128<float> left, Vector128<float> right, [ConstantExpected] byte control) => MinMax(left, right, mode);
        
        // VMINMAXPS ymm1{k1}{z}, ymm2, ymm3/m256/m32bcst {sae}, imm8
        public static Vector256<float> MinMax(Vector256<float> left, Vector256<float> right, [ConstantExpected] byte control) => MinMax(left, right, mode);
        
        // VMINMAXSD xmm1{k1}{z}, xmm2, xmm3/m64 {sae}, imm8
        public static Vector128<double> MinMaxScalar(Vector128<double> left, Vector128<double> right, [ConstantExpected] byte control) => MinMaxScalar(left, right, mode);

        // VMINMAXSS xmm1{k1}{z}, xmm2, xmm3/m32 {sae}, imm8
        public static Vector128<float> MinMaxScalar(Vector128<float> left, Vector128<float> right, [ConstantExpected] byte control) => MinMaxScalar(left, right, mode);

        // VADDPD ymm1{k1}{z}, ymm2, ymm3/m256/m64bcst {er}
        public static Vector256<double> Add(Vector256<double> left, Vector256<double> right, [ConstantExpected(Max = FloatRoundingMode.ToZero)] FloatRoundingMode mode) => Add(left, right, mode);

        // VADDPS ymm1{k1}{z}, ymm2, ymm3/m256/m32bcst {er}
        public static Vector256<float> Add(Vector256<float> left, Vector256<float> right, [ConstantExpected(Max = FloatRoundingMode.ToZero)] FloatRoundingMode mode) => Add(left, right, mode);
                
        // VDIVPD ymm1{k1}{z}, ymm2, ymm3/m256/m64bcst {er}
        public static Vector256<double> Divide(Vector256<double> left, Vector256<double> right, [ConstantExpected(Max = FloatRoundingMode.ToZero)] FloatRoundingMode mode) => Divide(left, right, mode);
        
        // VDIVPS ymm1{k1}{z}, ymm2, ymm3/m256/m32bcst {er}
        public static Vector256<float> Divide(Vector256<float> left, Vector256<float> right, [ConstantExpected(Max = FloatRoundingMode.ToZero)] FloatRoundingMode mode) => Divide(left, right, mode);
        
        // VCVTPS2IBS xmm1{k1}{z}, xmm2/m128/m32bcst
        public static Vector128<int> ConvertToByteWithSaturationAndWidenToInt32(Vector128<float> value) => ConvertToByteWithSaturationAndWidenToInt32(value);
                
        // VCVTPS2IBS ymm1{k1}{z}, ymm2/m256/m32bcst {er}
        public static Vector256<int> ConvertToByteWithSaturationAndWidenToInt32(Vector256<float> value) => ConvertToByteWithSaturationAndWidenToInt32(value);
        
        // VCVTPS2IBS ymm1{k1}{z}, ymm2/m256/m32bcst {er}
        public static Vector256<int> ConvertToByteWithSaturationAndWidenToInt32(Vector256<float> value, [ConstantExpected(Max = FloatRoundingMode.ToZero)] FloatRoundingMode mode) => ConvertToByteWithSaturationAndWidenToInt32(value, mode);
        
        // VCVTPS2IUBS xmm1{k1}{z}, xmm2/m128/m32bcst
        public static Vector128<uint> ConvertToByteWithSaturationAndWidenToUInt32(Vector128<float> value) => ConvertToByteWithSaturationAndWidenToUInt32(value);
                
        // VCVTPS2IUBS ymm1{k1}{z}, ymm2/m256/m32bcst {er}
        public static Vector256<uint> ConvertToByteWithSaturationAndWidenToUInt32(Vector256<float> value) => ConvertToByteWithSaturationAndWidenToUInt32(value);
        
        // VCVTPS2IUBS ymm1{k1}{z}, ymm2/m256/m32bcst {er}
        public static Vector256<uint> ConvertToByteWithSaturationAndWidenToUInt32(Vector256<float> value, [ConstantExpected(Max = FloatRoundingMode.ToZero)] FloatRoundingMode mode) => ConvertToByteWithSaturationAndWidenToUInt32(value, mode);

        // VCVTTPS2IBS xmm1{k1}{z}, xmm2/m128/m32bcst
        public static Vector128<int> ConvertToByteWithTruncatedSaturationAndWidenToInt32(Vector128<float> value) => ConvertToByteWithTruncationSaturationAndWidenToInt32(value);
                
        // VCVTTPS2IBS ymm1{k1}{z}, ymm2/m256/m32bcst {sae}
        public static Vector256<int> ConvertToByteWithTruncatedSaturationAndWidenToInt32(Vector256<float> value) => ConvertToVector256SByteWithTruncationSaturation(value);
                
        // VCVTTPS2IUBS xmm1{k1}{z}, xmm2/m128/m32bcst
        public static Vector128<uint> ConvertToByteWithTruncatedSaturationAndWidenToUInt32(Vector128<float> value) => ConvertToByteWithTruncatedSaturationAndWidenToUInt32(value);
                
        // VCVTTPS2IUBS ymm1{k1}{z}, ymm2/m256/m32bcst {sae}
        public static Vector256<uint> ConvertToByteWithTruncatedSaturationAndWidenToUInt32(Vector256<float> value) => ConvertToByteWithTruncatedSaturationAndWidenToUInt32(value);
        
        // VMOVD xmm1, xmm2/m32
        public static Vector128<uint> ConvertScalarToVector128UInt32(Vector128<uint> value) => ConvertScalarToVector128UInt32(value);
        
        // VMOVW xmm1, xmm2/m16
        public static Vector128<ushort> ConvertScalarToVector128UInt16(Vector128<ushort> value) => ConvertScalarToVector128UInt16(value);
        
        //The below instructions are those where 
        //embedded rouding support have been added 
        //to the existing API

        // VCVTDQ2PS ymm1{k1}{z}, ymm2/m256/m32bcst {er}
        public static Vector256<float> ConvertToVector256Single(Vector256<int> value, [ConstantExpected(Max = FloatRoundingMode.ToZero)] FloatRoundingMode mode) => ConvertToVector256Single(value, mode);
        
        // VCVTPD2DQ xmm1{k1}{z}, ymm2/m256/m64bcst {er}
        public static Vector128<int> ConvertToVector128Int32(Vector256<double> value, [ConstantExpected(Max = FloatRoundingMode.ToZero)] FloatRoundingMode mode) => ConvertToVector128Int32(value, mode);

        // VCVTPD2PS xmm1{k1}{z}, ymm2/m256/m64bcst {er}
        public static Vector128<float> ConvertToVector128Single(Vector256<double> value, [ConstantExpected(Max = FloatRoundingMode.ToZero)] FloatRoundingMode mode) => ConvertToVector128Single(value, mode);

        // VCVTPD2QQ ymm1{k1}{z}, ymm2/m256/m64bcst {er}
        public static Vector256<long> ConvertToVector256Int64(Vector256<double> value, [ConstantExpected(Max = FloatRoundingMode.ToZero)] FloatRoundingMode mode) => ConvertToVector256Int64(value, mode);

        // VCVTPD2UDQ xmm1{k1}{z}, ymm2/m256/m64bcst {er}
        public static Vector128<uint> ConvertToVector128UInt32(Vector256<double> value, [ConstantExpected(Max = FloatRoundingMode.ToZero)] FloatRoundingMode mode) => ConvertToVector128UInt32(value, mode);

        // VCVTPD2UQQ ymm1{k1}{z}, ymm2/m256/m64bcst {er}
        public static Vector256<ulong> ConvertToVector256UInt64(Vector256<double> value, [ConstantExpected(Max = FloatRoundingMode.ToZero)] FloatRoundingMode mode) => ConvertToVector256UInt64(value, mode);

        // VCVTPS2DQ ymm1{k1}{z}, ymm2/m256/m32bcst {er}
        public static Vector256<int> ConvertToVector256Int32(Vector256<float> value, [ConstantExpected(Max = FloatRoundingMode.ToZero)] FloatRoundingMode mode) => ConvertToVector256Int32(value, mode);

        // VCVTPS2QQ ymm1{k1}{z}, xmm2/m128/m32bcst {er}
        public static Vector256<long> ConvertToVector256Int64(Vector128<float> value, [ConstantExpected(Max = FloatRoundingMode.ToZero)] FloatRoundingMode mode) => ConvertToVector256Int64(value, mode);

        // VCVTPS2UDQ ymm1{k1}{z}, ymm2/m256/m32bcst {er}
        public static Vector256<uint> ConvertToVector256UInt32(Vector256<float> value, [ConstantExpected(Max = FloatRoundingMode.ToZero)] FloatRoundingMode mode) => ConvertToVector256UInt32(value, mode);

        // VCVTPS2UQQ ymm1{k1}{z}, xmm2/m128/m32bcst {er}
        public static Vector256<ulong> ConvertToVector256UInt64(Vector128<float> value, [ConstantExpected(Max = FloatRoundingMode.ToZero)] FloatRoundingMode mode) => ConvertToVector256UInt64(value, mode);

        // VCVTQQ2PS xmm1{k1}{z}, ymm2/m256/m64bcst {er}
        public static Vector128<float> ConvertToVector128Single(Vector256<ulong> value, [ConstantExpected(Max = FloatRoundingMode.ToZero)] FloatRoundingMode mode) => ConvertToVector128Single(value, mode);
        
        // VCVTQQ2PD ymm1{k1}{z}, ymm2/m256/m64bcst {er}
        public static Vector256<double> ConvertToVector256Double(Vector256<ulong> value, [ConstantExpected(Max = FloatRoundingMode.ToZero)] FloatRoundingMode mode) => ConvertToVector256Double(value, mode);
        
        // VCVTUDQ2PS ymm1{k1}{z}, ymm2/m256/m32bcst {er}
        public static Vector256<float> ConvertToVector256Single(Vector256<uint> value, [ConstantExpected(Max = FloatRoundingMode.ToZero)] FloatRoundingMode mode) => ConvertToVector256Single(value, mode);
        
        // VCVTUQQ2PS xmm1{k1}{z}, ymm2/m256/m64bcst {er}
        public static Vector128<float> ConvertToVector128Single(Vector256<long> value, [ConstantExpected(Max = FloatRoundingMode.ToZero)] FloatRoundingMode mode) => ConvertToVector128Single(value, mode);
        
        // VCVTUQQ2PD ymm1{k1}{z}, ymm2/m256/m64bcst {er}
        public static Vector256<double> ConvertToVector256Double(Vector256<long> value, [ConstantExpected(Max = FloatRoundingMode.ToZero)] FloatRoundingMode mode) => ConvertToVector256Double(value, mode);
        
        // VMULPD ymm1{k1}{z}, ymm2, ymm3/m256/m64bcst {er}
        public static Vector256<double> Multiply(Vector256<double> left, Vector256<double> right, [ConstantExpected(Max = FloatRoundingMode.ToZero)] FloatRoundingMode mode) => Multiply(left, right, mode);
        
        // VMULPS ymm1{k1}{z}, ymm2, ymm3/m256/m32bcst {er}
        public static Vector256<float> Multiply(Vector256<float> left, Vector256<float> right, [ConstantExpected(Max = FloatRoundingMode.ToZero)] FloatRoundingMode mode) => Multiply(left, right, mode);
        
        // VSCALEFPD ymm1{k1}{z}, ymm2, ymm3/m256/m64bcst {er}
        public static Vector256<double> Scale(Vector256<double> left, Vector256<double> right, [ConstantExpected(Max = FloatRoundingMode.ToZero)] FloatRoundingMode mode) => Scale(left, right, mode);
        
        // VSCALEFPS ymm1{k1}{z}, ymm2, ymm3/m256/m32bcst {er}
        public static Vector256<float> Scale(Vector256<float> left, Vector256<float> right, [ConstantExpected(Max = FloatRoundingMode.ToZero)] FloatRoundingMode mode) => Scale(left, right, mode);
        
        // VSQRTPD ymm1{k1}{z}, ymm2/m256/m64bcst {er}
        public static Vector256<double> Sqrt(Vector256<double> value, [ConstantExpected(Max = FloatRoundingMode.ToZero)] FloatRoundingMode mode) => Sqrt(value, mode);
        
        // VSQRTPS ymm1{k1}{z}, ymm2/m256/m32bcst {er}
        public static Vector256<float> Sqrt(Vector256<float> value, [ConstantExpected(Max = FloatRoundingMode.ToZero)] FloatRoundingMode mode) => Sqrt(value, mode);
        
        // VSUBPD ymm1{k1}{z}, ymm2, ymm3/m256/m64bcst {er}
        public static Vector256<double> Subtract(Vector256<double> left, Vector256<double> right, [ConstantExpected(Max = FloatRoundingMode.ToZero)] FloatRoundingMode mode) => Subtract(left, right, mode);
        
        // VSUBPS ymm1{k1}{z}, ymm2, ymm3/m256/m32bcst {er}
        public static Vector256<float> Subtract(Vector256<float> left, Vector256<float> right, [ConstantExpected(Max = FloatRoundingMode.ToZero)] FloatRoundingMode mode) => Subtract(left, right, mode);
        
        [Intrinsic]
        public new abstract class X64 : Avx10v1.X64
        {
            internal X64() { }

            public static new bool IsSupported { get => IsSupported; }
        }

        [Intrinsic]
        public abstract class V512 : Avx10v1.V512
        {
            internal V512() { }

            public static new bool IsSupported { get => IsSupported; }
    
            // VMINMAXPD zmm1{k1}{z}, zmm2, zmm3/m512/m64bcst {sae}, imm8
            public static Vector512<double> MinMax(Vector512<double> left, Vector512<double> right, [ConstantExpected] byte control) => MinMax(left, right, mode);
            
            // VMINMAXPS zmm1{k1}{z}, zmm2, zmm3/m512/m32bcst {sae}, imm8
            public static Vector512<float> MinMax(Vector512<float> left, Vector512<float> right, [ConstantExpected] byte control) => MinMax(left, right, mode);
            
            // VCVTPS2IBS zmm1{k1}{z}, zmm2/m512/m32bcst {er}
            public static Vector512<int> ConvertToByteWithSaturationAndWidenToInt32(Vector512<float> value) => ConvertToByteWithSaturationAndWidenToInt32(value);
            
            // VCVTPS2IBS zmm1{k1}{z}, zmm2/m512/m32bcst {er}
            public static Vector512<int> ConvertToByteWithSaturationAndWidenToInt32(Vector512<float> value, [ConstantExpected(Max = FloatRoundingMode.ToZero)] FloatRoundingMode mode) => ConvertToByteWithSaturationAndWidenToInt32(value, mode);
            
            // VCVTPS2IUBS zmm1{k1}{z}, zmm2/m512/m32bcst {er}
            public static Vector512<uint> ConvertToByteWithSaturationAndWidenToUInt32(Vector512<float> value) => ConvertToByteWithSaturationAndWidenToUInt32(value);
            
            // VCVTPS2IUBS zmm1{k1}{z}, zmm2/m512/m32bcst {er}
            public static Vector512<uint> ConvertToByteWithSaturationAndWidenToUInt32(Vector512<float> value, [ConstantExpected(Max = FloatRoundingMode.ToZero)] FloatRoundingMode mode) => ConvertToByteWithSaturationAndWidenToUInt32(value, mode);

            // VCVTTPS2IUBS zmm1{k1}{z}, zmm2/m512/m32bcst {sae}
            public static Vector512<int> ConvertToByteWithTruncatedSaturationAndWidenToInt32(Vector512<float> value) => ConvertToByteWithTruncatedSaturationAndWidenToInt32(value);
                        
            // VCVTTPS2IUBS zmm1{k1}{z}, zmm2/m512/m32bcst {sae}
            public static Vector512<uint> ConvertToByteWithTruncatedSaturationAndWidenToUInt32(Vector512<float> value) => ConvertToByteWithTruncatedSaturationAndWidenToUInt32(value);
            
            // This is a 512 extension of previously existing 128/26 inrinsic
            // VMPSADBW zmm1{k1}{z}, zmm2, zmm3/m512, imm8
            public static Vector512<ushort> MultipleSumAbsoluteDifferences(Vector512<byte> left, Vector512<byte> right, [ConstantExpected] byte mask) => MultipleSumAbsoluteDifferences(left, right, mask);

            [Intrinsic]
            public new abstract class X64 : Avx10v1.V512.X64
            {
                internal X64() { }

                public static new bool IsSupported { get => IsSupported; }
            }
        }
    }
}

@bartonjs bartonjs added api-approved API was approved in API review, it can be implemented and removed api-ready-for-review API is ready for review, it is NOT ready for implementation labels Nov 19, 2024
@khushal1996
Copy link
Contributor

I will also like to discuss the following API

        // VMOVD xmm1, xmm2/m32
        public static Vector128<uint> ConvertToVector128UInt32(Vector128<uint> value) => ConvertToVector128UInt32(value);

        // VMOVW xmm1, xmm2/m16
        public static Vector128<ushort> ConvertToVector128UInt16(Vector128<ushort> value) => ConvertToVector128UInt16(value);

and would like to change them to

        public static unsafe void StoreLowDWord(byte* address, Vector128<uint> source) => StoreLowDWord(address, source);
        public static unsafe void StoreLowWord(byte* address, Vector128<ushort> source) => StoreLowWord(address, source);
        public static unsafe Vector128<uint> RetrieveLowDWord(byte* address) => RetrieveLowDWord(address);
        public static unsafe Vector128<ushort> RetrieveLowWord(byte* address) => RetrieveLowWord(address);

Image

Image

@tannergooding
Copy link
Member

@khushal1996 the proposed signatures don't match the .NET naming conventions (we'd still use UInt32) but also don't cover all the functionality the underlying API supports

In particular we already expose the existing movd/movq variants that deal with general-purpose to/from SIMD and so which can already work with loading from or storing to memory. These notably have a signature similar to static Vector128<int> ConvertScalarToVector128Int32(int value)

Likewise while we expose some instructions like movss, where the managed signature is static Vector128<float> MoveScalar(Vector128<float> upper, Vector128<float> value), these preserve the upper bits.

The new movd/movw variants are most similar to the existing movq variant which deals with SIMD to SIMD and zero the upper bits. The latter is exposed in two ways today: MoveScalar for SIMD to/from SIMD and ConvertToVector128UInt64 which is general-purpose to/from SIMD. So it might be goodness for these ones to similarly be MoveScalar rather than ConvertToVector128*

@khushal1996
Copy link
Contributor

Thanks @tannergooding
To conclude, this is what you are proposing

public static Vector128<uint>MoveScalarUInt32(Vector128<uint>)

@tannergooding
Copy link
Member

In this case, since its a move and no conversions are possible, it'd just be 2-4 MoveScalar overloads (depending on if we only want unsigned or also signed overloads).

@khushal1996
Copy link
Contributor

Thanks @tannergooding

Concluding this discussion with addition of following APIs

public static Vector128<uint>MoveScalar(Vector128<uint>)
public static Vector128<int>MoveScalar(Vector128<int>)
public static Vector128<ushort>MoveScalar(Vector128<ushort>)
public static Vector128<short>MoveScalar(Vector128<short>)

@DeepakRajendrakumaran
Copy link
Contributor Author

Thanks @tannergooding

Concluding this discussion with addition of following APIs

public static Vector128<uint>MoveScalar(Vector128<uint>)
public static Vector128<int>MoveScalar(Vector128<int>)
public static Vector128<ushort>MoveScalar(Vector128<ushort>)
public static Vector128<short>MoveScalar(Vector128<short>)

@tannergooding Assuming we are planning on this, do we add this to the proposal somewhere now that the API name has changes from the original approved proposal?

@tannergooding
Copy link
Member

We'd need to extract them to their own proposal. API review is done for the year so we should get to them sometime in January. We shouldn't be blocked on doing any work around the other approved APIs in the meantime, however.

@khushal1996
Copy link
Contributor

khushal1996 commented Dec 12, 2024

I think we should default to the verbose name, which is the most consistent with our other APIs and the least problematic.

We'll likely discuss some of the alternatives in API review and it wouldn't hurt to have them listed.

Notably we have Vector128<int> ConvertToInt32(Vector128<float> value) in SSE2 (and similar for Int64/UInt32/UInt654 in other ISAs), so something like ConvertToByteWithSaturationAndWidenToInt32 might be a feasible shorter name that won't conflict, but I don't think we could get much shorter otherwise.

Since the APIs listed in description and #109083 (comment) are different from what we agreed upon, just to confirm, for APIs for vcvtps2ibs, vcvtps2iubs, vcvttps2ibs, vcvttps2iubs; we decided to go with the following names respectively

ConvertToSByteWithSaturationAndWidenToInt32
ConvertToByteWithSaturationAndWidenToInt32
ConvertToSByteWithTruncationSaturationAndWidenToInt32
ConvertToByteWithTruncationSaturationAndWidenToInt32

Let me know if this is correct

@tannergooding
Copy link
Member

They should be matching the naming, particularly the first two.

The latter two match the general format but were changed from WithTruncationSaturation to instead be WithTruncatedSaturation which API review felt relayed the concept better.

This should match what is under the approving comment here: #109083 (comment)

@khushal1996
Copy link
Contributor

khushal1996 commented Dec 12, 2024

They should be matching the naming, particularly the first two.

The latter two match the general format but were changed from WithTruncationSaturation to instead be WithTruncatedSaturation which API review felt relayed the concept better.

This should match what is under the approving comment here: #109083 (comment)

I think there is a mismatch between what the approving comment says and what actually we were discussing here #109083 (comment). Our discussion had no conclusion and the approving comment went in the wrong direction because of the original proposal.

There is no such thing as ConvertToByteWithSaturationAndWidenToUInt32. The result is always Int32 no matter what. The conversion is done to signed and unsigned byte and result is always zero extended integer.

@tannergooding
Copy link
Member

tannergooding commented Dec 13, 2024

The result is always Int32 no matter what

This is a bit of a nomenclature thing. The result of a byte being widened to Int32 and UInt32 is identical both bitwise and in the value represented after the widening.

It is only the result of an sbyte being widened where they differ as -1 becomes -1 as an Int32 and 4294967295 as a UInt32.

The conversion is done to signed and unsigned byte and result is always zero extended integer.

The spec gives the following:

The downconverted 8-bit result is written inplace at the lower 8-bit of the corresponding 32-bit element.
The upper 3bytes are zeroed.

VCVTPS2IBS converts single-precision floating point elements into signed byte integer elements. When a
conversion is inexact, floating-point precision exception is raised and the value returned is rounded
according to the rounding control bits in the MXCSR register or the embedded rounding control bits. If a
converted result cannot be represented in the destination format, the floating-point invalid exception is
raised, and if this exception is masked then: If value is too big, the INT_MAX value (2^(w-1)-1, where w
represents the number of bits in the destination format) is returned. If value is too small, the INT_MIN
value (2^(w-1)) is returned. For NaN, (0) is returned.

VCVTPS2IUBS converts single-precision floating point elements into un-signed byte integer elements.
Whenaconversion is inexact, floating-point precision exception is raised and the value returned is
rounded according to the rounding control bits in the MXCSR register or the embedded rounding control
bits. If a converted result cannot be represented in the destination format, the floating-point invalid
exception is raised, and if this exception is masked then: If value is too big, the UINT_MAX value (2^w-1,
where wrepresents the number of bits in the destination format) is returned. If value is too small, the
UINT_MIN value (0) is returned. For NaN, (0) is returned.

The nuance is then the "upper 3 bytes are zeroed" portion which means that ConvertToSByteWithSaturationAndWidenToInt32 is "incorrect" because we aren't widening. Instead it should be ConvertToSByteWithSaturationAndZeroExtendToInt32 which means that -1 becomes 255 as an Int32 or UInt32, it doesn't become -1 as some might expect. You therefore have to extract every 4th byte and still manually upcast to int to get a correct result, rather than simply extracting the whole 32-bit integer like you can with ConvertToByteWithSaturationAndWidenTo...

@khushal1996
Copy link
Contributor

The nuance is then the "upper 3 bytes are zeroed" portion which means that ConvertToSByteWithSaturationAndWidenToInt32 is "incorrect" because we aren't widening. Instead it should be ConvertToSByteWithSaturationAndZeroExtendToInt32

True. I will change the APIs to the following

ConvertToSByteWithSaturationAndZeroExtendToInt32
ConvertToByteWithSaturationAndZeroExtendToInt32
ConvertToSByteWithTruncatedSaturationAndZeroExtendToInt32
ConvertToByteWithTruncatedSaturationAndZeroExtendToInt32

@tannergooding
Copy link
Member

tannergooding commented Jan 31, 2025

@khushal1996 this proposal was fully handled in #111209 and so can be closed now, correct?

@khushal1996
Copy link
Contributor

@tannergooding yes, this proposal can be considered fully handled and we will have 2 proposals for the remaining APIs

  1. [API Proposal]: Add AVX10v2 API to add Avx10.2 support #109083 (comment)
  2. [API Proposal]: Add dot product intrinsics to AVX10v2 API #110032

@khushal1996
Copy link
Contributor

khushal1996 commented Feb 19, 2025

We'd need to extract them to their own proposal. API review is done for the year so we should get to them sometime in January. We shouldn't be blocked on doing any work around the other approved APIs in the meantime, however.

Hi @tannergooding . I was going through the pending APIs and we are yet to add these 4 APIs for AVX10.2. As discussed, I will open a new issue to discuss them and we can take it forward. Let me know if there are any concerns.

@tannergooding
Copy link
Member

No concerns, thanks for following up!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api-approved API was approved in API review, it can be implemented area-System.Runtime.Intrinsics avx10 Related to the AVX10 architecture
Projects
None yet
Development

No branches or pull requests

6 participants