-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for AVX10.2, Add AVX10.2 API surface and template tests #111209
Add support for AVX10.2, Add AVX10.2 API surface and template tests #111209
Conversation
Note regarding the
|
1 similar comment
Note regarding the
|
Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch |
ec7732d
to
216999c
Compare
7f2b5ae
to
83868ab
Compare
src/coreclr/jit/emitxarch.cpp
Outdated
// Reserved for isas Avx10.1 and below | ||
// Needs to be set to 0 for AVX10.2 adn above to indicate YMM embedded rounding |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit:
// Reserved for isas Avx10.1 and below | |
// Needs to be set to 0 for AVX10.2 adn above to indicate YMM embedded rounding | |
// Reserved for isas Avx10.1 and below | |
// Set to 0 on AVX10.2 and above for YMM embedded rounding support |
Or something along those lines. The current wording makes it sound like it must be always set to 0, rather than only conditionally set if we're using embedded rounding for YMM sizes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe on Avx10.1 and below it is required to be set to 1
as well, so this should likely indicate that. Simply reserved doesn't imply a necessary state
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have pushed these changes in my latest commit
src/coreclr/jit/emitxarch.cpp
Outdated
// Scalar Double Scalar Single Packed Double | ||
return ((b == 0xF2) || (b == 0xF3) || (b == 0x66)); | ||
// Scalar Double Scalar Single Packed Double No prefix | ||
return ((b == 0xF2) || (b == 0xF3) || (b == 0x66) || (b == 0x00)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change isn't necessary given the assert(b != 0
above
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have pushed these changes in my latest commit
@@ -2314,6 +2346,13 @@ emitter::code_t emitter::emitExtractEvexPrefix(instruction ins, code_t& code) co | |||
break; | |||
} | |||
|
|||
case 0x05: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this need to handle leadingBytes
being [0x00, 0x04]
and [0x06, 0x07]
?
The higher assert seems to indicate it can be any of those, but you've only added 0x05
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As of now we dont have those instructions. But yes, I think it is better to have to handle those as invalid as of now since we dont have those instructions yet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have pushed these changes in my latest commit
HARDWARE_INTRINSIC(AVX10v2_V512, ConvertToVectorUInt32WithTruncationSaturation, 64, 1, {INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_vcvttps2udqs, INS_vcvttpd2udqs}, HW_Category_SimpleSIMD, HW_Flag_BaseTypeFromFirstArg|HW_Flag_EmbBroadcastCompatible|HW_Flag_EmbMaskingCompatible) | ||
HARDWARE_INTRINSIC(AVX10v2_V512, ConvertToVectorUInt64WithTruncationSaturation, 64, 1, {INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_vcvttps2uqqs, INS_vcvttpd2uqqs}, HW_Category_SimpleSIMD, HW_Flag_BaseTypeFromFirstArg|HW_Flag_EmbBroadcastCompatible|HW_Flag_EmbMaskingCompatible) | ||
HARDWARE_INTRINSIC(AVX10v2_V512, MinMax, 64, 3, {INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_vminmaxps, INS_vminmaxpd}, HW_Category_IMM, HW_Flag_BaseTypeFromFirstArg|HW_Flag_EmbBroadcastCompatible|HW_Flag_EmbMaskingCompatible) | ||
HARDWARE_INTRINSIC(AVX10v2_V512, MultipleSumAbsoluteDifferences, 64, 3, {INS_vmpsadbw, INS_invalid, INS_invalid, INS_vmpsadbw, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid}, HW_Category_IMM, HW_Flag_FullRangeIMM|HW_Flag_EmbMaskingCompatible) // TBD where should we put the instruction typ_byte or typ_ushort? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the TBD
, this has to match the tracked simdBaseType
that the node has.
The node typically defaults to the base type of the SIMD return as that is generally unambiguous and matches the types used for the inputs. However, if there are conflicts there then we switch to the base type of the first or second argument, depending on which is "unique" and allows disambiguation (in the same way it would for overload resolution).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My bad. I missed the TBD comment here. So what you are saying is the simdBaseType
of the node needs to be matched here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have pushed these changes in my latest commit
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tannergooding there seems to be a bug here and the template test for MultipleSumAbsoluteDifferences seems to be failing. It hits an assert because the simdBaseType was TYP_USHORT. Is that correct for this API? should the simdBaseType be TYP_BYTE or TYP_USHORT? from your comment above, I concluded that it needs to be TYP_BYTE since the API is
public static Vector512<ushort> MultipleSumAbsoluteDifferences(Vector512<byte> left, Vector512<byte> right, [ConstantExpected] byte mask) => MultipleSumAbsoluteDifferences(left, right, mask);
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@khushal1996 We start with it undefined
here: https://github.com/dotnet/runtime/blob/main/src/coreclr/jit/hwintrinsic.cpp#L1694C28-L1694C43
If the return type is a struct, we then initialize simdBaseJitType
and sizeBytes
based on the information extracted from it: https://github.com/dotnet/runtime/blob/main/src/coreclr/jit/hwintrinsic.cpp#L1700; thus we get CORINFO_TYPE_USHORT
and 64
We have a path that may adjust this here, by looking at the type of the input arguments: https://github.com/dotnet/runtime/blob/main/src/coreclr/jit/hwintrinsic.cpp#L1765
However, such a path only triggers if HW_Flag_BaseTypeFromFirstArg
or HW_Flag_BaseTypeFromSecondArg
is set: https://github.com/dotnet/runtime/blob/main/src/coreclr/jit/hwintrinsic.cpp#L732, otherwise it preserves the original value.
It can also be updated if HW_Flag_NormalizeSmallTypeToInt
is set: https://github.com/dotnet/runtime/blob/main/src/coreclr/jit/hwintrinsic.cpp#L1847-L1851
Since the category and flags on AVX10v2_V512_MultipleSumAbsoluteDifferences
are: HW_Category_IMM
and HW_Flag_FullRangeIMM | HW_Flag_EmbMaskingCompatible
, debugging should show that the base type is computed to be CORINFO_TYP_USHORT
.
Is that not the case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So from what you are saying, we can control the simdBaseJitType and we have to define an intrinsic based on that? There is no fixed way but multiple ways to define a HardwareIntrinsic?
For this, in a way,
HARDWARE_INTRINSIC(AVX10v2_V512, MultipleSumAbsoluteDifferences, 64, 3, {INS_invalid, INS_invalid, INS_invalid, INS_vmpsadbw, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid}, HW_Category_IMM, HW_Flag_FullRangeIMM|HW_Flag_EmbMaskingCompatible)
HARDWARE_INTRINSIC(AVX10v2_V512, MultipleSumAbsoluteDifferences, 64, 3, {INS_invalid, INS_vmpsadbw, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid}, HW_Category_IMM, HW_Flag_BaseTypeFromFirstArg|HW_Flag_FullRangeIMM|HW_Flag_EmbMaskingCompatible)
So both of them will be handled well and there can be more than one way to handle them?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So from what you are saying, we can control the simdBaseJitType and we have to define an intrinsic based on tha
Right, with the intent being that the defaults are good enough in most cases and we only deviate from that default when something requires it; such as needing a different base type to determine the correct instruction to emit.
So both of them will be handled well and there can be more than one way to handle them?
Yes, assuming of course that the difference between TYP_UBYTE
and TYP_USHORT
is otherwise irrelevant to the other phases. If you needed to key off the base type of the parameter such as to determine if embedded masking/containment was valid instead, that's a different story and you'd want the latter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it thanks for the detailed explanation. I have identified the actual bug and will raise an external PR soon for the failing case. But before that I would also want to evaluate the case of embeddded masking since it is supported here. let me investigate and then raise an external PR.
The changes overall LGTM. Just a couple small cleanup asks. It should generally be good for secondary review, CC. @dotnet/jit-contrib |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me except for one nit and one thing in lookupInstructionSet
that looks like a bug.
src/libraries/System.Private.CoreLib/src/System/Runtime/Intrinsics/X86/Avx10v2.cs
Outdated
Show resolved
Hide resolved
Co-authored-by: Bruce Forstall <brucefo@microsoft.com>
Thanks for all the work here @khushal1996! |
+1 Thanks @khushal1996! |
Thanks @tannergooding @BruceForstall |
* main: (22 commits) Clean up Stopwatch a bit (dotnet#111834) JIT: Fix embedded broadcast simd size (dotnet#111638) Revert potential UB due to aliasing + more WB removals (dotnet#111733) re-enable acceleration of Vector512<long>.op_Multiply (dotnet#111832) Handle OSSL 3.4 change to SAN:othername formatting JIT: Fix stack allocated arrays for NativeAOT (dotnet#111827) JIT: enhance RBO inference for similar compares to constants (dotnet#111766) JIT: Don't run optSetBlockWeights when we have PGO data (dotnet#111764) [Android] Make sure RuntimeFlavor=CoreCLR when clr subset is specified (dotnet#111821) Change empty subject test certificate to include a critical SAN. Fix reversed code offsets in GcInfo (dotnet#111792) Swap some libraries areas between leads (dotnet#111816) Add left-handed spherical and cylindrical billboards (dotnet#109605) JIT: revise `optRelopImpliesRelop` to always set `reverseSense` (dotnet#111803) Fix Zip64ExtraField handling (dotnet#111802) Add build support for Android+CoreCLR (dotnet#110471) arm64: Add bic(s) compact encoding (dotnet#111452) JIT: Ensure `BBF_PROF_WEIGHT` flag is set when we have PGO data (dotnet#111780) Add support for AVX10.2, Add AVX10.2 API surface and template tests (dotnet#111209) JIT: Preliminary for enabling inlining late devirted calls (dotnet#111782) ...
This PR covers implementation of approved AVX10.2 APIs along with addition of corresponding template tests and lowering support in JIT.
It enables ymm embedded rounding which comes with AVX10.2
data:image/s3,"s3://crabby-images/afc13/afc1331553df2b787d0037ed2739c5daadb0cc56" alt="image"
Addresses #109083
Testing overview
We follow a multi-step testing plan to verify the encoding correctness and the semantic correctness.
Testing results will be presented below.
In codgenxarch.cpp, similar to genAmd64EmitterUnitTestsSse2, we used the JitLateDisasm feature to insert instructions to encode as unit tests for emitter, and LateDisasm will invoke LLVM to disasm the code stream, this gave us the chance to cross validate the disassembly from JIT and LLVM. The output of this step is to verify the emit paths are generating "correct" code that would not trigger #UD or have wrong semantics.
Note that we are using a custom coredistools.dll which uses a recent LLVM that supports AVX10.2 decoding.
In this step, we would run the SuperPMI pipeline to get the asmdiffs,; the inputs are all the MCH files. This step will give us the chance to check if there is any assertion failure or internal error within JIT and since the pipeline will invoke coredistools.dll as well, so we can verify the encoding correctness in a larger scope.
To ensure the new changes will not hit the existing code path in terms of throughput, we ran asmdiffs with base JIT to be the main branch where changes are based on, and diff JIT to be the one with all the changes.
The 2 steps mentioned above are mainly verifying the encoding correctness of the generated binary code. In this step we have used the existing CoreCLR unit test set: JIT and run it in the Intel SDE emulator with
AVX10.2
on and off.Testing results
Run Emitter tests
Result of emitter tests using LLVM disassembler (left - JIT emitted code, right - LLVM disassembler output)
data:image/s3,"s3://crabby-images/2cf6f/2cf6ffa804fa1e01f00349d01295fb584d85e139" alt="image"
data:image/s3,"s3://crabby-images/3e68d/3e68d6be660525bb31081d7feba07122085925f1" alt="image"
data:image/s3,"s3://crabby-images/ae564/ae5640b2b2c444c15106c027c4590715aa679b18" alt="image"
Run superpmi using JITLateDisasm to check for errors
No Decode failures observed in superpmi log.
Running superpmi without JITLateDisasm to check for assert errors
No assertion failures or asm diffs observed
Run JIT subtree with AVX10.2 enabled / disabled
AVX10.2 enabled
AVX10.2 disabled