[CUDA/ROCm] Conditionally support ArgMax and ArgMin for opset 12 and above #22713

tianleiwu · 2024-11-04T16:29:47Z

Description

Based on #9700, and extend it to ArgMin as well.

This pull request introduces several enhancements and fixes related to the ArgMax and ArgMin operators in the CUDA execution provider. The changes ensure proper handling of these operators across different versions and improve kernel registration and fallback mechanisms.

Key changes include:

Enhancements to `ArgMax` and `ArgMin` Operators:

Added new kernel class registrations for ArgMax and ArgMin for different data types and versions in onnxruntime/core/providers/cuda/cuda_execution_provider.cc. [1] [2] [3] [4] [5] [6]
Introduced ArgMaxOrArgMinNeedFallbackToCPU function to handle fallback to CPU when the select_last_index attribute is set to 1, as CUDA does not support this attribute. [1] [2]

Macro and Kernel Registration Improvements:

Replaced REGISTER_KERNEL_UNTIL_VERSIONED_TYPED with REGISTER_KERNEL_VERSIONED_RANGE_TYPED and REGISTER_KERNEL_VERSIONED_SINCE_TYPED macros for better version handling. [1] [2]
Updated kernel registration for ArgMax and ArgMin to use the new macros, ensuring proper version handling and support for different data types.

Safety Checks:

Added safety checks in the ArgMax and ArgMin classes to ensure select_last_index is not set to 1, as it is not supported on CUDA. [1] [2]

Testing Enhancements:

Added new tests for ArgMax and ArgMin operators to verify behavior when select_last_index is set to 0, ensuring compatibility with both CPU and CUDA execution providers. [1] [2]

Motivation and Context

Improve CUDA kernel coverage for stable diffusion model and hence improve its performance on CUDA

onnxruntime/core/providers/cuda/cuda_execution_provider.cc

yuslepukhin

yuslepukhin

yuslepukhin

…above (microsoft#22713) ### Description Based on microsoft#9700, and extend it to ArgMin as well. This pull request introduces several enhancements and fixes related to the `ArgMax` and `ArgMin` operators in the CUDA execution provider. The changes ensure proper handling of these operators across different versions and improve kernel registration and fallback mechanisms. Key changes include: #### Enhancements to `ArgMax` and `ArgMin` Operators: * Added new kernel class registrations for `ArgMax` and `ArgMin` for different data types and versions in `onnxruntime/core/providers/cuda/cuda_execution_provider.cc`. [[1]](diffhunk://#diff-57ba769b54dce57acd89df47140ede5f29ea670d61176096076701912d573285R966-R972) [[2]](diffhunk://#diff-57ba769b54dce57acd89df47140ede5f29ea670d61176096076701912d573285R1209-R1215) [[3]](diffhunk://#diff-57ba769b54dce57acd89df47140ede5f29ea670d61176096076701912d573285R1657-R1659) [[4]](diffhunk://#diff-57ba769b54dce57acd89df47140ede5f29ea670d61176096076701912d573285L1825-L1827) [[5]](diffhunk://#diff-57ba769b54dce57acd89df47140ede5f29ea670d61176096076701912d573285R1933-R1939) [[6]](diffhunk://#diff-57ba769b54dce57acd89df47140ede5f29ea670d61176096076701912d573285R2174-R2180) * Introduced `ArgMaxOrArgMinNeedFallbackToCPU` function to handle fallback to CPU when the `select_last_index` attribute is set to 1, as CUDA does not support this attribute. [[1]](diffhunk://#diff-57ba769b54dce57acd89df47140ede5f29ea670d61176096076701912d573285R2597-R2622) [[2]](diffhunk://#diff-57ba769b54dce57acd89df47140ede5f29ea670d61176096076701912d573285R2672-R2674) #### Macro and Kernel Registration Improvements: * Replaced `REGISTER_KERNEL_UNTIL_VERSIONED_TYPED` with `REGISTER_KERNEL_VERSIONED_RANGE_TYPED` and `REGISTER_KERNEL_VERSIONED_SINCE_TYPED` macros for better version handling. [[1]](diffhunk://#diff-ee5316fc3898058f70e942d9a84de36be4c7da09f144633a2504236430d5d033L19-R29) [[2]](diffhunk://#diff-ee5316fc3898058f70e942d9a84de36be4c7da09f144633a2504236430d5d033L40-R46) * Updated kernel registration for `ArgMax` and `ArgMin` to use the new macros, ensuring proper version handling and support for different data types. #### Safety Checks: * Added safety checks in the `ArgMax` and `ArgMin` classes to ensure `select_last_index` is not set to 1, as it is not supported on CUDA. [[1]](diffhunk://#diff-8ab09fef1f4a12cbf3b3432e509f8f1ef561e83c72778a0e047780060aeef6efL91-R99) [[2]](diffhunk://#diff-8ab09fef1f4a12cbf3b3432e509f8f1ef561e83c72778a0e047780060aeef6efL101-R117) #### Testing Enhancements: * Added new tests for `ArgMax` and `ArgMin` operators to verify behavior when `select_last_index` is set to 0, ensuring compatibility with both CPU and CUDA execution providers. [[1]](diffhunk://#diff-77affe1b70d1a9d38c2485f7c6b16ef2b6b541ed94dd727bc9b286f068f1481aR3340-R3360) [[2]](diffhunk://#diff-77affe1b70d1a9d38c2485f7c6b16ef2b6b541ed94dd727bc9b286f068f1481aR3679-R3699) ### Motivation and Context Improve CUDA kernel coverage for stable diffusion model and hence improve its performance on CUDA

…above (#22713) ### Description Based on #9700, and extend it to ArgMin as well. This pull request introduces several enhancements and fixes related to the `ArgMax` and `ArgMin` operators in the CUDA execution provider. The changes ensure proper handling of these operators across different versions and improve kernel registration and fallback mechanisms. Key changes include: #### Enhancements to `ArgMax` and `ArgMin` Operators: * Added new kernel class registrations for `ArgMax` and `ArgMin` for different data types and versions in `onnxruntime/core/providers/cuda/cuda_execution_provider.cc`. [[1]](diffhunk://#diff-57ba769b54dce57acd89df47140ede5f29ea670d61176096076701912d573285R966-R972) [[2]](diffhunk://#diff-57ba769b54dce57acd89df47140ede5f29ea670d61176096076701912d573285R1209-R1215) [[3]](diffhunk://#diff-57ba769b54dce57acd89df47140ede5f29ea670d61176096076701912d573285R1657-R1659) [[4]](diffhunk://#diff-57ba769b54dce57acd89df47140ede5f29ea670d61176096076701912d573285L1825-L1827) [[5]](diffhunk://#diff-57ba769b54dce57acd89df47140ede5f29ea670d61176096076701912d573285R1933-R1939) [[6]](diffhunk://#diff-57ba769b54dce57acd89df47140ede5f29ea670d61176096076701912d573285R2174-R2180) * Introduced `ArgMaxOrArgMinNeedFallbackToCPU` function to handle fallback to CPU when the `select_last_index` attribute is set to 1, as CUDA does not support this attribute. [[1]](diffhunk://#diff-57ba769b54dce57acd89df47140ede5f29ea670d61176096076701912d573285R2597-R2622) [[2]](diffhunk://#diff-57ba769b54dce57acd89df47140ede5f29ea670d61176096076701912d573285R2672-R2674) #### Macro and Kernel Registration Improvements: * Replaced `REGISTER_KERNEL_UNTIL_VERSIONED_TYPED` with `REGISTER_KERNEL_VERSIONED_RANGE_TYPED` and `REGISTER_KERNEL_VERSIONED_SINCE_TYPED` macros for better version handling. [[1]](diffhunk://#diff-ee5316fc3898058f70e942d9a84de36be4c7da09f144633a2504236430d5d033L19-R29) [[2]](diffhunk://#diff-ee5316fc3898058f70e942d9a84de36be4c7da09f144633a2504236430d5d033L40-R46) * Updated kernel registration for `ArgMax` and `ArgMin` to use the new macros, ensuring proper version handling and support for different data types. #### Safety Checks: * Added safety checks in the `ArgMax` and `ArgMin` classes to ensure `select_last_index` is not set to 1, as it is not supported on CUDA. [[1]](diffhunk://#diff-8ab09fef1f4a12cbf3b3432e509f8f1ef561e83c72778a0e047780060aeef6efL91-R99) [[2]](diffhunk://#diff-8ab09fef1f4a12cbf3b3432e509f8f1ef561e83c72778a0e047780060aeef6efL101-R117) #### Testing Enhancements: * Added new tests for `ArgMax` and `ArgMin` operators to verify behavior when `select_last_index` is set to 0, ensuring compatibility with both CPU and CUDA execution providers. [[1]](diffhunk://#diff-77affe1b70d1a9d38c2485f7c6b16ef2b6b541ed94dd727bc9b286f068f1481aR3340-R3360) [[2]](diffhunk://#diff-77affe1b70d1a9d38c2485f7c6b16ef2b6b541ed94dd727bc9b286f068f1481aR3679-R3699) ### Motivation and Context Improve CUDA kernel coverage for stable diffusion model and hence improve its performance on CUDA

…above (microsoft#22713) ### Description Based on microsoft#9700, and extend it to ArgMin as well. This pull request introduces several enhancements and fixes related to the `ArgMax` and `ArgMin` operators in the CUDA execution provider. The changes ensure proper handling of these operators across different versions and improve kernel registration and fallback mechanisms. Key changes include: #### Enhancements to `ArgMax` and `ArgMin` Operators: * Added new kernel class registrations for `ArgMax` and `ArgMin` for different data types and versions in `onnxruntime/core/providers/cuda/cuda_execution_provider.cc`. [[1]](diffhunk://#diff-57ba769b54dce57acd89df47140ede5f29ea670d61176096076701912d573285R966-R972) [[2]](diffhunk://#diff-57ba769b54dce57acd89df47140ede5f29ea670d61176096076701912d573285R1209-R1215) [[3]](diffhunk://#diff-57ba769b54dce57acd89df47140ede5f29ea670d61176096076701912d573285R1657-R1659) [[4]](diffhunk://#diff-57ba769b54dce57acd89df47140ede5f29ea670d61176096076701912d573285L1825-L1827) [[5]](diffhunk://#diff-57ba769b54dce57acd89df47140ede5f29ea670d61176096076701912d573285R1933-R1939) [[6]](diffhunk://#diff-57ba769b54dce57acd89df47140ede5f29ea670d61176096076701912d573285R2174-R2180) * Introduced `ArgMaxOrArgMinNeedFallbackToCPU` function to handle fallback to CPU when the `select_last_index` attribute is set to 1, as CUDA does not support this attribute. [[1]](diffhunk://#diff-57ba769b54dce57acd89df47140ede5f29ea670d61176096076701912d573285R2597-R2622) [[2]](diffhunk://#diff-57ba769b54dce57acd89df47140ede5f29ea670d61176096076701912d573285R2672-R2674) #### Macro and Kernel Registration Improvements: * Replaced `REGISTER_KERNEL_UNTIL_VERSIONED_TYPED` with `REGISTER_KERNEL_VERSIONED_RANGE_TYPED` and `REGISTER_KERNEL_VERSIONED_SINCE_TYPED` macros for better version handling. [[1]](diffhunk://#diff-ee5316fc3898058f70e942d9a84de36be4c7da09f144633a2504236430d5d033L19-R29) [[2]](diffhunk://#diff-ee5316fc3898058f70e942d9a84de36be4c7da09f144633a2504236430d5d033L40-R46) * Updated kernel registration for `ArgMax` and `ArgMin` to use the new macros, ensuring proper version handling and support for different data types. #### Safety Checks: * Added safety checks in the `ArgMax` and `ArgMin` classes to ensure `select_last_index` is not set to 1, as it is not supported on CUDA. [[1]](diffhunk://#diff-8ab09fef1f4a12cbf3b3432e509f8f1ef561e83c72778a0e047780060aeef6efL91-R99) [[2]](diffhunk://#diff-8ab09fef1f4a12cbf3b3432e509f8f1ef561e83c72778a0e047780060aeef6efL101-R117) #### Testing Enhancements: * Added new tests for `ArgMax` and `ArgMin` operators to verify behavior when `select_last_index` is set to 0, ensuring compatibility with both CPU and CUDA execution providers. [[1]](diffhunk://#diff-77affe1b70d1a9d38c2485f7c6b16ef2b6b541ed94dd727bc9b286f068f1481aR3340-R3360) [[2]](diffhunk://#diff-77affe1b70d1a9d38c2485f7c6b16ef2b6b541ed94dd727bc9b286f068f1481aR3679-R3699) ### Motivation and Context Improve CUDA kernel coverage for stable diffusion model and hence improve its performance on CUDA

tianleiwu added 2 commits November 4, 2024 06:31

cuda ArgMin-12, ArgMin-13, ArgMax-12, ArgMax-13

7a4afce

update doc

bb5335a

tianleiwu requested a review from hariharans29 November 4, 2024 16:33

update comments

36da528

tianleiwu requested review from pranavsharma and yuslepukhin November 4, 2024 16:46

hariharans29 reviewed Nov 4, 2024

View reviewed changes

onnxruntime/core/providers/cuda/cuda_execution_provider.cc Outdated Show resolved Hide resolved

hariharans29 reviewed Nov 4, 2024

View reviewed changes

onnxruntime/core/providers/cuda/cuda_execution_provider.cc Show resolved Hide resolved

yuslepukhin reviewed Nov 4, 2024

View reviewed changes

onnxruntime/core/providers/cuda/cuda_execution_provider.cc Show resolved Hide resolved

tianleiwu added 2 commits November 4, 2024 23:19

test random

1b7311a

ArgMax / ArgMin in ROCm

cbaeebb

tianleiwu changed the title ~~[CUDA] Conditionally support ArgMax and ArgMin for opset 12 and above~~ [CUDA/ROCm] Conditionally support ArgMax and ArgMin for opset 12 and above Nov 4, 2024

hariharans29 previously approved these changes Nov 4, 2024

View reviewed changes

fix rocm

8b7c924

tianleiwu dismissed hariharans29’s stale review via 8b7c924 November 5, 2024 00:23

yuslepukhin previously approved these changes Nov 5, 2024

View reviewed changes

Fix openvino CI

6da3dea

tianleiwu dismissed yuslepukhin’s stale review via 6da3dea November 5, 2024 17:59

yuslepukhin previously approved these changes Nov 5, 2024

View reviewed changes

Exclude OpenVino in the ArgMax test

193f70f

tianleiwu dismissed yuslepukhin’s stale review via 193f70f November 5, 2024 23:05

tianleiwu requested review from yuslepukhin and hariharans29 November 6, 2024 00:22

yuslepukhin approved these changes Nov 6, 2024

View reviewed changes

tianleiwu merged commit ba22d78 into main Nov 6, 2024
90 of 91 checks passed

tianleiwu deleted the tlwu/cuda_argmin_argmax_12_and_13 branch November 6, 2024 17:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CUDA/ROCm] Conditionally support ArgMax and ArgMin for opset 12 and above #22713

[CUDA/ROCm] Conditionally support ArgMax and ArgMin for opset 12 and above #22713

tianleiwu commented Nov 4, 2024

yuslepukhin left a comment

yuslepukhin left a comment

yuslepukhin left a comment

[CUDA/ROCm] Conditionally support ArgMax and ArgMin for opset 12 and above #22713

[CUDA/ROCm] Conditionally support ArgMax and ArgMin for opset 12 and above #22713

Conversation

tianleiwu commented Nov 4, 2024

Description

Enhancements to ArgMax and ArgMin Operators:

Macro and Kernel Registration Improvements:

Safety Checks:

Testing Enhancements:

Motivation and Context

yuslepukhin left a comment

Choose a reason for hiding this comment

yuslepukhin left a comment

Choose a reason for hiding this comment

yuslepukhin left a comment

Choose a reason for hiding this comment

Enhancements to `ArgMax` and `ArgMin` Operators: