Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CUDA/ROCm] Conditionally support ArgMax and ArgMin for opset 12 and above #22713

Merged
merged 8 commits into from
Nov 6, 2024

Conversation

tianleiwu
Copy link
Contributor

Description

Based on #9700, and extend it to ArgMin as well.

This pull request introduces several enhancements and fixes related to the ArgMax and ArgMin operators in the CUDA execution provider. The changes ensure proper handling of these operators across different versions and improve kernel registration and fallback mechanisms.

Key changes include:

Enhancements to ArgMax and ArgMin Operators:

  • Added new kernel class registrations for ArgMax and ArgMin for different data types and versions in onnxruntime/core/providers/cuda/cuda_execution_provider.cc. [1] [2] [3] [4] [5] [6]

  • Introduced ArgMaxOrArgMinNeedFallbackToCPU function to handle fallback to CPU when the select_last_index attribute is set to 1, as CUDA does not support this attribute. [1] [2]

Macro and Kernel Registration Improvements:

  • Replaced REGISTER_KERNEL_UNTIL_VERSIONED_TYPED with REGISTER_KERNEL_VERSIONED_RANGE_TYPED and REGISTER_KERNEL_VERSIONED_SINCE_TYPED macros for better version handling. [1] [2]

  • Updated kernel registration for ArgMax and ArgMin to use the new macros, ensuring proper version handling and support for different data types.

Safety Checks:

  • Added safety checks in the ArgMax and ArgMin classes to ensure select_last_index is not set to 1, as it is not supported on CUDA. [1] [2]

Testing Enhancements:

  • Added new tests for ArgMax and ArgMin operators to verify behavior when select_last_index is set to 0, ensuring compatibility with both CPU and CUDA execution providers. [1] [2]

Motivation and Context

Improve CUDA kernel coverage for stable diffusion model and hence improve its performance on CUDA

@tianleiwu tianleiwu changed the title [CUDA] Conditionally support ArgMax and ArgMin for opset 12 and above [CUDA/ROCm] Conditionally support ArgMax and ArgMin for opset 12 and above Nov 4, 2024
hariharans29
hariharans29 previously approved these changes Nov 4, 2024
yuslepukhin
yuslepukhin previously approved these changes Nov 5, 2024
Copy link
Member

@yuslepukhin yuslepukhin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit:

yuslepukhin
yuslepukhin previously approved these changes Nov 5, 2024
Copy link
Member

@yuslepukhin yuslepukhin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit:

Copy link
Member

@yuslepukhin yuslepukhin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit:

@tianleiwu tianleiwu merged commit ba22d78 into main Nov 6, 2024
90 of 91 checks passed
@tianleiwu tianleiwu deleted the tlwu/cuda_argmin_argmax_12_and_13 branch November 6, 2024 17:54
ishwar-raut1 pushed a commit to ishwar-raut1/onnxruntime that referenced this pull request Nov 19, 2024
…above (microsoft#22713)

### Description
Based on microsoft#9700, and extend
it to ArgMin as well.

This pull request introduces several enhancements and fixes related to
the `ArgMax` and `ArgMin` operators in the CUDA execution provider. The
changes ensure proper handling of these operators across different
versions and improve kernel registration and fallback mechanisms.

Key changes include:

#### Enhancements to `ArgMax` and `ArgMin` Operators:

* Added new kernel class registrations for `ArgMax` and `ArgMin` for
different data types and versions in
`onnxruntime/core/providers/cuda/cuda_execution_provider.cc`.
[[1]](diffhunk://#diff-57ba769b54dce57acd89df47140ede5f29ea670d61176096076701912d573285R966-R972)
[[2]](diffhunk://#diff-57ba769b54dce57acd89df47140ede5f29ea670d61176096076701912d573285R1209-R1215)
[[3]](diffhunk://#diff-57ba769b54dce57acd89df47140ede5f29ea670d61176096076701912d573285R1657-R1659)
[[4]](diffhunk://#diff-57ba769b54dce57acd89df47140ede5f29ea670d61176096076701912d573285L1825-L1827)
[[5]](diffhunk://#diff-57ba769b54dce57acd89df47140ede5f29ea670d61176096076701912d573285R1933-R1939)
[[6]](diffhunk://#diff-57ba769b54dce57acd89df47140ede5f29ea670d61176096076701912d573285R2174-R2180)

* Introduced `ArgMaxOrArgMinNeedFallbackToCPU` function to handle
fallback to CPU when the `select_last_index` attribute is set to 1, as
CUDA does not support this attribute.
[[1]](diffhunk://#diff-57ba769b54dce57acd89df47140ede5f29ea670d61176096076701912d573285R2597-R2622)
[[2]](diffhunk://#diff-57ba769b54dce57acd89df47140ede5f29ea670d61176096076701912d573285R2672-R2674)

#### Macro and Kernel Registration Improvements:

* Replaced `REGISTER_KERNEL_UNTIL_VERSIONED_TYPED` with
`REGISTER_KERNEL_VERSIONED_RANGE_TYPED` and
`REGISTER_KERNEL_VERSIONED_SINCE_TYPED` macros for better version
handling.
[[1]](diffhunk://#diff-ee5316fc3898058f70e942d9a84de36be4c7da09f144633a2504236430d5d033L19-R29)
[[2]](diffhunk://#diff-ee5316fc3898058f70e942d9a84de36be4c7da09f144633a2504236430d5d033L40-R46)

* Updated kernel registration for `ArgMax` and `ArgMin` to use the new
macros, ensuring proper version handling and support for different data
types.

#### Safety Checks:

* Added safety checks in the `ArgMax` and `ArgMin` classes to ensure
`select_last_index` is not set to 1, as it is not supported on CUDA.
[[1]](diffhunk://#diff-8ab09fef1f4a12cbf3b3432e509f8f1ef561e83c72778a0e047780060aeef6efL91-R99)
[[2]](diffhunk://#diff-8ab09fef1f4a12cbf3b3432e509f8f1ef561e83c72778a0e047780060aeef6efL101-R117)

#### Testing Enhancements:

* Added new tests for `ArgMax` and `ArgMin` operators to verify behavior
when `select_last_index` is set to 0, ensuring compatibility with both
CPU and CUDA execution providers.
[[1]](diffhunk://#diff-77affe1b70d1a9d38c2485f7c6b16ef2b6b541ed94dd727bc9b286f068f1481aR3340-R3360)
[[2]](diffhunk://#diff-77affe1b70d1a9d38c2485f7c6b16ef2b6b541ed94dd727bc9b286f068f1481aR3679-R3699)

### Motivation and Context
Improve CUDA kernel coverage for stable diffusion model and hence
improve its performance on CUDA
guschmue pushed a commit that referenced this pull request Dec 2, 2024
…above (#22713)

### Description
Based on #9700, and extend
it to ArgMin as well.

This pull request introduces several enhancements and fixes related to
the `ArgMax` and `ArgMin` operators in the CUDA execution provider. The
changes ensure proper handling of these operators across different
versions and improve kernel registration and fallback mechanisms.

Key changes include:

#### Enhancements to `ArgMax` and `ArgMin` Operators:

* Added new kernel class registrations for `ArgMax` and `ArgMin` for
different data types and versions in
`onnxruntime/core/providers/cuda/cuda_execution_provider.cc`.
[[1]](diffhunk://#diff-57ba769b54dce57acd89df47140ede5f29ea670d61176096076701912d573285R966-R972)
[[2]](diffhunk://#diff-57ba769b54dce57acd89df47140ede5f29ea670d61176096076701912d573285R1209-R1215)
[[3]](diffhunk://#diff-57ba769b54dce57acd89df47140ede5f29ea670d61176096076701912d573285R1657-R1659)
[[4]](diffhunk://#diff-57ba769b54dce57acd89df47140ede5f29ea670d61176096076701912d573285L1825-L1827)
[[5]](diffhunk://#diff-57ba769b54dce57acd89df47140ede5f29ea670d61176096076701912d573285R1933-R1939)
[[6]](diffhunk://#diff-57ba769b54dce57acd89df47140ede5f29ea670d61176096076701912d573285R2174-R2180)

* Introduced `ArgMaxOrArgMinNeedFallbackToCPU` function to handle
fallback to CPU when the `select_last_index` attribute is set to 1, as
CUDA does not support this attribute.
[[1]](diffhunk://#diff-57ba769b54dce57acd89df47140ede5f29ea670d61176096076701912d573285R2597-R2622)
[[2]](diffhunk://#diff-57ba769b54dce57acd89df47140ede5f29ea670d61176096076701912d573285R2672-R2674)

#### Macro and Kernel Registration Improvements:

* Replaced `REGISTER_KERNEL_UNTIL_VERSIONED_TYPED` with
`REGISTER_KERNEL_VERSIONED_RANGE_TYPED` and
`REGISTER_KERNEL_VERSIONED_SINCE_TYPED` macros for better version
handling.
[[1]](diffhunk://#diff-ee5316fc3898058f70e942d9a84de36be4c7da09f144633a2504236430d5d033L19-R29)
[[2]](diffhunk://#diff-ee5316fc3898058f70e942d9a84de36be4c7da09f144633a2504236430d5d033L40-R46)

* Updated kernel registration for `ArgMax` and `ArgMin` to use the new
macros, ensuring proper version handling and support for different data
types.

#### Safety Checks:

* Added safety checks in the `ArgMax` and `ArgMin` classes to ensure
`select_last_index` is not set to 1, as it is not supported on CUDA.
[[1]](diffhunk://#diff-8ab09fef1f4a12cbf3b3432e509f8f1ef561e83c72778a0e047780060aeef6efL91-R99)
[[2]](diffhunk://#diff-8ab09fef1f4a12cbf3b3432e509f8f1ef561e83c72778a0e047780060aeef6efL101-R117)

#### Testing Enhancements:

* Added new tests for `ArgMax` and `ArgMin` operators to verify behavior
when `select_last_index` is set to 0, ensuring compatibility with both
CPU and CUDA execution providers.
[[1]](diffhunk://#diff-77affe1b70d1a9d38c2485f7c6b16ef2b6b541ed94dd727bc9b286f068f1481aR3340-R3360)
[[2]](diffhunk://#diff-77affe1b70d1a9d38c2485f7c6b16ef2b6b541ed94dd727bc9b286f068f1481aR3679-R3699)

### Motivation and Context
Improve CUDA kernel coverage for stable diffusion model and hence
improve its performance on CUDA
ankitm3k pushed a commit to intel/onnxruntime that referenced this pull request Dec 11, 2024
…above (microsoft#22713)

### Description
Based on microsoft#9700, and extend
it to ArgMin as well.

This pull request introduces several enhancements and fixes related to
the `ArgMax` and `ArgMin` operators in the CUDA execution provider. The
changes ensure proper handling of these operators across different
versions and improve kernel registration and fallback mechanisms.

Key changes include:

#### Enhancements to `ArgMax` and `ArgMin` Operators:

* Added new kernel class registrations for `ArgMax` and `ArgMin` for
different data types and versions in
`onnxruntime/core/providers/cuda/cuda_execution_provider.cc`.
[[1]](diffhunk://#diff-57ba769b54dce57acd89df47140ede5f29ea670d61176096076701912d573285R966-R972)
[[2]](diffhunk://#diff-57ba769b54dce57acd89df47140ede5f29ea670d61176096076701912d573285R1209-R1215)
[[3]](diffhunk://#diff-57ba769b54dce57acd89df47140ede5f29ea670d61176096076701912d573285R1657-R1659)
[[4]](diffhunk://#diff-57ba769b54dce57acd89df47140ede5f29ea670d61176096076701912d573285L1825-L1827)
[[5]](diffhunk://#diff-57ba769b54dce57acd89df47140ede5f29ea670d61176096076701912d573285R1933-R1939)
[[6]](diffhunk://#diff-57ba769b54dce57acd89df47140ede5f29ea670d61176096076701912d573285R2174-R2180)

* Introduced `ArgMaxOrArgMinNeedFallbackToCPU` function to handle
fallback to CPU when the `select_last_index` attribute is set to 1, as
CUDA does not support this attribute.
[[1]](diffhunk://#diff-57ba769b54dce57acd89df47140ede5f29ea670d61176096076701912d573285R2597-R2622)
[[2]](diffhunk://#diff-57ba769b54dce57acd89df47140ede5f29ea670d61176096076701912d573285R2672-R2674)

#### Macro and Kernel Registration Improvements:

* Replaced `REGISTER_KERNEL_UNTIL_VERSIONED_TYPED` with
`REGISTER_KERNEL_VERSIONED_RANGE_TYPED` and
`REGISTER_KERNEL_VERSIONED_SINCE_TYPED` macros for better version
handling.
[[1]](diffhunk://#diff-ee5316fc3898058f70e942d9a84de36be4c7da09f144633a2504236430d5d033L19-R29)
[[2]](diffhunk://#diff-ee5316fc3898058f70e942d9a84de36be4c7da09f144633a2504236430d5d033L40-R46)

* Updated kernel registration for `ArgMax` and `ArgMin` to use the new
macros, ensuring proper version handling and support for different data
types.

#### Safety Checks:

* Added safety checks in the `ArgMax` and `ArgMin` classes to ensure
`select_last_index` is not set to 1, as it is not supported on CUDA.
[[1]](diffhunk://#diff-8ab09fef1f4a12cbf3b3432e509f8f1ef561e83c72778a0e047780060aeef6efL91-R99)
[[2]](diffhunk://#diff-8ab09fef1f4a12cbf3b3432e509f8f1ef561e83c72778a0e047780060aeef6efL101-R117)

#### Testing Enhancements:

* Added new tests for `ArgMax` and `ArgMin` operators to verify behavior
when `select_last_index` is set to 0, ensuring compatibility with both
CPU and CUDA execution providers.
[[1]](diffhunk://#diff-77affe1b70d1a9d38c2485f7c6b16ef2b6b541ed94dd727bc9b286f068f1481aR3340-R3360)
[[2]](diffhunk://#diff-77affe1b70d1a9d38c2485f7c6b16ef2b6b541ed94dd727bc9b286f068f1481aR3679-R3699)

### Motivation and Context
Improve CUDA kernel coverage for stable diffusion model and hence
improve its performance on CUDA
ankitm3k pushed a commit to intel/onnxruntime that referenced this pull request Dec 11, 2024
…above (microsoft#22713)

### Description
Based on microsoft#9700, and extend
it to ArgMin as well.

This pull request introduces several enhancements and fixes related to
the `ArgMax` and `ArgMin` operators in the CUDA execution provider. The
changes ensure proper handling of these operators across different
versions and improve kernel registration and fallback mechanisms.

Key changes include:

#### Enhancements to `ArgMax` and `ArgMin` Operators:

* Added new kernel class registrations for `ArgMax` and `ArgMin` for
different data types and versions in
`onnxruntime/core/providers/cuda/cuda_execution_provider.cc`.
[[1]](diffhunk://#diff-57ba769b54dce57acd89df47140ede5f29ea670d61176096076701912d573285R966-R972)
[[2]](diffhunk://#diff-57ba769b54dce57acd89df47140ede5f29ea670d61176096076701912d573285R1209-R1215)
[[3]](diffhunk://#diff-57ba769b54dce57acd89df47140ede5f29ea670d61176096076701912d573285R1657-R1659)
[[4]](diffhunk://#diff-57ba769b54dce57acd89df47140ede5f29ea670d61176096076701912d573285L1825-L1827)
[[5]](diffhunk://#diff-57ba769b54dce57acd89df47140ede5f29ea670d61176096076701912d573285R1933-R1939)
[[6]](diffhunk://#diff-57ba769b54dce57acd89df47140ede5f29ea670d61176096076701912d573285R2174-R2180)

* Introduced `ArgMaxOrArgMinNeedFallbackToCPU` function to handle
fallback to CPU when the `select_last_index` attribute is set to 1, as
CUDA does not support this attribute.
[[1]](diffhunk://#diff-57ba769b54dce57acd89df47140ede5f29ea670d61176096076701912d573285R2597-R2622)
[[2]](diffhunk://#diff-57ba769b54dce57acd89df47140ede5f29ea670d61176096076701912d573285R2672-R2674)

#### Macro and Kernel Registration Improvements:

* Replaced `REGISTER_KERNEL_UNTIL_VERSIONED_TYPED` with
`REGISTER_KERNEL_VERSIONED_RANGE_TYPED` and
`REGISTER_KERNEL_VERSIONED_SINCE_TYPED` macros for better version
handling.
[[1]](diffhunk://#diff-ee5316fc3898058f70e942d9a84de36be4c7da09f144633a2504236430d5d033L19-R29)
[[2]](diffhunk://#diff-ee5316fc3898058f70e942d9a84de36be4c7da09f144633a2504236430d5d033L40-R46)

* Updated kernel registration for `ArgMax` and `ArgMin` to use the new
macros, ensuring proper version handling and support for different data
types.

#### Safety Checks:

* Added safety checks in the `ArgMax` and `ArgMin` classes to ensure
`select_last_index` is not set to 1, as it is not supported on CUDA.
[[1]](diffhunk://#diff-8ab09fef1f4a12cbf3b3432e509f8f1ef561e83c72778a0e047780060aeef6efL91-R99)
[[2]](diffhunk://#diff-8ab09fef1f4a12cbf3b3432e509f8f1ef561e83c72778a0e047780060aeef6efL101-R117)

#### Testing Enhancements:

* Added new tests for `ArgMax` and `ArgMin` operators to verify behavior
when `select_last_index` is set to 0, ensuring compatibility with both
CPU and CUDA execution providers.
[[1]](diffhunk://#diff-77affe1b70d1a9d38c2485f7c6b16ef2b6b541ed94dd727bc9b286f068f1481aR3340-R3360)
[[2]](diffhunk://#diff-77affe1b70d1a9d38c2485f7c6b16ef2b6b541ed94dd727bc9b286f068f1481aR3679-R3699)

### Motivation and Context
Improve CUDA kernel coverage for stable diffusion model and hence
improve its performance on CUDA
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants