Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unblock migraphx and linux GPU training ci pipelines #21662

Merged
merged 6 commits into from
Aug 9, 2024

Conversation

tianleiwu
Copy link
Contributor

@tianleiwu tianleiwu commented Aug 7, 2024

Description

Unblock orttraining-linux-gpu-ci-pipeline and orttraining-ortmodule-distributed and orttraining-amd-gpu-ci-pipeline pipelines:

  • Disable a model test in linux GPU training ci pipelines caused by Adding CUDNN Frontend and use for CUDA NN Convolution #19470:
    Sometime, cudnn frontend throws exception that cudnn graph does not support a Conv node of keras_lotus_resnet3D model on V100 GPU.
    Note that same test does not throw exception in other GPU pipelines. The failure might be related to cudnn 8.9 and V100 GPU used in the pipeline (Amper GPUs and cuDNN 9.x do not have the issue).
    The actual fix requires fallback logic, which will take time to implement, so we temporarily disable the test in training pipelines.
  • Force install torch for cuda 11.8. (The docker has torch 2.4.0 for cuda 12.1 to build torch extension, which it is not compatible cuda 11.8). Note that this is temporary walkround. More elegant fix is to make sure right torch version in docker build step, that might need update install_python_deps.sh and corresponding requirements.txt.
  • Skip test_gradient_correctness_conv1d since it causes segment fault. Root cause need more investigation (maybe due to cudnn frontend as well).
  • Skip test_aten_attention since it causes assert failure. Root cause need more investigation (maybe due to torch version).
  • Skip orttraining_ortmodule_distributed_tests.py since it has error that compiler for torch extension does not support c++17. One possible fix it to set the following compile argument inside setup.py of extension fused_adam: extra_compile_args['cxx'] = ['-std=c++17']. However, due to the urgency of unblocking the pipelines, just disable the test for now.
  • skip test_softmax_bf16_large. torch.cuda.is_bf16_supported() returns True in V100 for torch>= 2.2 (but False for torch <= 2.1), so the test was run in CI, but V100 does not support bf16 natively.
  • Fix typo of deterministic

Motivation and Context

prathikr
prathikr previously approved these changes Aug 8, 2024
baijumeswani
baijumeswani previously approved these changes Aug 8, 2024
@tianleiwu tianleiwu force-pushed the tlwu/fix_migraphx_and_linux_training_ci_pipelines branch from 449c28a to ee65ccc Compare August 8, 2024 14:49
@jingyanwangms jingyanwangms self-assigned this Aug 8, 2024
@jingyanwangms jingyanwangms self-requested a review August 8, 2024 17:40
jingyanwangms
jingyanwangms previously approved these changes Aug 8, 2024
@prathikr prathikr added the release:1.19.0 Cherry pick to ORT 1.19 label Aug 8, 2024
@tianleiwu tianleiwu requested a review from baijumeswani August 8, 2024 22:57
@tianleiwu tianleiwu merged commit a46e49b into main Aug 9, 2024
95 of 98 checks passed
@tianleiwu tianleiwu deleted the tlwu/fix_migraphx_and_linux_training_ci_pipelines branch August 9, 2024 02:44
prathikr pushed a commit that referenced this pull request Aug 9, 2024
* Fix migraphx build error caused by
#21598:
Add a conditional compile on code block that depends on ROCm >= 6.2.
Note that the pipeline uses ROCm 6.0.

Unblock orttraining-linux-gpu-ci-pipeline and
orttraining-ortmodule-distributed and orttraining-amd-gpu-ci-pipeline
pipelines:
* Disable a model test in linux GPU training ci pipelines caused by
#19470:
Sometime, cudnn frontend throws exception that cudnn graph does not
support a Conv node of keras_lotus_resnet3D model on V100 GPU.
Note that same test does not throw exception in other GPU pipelines. The
failure might be related to cudnn 8.9 and V100 GPU used in the pipeline
(Amper GPUs and cuDNN 9.x do not have the issue).
The actual fix requires fallback logic, which will take time to
implement, so we temporarily disable the test in training pipelines.
* Force install torch for cuda 11.8. (The docker has torch 2.4.0 for
cuda 12.1 to build torch extension, which it is not compatible cuda
11.8). Note that this is temporary walkround. More elegant fix is to
make sure right torch version in docker build step, that might need
update install_python_deps.sh and corresponding requirements.txt.
* Skip test_gradient_correctness_conv1d since it causes segment fault.
Root cause need more investigation (maybe due to cudnn frontend as
well).
* Skip test_aten_attention since it causes assert failure. Root cause
need more investigation (maybe due to torch version).
* Skip orttraining_ortmodule_distributed_tests.py since it has error
that compiler for torch extension does not support c++17. One possible
fix it to set the following compile argument inside setup.py of
extension fused_adam: extra_compile_args['cxx'] = ['-std=c++17'].
However, due to the urgency of unblocking the pipelines, just disable
the test for now.
* skip test_softmax_bf16_large. For some reason,
torch.cuda.is_bf16_supported() returns True in V100 with torch 2.3.1, so
the test was run in CI, but V100 does not support bf16 natively.
* Fix typo of deterministic

<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
@prathikr prathikr added the cherry-picked Cherry-picked for a cherrypicks branch label Aug 9, 2024
sumitsays pushed a commit that referenced this pull request Aug 9, 2024
* Fix migraphx build error caused by
#21598:
Add a conditional compile on code block that depends on ROCm >= 6.2.
Note that the pipeline uses ROCm 6.0.

Unblock orttraining-linux-gpu-ci-pipeline and
orttraining-ortmodule-distributed and orttraining-amd-gpu-ci-pipeline
pipelines:
* Disable a model test in linux GPU training ci pipelines caused by
#19470:
Sometime, cudnn frontend throws exception that cudnn graph does not
support a Conv node of keras_lotus_resnet3D model on V100 GPU.
Note that same test does not throw exception in other GPU pipelines. The
failure might be related to cudnn 8.9 and V100 GPU used in the pipeline
(Amper GPUs and cuDNN 9.x do not have the issue).
The actual fix requires fallback logic, which will take time to
implement, so we temporarily disable the test in training pipelines.
* Force install torch for cuda 11.8. (The docker has torch 2.4.0 for
cuda 12.1 to build torch extension, which it is not compatible cuda
11.8). Note that this is temporary walkround. More elegant fix is to
make sure right torch version in docker build step, that might need
update install_python_deps.sh and corresponding requirements.txt.
* Skip test_gradient_correctness_conv1d since it causes segment fault.
Root cause need more investigation (maybe due to cudnn frontend as
well).
* Skip test_aten_attention since it causes assert failure. Root cause
need more investigation (maybe due to torch version).
* Skip orttraining_ortmodule_distributed_tests.py since it has error
that compiler for torch extension does not support c++17. One possible
fix it to set the following compile argument inside setup.py of
extension fused_adam: extra_compile_args['cxx'] = ['-std=c++17'].
However, due to the urgency of unblocking the pipelines, just disable
the test for now.
* skip test_softmax_bf16_large. For some reason,
torch.cuda.is_bf16_supported() returns True in V100 with torch 2.3.1, so
the test was run in CI, but V100 does not support bf16 natively.
* Fix typo of deterministic

<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
sumitsays added a commit that referenced this pull request Aug 12, 2024
…1670)

### Description
This change cherry-picks 2 Pad fusion optimization:
#21640 and
#21556.

It also has to cherry-pick 2 extra changes to unblock pipeline and
dependency failure: #21300
and #21662 (didn't include
test which are part of 1.18.1 payload).

Also uploaded new version of
[onnxruntime_build_dependencies:10.177](https://dev.azure.com/onnxruntime/onnxruntime/_artifacts/feed/onnxruntime/UPack/onnxruntime_build_dependencies/overview/1.0.177)
and updated the same in `download-deps.yml`.

Additionally it also updates DML binary to 1.15.1.



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

---------

Co-authored-by: Changming Sun <chasun@microsoft.com>
Co-authored-by: Tianlei Wu <tlwu@microsoft.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cherry-picked Cherry-picked for a cherrypicks branch release:1.19.0 Cherry pick to ORT 1.19
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants