Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[compiler] FLUX.1 transformer compilation for gfx942 hangs with pad-to-intrinsics #19249

Open
monorimet opened this issue Nov 21, 2024 · 3 comments
Labels
bug 🐞 Something isn't working

Comments

@monorimet
Copy link
Collaborator

monorimet commented Nov 21, 2024

What happened?

IREE compiler hangs when I use iree-preprocessing-pad-to-intrinsics on this IR:

iree-compile --iree-hal-target-device=amdgpu --iree-hip-target=gfx942 --iree-hal-target-backends=rocm --iree-hip-target=gfx942 --iree-execution-model=async-external --iree-preprocessing-pass-pipeline='builtin.module(util.func(iree-global-opt-raise-special-ops, iree-flow-canonicalize), iree-preprocessing-transpose-convolution-pipeline, iree-preprocessing-pad-to-intrinsics, util.func(iree-preprocessing-generalize-linalg-matmul-experimental))' --iree-global-opt-propagate-transposes=1 --iree-opt-const-eval=0 --iree-opt-outer-dim-concat=1 --iree-opt-aggressively-propagate-transposes=1 --iree-dispatch-creation-enable-aggressive-fusion --iree-hal-force-indirect-command-buffers --iree-codegen-llvmgpu-use-vector-distribution=1 --iree-llvmgpu-enable-prefetch=1 --iree-codegen-gpu-native-math-precision=1 --iree-hip-legacy-sync=0 --iree-opt-data-tiling=0 --iree-vm-target-truncate-unsupported-floats --iree-dispatch-creation-enable-fuse-horizontal-contractions=1 flux_1_dev.torch_onnx.mlir -o flux-dev_transformer_bs1_512_1024x1024_fp32_amdgpu-gfx942.vmfb

the above hangs for longer than 30 minutes. if I take out iree-preprocessing-pad-to-intrinsics, i.e., use:

iree-compile --iree-hal-target-device=amdgpu --iree-hip-target=gfx942 --iree-hal-target-backends=rocm --iree-hip-target=gfx942 --iree-execution-model=async-external --iree-preprocessing-pass-pipeline='builtin.module(util.func(iree-global-opt-raise-special-ops, iree-flow-canonicalize), iree-preprocessing-transpose-convolution-pipeline, util.func(iree-preprocessing-generalize-linalg-matmul-experimental))' --iree-global-opt-propagate-transposes=1 --iree-opt-const-eval=0 --iree-opt-outer-dim-concat=1 --iree-opt-aggressively-propagate-transposes=1 --iree-dispatch-creation-enable-aggressive-fusion --iree-hal-force-indirect-command-buffers --iree-codegen-llvmgpu-use-vector-distribution=1 --iree-llvmgpu-enable-prefetch=1 --iree-codegen-gpu-native-math-precision=1 --iree-hip-legacy-sync=0 --iree-opt-data-tiling=0 --iree-vm-target-truncate-unsupported-floats --iree-dispatch-creation-enable-fuse-horizontal-contractions=1 flux_1_dev.torch_onnx.mlir -o flux-dev_transformer_bs1_512_1024x1024_fp32_amdgpu-gfx942.vmfb

it compiles in under 30 seconds.

I'm not sure if this pass is still required for matching amdgpu intrinsics.

Steps to reproduce your issue

  1. install iree compiler
  2. wget https://gist.githubusercontent.com/zjgarvey/91c733825018b077565f668e6bda96d8/raw/de72daddf6ab27b7b07ea5e71ca4fc11504edcc8/flux_1_dev.torch_onnx.mlir
  3. run:
iree-compile --iree-hal-target-device=amdgpu --iree-hip-target=gfx942 --iree-hal-target-backends=rocm --iree-hip-target=gfx942 --iree-execution-model=async-external --iree-preprocessing-pass-pipeline='builtin.module(util.func(iree-global-opt-raise-special-ops, iree-flow-canonicalize), iree-preprocessing-transpose-convolution-pipeline, iree-preprocessing-pad-to-intrinsics, util.func(iree-preprocessing-generalize-linalg-matmul-experimental))' --iree-global-opt-propagate-transposes=1 --iree-opt-const-eval=0 --iree-opt-outer-dim-concat=1 --iree-opt-aggressively-propagate-transposes=1 --iree-dispatch-creation-enable-aggressive-fusion --iree-hal-force-indirect-command-buffers --iree-codegen-llvmgpu-use-vector-distribution=1 --iree-llvmgpu-enable-prefetch=1 --iree-codegen-gpu-native-math-precision=1 --iree-hip-legacy-sync=0 --iree-opt-data-tiling=0 --iree-vm-target-truncate-unsupported-floats --iree-dispatch-creation-enable-fuse-horizontal-contractions=1 flux_1_dev.torch_onnx.mlir -o flux-dev_transformer_bs1_512_1024x1024_fp32_amdgpu-gfx942.vmfb

What component(s) does this issue relate to?

Compiler

Version information

IREE compiler version 3.0.0rc20241118 @ 29c451b

Additional context

The MLIR used is an ONNX export. It has its parameters externalized. The model precision is fp32.

@monorimet monorimet added the bug 🐞 Something isn't working label Nov 21, 2024
@monorimet
Copy link
Collaborator Author

It just occurred to me that this is probably because the model was exported with several dynamic input dims. Could this be why padding to intrinsics is getting stuck?

@IanWood1
Copy link
Contributor

I'm not sure why iree-preprocessing-pad-to-intrinsics changes anything but OptimizeIntArithmetic seems to be the problem. Its spending a ton of time making calls to solver.eraseState(). I previously tried to fix it with #19130

@monorimet
Copy link
Collaborator Author

I also encountered this issue compiling https://sharkpublic.blob.core.windows.net/sharkpublic/flux.1/mlir/flux_dev_t5xxl_bf16.mlir with the pad-to-intrinsics preprocessing pass. Like the original IR in this issue, it is exported with dynamic dims for batch dim and sequence length.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug 🐞 Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants