[compiler] FLUX.1 transformer compilation for gfx942 hangs with pad-to-intrinsics #19249

monorimet · 2024-11-21T17:38:40Z

What happened?

IREE compiler hangs when I use iree-preprocessing-pad-to-intrinsics on this IR:

iree-compile --iree-hal-target-device=amdgpu --iree-hip-target=gfx942 --iree-hal-target-backends=rocm --iree-hip-target=gfx942 --iree-execution-model=async-external --iree-preprocessing-pass-pipeline='builtin.module(util.func(iree-global-opt-raise-special-ops, iree-flow-canonicalize), iree-preprocessing-transpose-convolution-pipeline, iree-preprocessing-pad-to-intrinsics, util.func(iree-preprocessing-generalize-linalg-matmul-experimental))' --iree-global-opt-propagate-transposes=1 --iree-opt-const-eval=0 --iree-opt-outer-dim-concat=1 --iree-opt-aggressively-propagate-transposes=1 --iree-dispatch-creation-enable-aggressive-fusion --iree-hal-force-indirect-command-buffers --iree-codegen-llvmgpu-use-vector-distribution=1 --iree-llvmgpu-enable-prefetch=1 --iree-codegen-gpu-native-math-precision=1 --iree-hip-legacy-sync=0 --iree-opt-data-tiling=0 --iree-vm-target-truncate-unsupported-floats --iree-dispatch-creation-enable-fuse-horizontal-contractions=1 flux_1_dev.torch_onnx.mlir -o flux-dev_transformer_bs1_512_1024x1024_fp32_amdgpu-gfx942.vmfb

the above hangs for longer than 30 minutes. if I take out iree-preprocessing-pad-to-intrinsics, i.e., use:

iree-compile --iree-hal-target-device=amdgpu --iree-hip-target=gfx942 --iree-hal-target-backends=rocm --iree-hip-target=gfx942 --iree-execution-model=async-external --iree-preprocessing-pass-pipeline='builtin.module(util.func(iree-global-opt-raise-special-ops, iree-flow-canonicalize), iree-preprocessing-transpose-convolution-pipeline, util.func(iree-preprocessing-generalize-linalg-matmul-experimental))' --iree-global-opt-propagate-transposes=1 --iree-opt-const-eval=0 --iree-opt-outer-dim-concat=1 --iree-opt-aggressively-propagate-transposes=1 --iree-dispatch-creation-enable-aggressive-fusion --iree-hal-force-indirect-command-buffers --iree-codegen-llvmgpu-use-vector-distribution=1 --iree-llvmgpu-enable-prefetch=1 --iree-codegen-gpu-native-math-precision=1 --iree-hip-legacy-sync=0 --iree-opt-data-tiling=0 --iree-vm-target-truncate-unsupported-floats --iree-dispatch-creation-enable-fuse-horizontal-contractions=1 flux_1_dev.torch_onnx.mlir -o flux-dev_transformer_bs1_512_1024x1024_fp32_amdgpu-gfx942.vmfb

it compiles in under 30 seconds.

I'm not sure if this pass is still required for matching amdgpu intrinsics.

Steps to reproduce your issue

install iree compiler
wget https://gist.githubusercontent.com/zjgarvey/91c733825018b077565f668e6bda96d8/raw/de72daddf6ab27b7b07ea5e71ca4fc11504edcc8/flux_1_dev.torch_onnx.mlir
run:

iree-compile --iree-hal-target-device=amdgpu --iree-hip-target=gfx942 --iree-hal-target-backends=rocm --iree-hip-target=gfx942 --iree-execution-model=async-external --iree-preprocessing-pass-pipeline='builtin.module(util.func(iree-global-opt-raise-special-ops, iree-flow-canonicalize), iree-preprocessing-transpose-convolution-pipeline, iree-preprocessing-pad-to-intrinsics, util.func(iree-preprocessing-generalize-linalg-matmul-experimental))' --iree-global-opt-propagate-transposes=1 --iree-opt-const-eval=0 --iree-opt-outer-dim-concat=1 --iree-opt-aggressively-propagate-transposes=1 --iree-dispatch-creation-enable-aggressive-fusion --iree-hal-force-indirect-command-buffers --iree-codegen-llvmgpu-use-vector-distribution=1 --iree-llvmgpu-enable-prefetch=1 --iree-codegen-gpu-native-math-precision=1 --iree-hip-legacy-sync=0 --iree-opt-data-tiling=0 --iree-vm-target-truncate-unsupported-floats --iree-dispatch-creation-enable-fuse-horizontal-contractions=1 flux_1_dev.torch_onnx.mlir -o flux-dev_transformer_bs1_512_1024x1024_fp32_amdgpu-gfx942.vmfb

What component(s) does this issue relate to?

Compiler

Version information

IREE compiler version 3.0.0rc20241118 @ 29c451b

Additional context

The MLIR used is an ONNX export. It has its parameters externalized. The model precision is fp32.

The text was updated successfully, but these errors were encountered:

monorimet · 2024-11-21T19:06:41Z

It just occurred to me that this is probably because the model was exported with several dynamic input dims. Could this be why padding to intrinsics is getting stuck?

IanWood1 · 2024-11-21T19:39:38Z

I'm not sure why iree-preprocessing-pad-to-intrinsics changes anything but OptimizeIntArithmetic seems to be the problem. Its spending a ton of time making calls to solver.eraseState(). I previously tried to fix it with #19130

monorimet · 2024-11-27T17:57:27Z

I also encountered this issue compiling https://sharkpublic.blob.core.windows.net/sharkpublic/flux.1/mlir/flux_dev_t5xxl_bf16.mlir with the pad-to-intrinsics preprocessing pass. Like the original IR in this issue, it is exported with dynamic dims for batch dim and sequence length.

monorimet added the bug 🐞 Something isn't working label Nov 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[compiler] FLUX.1 transformer compilation for gfx942 hangs with pad-to-intrinsics #19249

[compiler] FLUX.1 transformer compilation for gfx942 hangs with pad-to-intrinsics #19249

monorimet commented Nov 21, 2024 •

edited

Loading

monorimet commented Nov 21, 2024

IanWood1 commented Nov 21, 2024

monorimet commented Nov 27, 2024

[compiler] FLUX.1 transformer compilation for gfx942 hangs with pad-to-intrinsics #19249

[compiler] FLUX.1 transformer compilation for gfx942 hangs with pad-to-intrinsics #19249

Comments

monorimet commented Nov 21, 2024 • edited Loading

What happened?

Steps to reproduce your issue

What component(s) does this issue relate to?

Version information

Additional context

monorimet commented Nov 21, 2024

IanWood1 commented Nov 21, 2024

monorimet commented Nov 27, 2024

monorimet commented Nov 21, 2024 •

edited

Loading