Slow compilation of DINO ViT model using `native_vector_size = 64` (cpu_features) #16993

ScottTodd · 2024-04-08T17:51:37Z

What happened?

Splitting off from nod-ai/SHARK-ModelDev#523 (comment), this model is very slow to compile (20 minutes - 1 hour) when using --iree-llvmcpu-target-cpu-features=host.

The specific features on the user machine are --iree-llvmcpu-target-cpu-features=+prfchw,-cldemote,+avx,+aes,+sahf,+pclmul,-xop,+crc32,+xsaves,-avx512fp16,-usermsr,-sm4,-egpr,+sse4.1,+avx512ifma,+xsave,-avx512pf,+sse4.2,-tsxldtrk,-ptwrite,-widekl,-sm3,+invpcid,+64bit,+xsavec,-avx10.1-512,+avx512vpopcntdq,+cmov,-avx512vp2intersect,+avx512cd,+movbe,-avxvnniint8,-avx512er,-ccmp,-amx-int8,-kl,-avx10.1-256,-sha512,-avxvnni,-rtm,+adx,+avx2,-hreset,-movdiri,-serialize,+vpclmulqdq,+avx512vl,-uintr,-cf,+clflushopt,-raoint,-cmpccxadd,+bmi,-amx-tile,+sse,+gfni,-avxvnniint16,-amx-fp16,-ndd,+xsaveopt,+rdrnd,+avx512f,-amx-bf16,+avx512bf16,+avx512vnni,-push2pop2,+cx8,+avx512bw,+sse3,+pku,+fsgsbase,+clzero,+mwaitx,-lwp,+lzcnt,+sha,-movdir64b,-ppx,+wbnoinvd,-enqcmd,-prefetchwt1,-avxneconvert,-tbm,-pconfig,-amx-complex,+ssse3,+cx16,+bmi2,+fma,+popcnt,-avxifma,+f16c,+avx512bitalg,+rdpru,+clwb,+mmx,+sse2,+rdseed,+avx512vbmi2,-prefetchi,+rdpid,-fma4,+avx512vbmi,+shstk,+vaes,-waitpkg,-sgx,+fxsr,+avx512dq,+sse4a

I've narrowed this down to just native_vector_size = 16 : i64 (fast) -> native_vector_size = 64 : i64 (slow), which setting the CPU features affects.

After the LLVMCPUVirtualVectorLowering pass, the MLIR module size grows to 200k+ ops: https://gist.github.com/ScottTodd/2aab78a54dd8b9399900102f8782d5e2

Steps to reproduce your issue

Full model

Download DINO ViT model through Turbine nod-ai/SHARK-ModelDev#523 (comment)
Compile: iree-compile dino_model.mlir --iree-hal-target-backends=llvm-cpu --iree-llvmcpu-target-cpu-features=+prfchw,-cldemote,+avx,+aes,+sahf,+pclmul,-xop,+crc32,+xsaves,-avx512fp16,-usermsr,-sm4,-egpr,+sse4.1,+avx512ifma,+xsave,-avx512pf,+sse4.2,-tsxldtrk,-ptwrite,-widekl,-sm3,+invpcid,+64bit,+xsavec,-avx10.1-512,+avx512vpopcntdq,+cmov,-avx512vp2intersect,+avx512cd,+movbe,-avxvnniint8,-avx512er,-ccmp,-amx-int8,-kl,-avx10.1-256,-sha512,-avxvnni,-rtm,+adx,+avx2,-hreset,-movdiri,-serialize,+vpclmulqdq,+avx512vl,-uintr,-cf,+clflushopt,-raoint,-cmpccxadd,+bmi,-amx-tile,+sse,+gfni,-avxvnniint16,-amx-fp16,-ndd,+xsaveopt,+rdrnd,+avx512f,-amx-bf16,+avx512bf16,+avx512vnni,-push2pop2,+cx8,+avx512bw,+sse3,+pku,+fsgsbase,+clzero,+mwaitx,-lwp,+lzcnt,+sha,-movdir64b,-ppx,+wbnoinvd,-enqcmd,-prefetchwt1,-avxneconvert,-tbm,-pconfig,-amx-complex,+ssse3,+cx16,+bmi2,+fma,+popcnt,-avxifma,+f16c,+avx512bitalg,+rdpru,+clwb,+mmx,+sse2,+rdseed,+avx512vbmi2,-prefetchi,+rdpid,-fma4,+avx512vbmi,+shstk,+vaes,-waitpkg,-sgx,+fxsr,+avx512dq,+sse4a --iree-llvmcpu-enable-ukernels=all --iree-stream-resource-index-bits=64 --iree-vm-target-index-bits=64 --iree-flow-inline-constants-max-byte-length=1 --compile-mode=hal-executable -o /dev/null
Observe slow compile

Minimal reproducer (bottleneck dispatch):

Download dispatch_2_size16.mlir and dispatch_2_size64.mlir from https://gist.github.com/ScottTodd/607ca024de581ae3a5ba7265eabfd158. These only differ in native_vector_size = 16 : i64 -> native_vector_size = 64 : i64
Compile the dispatches: iree-compile --compile-mode=hal-executable dispatch_2_size64.mlir -o /dev/null
Observe slow compile, compare with iree-compile --compile-mode=hal-executable dispatch_2_size16.mlir -o /dev/null

What component(s) does this issue relate to?

Compiler

Version information

Tip of tree from source, dcc8e19

Additional context

No response

The text was updated successfully, but these errors were encountered:

ScottTodd · 2024-04-08T17:58:23Z

Trace with native_vector_size = 16:

Trace with native_vector_size = 64 (didn't let it finish):

ScottTodd · 2024-04-08T18:29:48Z

cc @hanhanW , @MaheshRavishankar (based on blame layer for LLVMCPUVirtualVectorLowering)

hanhanW · 2024-04-08T19:00:00Z

It looks like a 16x16x16 vector shape is chosen in the configuration, which is not good. I can take a look when I get some cycles. What the priority of this would be?

%27 = vector.broadcast %26 : vector<16xi64> to vector<16x16x16xi64>

ScottTodd · 2024-04-08T19:02:23Z

Relatively low priority IMO, but I did want to debug far enough to see why compilation time (and likely runtime too) was falling off a cliff.

hanhanW · 2024-04-08T22:40:51Z

Relatively low priority IMO, but I did want to debug far enough to see why compilation time (and likely runtime too) was falling off a cliff.

I see, I'll try to get more details some time this week. It should not be a hard fix.

hanhanW · 2024-04-30T22:05:46Z

We have very bad tile sizes selection for element-wise op. We used to have reasonable sizes because they are flatten. It is not flatten in this case because the output indexing map is not an identity permutation. I have an WIP PR that makes the compilation time much more reasonable. It takes 14 seconds to compile the whole model on my VM.

I'm working on updating tests, and will send it out for review soon. #17227

hanhanW · 2024-05-01T22:21:30Z

The other issue that I noticed is the output indexing_map is not identity. I talked to @MaheshRavishankar , and I will prepare the fix.

The revision also deprecates an outdated lit test that is impacted by it. It adds the other lit test simplified from the #16993 Fixes #16993

The revision also deprecates an outdated lit test that is impacted by it. It adds the other lit test simplified from the iree-org#16993 Fixes iree-org#16993 Signed-off-by: Lubo Litchev <lubol@google.com>

ScottTodd added bug 🐞 Something isn't working performance ⚡ Performance/optimization related work across the compiler and runtime codegen/llvm LLVM code generation compiler backend labels Apr 8, 2024

hanhanW self-assigned this Apr 8, 2024

hanhanW mentioned this issue Apr 30, 2024

[CPU] Limit unrolling factors for generic ops. #17227

Merged

hanhanW closed this as completed in #17227 May 2, 2024

hanhanW added a commit that referenced this issue May 2, 2024

[CPU] Limit unrolling factors for generic ops. (#17227)

8547374

The revision also deprecates an outdated lit test that is impacted by it. It adds the other lit test simplified from the #16993 Fixes #16993

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Slow compilation of DINO ViT model using `native_vector_size = 64` (cpu_features) #16993

Slow compilation of DINO ViT model using `native_vector_size = 64` (cpu_features) #16993

ScottTodd commented Apr 8, 2024

ScottTodd commented Apr 8, 2024

ScottTodd commented Apr 8, 2024

hanhanW commented Apr 8, 2024

ScottTodd commented Apr 8, 2024

hanhanW commented Apr 8, 2024

hanhanW commented Apr 30, 2024

hanhanW commented May 1, 2024

Slow compilation of DINO ViT model using native_vector_size = 64 (cpu_features) #16993

Slow compilation of DINO ViT model using native_vector_size = 64 (cpu_features) #16993

Comments

ScottTodd commented Apr 8, 2024

What happened?

Steps to reproduce your issue

What component(s) does this issue relate to?

Version information

Additional context

ScottTodd commented Apr 8, 2024

ScottTodd commented Apr 8, 2024

hanhanW commented Apr 8, 2024

ScottTodd commented Apr 8, 2024

hanhanW commented Apr 8, 2024

hanhanW commented Apr 30, 2024

hanhanW commented May 1, 2024

Slow compilation of DINO ViT model using `native_vector_size = 64` (cpu_features) #16993

Slow compilation of DINO ViT model using `native_vector_size = 64` (cpu_features) #16993