Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Slow compilation of DINO ViT model using native_vector_size = 64 (cpu_features) #16993

Closed
ScottTodd opened this issue Apr 8, 2024 · 7 comments · Fixed by #17227
Closed

Slow compilation of DINO ViT model using native_vector_size = 64 (cpu_features) #16993

ScottTodd opened this issue Apr 8, 2024 · 7 comments · Fixed by #17227
Assignees
Labels
bug 🐞 Something isn't working codegen/llvm LLVM code generation compiler backend performance ⚡ Performance/optimization related work across the compiler and runtime

Comments

@ScottTodd
Copy link
Member

What happened?

Splitting off from nod-ai/SHARK-ModelDev#523 (comment), this model is very slow to compile (20 minutes - 1 hour) when using --iree-llvmcpu-target-cpu-features=host.

The specific features on the user machine are --iree-llvmcpu-target-cpu-features=+prfchw,-cldemote,+avx,+aes,+sahf,+pclmul,-xop,+crc32,+xsaves,-avx512fp16,-usermsr,-sm4,-egpr,+sse4.1,+avx512ifma,+xsave,-avx512pf,+sse4.2,-tsxldtrk,-ptwrite,-widekl,-sm3,+invpcid,+64bit,+xsavec,-avx10.1-512,+avx512vpopcntdq,+cmov,-avx512vp2intersect,+avx512cd,+movbe,-avxvnniint8,-avx512er,-ccmp,-amx-int8,-kl,-avx10.1-256,-sha512,-avxvnni,-rtm,+adx,+avx2,-hreset,-movdiri,-serialize,+vpclmulqdq,+avx512vl,-uintr,-cf,+clflushopt,-raoint,-cmpccxadd,+bmi,-amx-tile,+sse,+gfni,-avxvnniint16,-amx-fp16,-ndd,+xsaveopt,+rdrnd,+avx512f,-amx-bf16,+avx512bf16,+avx512vnni,-push2pop2,+cx8,+avx512bw,+sse3,+pku,+fsgsbase,+clzero,+mwaitx,-lwp,+lzcnt,+sha,-movdir64b,-ppx,+wbnoinvd,-enqcmd,-prefetchwt1,-avxneconvert,-tbm,-pconfig,-amx-complex,+ssse3,+cx16,+bmi2,+fma,+popcnt,-avxifma,+f16c,+avx512bitalg,+rdpru,+clwb,+mmx,+sse2,+rdseed,+avx512vbmi2,-prefetchi,+rdpid,-fma4,+avx512vbmi,+shstk,+vaes,-waitpkg,-sgx,+fxsr,+avx512dq,+sse4a

I've narrowed this down to just native_vector_size = 16 : i64 (fast) -> native_vector_size = 64 : i64 (slow), which setting the CPU features affects.

After the LLVMCPUVirtualVectorLowering pass, the MLIR module size grows to 200k+ ops: https://gist.github.com/ScottTodd/2aab78a54dd8b9399900102f8782d5e2

Steps to reproduce your issue

Full model

  1. Download DINO ViT model through Turbine nod-ai/SHARK-ModelDev#523 (comment)
  2. Compile: iree-compile dino_model.mlir --iree-hal-target-backends=llvm-cpu --iree-llvmcpu-target-cpu-features=+prfchw,-cldemote,+avx,+aes,+sahf,+pclmul,-xop,+crc32,+xsaves,-avx512fp16,-usermsr,-sm4,-egpr,+sse4.1,+avx512ifma,+xsave,-avx512pf,+sse4.2,-tsxldtrk,-ptwrite,-widekl,-sm3,+invpcid,+64bit,+xsavec,-avx10.1-512,+avx512vpopcntdq,+cmov,-avx512vp2intersect,+avx512cd,+movbe,-avxvnniint8,-avx512er,-ccmp,-amx-int8,-kl,-avx10.1-256,-sha512,-avxvnni,-rtm,+adx,+avx2,-hreset,-movdiri,-serialize,+vpclmulqdq,+avx512vl,-uintr,-cf,+clflushopt,-raoint,-cmpccxadd,+bmi,-amx-tile,+sse,+gfni,-avxvnniint16,-amx-fp16,-ndd,+xsaveopt,+rdrnd,+avx512f,-amx-bf16,+avx512bf16,+avx512vnni,-push2pop2,+cx8,+avx512bw,+sse3,+pku,+fsgsbase,+clzero,+mwaitx,-lwp,+lzcnt,+sha,-movdir64b,-ppx,+wbnoinvd,-enqcmd,-prefetchwt1,-avxneconvert,-tbm,-pconfig,-amx-complex,+ssse3,+cx16,+bmi2,+fma,+popcnt,-avxifma,+f16c,+avx512bitalg,+rdpru,+clwb,+mmx,+sse2,+rdseed,+avx512vbmi2,-prefetchi,+rdpid,-fma4,+avx512vbmi,+shstk,+vaes,-waitpkg,-sgx,+fxsr,+avx512dq,+sse4a --iree-llvmcpu-enable-ukernels=all --iree-stream-resource-index-bits=64 --iree-vm-target-index-bits=64 --iree-flow-inline-constants-max-byte-length=1 --compile-mode=hal-executable -o /dev/null
  3. Observe slow compile

Minimal reproducer (bottleneck dispatch):

  1. Download dispatch_2_size16.mlir and dispatch_2_size64.mlir from https://gist.github.com/ScottTodd/607ca024de581ae3a5ba7265eabfd158. These only differ in native_vector_size = 16 : i64 -> native_vector_size = 64 : i64
  2. Compile the dispatches: iree-compile --compile-mode=hal-executable dispatch_2_size64.mlir -o /dev/null
  3. Observe slow compile, compare with iree-compile --compile-mode=hal-executable dispatch_2_size16.mlir -o /dev/null

What component(s) does this issue relate to?

Compiler

Version information

Tip of tree from source, dcc8e19

Additional context

No response

@ScottTodd ScottTodd added bug 🐞 Something isn't working performance ⚡ Performance/optimization related work across the compiler and runtime codegen/llvm LLVM code generation compiler backend labels Apr 8, 2024
@ScottTodd
Copy link
Member Author

Trace with native_vector_size = 16:
image

Trace with native_vector_size = 64 (didn't let it finish):
image

@ScottTodd
Copy link
Member Author

cc @hanhanW , @MaheshRavishankar (based on blame layer for LLVMCPUVirtualVectorLowering)

@hanhanW
Copy link
Contributor

hanhanW commented Apr 8, 2024

It looks like a 16x16x16 vector shape is chosen in the configuration, which is not good. I can take a look when I get some cycles. What the priority of this would be?

%27 = vector.broadcast %26 : vector<16xi64> to vector<16x16x16xi64>

@ScottTodd
Copy link
Member Author

Relatively low priority IMO, but I did want to debug far enough to see why compilation time (and likely runtime too) was falling off a cliff.

@hanhanW
Copy link
Contributor

hanhanW commented Apr 8, 2024

Relatively low priority IMO, but I did want to debug far enough to see why compilation time (and likely runtime too) was falling off a cliff.

I see, I'll try to get more details some time this week. It should not be a hard fix.

@hanhanW hanhanW self-assigned this Apr 8, 2024
@hanhanW
Copy link
Contributor

hanhanW commented Apr 30, 2024

We have very bad tile sizes selection for element-wise op. We used to have reasonable sizes because they are flatten. It is not flatten in this case because the output indexing map is not an identity permutation. I have an WIP PR that makes the compilation time much more reasonable. It takes 14 seconds to compile the whole model on my VM.

I'm working on updating tests, and will send it out for review soon. #17227

@hanhanW
Copy link
Contributor

hanhanW commented May 1, 2024

The other issue that I noticed is the output indexing_map is not identity. I talked to @MaheshRavishankar , and I will prepare the fix.

hanhanW added a commit that referenced this issue May 2, 2024
The revision also deprecates an outdated lit test that is impacted by
it. It adds the other lit test simplified from the
#16993

Fixes #16993
LLITCHEV pushed a commit to LLITCHEV/iree that referenced this issue Jul 30, 2024
The revision also deprecates an outdated lit test that is impacted by
it. It adds the other lit test simplified from the
iree-org#16993

Fixes iree-org#16993

Signed-off-by: Lubo Litchev <lubol@google.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug 🐞 Something isn't working codegen/llvm LLVM code generation compiler backend performance ⚡ Performance/optimization related work across the compiler and runtime
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants