-
Notifications
You must be signed in to change notification settings - Fork 663
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Slow compilation of DINO ViT model using native_vector_size = 64
(cpu_features)
#16993
Comments
cc @hanhanW , @MaheshRavishankar (based on blame layer for |
It looks like a %27 = vector.broadcast %26 : vector<16xi64> to vector<16x16x16xi64> |
Relatively low priority IMO, but I did want to debug far enough to see why compilation time (and likely runtime too) was falling off a cliff. |
I see, I'll try to get more details some time this week. It should not be a hard fix. |
We have very bad tile sizes selection for element-wise op. We used to have reasonable sizes because they are flatten. It is not flatten in this case because the output indexing map is not an identity permutation. I have an WIP PR that makes the compilation time much more reasonable. It takes 14 seconds to compile the whole model on my VM. I'm working on updating tests, and will send it out for review soon. #17227 |
The other issue that I noticed is the output indexing_map is not identity. I talked to @MaheshRavishankar , and I will prepare the fix. |
The revision also deprecates an outdated lit test that is impacted by it. It adds the other lit test simplified from the iree-org#16993 Fixes iree-org#16993 Signed-off-by: Lubo Litchev <lubol@google.com>
What happened?
Splitting off from nod-ai/SHARK-ModelDev#523 (comment), this model is very slow to compile (20 minutes - 1 hour) when using
--iree-llvmcpu-target-cpu-features=host
.The specific features on the user machine are
--iree-llvmcpu-target-cpu-features=+prfchw,-cldemote,+avx,+aes,+sahf,+pclmul,-xop,+crc32,+xsaves,-avx512fp16,-usermsr,-sm4,-egpr,+sse4.1,+avx512ifma,+xsave,-avx512pf,+sse4.2,-tsxldtrk,-ptwrite,-widekl,-sm3,+invpcid,+64bit,+xsavec,-avx10.1-512,+avx512vpopcntdq,+cmov,-avx512vp2intersect,+avx512cd,+movbe,-avxvnniint8,-avx512er,-ccmp,-amx-int8,-kl,-avx10.1-256,-sha512,-avxvnni,-rtm,+adx,+avx2,-hreset,-movdiri,-serialize,+vpclmulqdq,+avx512vl,-uintr,-cf,+clflushopt,-raoint,-cmpccxadd,+bmi,-amx-tile,+sse,+gfni,-avxvnniint16,-amx-fp16,-ndd,+xsaveopt,+rdrnd,+avx512f,-amx-bf16,+avx512bf16,+avx512vnni,-push2pop2,+cx8,+avx512bw,+sse3,+pku,+fsgsbase,+clzero,+mwaitx,-lwp,+lzcnt,+sha,-movdir64b,-ppx,+wbnoinvd,-enqcmd,-prefetchwt1,-avxneconvert,-tbm,-pconfig,-amx-complex,+ssse3,+cx16,+bmi2,+fma,+popcnt,-avxifma,+f16c,+avx512bitalg,+rdpru,+clwb,+mmx,+sse2,+rdseed,+avx512vbmi2,-prefetchi,+rdpid,-fma4,+avx512vbmi,+shstk,+vaes,-waitpkg,-sgx,+fxsr,+avx512dq,+sse4a
I've narrowed this down to just
native_vector_size = 16 : i64
(fast) ->native_vector_size = 64 : i64
(slow), which setting the CPU features affects.After the
LLVMCPUVirtualVectorLowering
pass, the MLIR module size grows to 200k+ ops: https://gist.github.com/ScottTodd/2aab78a54dd8b9399900102f8782d5e2Steps to reproduce your issue
Full model
iree-compile dino_model.mlir --iree-hal-target-backends=llvm-cpu --iree-llvmcpu-target-cpu-features=+prfchw,-cldemote,+avx,+aes,+sahf,+pclmul,-xop,+crc32,+xsaves,-avx512fp16,-usermsr,-sm4,-egpr,+sse4.1,+avx512ifma,+xsave,-avx512pf,+sse4.2,-tsxldtrk,-ptwrite,-widekl,-sm3,+invpcid,+64bit,+xsavec,-avx10.1-512,+avx512vpopcntdq,+cmov,-avx512vp2intersect,+avx512cd,+movbe,-avxvnniint8,-avx512er,-ccmp,-amx-int8,-kl,-avx10.1-256,-sha512,-avxvnni,-rtm,+adx,+avx2,-hreset,-movdiri,-serialize,+vpclmulqdq,+avx512vl,-uintr,-cf,+clflushopt,-raoint,-cmpccxadd,+bmi,-amx-tile,+sse,+gfni,-avxvnniint16,-amx-fp16,-ndd,+xsaveopt,+rdrnd,+avx512f,-amx-bf16,+avx512bf16,+avx512vnni,-push2pop2,+cx8,+avx512bw,+sse3,+pku,+fsgsbase,+clzero,+mwaitx,-lwp,+lzcnt,+sha,-movdir64b,-ppx,+wbnoinvd,-enqcmd,-prefetchwt1,-avxneconvert,-tbm,-pconfig,-amx-complex,+ssse3,+cx16,+bmi2,+fma,+popcnt,-avxifma,+f16c,+avx512bitalg,+rdpru,+clwb,+mmx,+sse2,+rdseed,+avx512vbmi2,-prefetchi,+rdpid,-fma4,+avx512vbmi,+shstk,+vaes,-waitpkg,-sgx,+fxsr,+avx512dq,+sse4a --iree-llvmcpu-enable-ukernels=all --iree-stream-resource-index-bits=64 --iree-vm-target-index-bits=64 --iree-flow-inline-constants-max-byte-length=1 --compile-mode=hal-executable -o /dev/null
Minimal reproducer (bottleneck dispatch):
dispatch_2_size16.mlir
anddispatch_2_size64.mlir
from https://gist.github.com/ScottTodd/607ca024de581ae3a5ba7265eabfd158. These only differ innative_vector_size = 16 : i64
->native_vector_size = 64 : i64
iree-compile --compile-mode=hal-executable dispatch_2_size64.mlir -o /dev/null
iree-compile --compile-mode=hal-executable dispatch_2_size16.mlir -o /dev/null
What component(s) does this issue relate to?
Compiler
Version information
Tip of tree from source, dcc8e19
Additional context
No response
The text was updated successfully, but these errors were encountered: