-
Notifications
You must be signed in to change notification settings - Fork 11k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Metal (iOS): Compute function exceeds available temporary registers #7261
Comments
Yes, for head size = 256 the Metal kernels are very slow. I suspected that it has something to do with running out of register and now this error confirms it. Btw, how do you make the error show up - it never does on my mac? |
I have a mac on intel, can't check there. I get the error when running on iPhone 12 pro. And it does not depend on the model, because the error occurs at the stage of resource allocation. |
I encounter the same issue when I try to deploy llama.cpp on iPhone 14. The issue can be bypassed by commenting out all flash_attn related kernels in I'd like say the issue may be problematic since it's hard to catch it on CI. The code passes compilation and only throws error at runtime. Running on simulator does not help since Apple explicitly says that simulator does not match real hardware in GPU capability. In fact, running llama.cpp on iPhone simulator throws another error saying "more than 14 constant buffer is not supported". It seems the only way to expose it is to run on a real iPhone.
The issue does not exist on my M2 Mac mini, either. It's specific to iPhone, not Mac. |
I've disabled the HS=256 kernel from the build |
I confirm the newest master (commit |
llama.cpp b2864
iPhone 12 pro Max
if
GGML_METAL_ADD_KERNEL(GGML_METAL_KERNEL_TYPE_FLASH_ATTN_EXT_F16_H256, flash_attn_ext_f16_h256, ctx->support_simdgroup_mm);
i get:
if
GGML_METAL_ADD_KERNEL(GGML_METAL_KERNEL_TYPE_FLASH_ATTN_EXT_F16_H256, flash_attn_ext_f16_h256, false);
work fine.
The text was updated successfully, but these errors were encountered: