-
Notifications
You must be signed in to change notification settings - Fork 3.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WebGPU] Kernel "[GroupQueryAttention] /model/layers.0/attn/GroupQueryAttention" failed. Error: Input "key" is expected to have 3, 4, or 5 dimensions".
#22987
Comments
This also happens for https://huggingface.co/HuggingFaceTB/SmolLM2-135M-Instruct and https://huggingface.co/HuggingFaceTB/SmolLM2-360M-Instruct, which are easier models to test with. |
Kernel "[GroupQueryAttention] /model/layers.0/attn/GroupQueryAttention" failed. Error: Input "key" is expected to have 3, 4, or 5 dimensions".
looking |
Thanks! I'd recommend testing with https://huggingface.co/HuggingFaceTB/SmolLM2-135M-Instruct (or other models with GQA and num_key_value_heads != num_attention_heads). |
This issue has been automatically marked as stale due to inactivity and will be closed in 30 days if no further activity occurs. If further support is needed, please provide an update and/or more details. |
Bump since this is pretty important |
for me https://huggingface.co/HuggingFaceTB/SmolLM2-135M-Instruct does not crash and instead produces garbage.
Just to be sure I tried MHA:
With MHA if I do something like summarize the model is doing a little better initially but then doesn't know how to stop and repeats the same sentence over and over. looking |
Thanks @guschmue! Let me know if there's anything I can do to assist in debugging 🫡 |
using transformers.js-examples/smollm-webgpu with SmolLM2-135M-Instruct: |
That's the same behaviour as I was seeing too 👍 Model output is fine in Node.js, so the garbage output is not due to severe quantisation. |
To summarize: for SmolVLM-Instruct: The sample code above has: but the dump of the inputs shows Uint16 (aka float16) for past_kv ... the q4 model would use float32 for past_kv. I tried with a little sample app to feed the inputs like shown above directly to onnxruntime (tried both q4 and q4fp16):
and the model is happy, don't see errors (using onnxruntime/main). |
Both SmolLM2-135M-Instruct and SmolLM2-360M-Instruct fail due to setting do_rotary attribute on GQA nodes. This attribute is not implemented currently. |
… attribute. (#23287) ### Description <!-- Describe your changes. --> Added a fatal error message for unsupported GroupQuerryAttention do_rotary attribute. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> #22987 Help user understand that this attribute is not supported.
… attribute. (#23287) ### Description <!-- Describe your changes. --> Added a fatal error message for unsupported GroupQuerryAttention do_rotary attribute. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> #22987 Help user understand that this attribute is not supported.
we'll add support for do_rotary in the near future |
not sure if this is the same error, but when using https://huggingface.co/onnx-community/Qwen2.5-Coder-0.5B-ONNX i'm running into this:
(happy to open a new issue if this is not related, i just figured there is a relation because of the comment from @guschmue) |
Indeed, that error message was added with #23287 |
The Following PR supports do_rotary attribute on GQA. |
For now we recommend to use plain GQA because we plan to switch to a new webgpu ep that supports things like flashattention2 and is going to be a magnitude faster. |
that sounds super amazing @guschmue, thx! |
@guschmue Have the correctness errors for GQA been resolved too? |
@xenova is there an open issue about the correctness errors for GQA? If there are steps to reproduce, I can verify. |
Above in this thread :) ^ #22987 (comment) |
I am running into the issue reported here |
When I use Chrome instead Chrome Canary to run SmolLM2-135M-Instruct I am getting the following output printed repeatedly, because float16 is supported by default in Chrome. |
Describe the issue
The following error occurs when trying to run https://huggingface.co/HuggingFaceTB/SmolVLM-Instruct on WebGPU.
Note that the CPU implementation operates correctly, so this is indeed a bug with the WebGPU EP. Moreover, the zero-dimension tensor is by design, and is used for the first generation step.
To reproduce
Urgency
This blocks SmolVLM usage in Transformers.js.
ONNX Runtime Installation
Released Package
ONNX Runtime Version or Commit ID
1.20.1
Execution Provider
'webgpu' (WebGPU)
The text was updated successfully, but these errors were encountered: