Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WebGPU] Kernel "[GroupQueryAttention] /model/layers.0/attn/GroupQueryAttention" failed. Error: Input "key" is expected to have 3, 4, or 5 dimensions". #22987

Open
xenova opened this issue Dec 3, 2024 · 22 comments
Labels
ep:WebGPU ort-web webgpu provider model:transformer issues related to a transformer model: BERT, GPT2, Hugging Face, Longformer, T5, etc. platform:web issues related to ONNX Runtime web; typically submitted using template

Comments

@xenova
Copy link
Contributor

xenova commented Dec 3, 2024

Describe the issue

The following error occurs when trying to run https://huggingface.co/HuggingFaceTB/SmolVLM-Instruct on WebGPU.

Image

Note that the CPU implementation operates correctly, so this is indeed a bug with the WebGPU EP. Moreover, the zero-dimension tensor is by design, and is used for the first generation step.

To reproduce

  1. Install and build Transformers.js from source (https://github.com/huggingface/transformers.js)
  2. Run the following code in-browser:
import {
  AutoProcessor,
  AutoModelForVision2Seq,
  load_image,
} from "@huggingface/transformers";

// Initialize processor and model
const model_id = "HuggingFaceTB/SmolVLM-Instruct";
const processor = await AutoProcessor.from_pretrained(model_id);
const model = await AutoModelForVision2Seq.from_pretrained(model_id, {
  dtype: {
    embed_tokens: "fp16", // "fp32", "fp16", "q8"
    vision_encoder: "q4", // "fp32", "fp16", "q8", "q4", "q4f16"
    decoder_model_merged: "q4", // "q8", "q4", "q4f16"
  },
  device: 'webgpu',
});

// Load images
const image1 = await load_image("https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg");
const image2 = await load_image("https://huggingface.co/spaces/merve/chameleon-7b/resolve/main/bee.jpg");

// Create input messages
const messages = [
  {
    role: "user",
    content: [
      { type: "image" },
      { type: "image" },
      { type: "text", text: "Can you describe the two images?" },
    ],
  },
];

// Prepare inputs
const text = processor.apply_chat_template(messages, { add_generation_prompt: true });
const inputs = await processor(text, [image1, image2], {
  // Set `do_image_splitting: true` to split images into multiple patches.
  // NOTE: This uses more memory, but can provide more accurate results.
  do_image_splitting: false,
});

// Generate outputs
const generated_ids = await model.generate({
  ...inputs,
  max_new_tokens: 500,
});
const generated_texts = processor.batch_decode(
  generated_ids.slice(null, [inputs.input_ids.dims.at(-1), null]),
  { skip_special_tokens: true },
);
console.log(generated_texts[0]);
// ' In the first image, there is a green statue of liberty on a pedestal in the middle of the water. The water is surrounded by trees and buildings in the background. In the second image, there are pink and red flowers with a bee on the pink flower.'

Urgency

This blocks SmolVLM usage in Transformers.js.

ONNX Runtime Installation

Released Package

ONNX Runtime Version or Commit ID

1.20.1

Execution Provider

'webgpu' (WebGPU)

@xenova xenova added the platform:web issues related to ONNX Runtime web; typically submitted using template label Dec 3, 2024
@github-actions github-actions bot added ep:WebGPU ort-web webgpu provider model:transformer issues related to a transformer model: BERT, GPT2, Hugging Face, Longformer, T5, etc. labels Dec 3, 2024
@xenova
Copy link
Contributor Author

xenova commented Dec 3, 2024

This also happens for https://huggingface.co/HuggingFaceTB/SmolLM2-135M-Instruct and https://huggingface.co/HuggingFaceTB/SmolLM2-360M-Instruct, which are easier models to test with.

@xenova xenova changed the title [WebGPU] Kernel "[GroupQueryAttention] /model/layers.0/attn/GroupQueryAttention" failed. Error: Input "key" is expected to have 3, 4, or 5 dimensions". [WebGPU] Kernel "[GroupQueryAttention] /model/layers.0/attn/GroupQueryAttention" failed. Error: Input "key" is expected to have 3, 4, or 5 dimensions". Dec 3, 2024
@guschmue
Copy link
Contributor

guschmue commented Dec 5, 2024

looking

@xenova
Copy link
Contributor Author

xenova commented Dec 5, 2024

Thanks! I'd recommend testing with https://huggingface.co/HuggingFaceTB/SmolLM2-135M-Instruct (or other models with GQA and num_key_value_heads != num_attention_heads).

Copy link
Contributor

github-actions bot commented Jan 5, 2025

This issue has been automatically marked as stale due to inactivity and will be closed in 30 days if no further activity occurs. If further support is needed, please provide an update and/or more details.

@github-actions github-actions bot added the stale issues that have not been addressed in a while; categorized by a bot label Jan 5, 2025
@xenova
Copy link
Contributor Author

xenova commented Jan 5, 2025

Bump since this is pretty important

@github-actions github-actions bot removed the stale issues that have not been addressed in a while; categorized by a bot label Jan 6, 2025
@guschmue
Copy link
Contributor

guschmue commented Jan 6, 2025

for me https://huggingface.co/HuggingFaceTB/SmolLM2-135M-Instruct does not crash and instead produces garbage.
llama3.2-1b with gqa is working fine.
I created my own gqa model, output is:

user: Tell me about Constantinople.
assistant: You are a helpful AI assistant. You are trained by a user named Hugging.

Just to be sure I tried MHA:

system
You are a helpful AI assistant named SmolLM, trained by Hugging Face
user
Tell me about Constantinople.
assistant
The city of Constantinople, the greatest metropolis in the world. It was the capital of the Byzantine Empire, the largest empire in history, and the largest city in the world.
As a historian, I can tell you that the city was the center of the Byzantine Empire, the largest empire in history, and the largest city in the world. The city was the seat of the emperor, the capital, the largest city, the largest city, the largest city, the largest city, the largest city, the largest city, the largest city, the largest city, the largest city, the largest city, the largest...

With MHA if I do something like summarize the model is doing a little better initially but then doesn't know how to stop and repeats the same sentence over and over.

looking

@xenova
Copy link
Contributor Author

xenova commented Jan 6, 2025

Thanks @guschmue! Let me know if there's anything I can do to assist in debugging 🫡

@guschmue
Copy link
Contributor

guschmue commented Jan 7, 2025

using transformers.js-examples/smollm-webgpu with SmolLM2-135M-Instruct:
my MHA model is fine
my GQA model does not crash but has wrong output or hangs depending on the query

@xenova
Copy link
Contributor Author

xenova commented Jan 7, 2025

That's the same behaviour as I was seeing too 👍 Model output is fine in Node.js, so the garbage output is not due to severe quantisation.

@guschmue
Copy link
Contributor

guschmue commented Jan 7, 2025

To summarize:
SmolLM2-135M-Instruct with GQA does have a correctness issue, MHA is fine.

for SmolVLM-Instruct:
seems to be a different issue but we cannot reproduce it.
we assume models are generated with model builder like:
q4: -e cpu
q4f16: -e cuda
In theory that should work. Both would use packedQKV. A little worried about this one because I never used it myself. We have unittests for it and they are passing. -e web would generate a model for MHA, -e dml would generate one with GQA but not use packedQKV.

The sample code above has:
decoder_model_merged: "q4"

but the dump of the inputs shows Uint16 (aka float16) for past_kv ... the q4 model would use float32 for past_kv.

I tried with a little sample app to feed the inputs like shown above directly to onnxruntime (tried both q4 and q4fp16):

        const tokens = [101n];
        feed['attention_mask'] = fillTensor([1, tokens.length], "int64", 1n);
        feed['inputs_embeds'] = fillTensor([1, 1, 2048], "float32", 1.);
        const decoder_shape = [1, 32, 0, 64];
        for (var k in inputNames) {
            const v = inputNames[k];
            if (v.startsWith("past_key_values.")) {
                feed[v] = fillTensor(decoder_shape, "float32", 0);
            }
        }

and the model is happy, don't see errors (using onnxruntime/main).
Maybe SmolVLM-Instruct is no longer an issue ?

@satyajandhyala
Copy link
Contributor

satyajandhyala commented Jan 8, 2025

Both SmolLM2-135M-Instruct and SmolLM2-360M-Instruct fail due to setting do_rotary attribute on GQA nodes. This attribute is not implemented currently.

guschmue pushed a commit that referenced this issue Jan 9, 2025
… attribute. (#23287)

### Description
<!-- Describe your changes. -->

Added a fatal error message for unsupported GroupQuerryAttention
do_rotary attribute.

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
#22987
Help user understand that this attribute is not supported.
guschmue pushed a commit that referenced this issue Jan 12, 2025
… attribute. (#23287)

### Description
<!-- Describe your changes. -->

Added a fatal error message for unsupported GroupQuerryAttention
do_rotary attribute.

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
#22987
Help user understand that this attribute is not supported.
@guschmue
Copy link
Contributor

we'll add support for do_rotary in the near future

@TimPietrusky
Copy link

not sure if this is the same error, but when using https://huggingface.co/onnx-community/Qwen2.5-Coder-0.5B-ONNX i'm running into this:

[WebGPU] Kernel "[GroupQueryAttention] /model/layers.0/attn/GroupQueryAttention" failed. Error: GroupQuerryAttention do_rotary attribute is not supported"

(happy to open a new issue if this is not related, i just figured there is a relation because of the comment from @guschmue)

@xenova
Copy link
Contributor Author

xenova commented Feb 17, 2025

Indeed, that error message was added with #23287

@satyajandhyala
Copy link
Contributor

The Following PR supports do_rotary attribute on GQA.
#23524

@guschmue
Copy link
Contributor

For now we recommend to use plain GQA because we plan to switch to a new webgpu ep that supports things like flashattention2 and is going to be a magnitude faster.
That new ep does not support do_rotary for FA2 yet and we want to avoid that models work on 1 EP but not on the other one.
Once we have the functionality the same everywhere we'll switch model builder to use do_rotary.

@TimPietrusky
Copy link

that sounds super amazing @guschmue, thx!

@xenova
Copy link
Contributor Author

xenova commented Feb 25, 2025

@guschmue Have the correctness errors for GQA been resolved too?

@satyajandhyala
Copy link
Contributor

satyajandhyala commented Feb 25, 2025

@xenova is there an open issue about the correctness errors for GQA? If there are steps to reproduce, I can verify.

@xenova
Copy link
Contributor Author

xenova commented Feb 25, 2025

Above in this thread :) ^ #22987 (comment)

@satyajandhyala
Copy link
Contributor

I am running into the issue reported here

@satyajandhyala
Copy link
Contributor

satyajandhyala commented Feb 26, 2025

When I use Chrome instead Chrome Canary to run SmolLM2-135M-Instruct I am getting the following output printed repeatedly, because float16 is supported by default in Chrome.
I am a user.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ep:WebGPU ort-web webgpu provider model:transformer issues related to a transformer model: BERT, GPT2, Hugging Face, Longformer, T5, etc. platform:web issues related to ONNX Runtime web; typically submitted using template
Projects
None yet
Development

No branches or pull requests

4 participants