[WebGPU] `Kernel "[GroupQueryAttention] /model/layers.0/attn/GroupQueryAttention" failed. Error: Input "key" is expected to have 3, 4, or 5 dimensions".` #22987

xenova · 2024-12-03T00:31:31Z

Describe the issue

The following error occurs when trying to run https://huggingface.co/HuggingFaceTB/SmolVLM-Instruct on WebGPU.

Note that the CPU implementation operates correctly, so this is indeed a bug with the WebGPU EP. Moreover, the zero-dimension tensor is by design, and is used for the first generation step.

To reproduce

Install and build Transformers.js from source (https://github.com/huggingface/transformers.js)
Run the following code in-browser:

import {
  AutoProcessor,
  AutoModelForVision2Seq,
  load_image,
} from "@huggingface/transformers";

// Initialize processor and model
const model_id = "HuggingFaceTB/SmolVLM-Instruct";
const processor = await AutoProcessor.from_pretrained(model_id);
const model = await AutoModelForVision2Seq.from_pretrained(model_id, {
  dtype: {
    embed_tokens: "fp16", // "fp32", "fp16", "q8"
    vision_encoder: "q4", // "fp32", "fp16", "q8", "q4", "q4f16"
    decoder_model_merged: "q4", // "q8", "q4", "q4f16"
  },
  device: 'webgpu',
});

// Load images
const image1 = await load_image("https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg");
const image2 = await load_image("https://huggingface.co/spaces/merve/chameleon-7b/resolve/main/bee.jpg");

// Create input messages
const messages = [
  {
    role: "user",
    content: [
      { type: "image" },
      { type: "image" },
      { type: "text", text: "Can you describe the two images?" },
    ],
  },
];

// Prepare inputs
const text = processor.apply_chat_template(messages, { add_generation_prompt: true });
const inputs = await processor(text, [image1, image2], {
  // Set `do_image_splitting: true` to split images into multiple patches.
  // NOTE: This uses more memory, but can provide more accurate results.
  do_image_splitting: false,
});

// Generate outputs
const generated_ids = await model.generate({
  ...inputs,
  max_new_tokens: 500,
});
const generated_texts = processor.batch_decode(
  generated_ids.slice(null, [inputs.input_ids.dims.at(-1), null]),
  { skip_special_tokens: true },
);
console.log(generated_texts[0]);
// ' In the first image, there is a green statue of liberty on a pedestal in the middle of the water. The water is surrounded by trees and buildings in the background. In the second image, there are pink and red flowers with a bee on the pink flower.'

Urgency

This blocks SmolVLM usage in Transformers.js.

ONNX Runtime Installation

Released Package

ONNX Runtime Version or Commit ID

1.20.1

Execution Provider

'webgpu' (WebGPU)

xenova · 2024-12-03T00:40:55Z

This also happens for https://huggingface.co/HuggingFaceTB/SmolLM2-135M-Instruct and https://huggingface.co/HuggingFaceTB/SmolLM2-360M-Instruct, which are easier models to test with.

guschmue · 2024-12-05T20:22:49Z

looking

xenova · 2024-12-05T22:10:51Z

Thanks! I'd recommend testing with https://huggingface.co/HuggingFaceTB/SmolLM2-135M-Instruct (or other models with GQA and num_key_value_heads != num_attention_heads).

github-actions · 2025-01-05T15:03:30Z

This issue has been automatically marked as stale due to inactivity and will be closed in 30 days if no further activity occurs. If further support is needed, please provide an update and/or more details.

xenova · 2025-01-05T22:20:07Z

Bump since this is pretty important

guschmue · 2025-01-06T17:09:39Z

for me https://huggingface.co/HuggingFaceTB/SmolLM2-135M-Instruct does not crash and instead produces garbage.
llama3.2-1b with gqa is working fine.
I created my own gqa model, output is:

user: Tell me about Constantinople.
assistant: You are a helpful AI assistant. You are trained by a user named Hugging.

Just to be sure I tried MHA:

system
You are a helpful AI assistant named SmolLM, trained by Hugging Face
user
Tell me about Constantinople.
assistant
The city of Constantinople, the greatest metropolis in the world. It was the capital of the Byzantine Empire, the largest empire in history, and the largest city in the world.
As a historian, I can tell you that the city was the center of the Byzantine Empire, the largest empire in history, and the largest city in the world. The city was the seat of the emperor, the capital, the largest city, the largest city, the largest city, the largest city, the largest city, the largest city, the largest city, the largest city, the largest city, the largest city, the largest...

With MHA if I do something like summarize the model is doing a little better initially but then doesn't know how to stop and repeats the same sentence over and over.

looking

xenova · 2025-01-06T21:34:00Z

Thanks @guschmue! Let me know if there's anything I can do to assist in debugging 🫡

guschmue · 2025-01-07T16:17:40Z

using transformers.js-examples/smollm-webgpu with SmolLM2-135M-Instruct:
my MHA model is fine
my GQA model does not crash but has wrong output or hangs depending on the query

xenova · 2025-01-07T16:59:36Z

That's the same behaviour as I was seeing too 👍 Model output is fine in Node.js, so the garbage output is not due to severe quantisation.

guschmue · 2025-01-07T20:23:19Z

To summarize:
SmolLM2-135M-Instruct with GQA does have a correctness issue, MHA is fine.

for SmolVLM-Instruct:
seems to be a different issue but we cannot reproduce it.
we assume models are generated with model builder like:
q4: -e cpu
q4f16: -e cuda
In theory that should work. Both would use packedQKV. A little worried about this one because I never used it myself. We have unittests for it and they are passing. -e web would generate a model for MHA, -e dml would generate one with GQA but not use packedQKV.

The sample code above has:
decoder_model_merged: "q4"

but the dump of the inputs shows Uint16 (aka float16) for past_kv ... the q4 model would use float32 for past_kv.

I tried with a little sample app to feed the inputs like shown above directly to onnxruntime (tried both q4 and q4fp16):

        const tokens = [101n];
        feed['attention_mask'] = fillTensor([1, tokens.length], "int64", 1n);
        feed['inputs_embeds'] = fillTensor([1, 1, 2048], "float32", 1.);
        const decoder_shape = [1, 32, 0, 64];
        for (var k in inputNames) {
            const v = inputNames[k];
            if (v.startsWith("past_key_values.")) {
                feed[v] = fillTensor(decoder_shape, "float32", 0);
            }
        }

and the model is happy, don't see errors (using onnxruntime/main).
Maybe SmolVLM-Instruct is no longer an issue ?

satyajandhyala · 2025-01-08T10:17:30Z

Both SmolLM2-135M-Instruct and SmolLM2-360M-Instruct fail due to setting do_rotary attribute on GQA nodes. This attribute is not implemented currently.

… attribute. (#23287) ### Description  Added a fatal error message for unsupported GroupQuerryAttention do_rotary attribute. ### Motivation and Context  #22987 Help user understand that this attribute is not supported.

guschmue · 2025-01-14T00:41:48Z

we'll add support for do_rotary in the near future

TimPietrusky · 2025-02-17T13:23:07Z

not sure if this is the same error, but when using https://huggingface.co/onnx-community/Qwen2.5-Coder-0.5B-ONNX i'm running into this:

[WebGPU] Kernel "[GroupQueryAttention] /model/layers.0/attn/GroupQueryAttention" failed. Error: GroupQuerryAttention do_rotary attribute is not supported"

(happy to open a new issue if this is not related, i just figured there is a relation because of the comment from @guschmue)

xenova · 2025-02-17T13:43:20Z

Indeed, that error message was added with #23287

satyajandhyala · 2025-02-25T17:40:38Z

The Following PR supports do_rotary attribute on GQA.
#23524

guschmue · 2025-02-25T20:53:00Z

For now we recommend to use plain GQA because we plan to switch to a new webgpu ep that supports things like flashattention2 and is going to be a magnitude faster.
That new ep does not support do_rotary for FA2 yet and we want to avoid that models work on 1 EP but not on the other one.
Once we have the functionality the same everywhere we'll switch model builder to use do_rotary.

TimPietrusky · 2025-02-25T21:47:40Z

that sounds super amazing @guschmue, thx!

xenova · 2025-02-25T22:34:44Z

@guschmue Have the correctness errors for GQA been resolved too?

satyajandhyala · 2025-02-25T22:44:34Z

@xenova is there an open issue about the correctness errors for GQA? If there are steps to reproduce, I can verify.

xenova · 2025-02-25T23:33:38Z

Above in this thread :) ^ #22987 (comment)

satyajandhyala · 2025-02-26T15:10:37Z

I am running into the issue reported here

satyajandhyala · 2025-02-26T16:04:11Z

When I use Chrome instead Chrome Canary to run SmolLM2-135M-Instruct I am getting the following output printed repeatedly, because float16 is supported by default in Chrome.
I am a user.

xenova added the platform:web issues related to ONNX Runtime web; typically submitted using template label Dec 3, 2024

github-actions bot added ep:WebGPU ort-web webgpu provider model:transformer issues related to a transformer model: BERT, GPT2, Hugging Face, Longformer, T5, etc. labels Dec 3, 2024

github-actions bot added the stale issues that have not been addressed in a while; categorized by a bot label Jan 5, 2025

github-actions bot removed the stale issues that have not been addressed in a while; categorized by a bot label Jan 6, 2025

satyajandhyala mentioned this issue Jan 6, 2025

Check if the size is zero in addition to other checks for packedQKV #23266

Closed

satyajandhyala mentioned this issue Jan 8, 2025

[JSEP/WebGPU] Add a fatal error message for unsupported GQA do_rotary attribute. #23287

Merged

xenova mentioned this issue Feb 26, 2025

use '-e webgpu' to generate a model for webgpu microsoft/onnxruntime-genai#1278

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WebGPU] `Kernel "[GroupQueryAttention] /model/layers.0/attn/GroupQueryAttention" failed. Error: Input "key" is expected to have 3, 4, or 5 dimensions".` #22987

[WebGPU] `Kernel "[GroupQueryAttention] /model/layers.0/attn/GroupQueryAttention" failed. Error: Input "key" is expected to have 3, 4, or 5 dimensions".` #22987

xenova commented Dec 3, 2024

xenova commented Dec 3, 2024

guschmue commented Dec 5, 2024

xenova commented Dec 5, 2024 •

edited

Loading

github-actions bot commented Jan 5, 2025

xenova commented Jan 5, 2025

guschmue commented Jan 6, 2025 •

edited

Loading

xenova commented Jan 6, 2025

guschmue commented Jan 7, 2025 •

edited

Loading

xenova commented Jan 7, 2025

guschmue commented Jan 7, 2025

satyajandhyala commented Jan 8, 2025 •

edited

Loading

guschmue commented Jan 14, 2025

TimPietrusky commented Feb 17, 2025

xenova commented Feb 17, 2025

satyajandhyala commented Feb 25, 2025

guschmue commented Feb 25, 2025

TimPietrusky commented Feb 25, 2025

xenova commented Feb 25, 2025

satyajandhyala commented Feb 25, 2025 •

edited

Loading

xenova commented Feb 25, 2025

satyajandhyala commented Feb 26, 2025

satyajandhyala commented Feb 26, 2025 •

edited

Loading

[WebGPU] Kernel "[GroupQueryAttention] /model/layers.0/attn/GroupQueryAttention" failed. Error: Input "key" is expected to have 3, 4, or 5 dimensions". #22987

[WebGPU] Kernel "[GroupQueryAttention] /model/layers.0/attn/GroupQueryAttention" failed. Error: Input "key" is expected to have 3, 4, or 5 dimensions". #22987

Comments

xenova commented Dec 3, 2024

Describe the issue

To reproduce

Urgency

ONNX Runtime Installation

ONNX Runtime Version or Commit ID

Execution Provider

xenova commented Dec 3, 2024

guschmue commented Dec 5, 2024

xenova commented Dec 5, 2024 • edited Loading

github-actions bot commented Jan 5, 2025

xenova commented Jan 5, 2025

guschmue commented Jan 6, 2025 • edited Loading

xenova commented Jan 6, 2025

guschmue commented Jan 7, 2025 • edited Loading

xenova commented Jan 7, 2025

guschmue commented Jan 7, 2025

satyajandhyala commented Jan 8, 2025 • edited Loading

guschmue commented Jan 14, 2025

TimPietrusky commented Feb 17, 2025

xenova commented Feb 17, 2025

satyajandhyala commented Feb 25, 2025

guschmue commented Feb 25, 2025

TimPietrusky commented Feb 25, 2025

xenova commented Feb 25, 2025

satyajandhyala commented Feb 25, 2025 • edited Loading

xenova commented Feb 25, 2025

satyajandhyala commented Feb 26, 2025

satyajandhyala commented Feb 26, 2025 • edited Loading

[WebGPU] `Kernel "[GroupQueryAttention] /model/layers.0/attn/GroupQueryAttention" failed. Error: Input "key" is expected to have 3, 4, or 5 dimensions".` #22987

[WebGPU] `Kernel "[GroupQueryAttention] /model/layers.0/attn/GroupQueryAttention" failed. Error: Input "key" is expected to have 3, 4, or 5 dimensions".` #22987

xenova commented Dec 5, 2024 •

edited

Loading

guschmue commented Jan 6, 2025 •

edited

Loading

guschmue commented Jan 7, 2025 •

edited

Loading

satyajandhyala commented Jan 8, 2025 •

edited

Loading

satyajandhyala commented Feb 25, 2025 •

edited

Loading

satyajandhyala commented Feb 26, 2025 •

edited

Loading