[webgpu] Optimize MatMulNBits f16 prefill shader for subgroup size 32 · microsoft/onnxruntime@1a4fc42

Commit

[webgpu] Optimize MatMulNBits f16 prefill shader for subgroup size 32

This commit optimizes the MatMulNBits f16 prefill shader for devices
with a subgroup size of 32.

Testing on Lunar Lake shows a ~5x improvement in prompt processing
performance, increasing from 14.02 tps to 69.40 tps.

Before:
model_benchmark.exe -l 1000 -i Phi-3.5-mini-instruct-onnx-web

Batch size: 1, prompt tokens: 1001, tokens to generate: 128
Prompt processing (time to first token):
        avg (us):       7.13811e+07
        avg (tokens/s): 14.0233
        p50 (us):       7.13158e+07
        stddev (us):    120674
        n:              5 * 1001 token(s)

After:
model_benchmark.exe -l 1000 -i Phi-3.5-mini-instruct-onnx-web

Batch size: 1, prompt tokens: 1001, tokens to generate: 128
Prompt processing (time to first token):
        avg (us):       1.44234e+07
        avg (tokens/s): 69.4009
        p50 (us):       1.44293e+07
        stddev (us):    60263.9
        n:              5 * 1001 token(s)

See above.

Loading branch information

daijh committed Feb 21, 2025

1 parent 47a0077 commit 1a4fc42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Commit

There are no files selected for viewing

0 comments on commit `1a4fc42`

Commit

There are no files selected for viewing

0 comments on commit 1a4fc42

0 comments on commit `1a4fc42`