Skip to content

Commit

Permalink
[webgpu] Optimize MatMulNBits f16 prefill shader for subgroup size 32
Browse files Browse the repository at this point in the history
This commit optimizes the MatMulNBits f16 prefill shader for devices
with a subgroup size of 32.

Testing on Lunar Lake shows a ~5x improvement in prompt processing
performance, increasing from 14.02 tps to 69.40 tps.

Before:
model_benchmark.exe -l 1000 -i Phi-3.5-mini-instruct-onnx-web

Batch size: 1, prompt tokens: 1001, tokens to generate: 128
Prompt processing (time to first token):
        avg (us):       7.13811e+07
        avg (tokens/s): 14.0233
        p50 (us):       7.13158e+07
        stddev (us):    120674
        n:              5 * 1001 token(s)

After:
model_benchmark.exe -l 1000 -i Phi-3.5-mini-instruct-onnx-web

Batch size: 1, prompt tokens: 1001, tokens to generate: 128
Prompt processing (time to first token):
        avg (us):       1.44234e+07
        avg (tokens/s): 69.4009
        p50 (us):       1.44293e+07
        stddev (us):    60263.9
        n:              5 * 1001 token(s)

See above.
  • Loading branch information
daijh committed Feb 21, 2025
1 parent 47a0077 commit 1a4fc42
Showing 1 changed file with 117 additions and 76 deletions.
Loading

0 comments on commit 1a4fc42

Please sign in to comment.