Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
[webgpu] Optimize MatMulNBits f16 prefill shader for subgroup size 32
This commit optimizes the MatMulNBits f16 prefill shader for devices with a subgroup size of 32. Testing on Lunar Lake shows a ~5x improvement in prompt processing performance, increasing from 14.02 tps to 69.40 tps. Before: model_benchmark.exe -l 1000 -i Phi-3.5-mini-instruct-onnx-web Batch size: 1, prompt tokens: 1001, tokens to generate: 128 Prompt processing (time to first token): avg (us): 7.13811e+07 avg (tokens/s): 14.0233 p50 (us): 7.13158e+07 stddev (us): 120674 n: 5 * 1001 token(s) After: model_benchmark.exe -l 1000 -i Phi-3.5-mini-instruct-onnx-web Batch size: 1, prompt tokens: 1001, tokens to generate: 128 Prompt processing (time to first token): avg (us): 1.44234e+07 avg (tokens/s): 69.4009 p50 (us): 1.44293e+07 stddev (us): 60263.9 n: 5 * 1001 token(s) See above.
- Loading branch information