Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

metal : multi-simd softmax kernel #3710

Merged
merged 1 commit into from
Nov 1, 2023
Merged

metal : multi-simd softmax kernel #3710

merged 1 commit into from
Nov 1, 2023

Conversation

ggerganov
Copy link
Member

Slight improvement in softmax kernel for very long sequence lengths and batched decoding (N_KV > 1024)
Instead of using 1 SIMD group, we use multiple groups.

# 1B model
./bin/batched-bench ../models/tinyllama-1b/ggml-model-f16.gguf 8192 1 99 0 512 512 1,2,3,4,5,6,7,8,9
  • PR
PP TG B N_KV T_PP s S_PP t/s T_TG s S_TG t/s T s S t/s
512 512 1 1024 0.084 6059.67 3.853 132.88 3.938 260.06
512 512 2 1536 0.080 6368.16 10.320 99.23 10.400 147.69
512 512 3 2048 0.079 6521.96 10.630 144.50 10.708 191.26
512 512 4 2560 0.078 6537.28 10.906 187.78 10.985 233.05
512 512 5 3072 0.078 6535.20 11.151 229.58 11.229 273.57
512 512 6 3584 0.078 6536.62 11.394 269.61 11.472 312.40
512 512 7 4096 0.079 6514.16 11.710 306.06 11.789 347.45
512 512 8 4608 0.079 6520.55 12.097 338.60 12.175 378.47
512 512 9 5120 0.078 6543.55 12.258 375.93 12.336 415.05
  • master
PP TG B N_KV T_PP s S_PP t/s T_TG s S_TG t/s T s S t/s
512 512 1 1024 0.084 6128.51 3.933 130.19 4.016 254.95
512 512 2 1536 0.078 6541.54 10.358 98.86 10.436 147.18
512 512 3 2048 0.078 6546.39 10.781 142.47 10.859 188.59
512 512 4 2560 0.078 6557.04 10.924 187.48 11.002 232.69
512 512 5 3072 0.078 6543.63 11.444 223.70 11.522 266.62
512 512 6 3584 0.078 6541.38 11.570 265.51 11.648 307.68
512 512 7 4096 0.078 6590.63 11.947 299.99 12.025 340.63
512 512 8 4608 0.078 6543.30 12.049 339.96 12.127 379.98
512 512 9 5120 0.078 6538.70 12.557 366.98 12.635 405.22

There is no difference for short single-batch cases:

model size backend ngl th test master t/s PR t/s speedup
llama 7B F16 12.55 GiB Metal 1 4 pp 512 1402.10 ± 1.73 1401.19 ± 1.34 0.999
llama 7B F16 12.55 GiB Metal 1 4 tg 128 41.67 ± 0.02 41.61 ± 0.03 0.999
llama 7B Q8_0 6.67 GiB Metal 1 4 pp 512 1247.43 ± 1.26 1246.13 ± 0.58 0.999
llama 7B Q8_0 6.67 GiB Metal 1 4 tg 128 68.67 ± 0.03 68.44 ± 0.05 0.997
llama 7B Q4_0 3.56 GiB Metal 1 4 pp 512 1237.83 ± 0.72 1236.28 ± 0.63 0.999
llama 7B Q4_0 3.56 GiB Metal 1 4 tg 128 97.55 ± 0.06 96.99 ± 0.07 0.994
llama 7B Q4_1 3.95 GiB Metal 1 4 pp 512 1239.30 ± 0.87 1238.69 ± 0.93 1.000
llama 7B Q4_1 3.95 GiB Metal 1 4 tg 128 90.27 ± 0.04 89.96 ± 0.03 0.997

@ggerganov ggerganov added the need feedback Testing and feedback with results are needed label Oct 21, 2023
@ggerganov ggerganov merged commit e16b9fa into master Nov 1, 2023
brittlewis12 added a commit to brittlewis12/llmfarm_core.swift that referenced this pull request Nov 17, 2023
brittlewis12 added a commit to brittlewis12/llmfarm_core.swift that referenced this pull request Nov 18, 2023
olexiyb pushed a commit to Sanctum-AI/llama.cpp that referenced this pull request Nov 23, 2023
brittlewis12 added a commit to brittlewis12/llmfarm_core.swift that referenced this pull request Nov 30, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
need feedback Testing and feedback with results are needed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant