metal : multi-simd softmax kernel #3710

ggerganov · 2023-10-21T11:56:42Z

Slight improvement in softmax kernel for very long sequence lengths and batched decoding (N_KV > 1024)
Instead of using 1 SIMD group, we use multiple groups.

# 1B model
./bin/batched-bench ../models/tinyllama-1b/ggml-model-f16.gguf 8192 1 99 0 512 512 1,2,3,4,5,6,7,8,9

PR

PP	TG	B	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s	T s	S t/s
512	512	1	1024	0.084	6059.67	3.853	132.88	3.938	260.06
512	512	2	1536	0.080	6368.16	10.320	99.23	10.400	147.69
512	512	3	2048	0.079	6521.96	10.630	144.50	10.708	191.26
512	512	4	2560	0.078	6537.28	10.906	187.78	10.985	233.05
512	512	5	3072	0.078	6535.20	11.151	229.58	11.229	273.57
512	512	6	3584	0.078	6536.62	11.394	269.61	11.472	312.40
512	512	7	4096	0.079	6514.16	11.710	306.06	11.789	347.45
512	512	8	4608	0.079	6520.55	12.097	338.60	12.175	378.47
512	512	9	5120	0.078	6543.55	12.258	375.93	12.336	415.05

master

PP	TG	B	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s	T s	S t/s
512	512	1	1024	0.084	6128.51	3.933	130.19	4.016	254.95
512	512	2	1536	0.078	6541.54	10.358	98.86	10.436	147.18
512	512	3	2048	0.078	6546.39	10.781	142.47	10.859	188.59
512	512	4	2560	0.078	6557.04	10.924	187.48	11.002	232.69
512	512	5	3072	0.078	6543.63	11.444	223.70	11.522	266.62
512	512	6	3584	0.078	6541.38	11.570	265.51	11.648	307.68
512	512	7	4096	0.078	6590.63	11.947	299.99	12.025	340.63
512	512	8	4608	0.078	6543.30	12.049	339.96	12.127	379.98
512	512	9	5120	0.078	6538.70	12.557	366.98	12.635	405.22

There is no difference for short single-batch cases:

model	size	backend	ngl	th	test	master t/s	PR t/s	speedup
llama 7B F16	12.55 GiB	Metal	1	4	pp 512	1402.10 ± 1.73	1401.19 ± 1.34	0.999
llama 7B F16	12.55 GiB	Metal	1	4	tg 128	41.67 ± 0.02	41.61 ± 0.03	0.999
llama 7B Q8_0	6.67 GiB	Metal	1	4	pp 512	1247.43 ± 1.26	1246.13 ± 0.58	0.999
llama 7B Q8_0	6.67 GiB	Metal	1	4	tg 128	68.67 ± 0.03	68.44 ± 0.05	0.997
llama 7B Q4_0	3.56 GiB	Metal	1	4	pp 512	1237.83 ± 0.72	1236.28 ± 0.63	0.999
llama 7B Q4_0	3.56 GiB	Metal	1	4	tg 128	97.55 ± 0.06	96.99 ± 0.07	0.994
llama 7B Q4_1	3.95 GiB	Metal	1	4	pp 512	1239.30 ± 0.87	1238.69 ± 0.93	1.000
llama 7B Q4_1	3.95 GiB	Metal	1	4	tg 128	90.27 ± 0.04	89.96 ± 0.03	0.997

ggml-ci

* ggml-org/llama.cpp#3710

ggml-ci

* ggml-org/llama.cpp#3710

ggerganov added the need feedback Testing and feedback with results are needed label Oct 21, 2023

ggerganov force-pushed the metal-soft-max branch from 228a568 to 6be887c Compare November 1, 2023 19:14

metal : multi-simd softmax

46868a4

ggml-ci

ggerganov force-pushed the metal-soft-max branch from 6be887c to 46868a4 Compare November 1, 2023 19:17

ggerganov merged commit e16b9fa into master Nov 1, 2023

brittlewis12 added a commit to brittlewis12/llmfarm_core.swift that referenced this pull request Nov 17, 2023

multi-simd softmax kernel

5cadba1

* ggml-org/llama.cpp#3710

brittlewis12 added a commit to brittlewis12/llmfarm_core.swift that referenced this pull request Nov 18, 2023

multi-simd softmax kernel

8918bb0

* ggml-org/llama.cpp#3710

olexiyb pushed a commit to Sanctum-AI/llama.cpp that referenced this pull request Nov 23, 2023

metal : multi-simd softmax (ggml-org#3710)

ca1f3a2

ggml-ci

brittlewis12 added a commit to brittlewis12/llmfarm_core.swift that referenced this pull request Nov 30, 2023

multi-simd softmax kernel

a0f1add

* ggml-org/llama.cpp#3710

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

metal : multi-simd softmax kernel #3710

metal : multi-simd softmax kernel #3710

ggerganov commented Oct 21, 2023

metal : multi-simd softmax kernel #3710

metal : multi-simd softmax kernel #3710

Conversation

ggerganov commented Oct 21, 2023