metal : optimize FA kernels #10171

ggerganov · 2024-11-04T17:16:01Z

Various optimizations for the FA kernels:

F16 math
reduce register pressure
store mask in shared mem
earlier -INF block checks

The performance should be noticeably better at larger contexts. The kernels continue to use F32 accumulators for the Q*K*scale so I hope there are no floating-point range issues. Though some extra testing won't hurt.

The original idea of using full BF16 math in the FA kernels did not produce satisfactory results. I think that bfloat performance is not great on Metal yet.

Here are some benches:

./scripts/compare-commits.sh master gg/metal-fa-f16 -m ./models/llama-3.2-3b-instruct/ggml-model-f16.gguf -m ./models/llama-3.2-3b-instruct/ggml-model-q8_0.gguf -m ./models/llama-3.2-1b-instruct/ggml-model-f16.gguf -m ./models/llama-3.2-1b-instruct/ggml-model-q8_0.gguf -m models/qwen2.5-7b-coder/ggml-model-q8_0.gguf -m models/qwen2.5-1.5b-coder/ggml-model-q8_0.gguf -m models/mistral-7b-v0.2/ggml-model-q8_0.gguf -m models/gemma-2b/ggml-model-q4_0.gguf -m models/gemma-2b/ggml-model-f16.gguf -m models/gemma-7b/ggml-model-q4_0.gguf -m models/gemma-7b/ggml-model-f16.gguf -fa 1 -p 4096 -ub 4096 -n 0

CPU	Model	Test	t/s master	t/s gg/metal-fa-f16	Speedup
M2 Ultra	gemma 2B F16 (guessed)	pp4096	3729.96	3855.89	1.03
M2 Ultra	gemma 2B F16 (guessed)	tg512	90.04	94.24	1.05
M2 Ultra	gemma 2B Q4_0	pp4096	3390.40	3492.34	1.03
M2 Ultra	gemma 2B Q4_0	tg512	165.62	179.59	1.08
M2 Ultra	gemma 7B F16 (guessed)	pp4096	1055.38	1092.29	1.03
M2 Ultra	gemma 7B F16 (guessed)	tg512	32.37	33.21	1.03
M2 Ultra	gemma 7B Q4_0	pp4096	947.17	976.45	1.03
M2 Ultra	gemma 7B Q4_0	tg512	79.61	84.98	1.07
M2 Ultra	llama 1B F16	pp4096	7259.47	7710.98	1.06
M2 Ultra	llama 1B F16	tg512	156.15	159.51	1.02
M2 Ultra	llama 1B Q8_0	pp4096	6743.79	7141.41	1.06
M2 Ultra	llama 1B Q8_0	tg512	221.67	228.48	1.03
M2 Ultra	llama 3B F16	pp4096	2789.54	3055.63	1.10
M2 Ultra	llama 3B F16	tg512	71.40	72.28	1.01
M2 Ultra	llama 3B Q8_0	pp4096	2572.74	2786.11	1.08
M2 Ultra	llama 3B Q8_0	tg512	114.93	116.99	1.02
M2 Ultra	llama 7B Q8_0	pp4096	1156.61	1223.59	1.06
M2 Ultra	llama 7B Q8_0	tg512	65.19	66.09	1.01
M2 Ultra	qwen2 ?B Q8_0	pp4096	3058.36	3348.89	1.09
M2 Ultra	qwen2 ?B Q8_0	tg512	109.80	111.79	1.02

Using llama-batched-bench to show TG speed after large prompts (S_TG column):

./llama-batched-bench -m models/qwen2.5-7b-coder/ggml-model-q4_0.gguf -c 65768 -b 1024 -npp 4096,8192,16384,32768 -ntg 32 -npl 1 -fa

master

PP	TG	B	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s	T s	S t/s
4096	32	1	4128	3.553	1152.93	0.405	78.98	3.958	1042.99
8192	32	1	8224	7.614	1075.97	0.476	67.29	8.089	1016.67
16384	32	1	16416	18.634	879.25	0.620	51.63	19.254	852.60
32768	32	1	32800	49.343	664.08	0.907	35.28	50.250	652.73

gg/metal-fa-f16

PP	TG	B	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s	T s	S t/s
4096	32	1	4128	3.601	1137.49	0.396	80.86	3.997	1032.86
8192	32	1	8224	7.306	1121.27	0.433	73.90	7.739	1062.67
16384	32	1	16416	17.269	948.76	0.542	59.07	17.811	921.70
32768	32	1	32800	45.875	714.29	0.759	42.17	46.634	703.35

M1 Pro

CPU	Model	Test	t/s master	t/s gg/metal-fa-f16	Speedup
Accelerate, Apple M1 Pro	llama 3B F16	pp4096	471.11	546.22	1.16
Accelerate, Apple M1 Pro	llama 3B F16	tg128	25.13	25.26	1.01
Accelerate, Apple M1 Pro	qwen2 ?B F16	pp4096	980.11	1145.10	1.17
Accelerate, Apple M1 Pro	qwen2 ?B F16	tg128	47.84	48.42	1.01

TODO:

Run some tests on M1 Pro
Test some models

ggerganov · 2024-11-07T19:50:29Z

This PR should be gucci now.

slaren · 2024-11-08T02:04:51Z

The performance increase looks about the same with M3 Max:

CPU	Model	Test	t/s master	t/s gg/metal-fa-f16	Speedup
Accelerate, Apple M3 Max	llama 1B Q8_0	pp4096	3857.10	4108.20	1.07
Accelerate, Apple M3 Max	llama 1B Q8_0	tg128	169.28	170.57	1.01

Note that to test batch sizes larger than the default 2048 with llama-bench it is necessary to increase both the batch size and ubatch size, since llama-bench always uses a batch of size n_batch, regardless of the value of n_ubatch.

ggml-ci

ggerganov · 2024-11-08T09:16:52Z

Note that to test batch sizes larger than the default 2048 with llama-bench it is necessary to increase both the batch size and ubatch size, since llama-bench always uses a batch of size n_batch, regardless of the value of n_ubatch.

Thanks, I forgot about that.

As a data point, running some tests as a function of the ubatch size, the Metal backend now benefits from sizes of up to 4096 (up from 2048 on master) when FA is enabled:

./llama-bench -m ./models/llama-3.2-3b-instruct/ggml-model-f16.gguf -fa 1 -p 1024,2048,4096,8192,16384 -b 16384 -ub 512,1024,2048,4096,8192 -n 0

model	size	backend	n_batch	n_ubatch	fa	test	t/s
llama 3B F16	6.72 GiB	Metal,BLAS	16384	512	1	pp1024	3180.10 ± 4.02
llama 3B F16	6.72 GiB	Metal,BLAS	16384	512	1	pp2048	3026.70 ± 1.89
llama 3B F16	6.72 GiB	Metal,BLAS	16384	512	1	pp4096	2745.50 ± 2.56
llama 3B F16	6.72 GiB	Metal,BLAS	16384	512	1	pp8192	2300.55 ± 0.57
llama 3B F16	6.72 GiB	Metal,BLAS	16384	512	1	pp16384	1730.21 ± 0.67
llama 3B F16	6.72 GiB	Metal,BLAS	16384	1024	1	pp1024	3437.65 ± 4.19
llama 3B F16	6.72 GiB	Metal,BLAS	16384	1024	1	pp2048	3260.28 ± 2.36
llama 3B F16	6.72 GiB	Metal,BLAS	16384	1024	1	pp4096	2962.40 ± 3.42
llama 3B F16	6.72 GiB	Metal,BLAS	16384	1024	1	pp8192	2486.86 ± 1.58
llama 3B F16	6.72 GiB	Metal,BLAS	16384	1024	1	pp16384	1870.89 ± 0.76
llama 3B F16	6.72 GiB	Metal,BLAS	16384	2048	1	pp1024	3437.07 ± 2.61
llama 3B F16	6.72 GiB	Metal,BLAS	16384	2048	1	pp2048	3375.57 ± 5.92
llama 3B F16	6.72 GiB	Metal,BLAS	16384	2048	1	pp4096	3066.98 ± 3.09
llama 3B F16	6.72 GiB	Metal,BLAS	16384	2048	1	pp8192	2581.90 ± 1.83
llama 3B F16	6.72 GiB	Metal,BLAS	16384	2048	1	pp16384	1941.22 ± 0.66
llama 3B F16	6.72 GiB	Metal,BLAS	16384	4096	1	pp1024	3435.34 ± 4.51
llama 3B F16	6.72 GiB	Metal,BLAS	16384	4096	1	pp2048	3373.15 ± 5.77
llama 3B F16	6.72 GiB	Metal,BLAS	16384	4096	1	pp4096	3104.95 ± 1.71
llama 3B F16	6.72 GiB	Metal,BLAS	16384	4096	1	pp8192	2613.96 ± 2.99
llama 3B F16	6.72 GiB	Metal,BLAS	16384	4096	1	pp16384	1959.44 ± 1.24
llama 3B F16	6.72 GiB	Metal,BLAS	16384	8192	1	pp1024	3433.74 ± 3.59
llama 3B F16	6.72 GiB	Metal,BLAS	16384	8192	1	pp2048	3374.11 ± 4.87
llama 3B F16	6.72 GiB	Metal,BLAS	16384	8192	1	pp4096	3104.40 ± 1.08
llama 3B F16	6.72 GiB	Metal,BLAS	16384	8192	1	pp8192	2597.08 ± 0.44
llama 3B F16	6.72 GiB	Metal,BLAS	16384	8192	1	pp16384	1939.36 ± 1.76

build: 59792ff (4057)

./llama-bench -m ./models/qwen2.5-7b-coder/ggml-model-q8_0.gguf -fa 1 -p 1024,2048,4096,8192,16384 -b 16384 -ub 512,1024,2048,4096,8192 -n 0

model	size	backend	n_batch	n_ubatch	fa	test	t/s
qwen2 ?B Q8_0	7.54 GiB	Metal,BLAS	16384	512	1	pp1024	1345.30 ± 0.30
qwen2 ?B Q8_0	7.54 GiB	Metal,BLAS	16384	512	1	pp2048	1313.49 ± 0.40
qwen2 ?B Q8_0	7.54 GiB	Metal,BLAS	16384	512	1	pp4096	1249.78 ± 0.36
qwen2 ?B Q8_0	7.54 GiB	Metal,BLAS	16384	512	1	pp8192	1132.37 ± 0.22
qwen2 ?B Q8_0	7.54 GiB	Metal,BLAS	16384	512	1	pp16384	950.45 ± 0.09
qwen2 ?B Q8_0	7.54 GiB	Metal,BLAS	16384	1024	1	pp1024	1416.53 ± 0.84
qwen2 ?B Q8_0	7.54 GiB	Metal,BLAS	16384	1024	1	pp2048	1380.82 ± 0.49
qwen2 ?B Q8_0	7.54 GiB	Metal,BLAS	16384	1024	1	pp4096	1313.26 ± 0.61
qwen2 ?B Q8_0	7.54 GiB	Metal,BLAS	16384	1024	1	pp8192	1194.45 ± 0.51
qwen2 ?B Q8_0	7.54 GiB	Metal,BLAS	16384	1024	1	pp16384	1008.75 ± 0.11
qwen2 ?B Q8_0	7.54 GiB	Metal,BLAS	16384	2048	1	pp1024	1416.76 ± 0.40
qwen2 ?B Q8_0	7.54 GiB	Metal,BLAS	16384	2048	1	pp2048	1410.50 ± 0.80
qwen2 ?B Q8_0	7.54 GiB	Metal,BLAS	16384	2048	1	pp4096	1343.40 ± 1.09
qwen2 ?B Q8_0	7.54 GiB	Metal,BLAS	16384	2048	1	pp8192	1225.15 ± 0.29
qwen2 ?B Q8_0	7.54 GiB	Metal,BLAS	16384	2048	1	pp16384	1036.47 ± 0.28
qwen2 ?B Q8_0	7.54 GiB	Metal,BLAS	16384	4096	1	pp1024	1416.40 ± 1.01
qwen2 ?B Q8_0	7.54 GiB	Metal,BLAS	16384	4096	1	pp2048	1409.80 ± 1.17
qwen2 ?B Q8_0	7.54 GiB	Metal,BLAS	16384	4096	1	pp4096	1358.02 ± 0.75
qwen2 ?B Q8_0	7.54 GiB	Metal,BLAS	16384	4096	1	pp8192	1236.98 ± 0.61
qwen2 ?B Q8_0	7.54 GiB	Metal,BLAS	16384	4096	1	pp16384	1045.57 ± 0.34
qwen2 ?B Q8_0	7.54 GiB	Metal,BLAS	16384	8192	1	pp1024	1415.87 ± 0.76
qwen2 ?B Q8_0	7.54 GiB	Metal,BLAS	16384	8192	1	pp2048	1411.00 ± 0.49
qwen2 ?B Q8_0	7.54 GiB	Metal,BLAS	16384	8192	1	pp4096	1357.65 ± 0.41
qwen2 ?B Q8_0	7.54 GiB	Metal,BLAS	16384	8192	1	pp8192	1235.60 ± 0.33
qwen2 ?B Q8_0	7.54 GiB	Metal,BLAS	16384	8192	1	pp16384	1043.91 ± 0.56

build: 1888c1f (4057)

My guess is that the logic for skipping the computation of attention blocks when the mask is full of -INF in that block is now more efficient. I'm wondering if this optimization could be viable for the CUDA FA as well.

ggml-ci

* ggml : add ggml_flash_attn_ext_get_prec * metal : use F16 precision in FA kernels ggml-ci * metal : minor clean-up * metal : compile-guard bf16 FA kernels ggml-ci * build : remove obsolete compile flag [no ci] * metal : prevent int overflows [no ci] * cuda : disable BF16 FA ggml-ci * metal : fix BF16 requirement for FA kernels ggml-ci * make : clean-up [no ci]

Base automatically changed from gg/metal-fa-q to master November 6, 2024 08:24

ggerganov force-pushed the gg/metal-fa-f16 branch from c71e0bc to d0cff71 Compare November 6, 2024 13:34

github-actions bot added Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels Nov 6, 2024

ggerganov marked this pull request as ready for review November 6, 2024 14:30

ggerganov force-pushed the gg/metal-fa-f16 branch from a797e5d to f66d362 Compare November 7, 2024 19:03

github-actions bot added testing Everything test related examples labels Nov 7, 2024

ggerganov changed the title ~~metal : switch to F16 FA~~ metal : optimize FA kernels Nov 7, 2024

ggerganov force-pushed the gg/metal-fa-f16 branch from ff1b4f5 to 5464b08 Compare November 7, 2024 19:40

ggerganov added 6 commits November 8, 2024 10:11

ggml : add ggml_flash_attn_ext_get_prec

25e8773

metal : use F16 precision in FA kernels

7facc29

ggml-ci

metal : minor clean-up

2fccc8a

metal : compile-guard bf16 FA kernels

120d512

ggml-ci

build : remove obsolete compile flag [no ci]

486a5eb

metal : prevent int overflows [no ci]

5d1a10d

ggerganov force-pushed the gg/metal-fa-f16 branch from a49913f to 5d1a10d Compare November 8, 2024 08:11

cuda : disable BF16 FA

bc143ec

ggml-ci

ggerganov force-pushed the gg/metal-fa-f16 branch from 59792ff to 1888c1f Compare November 8, 2024 08:39

ggerganov force-pushed the gg/metal-fa-f16 branch from 1888c1f to bc143ec Compare November 8, 2024 09:19

ggerganov added 2 commits November 8, 2024 11:28

metal : fix BF16 requirement for FA kernels

b89e71b

ggml-ci

make : clean-up [no ci]

a2385da

ggerganov merged commit 841f27a into master Nov 8, 2024
1 check passed

ggerganov added a commit that referenced this pull request Nov 8, 2024

metal : improve clarity (minor) (#10171)

695ad75

This was referenced Nov 9, 2024

metal : fix build and some more comments #10229

Merged

metal : fix F32 accumulation in FA vec kernel #10232

Merged

metal : more precise Q*K in FA vec kernel #10247

Merged

ggerganov mentioned this pull request Nov 13, 2024

vulkan: Add VK_NV_cooperative_matrix2 support for mul_mat and FlashAttention2 #10206

Merged

4 tasks

arthw pushed a commit to arthw/llama.cpp that referenced this pull request Nov 15, 2024

metal : improve clarity (minor) (ggerganov#10171)

cdb3fe7

arthw pushed a commit to arthw/llama.cpp that referenced this pull request Nov 18, 2024

metal : improve clarity (minor) (ggerganov#10171)

639012b

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

metal : optimize FA kernels #10171

metal : optimize FA kernels #10171

ggerganov commented Nov 4, 2024 •

edited

Loading

ggerganov commented Nov 7, 2024

slaren commented Nov 8, 2024

ggerganov commented Nov 8, 2024 •

edited

Loading

metal : optimize FA kernels #10171

metal : optimize FA kernels #10171

Conversation

ggerganov commented Nov 4, 2024 • edited Loading

M1 Pro

ggerganov commented Nov 7, 2024

slaren commented Nov 8, 2024

ggerganov commented Nov 8, 2024 • edited Loading

ggerganov commented Nov 4, 2024 •

edited

Loading

ggerganov commented Nov 8, 2024 •

edited

Loading