Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

metal : optimize FA kernels #10171

Merged
merged 9 commits into from
Nov 8, 2024
Merged

metal : optimize FA kernels #10171

merged 9 commits into from
Nov 8, 2024

Conversation

ggerganov
Copy link
Owner

@ggerganov ggerganov commented Nov 4, 2024

tgt #10149
rel #8439

Various optimizations for the FA kernels:

  • F16 math
  • reduce register pressure
  • store mask in shared mem
  • earlier -INF block checks

The performance should be noticeably better at larger contexts. The kernels continue to use F32 accumulators for the Q*K*scale so I hope there are no floating-point range issues. Though some extra testing won't hurt.

The original idea of using full BF16 math in the FA kernels did not produce satisfactory results. I think that bfloat performance is not great on Metal yet.

Here are some benches:

./scripts/compare-commits.sh master gg/metal-fa-f16 -m ./models/llama-3.2-3b-instruct/ggml-model-f16.gguf -m ./models/llama-3.2-3b-instruct/ggml-model-q8_0.gguf -m ./models/llama-3.2-1b-instruct/ggml-model-f16.gguf -m ./models/llama-3.2-1b-instruct/ggml-model-q8_0.gguf -m models/qwen2.5-7b-coder/ggml-model-q8_0.gguf -m models/qwen2.5-1.5b-coder/ggml-model-q8_0.gguf -m models/mistral-7b-v0.2/ggml-model-q8_0.gguf -m models/gemma-2b/ggml-model-q4_0.gguf -m models/gemma-2b/ggml-model-f16.gguf -m models/gemma-7b/ggml-model-q4_0.gguf -m models/gemma-7b/ggml-model-f16.gguf -fa 1 -p 4096 -ub 4096 -n 0
CPU Model Test t/s master t/s gg/metal-fa-f16 Speedup
M2 Ultra gemma 2B F16 (guessed) pp4096 3729.96 3855.89 1.03
M2 Ultra gemma 2B F16 (guessed) tg512 90.04 94.24 1.05
M2 Ultra gemma 2B Q4_0 pp4096 3390.40 3492.34 1.03
M2 Ultra gemma 2B Q4_0 tg512 165.62 179.59 1.08
M2 Ultra gemma 7B F16 (guessed) pp4096 1055.38 1092.29 1.03
M2 Ultra gemma 7B F16 (guessed) tg512 32.37 33.21 1.03
M2 Ultra gemma 7B Q4_0 pp4096 947.17 976.45 1.03
M2 Ultra gemma 7B Q4_0 tg512 79.61 84.98 1.07
M2 Ultra llama 1B F16 pp4096 7259.47 7710.98 1.06
M2 Ultra llama 1B F16 tg512 156.15 159.51 1.02
M2 Ultra llama 1B Q8_0 pp4096 6743.79 7141.41 1.06
M2 Ultra llama 1B Q8_0 tg512 221.67 228.48 1.03
M2 Ultra llama 3B F16 pp4096 2789.54 3055.63 1.10
M2 Ultra llama 3B F16 tg512 71.40 72.28 1.01
M2 Ultra llama 3B Q8_0 pp4096 2572.74 2786.11 1.08
M2 Ultra llama 3B Q8_0 tg512 114.93 116.99 1.02
M2 Ultra llama 7B Q8_0 pp4096 1156.61 1223.59 1.06
M2 Ultra llama 7B Q8_0 tg512 65.19 66.09 1.01
M2 Ultra qwen2 ?B Q8_0 pp4096 3058.36 3348.89 1.09
M2 Ultra qwen2 ?B Q8_0 tg512 109.80 111.79 1.02

Using llama-batched-bench to show TG speed after large prompts (S_TG column):

./llama-batched-bench -m models/qwen2.5-7b-coder/ggml-model-q4_0.gguf -c 65768 -b 1024 -npp 4096,8192,16384,32768 -ntg 32 -npl 1 -fa
  • master
PP TG B N_KV T_PP s S_PP t/s T_TG s S_TG t/s T s S t/s
4096 32 1 4128 3.553 1152.93 0.405 78.98 3.958 1042.99
8192 32 1 8224 7.614 1075.97 0.476 67.29 8.089 1016.67
16384 32 1 16416 18.634 879.25 0.620 51.63 19.254 852.60
32768 32 1 32800 49.343 664.08 0.907 35.28 50.250 652.73
  • gg/metal-fa-f16
PP TG B N_KV T_PP s S_PP t/s T_TG s S_TG t/s T s S t/s
4096 32 1 4128 3.601 1137.49 0.396 80.86 3.997 1032.86
8192 32 1 8224 7.306 1121.27 0.433 73.90 7.739 1062.67
16384 32 1 16416 17.269 948.76 0.542 59.07 17.811 921.70
32768 32 1 32800 45.875 714.29 0.759 42.17 46.634 703.35

M1 Pro

CPU Model Test t/s master t/s gg/metal-fa-f16 Speedup
Accelerate, Apple M1 Pro llama 3B F16 pp4096 471.11 546.22 1.16
Accelerate, Apple M1 Pro llama 3B F16 tg128 25.13 25.26 1.01
Accelerate, Apple M1 Pro qwen2 ?B F16 pp4096 980.11 1145.10 1.17
Accelerate, Apple M1 Pro qwen2 ?B F16 tg128 47.84 48.42 1.01

TODO:

  • Run some tests on M1 Pro
  • Test some models

Base automatically changed from gg/metal-fa-q to master November 6, 2024 08:24
@github-actions github-actions bot added Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels Nov 6, 2024
@ggerganov ggerganov marked this pull request as ready for review November 6, 2024 14:30
@github-actions github-actions bot added testing Everything test related examples labels Nov 7, 2024
@ggerganov ggerganov changed the title metal : switch to F16 FA metal : optimize FA kernels Nov 7, 2024
@ggerganov
Copy link
Owner Author

This PR should be gucci now.

@slaren
Copy link
Collaborator

slaren commented Nov 8, 2024

The performance increase looks about the same with M3 Max:

CPU Model Test t/s master t/s gg/metal-fa-f16 Speedup
Accelerate, Apple M3 Max llama 1B Q8_0 pp4096 3857.10 4108.20 1.07
Accelerate, Apple M3 Max llama 1B Q8_0 tg128 169.28 170.57 1.01

Note that to test batch sizes larger than the default 2048 with llama-bench it is necessary to increase both the batch size and ubatch size, since llama-bench always uses a batch of size n_batch, regardless of the value of n_ubatch.

@ggerganov
Copy link
Owner Author

ggerganov commented Nov 8, 2024

Note that to test batch sizes larger than the default 2048 with llama-bench it is necessary to increase both the batch size and ubatch size, since llama-bench always uses a batch of size n_batch, regardless of the value of n_ubatch.

Thanks, I forgot about that.

As a data point, running some tests as a function of the ubatch size, the Metal backend now benefits from sizes of up to 4096 (up from 2048 on master) when FA is enabled:

./llama-bench -m ./models/llama-3.2-3b-instruct/ggml-model-f16.gguf -fa 1 -p 1024,2048,4096,8192,16384 -b 16384 -ub 512,1024,2048,4096,8192 -n 0
model size backend n_batch n_ubatch fa test t/s
llama 3B F16 6.72 GiB Metal,BLAS 16384 512 1 pp1024 3180.10 ± 4.02
llama 3B F16 6.72 GiB Metal,BLAS 16384 512 1 pp2048 3026.70 ± 1.89
llama 3B F16 6.72 GiB Metal,BLAS 16384 512 1 pp4096 2745.50 ± 2.56
llama 3B F16 6.72 GiB Metal,BLAS 16384 512 1 pp8192 2300.55 ± 0.57
llama 3B F16 6.72 GiB Metal,BLAS 16384 512 1 pp16384 1730.21 ± 0.67
llama 3B F16 6.72 GiB Metal,BLAS 16384 1024 1 pp1024 3437.65 ± 4.19
llama 3B F16 6.72 GiB Metal,BLAS 16384 1024 1 pp2048 3260.28 ± 2.36
llama 3B F16 6.72 GiB Metal,BLAS 16384 1024 1 pp4096 2962.40 ± 3.42
llama 3B F16 6.72 GiB Metal,BLAS 16384 1024 1 pp8192 2486.86 ± 1.58
llama 3B F16 6.72 GiB Metal,BLAS 16384 1024 1 pp16384 1870.89 ± 0.76
llama 3B F16 6.72 GiB Metal,BLAS 16384 2048 1 pp1024 3437.07 ± 2.61
llama 3B F16 6.72 GiB Metal,BLAS 16384 2048 1 pp2048 3375.57 ± 5.92
llama 3B F16 6.72 GiB Metal,BLAS 16384 2048 1 pp4096 3066.98 ± 3.09
llama 3B F16 6.72 GiB Metal,BLAS 16384 2048 1 pp8192 2581.90 ± 1.83
llama 3B F16 6.72 GiB Metal,BLAS 16384 2048 1 pp16384 1941.22 ± 0.66
llama 3B F16 6.72 GiB Metal,BLAS 16384 4096 1 pp1024 3435.34 ± 4.51
llama 3B F16 6.72 GiB Metal,BLAS 16384 4096 1 pp2048 3373.15 ± 5.77
llama 3B F16 6.72 GiB Metal,BLAS 16384 4096 1 pp4096 3104.95 ± 1.71
llama 3B F16 6.72 GiB Metal,BLAS 16384 4096 1 pp8192 2613.96 ± 2.99
llama 3B F16 6.72 GiB Metal,BLAS 16384 4096 1 pp16384 1959.44 ± 1.24
llama 3B F16 6.72 GiB Metal,BLAS 16384 8192 1 pp1024 3433.74 ± 3.59
llama 3B F16 6.72 GiB Metal,BLAS 16384 8192 1 pp2048 3374.11 ± 4.87
llama 3B F16 6.72 GiB Metal,BLAS 16384 8192 1 pp4096 3104.40 ± 1.08
llama 3B F16 6.72 GiB Metal,BLAS 16384 8192 1 pp8192 2597.08 ± 0.44
llama 3B F16 6.72 GiB Metal,BLAS 16384 8192 1 pp16384 1939.36 ± 1.76

build: 59792ff (4057)

./llama-bench -m ./models/qwen2.5-7b-coder/ggml-model-q8_0.gguf -fa 1 -p 1024,2048,4096,8192,16384 -b 16384 -ub 512,1024,2048,4096,8192 -n 0
model size backend n_batch n_ubatch fa test t/s
qwen2 ?B Q8_0 7.54 GiB Metal,BLAS 16384 512 1 pp1024 1345.30 ± 0.30
qwen2 ?B Q8_0 7.54 GiB Metal,BLAS 16384 512 1 pp2048 1313.49 ± 0.40
qwen2 ?B Q8_0 7.54 GiB Metal,BLAS 16384 512 1 pp4096 1249.78 ± 0.36
qwen2 ?B Q8_0 7.54 GiB Metal,BLAS 16384 512 1 pp8192 1132.37 ± 0.22
qwen2 ?B Q8_0 7.54 GiB Metal,BLAS 16384 512 1 pp16384 950.45 ± 0.09
qwen2 ?B Q8_0 7.54 GiB Metal,BLAS 16384 1024 1 pp1024 1416.53 ± 0.84
qwen2 ?B Q8_0 7.54 GiB Metal,BLAS 16384 1024 1 pp2048 1380.82 ± 0.49
qwen2 ?B Q8_0 7.54 GiB Metal,BLAS 16384 1024 1 pp4096 1313.26 ± 0.61
qwen2 ?B Q8_0 7.54 GiB Metal,BLAS 16384 1024 1 pp8192 1194.45 ± 0.51
qwen2 ?B Q8_0 7.54 GiB Metal,BLAS 16384 1024 1 pp16384 1008.75 ± 0.11
qwen2 ?B Q8_0 7.54 GiB Metal,BLAS 16384 2048 1 pp1024 1416.76 ± 0.40
qwen2 ?B Q8_0 7.54 GiB Metal,BLAS 16384 2048 1 pp2048 1410.50 ± 0.80
qwen2 ?B Q8_0 7.54 GiB Metal,BLAS 16384 2048 1 pp4096 1343.40 ± 1.09
qwen2 ?B Q8_0 7.54 GiB Metal,BLAS 16384 2048 1 pp8192 1225.15 ± 0.29
qwen2 ?B Q8_0 7.54 GiB Metal,BLAS 16384 2048 1 pp16384 1036.47 ± 0.28
qwen2 ?B Q8_0 7.54 GiB Metal,BLAS 16384 4096 1 pp1024 1416.40 ± 1.01
qwen2 ?B Q8_0 7.54 GiB Metal,BLAS 16384 4096 1 pp2048 1409.80 ± 1.17
qwen2 ?B Q8_0 7.54 GiB Metal,BLAS 16384 4096 1 pp4096 1358.02 ± 0.75
qwen2 ?B Q8_0 7.54 GiB Metal,BLAS 16384 4096 1 pp8192 1236.98 ± 0.61
qwen2 ?B Q8_0 7.54 GiB Metal,BLAS 16384 4096 1 pp16384 1045.57 ± 0.34
qwen2 ?B Q8_0 7.54 GiB Metal,BLAS 16384 8192 1 pp1024 1415.87 ± 0.76
qwen2 ?B Q8_0 7.54 GiB Metal,BLAS 16384 8192 1 pp2048 1411.00 ± 0.49
qwen2 ?B Q8_0 7.54 GiB Metal,BLAS 16384 8192 1 pp4096 1357.65 ± 0.41
qwen2 ?B Q8_0 7.54 GiB Metal,BLAS 16384 8192 1 pp8192 1235.60 ± 0.33
qwen2 ?B Q8_0 7.54 GiB Metal,BLAS 16384 8192 1 pp16384 1043.91 ± 0.56

build: 1888c1f (4057)

My guess is that the logic for skipping the computation of attention blocks when the mask is full of -INF in that block is now more efficient. I'm wondering if this optimization could be viable for the CUDA FA as well.

@ggerganov ggerganov merged commit 841f27a into master Nov 8, 2024
1 check passed
ggerganov added a commit that referenced this pull request Nov 8, 2024
arthw pushed a commit to arthw/llama.cpp that referenced this pull request Nov 15, 2024
* ggml : add ggml_flash_attn_ext_get_prec

* metal : use F16 precision in FA kernels

ggml-ci

* metal : minor clean-up

* metal : compile-guard bf16 FA kernels

ggml-ci

* build : remove obsolete compile flag [no ci]

* metal : prevent int overflows [no ci]

* cuda : disable BF16 FA

ggml-ci

* metal : fix BF16 requirement for FA kernels

ggml-ci

* make : clean-up [no ci]
arthw pushed a commit to arthw/llama.cpp that referenced this pull request Nov 15, 2024
arthw pushed a commit to arthw/llama.cpp that referenced this pull request Nov 18, 2024
* ggml : add ggml_flash_attn_ext_get_prec

* metal : use F16 precision in FA kernels

ggml-ci

* metal : minor clean-up

* metal : compile-guard bf16 FA kernels

ggml-ci

* build : remove obsolete compile flag [no ci]

* metal : prevent int overflows [no ci]

* cuda : disable BF16 FA

ggml-ci

* metal : fix BF16 requirement for FA kernels

ggml-ci

* make : clean-up [no ci]
arthw pushed a commit to arthw/llama.cpp that referenced this pull request Nov 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
examples ggml changes relating to the ggml tensor library for machine learning Nvidia GPU Issues specific to Nvidia GPUs testing Everything test related
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants