Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

metal : enable mat-vec kernels for bs <= 4 #10491

Merged
merged 1 commit into from
Nov 25, 2024
Merged

Conversation

ggerganov
Copy link
Owner

This should improve the parallel performance in most cases for up to BS of 4. For BS [4, 32), we should try to implement a dedicated small-batch mat-mat multiplication kernel.

M1 Pro

Model Test t/s master t/s gg/metal-enable-mv Speedup
llama 13B Q4_0 pp1 21.34 21.24 1.00
llama 13B Q4_0 pp2 9.11 23.90 2.62
llama 13B Q4_0 pp3 13.63 25.05 1.84
llama 13B Q4_0 pp4 18.08 25.55 1.41
llama 13B Q8_0 pp1 12.26 12.28 1.00
llama 13B Q8_0 pp2 8.99 13.22 1.47
llama 13B Q8_0 pp3 13.46 13.54 1.01
llama 13B Q8_0 pp4 17.81 13.76 0.77
llama 1B Q4_0 pp1 140.73 140.13 1.00
llama 1B Q4_0 pp2 83.45 205.97 2.47
llama 1B Q4_0 pp3 126.71 243.25 1.92
llama 1B Q4_0 pp4 165.67 268.95 1.62
llama 1B Q4_K_M pp1 124.23 123.22 0.99
llama 1B Q4_K_M pp2 71.16 169.52 2.38
llama 1B Q4_K_M pp3 106.28 194.58 1.83
llama 1B Q4_K_M pp4 141.72 211.09 1.49
llama 1B Q8_0 pp1 103.76 104.65 1.01
llama 1B Q8_0 pp2 81.48 151.71 1.86
llama 1B Q8_0 pp3 121.93 178.10 1.46
llama 1B Q8_0 pp4 163.86 196.14 1.20
llama 3B Q4_0 pp1 68.73 68.55 1.00
llama 3B Q4_0 pp2 33.73 90.74 2.69
llama 3B Q4_0 pp3 50.20 100.87 2.01
llama 3B Q4_0 pp4 66.39 107.97 1.63
llama 3B Q4_K_M pp1 58.96 59.72 1.01
llama 3B Q4_K_M pp2 28.75 72.73 2.53
llama 3B Q4_K_M pp3 42.71 78.18 1.83
llama 3B Q4_K_M pp4 56.78 82.13 1.45
llama 3B Q8_0 pp1 44.68 44.61 1.00
llama 3B Q8_0 pp2 33.16 57.12 1.72
llama 3B Q8_0 pp3 49.16 62.74 1.28
llama 3B Q8_0 pp4 65.37 65.98 1.01
llama 7B Q4_0 pp1 39.10 39.14 1.00
llama 7B Q4_0 pp2 17.03 46.67 2.74
llama 7B Q4_0 pp3 25.33 48.92 1.93
llama 7B Q4_0 pp4 33.75 51.09 1.51
llama 7B Q4_K_M pp1 32.36 32.50 1.00
llama 7B Q4_K_M pp2 14.68 36.26 2.47
llama 7B Q4_K_M pp3 21.85 37.72 1.73
llama 7B Q4_K_M pp4 29.12 38.74 1.33
llama 7B Q8_0 pp1 23.20 23.18 1.00
llama 7B Q8_0 pp2 16.66 25.76 1.55
llama 7B Q8_0 pp3 24.80 27.10 1.09
llama 7B Q8_0 pp4 33.06 27.59 0.83

M2 Ultra

model size backend fa test master t/s PR t/s speedup
llama 1B Q4_0 727.75 MiB Metal,BLAS 1 pp1 296.76 ± 9.24 293.65 ± 13.14 0.99
llama 1B Q4_0 727.75 MiB Metal,BLAS 1 pp1 302.04 ± 2.50 300.45 ± 3.23 0.99
llama 1B Q4_0 727.75 MiB Metal,BLAS 1 pp2 141.30 ± 0.27 514.87 ± 2.20 3.64
llama 1B Q4_0 727.75 MiB Metal,BLAS 1 pp3 211.22 ± 0.29 666.60 ± 3.07 3.16
llama 1B Q4_0 727.75 MiB Metal,BLAS 1 pp4 280.21 ± 0.38 814.19 ± 3.36 2.91
llama 1B Q4_K 762.81 MiB Metal,BLAS 1 pp1 279.30 ± 1.25 278.45 ± 2.42 1.00
llama 1B Q4_K 762.81 MiB Metal,BLAS 1 pp1 280.32 ± 1.11 279.78 ± 1.07 1.00
llama 1B Q4_K 762.81 MiB Metal,BLAS 1 pp2 118.66 ± 0.15 456.58 ± 4.22 3.85
llama 1B Q4_K 762.81 MiB Metal,BLAS 1 pp3 176.87 ± 0.33 585.32 ± 2.96 3.31
llama 1B Q4_K 762.81 MiB Metal,BLAS 1 pp4 235.74 ± 0.45 698.84 ± 2.94 2.96
llama 1B Q8_0 1.22 GiB Metal,BLAS 1 pp1 249.40 ± 1.29 248.02 ± 5.09 0.99
llama 1B Q8_0 1.22 GiB Metal,BLAS 1 pp1 249.63 ± 1.13 248.38 ± 1.34 0.99
llama 1B Q8_0 1.22 GiB Metal,BLAS 1 pp2 136.21 ± 0.16 433.44 ± 1.77 3.18
llama 1B Q8_0 1.22 GiB Metal,BLAS 1 pp3 203.39 ± 0.48 575.19 ± 2.44 2.83
llama 1B Q8_0 1.22 GiB Metal,BLAS 1 pp4 271.85 ± 0.31 698.79 ± 2.12 2.57
llama 3B Q4_0 1.78 GiB Metal,BLAS 1 pp1 168.07 ± 4.00 170.88 ± 0.98 1.02
llama 3B Q4_0 1.78 GiB Metal,BLAS 1 pp1 170.78 ± 0.52 171.23 ± 0.93 1.00
llama 3B Q4_0 1.78 GiB Metal,BLAS 1 pp2 65.40 ± 0.05 267.68 ± 7.75 4.09
llama 3B Q4_0 1.78 GiB Metal,BLAS 1 pp3 97.25 ± 0.07 344.88 ± 1.20 3.55
llama 3B Q4_0 1.78 GiB Metal,BLAS 1 pp4 127.10 ± 0.12 365.76 ± 5.55 2.88
llama 3B Q4_K 1.87 GiB Metal,BLAS 1 pp1 154.97 ± 1.63 154.93 ± 2.47 1.00
llama 3B Q4_K 1.87 GiB Metal,BLAS 1 pp1 155.25 ± 0.81 154.21 ± 0.81 0.99
llama 3B Q4_K 1.87 GiB Metal,BLAS 1 pp2 54.41 ± 0.18 234.32 ± 1.12 4.31
llama 3B Q4_K 1.87 GiB Metal,BLAS 1 pp3 81.35 ± 0.05 283.35 ± 0.86 3.48
llama 3B Q4_K 1.87 GiB Metal,BLAS 1 pp4 105.25 ± 0.08 305.98 ± 0.61 2.91
llama 3B Q8_0 3.18 GiB Metal,BLAS 1 pp1 126.75 ± 0.50 126.33 ± 0.81 1.00
llama 3B Q8_0 3.18 GiB Metal,BLAS 1 pp1 126.74 ± 0.79 126.21 ± 0.70 1.00
llama 3B Q8_0 3.18 GiB Metal,BLAS 1 pp2 63.52 ± 0.10 203.61 ± 0.43 3.21
llama 3B Q8_0 3.18 GiB Metal,BLAS 1 pp3 94.01 ± 0.11 259.47 ± 0.40 2.76
llama 3B Q8_0 3.18 GiB Metal,BLAS 1 pp4 122.46 ± 0.32 276.02 ± 0.46 2.25
qwen2 7B Q4_0 4.12 GiB Metal,BLAS 1 pp1 104.26 ± 1.01 106.03 ± 0.81 1.02
qwen2 7B Q4_0 4.12 GiB Metal,BLAS 1 pp1 104.08 ± 0.49 105.94 ± 0.28 1.02
qwen2 7B Q4_0 4.12 GiB Metal,BLAS 1 pp2 37.58 ± 0.97 155.54 ± 0.37 4.14
qwen2 7B Q4_0 4.12 GiB Metal,BLAS 1 pp3 55.55 ± 1.29 189.16 ± 0.50 3.41
qwen2 7B Q4_0 4.12 GiB Metal,BLAS 1 pp4 73.82 ± 1.15 205.45 ± 0.28 2.78
qwen2 7B Q4_K 4.36 GiB Metal,BLAS 1 pp1 90.01 ± 1.22 92.15 ± 0.27 1.02
qwen2 7B Q4_K 4.36 GiB Metal,BLAS 1 pp1 90.54 ± 0.16 92.30 ± 0.17 1.02
qwen2 7B Q4_K 4.36 GiB Metal,BLAS 1 pp2 31.93 ± 0.21 122.85 ± 0.13 3.85
qwen2 7B Q4_K 4.36 GiB Metal,BLAS 1 pp3 47.51 ± 0.27 142.57 ± 0.19 3.00
qwen2 7B Q4_K 4.36 GiB Metal,BLAS 1 pp4 62.41 ± 0.39 151.44 ± 0.14 2.43
qwen2 7B Q8_0 7.54 GiB Metal,BLAS 1 pp1 70.87 ± 0.16 70.69 ± 0.10 1.00
qwen2 7B Q8_0 7.54 GiB Metal,BLAS 1 pp1 70.86 ± 0.11 70.67 ± 0.09 1.00
qwen2 7B Q8_0 7.54 GiB Metal,BLAS 1 pp2 37.29 ± 0.02 98.53 ± 0.10 2.64
qwen2 7B Q8_0 7.54 GiB Metal,BLAS 1 pp3 55.94 ± 0.04 114.95 ± 0.23 2.05
qwen2 7B Q8_0 7.54 GiB Metal,BLAS 1 pp4 73.46 ± 0.08 120.04 ± 0.16 1.63
qwen2 ?B Q4_0 7.93 GiB Metal,BLAS 1 pp1 56.01 ± 0.07 55.88 ± 0.09 1.00
qwen2 ?B Q4_0 7.93 GiB Metal,BLAS 1 pp1 56.00 ± 0.10 55.81 ± 0.07 1.00
qwen2 ?B Q4_0 7.93 GiB Metal,BLAS 1 pp2 19.89 ± 0.01 81.40 ± 0.17 4.09
qwen2 ?B Q4_0 7.93 GiB Metal,BLAS 1 pp3 29.62 ± 0.04 99.16 ± 0.08 3.35
qwen2 ?B Q4_0 7.93 GiB Metal,BLAS 1 pp4 38.93 ± 0.02 106.33 ± 0.06 2.73
qwen2 ?B Q4_K 8.37 GiB Metal,BLAS 1 pp1 50.33 ± 0.14 50.27 ± 0.15 1.00
qwen2 ?B Q4_K 8.37 GiB Metal,BLAS 1 pp1 50.35 ± 0.15 50.18 ± 0.10 1.00
qwen2 ?B Q4_K 8.37 GiB Metal,BLAS 1 pp2 16.32 ± 0.02 67.97 ± 0.07 4.16
qwen2 ?B Q4_K 8.37 GiB Metal,BLAS 1 pp3 24.42 ± 0.02 78.52 ± 0.05 3.22
qwen2 ?B Q4_K 8.37 GiB Metal,BLAS 1 pp4 32.14 ± 0.04 82.93 ± 0.06 2.58
qwen2 ?B Q8_0 14.62 GiB Metal,BLAS 1 pp1 35.89 ± 0.30 35.91 ± 0.05 1.00
qwen2 ?B Q8_0 14.62 GiB Metal,BLAS 1 pp1 35.71 ± 0.43 35.95 ± 0.06 1.01
qwen2 ?B Q8_0 14.62 GiB Metal,BLAS 1 pp2 18.56 ± 0.07 49.72 ± 0.07 2.68
qwen2 ?B Q8_0 14.62 GiB Metal,BLAS 1 pp3 27.63 ± 0.10 56.11 ± 0.04 2.03
qwen2 ?B Q8_0 14.62 GiB Metal,BLAS 1 pp4 36.27 ± 0.09 58.68 ± 0.04 1.62
qwen2 ?B Q4_K 18.48 GiB Metal,BLAS 1 pp1 25.99 ± 0.08 25.97 ± 0.10 1.00
qwen2 ?B Q4_K 18.48 GiB Metal,BLAS 1 pp1 26.03 ± 0.12 25.93 ± 0.07 1.00
qwen2 ?B Q4_K 18.48 GiB Metal,BLAS 1 pp2 8.41 ± 0.01 32.07 ± 0.02 3.81
qwen2 ?B Q4_K 18.48 GiB Metal,BLAS 1 pp3 12.56 ± 0.01 35.40 ± 0.03 2.82
qwen2 ?B Q4_K 18.48 GiB Metal,BLAS 1 pp4 16.59 ± 0.02 36.84 ± 0.04 2.22
qwen2 ?B Q8_0 32.42 GiB Metal,BLAS 1 pp1 17.77 ± 0.02 17.75 ± 0.03 1.00
qwen2 ?B Q8_0 32.42 GiB Metal,BLAS 1 pp1 17.76 ± 0.03 17.76 ± 0.02 1.00
qwen2 ?B Q8_0 32.42 GiB Metal,BLAS 1 pp2 9.75 ± 0.01 21.05 ± 0.03 2.16
qwen2 ?B Q8_0 32.42 GiB Metal,BLAS 1 pp3 14.56 ± 0.02 22.32 ± 0.04 1.53
qwen2 ?B Q8_0 32.42 GiB Metal,BLAS 1 pp4 19.22 ± 0.02 22.77 ± 0.01 1.18

@ggerganov ggerganov merged commit 106964e into master Nov 25, 2024
55 checks passed
@ggerganov ggerganov deleted the gg/metal-enable-mv branch November 25, 2024 19:49
arthw pushed a commit to arthw/llama.cpp that referenced this pull request Dec 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant