Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

metal : refactor kernel args into structs #10238

Merged
merged 20 commits into from
Nov 17, 2024
Merged

Conversation

ggerganov
Copy link
Owner

@ggerganov ggerganov commented Nov 9, 2024

ref #3229

Reduce encoder arguments, improve type safety, reduce register pressure, vectorize some loops and more uniform code style. Add new ggml-metal-impl.h header that currently contains the new kernel argument structs.

./scripts/compare-commits.sh master gg/metal-refactor-args -m ./models/llama-3.2-3b-instruct/ggml-model-f16.gguf -m ./models/llama-3.2-3b-instruct/ggml-model-q8_0.gguf -m ./models/llama-3.2-3b-instruct/ggml-model-q4_0.gguf -m ./models/llama-3.1-8b/ggml-model-q8_0.gguf -m ./models/llama-3.1-8b/ggml-model-q4_k.gguf -m ./models/llama-3.2-1b-instruct/ggml-model-f16.gguf -m ./models/llama-3.2-1b-instruct/ggml-model-q8_0.gguf -m models/qwen2.5-7b-coder/ggml-model-q8_0.gguf -m models/qwen2.5-7b-coder/ggml-model-q4_k.gguf -m models/qwen2.5-1.5b-coder/ggml-model-f16.gguf -m models/qwen2.5-1.5b-coder/ggml-model-q8_0.gguf -m models/gemma-2-2b/ggml-model-q8_0.gguf -m models/gemma-2-2b/ggml-model-f16.gguf -m models/gemma-2-9b/ggml-model-q5_k.gguf -m models/gemma-2-9b/ggml-model-q8_0.gguf -m models/falcon-7b/ggml-model-q8_0.gguf -m models/falcon-7b/ggml-model-q4_0.gguf -m models/mixtral-instruct-8x7b-fast/ggml-model-q8_0.gguf -m models/mixtral-instruct-8x7b-fast/ggml-model-q4_k.gguf -fa 1 -p 1,1,2,4,8,511,512
CPU Model Test t/s master t/s gg/metal-refactor-args Speedup
M2 Ultra falcon 7B Q4_0 pp1 88.09 96.71 1.10
M2 Ultra falcon 7B Q4_0 pp2 43.40 44.39 1.02
M2 Ultra falcon 7B Q4_0 pp4 86.45 89.63 1.04
M2 Ultra falcon 7B Q4_0 pp8 171.50 178.94 1.04
M2 Ultra falcon 7B Q4_0 pp511 1277.79 1378.13 1.08
M2 Ultra falcon 7B Q4_0 pp512 1309.29 1382.21 1.06
M2 Ultra falcon 7B Q4_0 tg128 89.03 95.97 1.08
M2 Ultra falcon 7B Q8_0 pp1 63.35 66.77 1.05
M2 Ultra falcon 7B Q8_0 pp2 42.06 42.99 1.02
M2 Ultra falcon 7B Q8_0 pp4 83.82 85.96 1.03
M2 Ultra falcon 7B Q8_0 pp8 167.02 171.28 1.03
M2 Ultra falcon 7B Q8_0 pp511 1317.34 1362.11 1.03
M2 Ultra falcon 7B Q8_0 pp512 1322.26 1364.39 1.03
M2 Ultra falcon 7B Q8_0 tg128 63.06 66.34 1.05
M2 Ultra gemma2 2B F16 pp1 80.72 82.66 1.02
M2 Ultra gemma2 2B F16 pp2 66.60 68.37 1.03
M2 Ultra gemma2 2B F16 pp4 120.14 123.02 1.02
M2 Ultra gemma2 2B F16 pp8 238.58 244.16 1.02
M2 Ultra gemma2 2B F16 pp511 3813.89 3892.16 1.02
M2 Ultra gemma2 2B F16 pp512 3858.75 3918.02 1.02
M2 Ultra gemma2 2B F16 tg128 80.03 82.40 1.03
M2 Ultra gemma2 2B Q8_0 pp1 116.57 124.96 1.07
M2 Ultra gemma2 2B Q8_0 pp2 64.99 66.54 1.02
M2 Ultra gemma2 2B Q8_0 pp4 117.47 119.98 1.02
M2 Ultra gemma2 2B Q8_0 pp8 233.29 238.34 1.02
M2 Ultra gemma2 2B Q8_0 pp511 3478.38 3606.00 1.04
M2 Ultra gemma2 2B Q8_0 pp512 3504.97 3616.49 1.03
M2 Ultra gemma2 2B Q8_0 tg128 115.11 123.91 1.08
M2 Ultra gemma2 9B Q5_K_M pp1 48.06 50.39 1.05
M2 Ultra gemma2 9B Q5_K_M pp2 21.91 22.27 1.02
M2 Ultra gemma2 9B Q5_K_M pp4 41.12 41.83 1.02
M2 Ultra gemma2 9B Q5_K_M pp8 81.57 82.76 1.01
M2 Ultra gemma2 9B Q5_K_M pp511 801.77 814.83 1.02
M2 Ultra gemma2 9B Q5_K_M pp512 804.63 817.78 1.02
M2 Ultra gemma2 9B Q5_K_M tg128 48.04 50.44 1.05
M2 Ultra gemma2 9B Q8_0 pp1 47.31 49.65 1.05
M2 Ultra gemma2 9B Q8_0 pp2 27.04 27.72 1.03
M2 Ultra gemma2 9B Q8_0 pp4 50.20 51.52 1.03
M2 Ultra gemma2 9B Q8_0 pp8 99.83 102.45 1.03
M2 Ultra gemma2 9B Q8_0 pp511 954.23 999.97 1.05
M2 Ultra gemma2 9B Q8_0 pp512 958.42 1003.39 1.05
M2 Ultra gemma2 9B Q8_0 tg128 47.39 49.64 1.05
M2 Ultra llama 1B F16 pp1 161.58 163.40 1.01
M2 Ultra llama 1B F16 pp2 135.36 136.35 1.01
M2 Ultra llama 1B F16 pp4 270.49 274.20 1.01
M2 Ultra llama 1B F16 pp8 537.53 545.40 1.01
M2 Ultra llama 1B F16 pp511 8534.40 8655.90 1.01
M2 Ultra llama 1B F16 pp512 8606.93 8702.97 1.01
M2 Ultra llama 1B F16 tg128 160.39 162.14 1.01
M2 Ultra llama 1B Q8_0 pp1 232.00 239.52 1.03
M2 Ultra llama 1B Q8_0 pp2 132.53 135.50 1.02
M2 Ultra llama 1B Q8_0 pp4 265.20 268.84 1.01
M2 Ultra llama 1B Q8_0 pp8 524.21 533.90 1.02
M2 Ultra llama 1B Q8_0 pp511 7738.12 7833.41 1.01
M2 Ultra llama 1B Q8_0 pp512 7789.26 7873.00 1.01
M2 Ultra llama 1B Q8_0 tg128 230.68 244.33 1.06
M2 Ultra llama 3B F16 pp1 71.53 73.12 1.02
M2 Ultra llama 3B F16 pp2 62.41 63.28 1.01
M2 Ultra llama 3B F16 pp4 121.90 123.50 1.01
M2 Ultra llama 3B F16 pp8 243.16 245.71 1.01
M2 Ultra llama 3B F16 pp511 3218.18 3253.24 1.01
M2 Ultra llama 3B F16 pp512 3237.78 3282.55 1.01
M2 Ultra llama 3B F16 tg128 72.29 73.29 1.01
M2 Ultra llama 3B Q4_0 pp1 156.33 166.87 1.07
M2 Ultra llama 3B Q4_0 pp2 63.89 65.31 1.02
M2 Ultra llama 3B Q4_0 pp4 124.42 125.98 1.01
M2 Ultra llama 3B Q4_0 pp8 246.92 250.69 1.02
M2 Ultra llama 3B Q4_0 pp511 2961.31 2988.81 1.01
M2 Ultra llama 3B Q4_0 pp512 2986.43 3003.08 1.01
M2 Ultra llama 3B Q4_0 tg128 155.86 166.60 1.07
M2 Ultra llama 3B Q8_0 pp1 117.06 124.45 1.06
M2 Ultra llama 3B Q8_0 pp2 61.76 63.02 1.02
M2 Ultra llama 3B Q8_0 pp4 120.03 122.31 1.02
M2 Ultra llama 3B Q8_0 pp8 238.00 242.28 1.02
M2 Ultra llama 3B Q8_0 pp511 2912.79 2953.61 1.01
M2 Ultra llama 3B Q8_0 pp512 2932.91 2959.40 1.01
M2 Ultra llama 3B Q8_0 tg128 117.53 124.10 1.06
M2 Ultra llama 8B Q4_K_M pp1 85.79 90.22 1.05
M2 Ultra llama 8B Q4_K_M pp2 29.44 30.17 1.02
M2 Ultra llama 8B Q4_K_M pp4 57.06 58.90 1.03
M2 Ultra llama 8B Q4_K_M pp8 112.79 117.47 1.04
M2 Ultra llama 8B Q4_K_M pp511 1021.82 1131.99 1.11
M2 Ultra llama 8B Q4_K_M pp512 1053.28 1135.20 1.08
M2 Ultra llama 8B Q4_K_M tg128 84.64 89.79 1.06
M2 Ultra llama 8B Q8_0 pp1 63.94 66.25 1.04
M2 Ultra llama 8B Q8_0 pp2 34.15 34.77 1.02
M2 Ultra llama 8B Q8_0 pp4 67.41 67.92 1.01
M2 Ultra llama 8B Q8_0 pp8 133.95 135.85 1.01
M2 Ultra llama 8B Q8_0 pp511 1281.93 1295.58 1.01
M2 Ultra llama 8B Q8_0 pp512 1286.93 1301.14 1.01
M2 Ultra llama 8B Q8_0 tg128 63.84 66.18 1.04
M2 Ultra llama 8x7B Q4_K_M pp1 48.95 52.57 1.07
M2 Ultra llama 8x7B Q4_K_M pp2 38.06 39.29 1.03
M2 Ultra llama 8x7B Q4_K_M pp4 55.82 57.55 1.03
M2 Ultra llama 8x7B Q4_K_M pp8 41.64 44.12 1.06
M2 Ultra llama 8x7B Q4_K_M pp511 262.19 272.51 1.04
M2 Ultra llama 8x7B Q4_K_M pp512 263.05 273.16 1.04
M2 Ultra llama 8x7B Q4_K_M tg128 48.98 52.51 1.07
M2 Ultra llama 8x7B Q8_0 pp1 38.34 40.02 1.04
M2 Ultra llama 8x7B Q8_0 pp2 31.92 33.04 1.04
M2 Ultra llama 8x7B Q8_0 pp4 42.61 43.81 1.03
M2 Ultra llama 8x7B Q8_0 pp8 46.21 49.96 1.08
M2 Ultra llama 8x7B Q8_0 pp511 273.87 289.44 1.06
M2 Ultra llama 8x7B Q8_0 pp512 275.15 289.45 1.05
M2 Ultra llama 8x7B Q8_0 tg128 38.13 40.00 1.05
M2 Ultra qwen2 1.5B F16 pp1 110.95 115.47 1.04
M2 Ultra qwen2 1.5B F16 pp2 87.56 90.33 1.03
M2 Ultra qwen2 1.5B F16 pp4 169.73 174.13 1.03
M2 Ultra qwen2 1.5B F16 pp8 341.05 346.79 1.02
M2 Ultra qwen2 1.5B F16 pp511 6168.27 6363.98 1.03
M2 Ultra qwen2 1.5B F16 pp512 6254.78 6421.24 1.03
M2 Ultra qwen2 1.5B F16 tg128 110.34 114.55 1.04
M2 Ultra qwen2 1.5B Q8_0 pp1 158.59 172.74 1.09
M2 Ultra qwen2 1.5B Q8_0 pp2 85.57 87.66 1.02
M2 Ultra qwen2 1.5B Q8_0 pp4 166.18 169.70 1.02
M2 Ultra qwen2 1.5B Q8_0 pp8 333.19 339.16 1.02
M2 Ultra qwen2 1.5B Q8_0 pp511 5566.97 5846.50 1.05
M2 Ultra qwen2 1.5B Q8_0 pp512 5622.69 5883.03 1.05
M2 Ultra qwen2 1.5B Q8_0 tg128 156.96 171.44 1.09
M2 Ultra qwen2 7B Q4_K_M pp1 85.00 90.37 1.06
M2 Ultra qwen2 7B Q4_K_M pp2 31.22 32.39 1.04
M2 Ultra qwen2 7B Q4_K_M pp4 61.77 63.80 1.03
M2 Ultra qwen2 7B Q4_K_M pp8 121.17 126.58 1.04
M2 Ultra qwen2 7B Q4_K_M pp511 1085.06 1223.95 1.13
M2 Ultra qwen2 7B Q4_K_M pp512 1113.67 1227.04 1.10
M2 Ultra qwen2 7B Q4_K_M tg128 85.34 90.79 1.06
M2 Ultra qwen2 7B Q8_0 pp1 67.31 70.27 1.04
M2 Ultra qwen2 7B Q8_0 pp2 36.24 37.13 1.02
M2 Ultra qwen2 7B Q8_0 pp4 69.89 73.25 1.05
M2 Ultra qwen2 7B Q8_0 pp8 140.26 146.21 1.04
M2 Ultra qwen2 7B Q8_0 pp511 1280.45 1407.70 1.10
M2 Ultra qwen2 7B Q8_0 pp512 1300.28 1413.38 1.09
M2 Ultra qwen2 7B Q8_0 tg128 67.22 69.92 1.04
./scripts/compare-commits.sh master gg/metal-refactor-args -m models/qwen2.5-1.5b-coder/ggml-model-f16.gguf -m models/qwen2.5-1.5b-coder/ggml-model-q8_0.gguf -m models/qwen2.5-1.5b-coder/ggml-model-q4_k.gguf -m models/llama-3.2-3b-instruct/ggml-model-f16.gguf -m models/llama-3.2-3b-instruct/ggml-model-q8_0.gguf -m models/llama-3.2-3b-instruct/ggml-model-q4_0.gguf -fa 1 -p 1,1,2,4,8,511,512
CPU Model Test t/s master t/s gg/metal-refactor-args Speedup
M1 Pro llama 3B F16 pp1 25.27 25.40 1.00
M1 Pro llama 3B F16 pp2 34.22 34.74 1.02
M1 Pro llama 3B F16 pp4 67.66 68.53 1.01
M1 Pro llama 3B F16 pp8 134.09 135.72 1.01
M1 Pro llama 3B F16 pp511 731.83 741.24 1.01
M1 Pro llama 3B F16 pp512 737.02 744.88 1.01
M1 Pro llama 3B F16 tg128 25.37 25.48 1.00
M1 Pro llama 3B Q4_0 pp1 65.97 68.54 1.04
M1 Pro llama 3B Q4_0 pp2 33.31 33.80 1.01
M1 Pro llama 3B Q4_0 pp4 65.85 66.33 1.01
M1 Pro llama 3B Q4_0 pp8 130.19 131.41 1.01
M1 Pro llama 3B Q4_0 pp511 667.66 672.42 1.01
M1 Pro llama 3B Q4_0 pp512 672.81 677.29 1.01
M1 Pro llama 3B Q4_0 tg128 66.73 68.96 1.03
M1 Pro llama 3B Q8_0 pp1 43.82 44.87 1.02
M1 Pro llama 3B Q8_0 pp2 32.09 33.07 1.03
M1 Pro llama 3B Q8_0 pp4 64.26 65.19 1.01
M1 Pro llama 3B Q8_0 pp8 127.64 129.10 1.01
M1 Pro llama 3B Q8_0 pp511 658.20 667.15 1.01
M1 Pro llama 3B Q8_0 pp512 662.64 670.39 1.01
M1 Pro llama 3B Q8_0 tg128 44.18 45.06 1.02
M1 Pro qwen2 1.5B F16 pp1 48.25 48.52 1.01
M1 Pro qwen2 1.5B F16 pp2 55.38 56.76 1.02
M1 Pro qwen2 1.5B F16 pp4 108.02 110.75 1.03
M1 Pro qwen2 1.5B F16 pp8 213.75 218.31 1.02
M1 Pro qwen2 1.5B F16 pp511 1484.57 1526.90 1.03
M1 Pro qwen2 1.5B F16 pp512 1497.15 1534.10 1.02
M1 Pro qwen2 1.5B F16 tg128 48.48 48.81 1.01
M1 Pro qwen2 1.5B Q4_K_M pp1 80.56 85.78 1.06
M1 Pro qwen2 1.5B Q4_K_M pp2 44.78 45.57 1.02
M1 Pro qwen2 1.5B Q4_K_M pp4 88.09 89.46 1.02
M1 Pro qwen2 1.5B Q4_K_M pp8 175.09 177.96 1.02
M1 Pro qwen2 1.5B Q4_K_M pp511 1164.96 1193.32 1.02
M1 Pro qwen2 1.5B Q4_K_M pp512 1171.93 1197.36 1.02
M1 Pro qwen2 1.5B Q4_K_M tg128 81.29 85.89 1.06
M1 Pro qwen2 1.5B Q8_0 pp1 76.47 80.04 1.05
M1 Pro qwen2 1.5B Q8_0 pp2 51.00 53.27 1.04
M1 Pro qwen2 1.5B Q8_0 pp4 100.55 103.76 1.03
M1 Pro qwen2 1.5B Q8_0 pp8 198.72 205.87 1.04
M1 Pro qwen2 1.5B Q8_0 pp511 1316.49 1379.49 1.05
M1 Pro qwen2 1.5B Q8_0 pp512 1326.27 1389.91 1.05
M1 Pro qwen2 1.5B Q8_0 tg128 77.04 80.16 1.04

TODO:

  • move structs to new header
  • GGML_OP_CONCAT
  • GGML_OP_ADD, GGML_OP_SUB, GGML_OP_MUL, GGML_OP_DIV
  • GGML_OP_REPEAT
  • GGML_OP_ACC, GGML_OP_CPY, GGML_OP_CONT, GGML_OP_DUP
  • GGML_OP_ROPE
  • GGML_OP_FLASH_ATTN_EXT
  • GGML_OP_MUL_MAT
  • GGML_OP_MUL_MAT_ID
  • GGML_OP_RMS_NORM
  • GGML_OP_NORM
  • rest of ops are less important - will be done in a next PR

@github-actions github-actions bot added the ggml changes relating to the ggml tensor library for machine learning label Nov 9, 2024
ggml/src/ggml-common.h Outdated Show resolved Hide resolved
@ggerganov ggerganov marked this pull request as ready for review November 10, 2024 16:31
@ggerganov ggerganov force-pushed the gg/metal-refactor-args branch from ab6a3b7 to 86ed72d Compare November 12, 2024 09:46
@ggerganov
Copy link
Owner Author

@slaren We can merge this after #10256 to avoid resolving conflicts. I can finish refactoring the remaining ops in the meantime anyway.

@slaren
Copy link
Collaborator

slaren commented Nov 14, 2024

#10256 should be ready to merge already.

@ggerganov ggerganov requested a review from Copilot November 15, 2024 19:55

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot reviewed 2 out of 5 changed files in this pull request and generated no suggestions.

Files not reviewed (3)
  • Makefile: Language not supported
  • ggml/src/CMakeLists.txt: Language not supported
  • ggml/src/ggml-metal-impl.h: Language not supported
@ggerganov ggerganov force-pushed the gg/metal-refactor-args branch from 86ed72d to a112eb4 Compare November 17, 2024 08:21
@ggerganov ggerganov merged commit cf32a9b into master Nov 17, 2024
61 checks passed
@ggerganov ggerganov deleted the gg/metal-refactor-args branch November 17, 2024 09:23
arthw pushed a commit to arthw/llama.cpp that referenced this pull request Nov 18, 2024
* metal : add kernel arg structs (wip)

* metal : fattn args

ggml-ci

* metal : cont + avoid potential int overflow [no ci]

* metal : mul mat struct (wip)

* cont : mul mat vec

* cont : pass by reference

* cont : args is first argument

* cont : use char ptr

* cont : shmem style

* cont : thread counters style

* cont : mul mm id

ggml-ci

* cont : int safety + register optimizations

ggml-ci

* metal : GGML_OP_CONCAT

ggml-ci

* metal : GGML_OP_ADD, GGML_OP_SUB, GGML_OP_MUL, GGML_OP_DIV

* metal : GGML_OP_REPEAT

* metal : GGML_OP_CPY

* metal : GGML_OP_RMS_NORM

* metal : GGML_OP_NORM

* metal : add TODOs for rest of ops

* ggml : add ggml-metal-impl.h

ggml-ci
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ggml changes relating to the ggml tensor library for machine learning
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants