Introduce ggml_syncthreads() #7455

jart · 2024-05-22T07:46:59Z

This change is an alternative proposal to #1507 and #6915. This pull request contains three commits, each detailing the three main synchronization tricks I used to make inference go ~6% faster in llamafile, while speeding up prompt processing too.

commit ebbc728e37d5d01e835603b1780ce8cc93bc50d0 (HEAD -> thread, jart/thread)
Author: Justine Tunney <jtunney@mozilla.com>
Date:   Wed May 22 00:35:13 2024 -0700

    Make atomic operations explicit

commit 8435ab0ae8adb6c7f61bffac690314b9bbb4cfdd
Author: Justine Tunney <jtunney@mozilla.com>
Date:   Wed May 22 00:31:28 2024 -0700

    Avoid INIT synchronization barrier when possible

    This change makes inference go ~5% faster for me.

commit 7deec14bd994d05d8b7fd6bbcb510c900dd56f36
Author: Justine Tunney <jtunney@mozilla.com>
Date:   Wed May 22 00:20:24 2024 -0700

    Make MUL_MAT initialization go fast

    Using an atomic to delegate slices of the matrix to separate threads is
    slow, because all the threads have to contend for the same memory spot.
    The right thing to do here is use a `chore` variable, where all threads
    perform the same computation independently.

    This change introduces the ggml_once() and ggml_syncthreads() functions
    which works the exact same way as CUDA. This is nice, since it means if
    BLAS or LLAMAFILE doesn't need `B` requantized, then it can skip paying
    for the synchronization barrier between the INIT and COMPUTE phases. We
    can further refactor along this path, to remove all the INIT/COMP/FINAL
    code too. All ops can should be in charge of their own synchronization.

This change only dips a toe in the water so to speak, by introducing CUDA-style primitives for CPU mode. You can actually go much further in this technical direction. If you do, I'll be very grateful, since it'll mean GGML is more aligned with llamafile and therefore much easier for me to import changes.

ggerganov#6915)" This reverts commit e1b40ac.

Using an atomic to delegate slices of the matrix to separate threads is slow, because all the threads have to contend for the same memory spot. The right thing to do here is use a `chore` variable, where all threads perform the same computation independently. This change introduces the ggml_once() and ggml_syncthreads() functions which works the exact same way as CUDA. This is nice, since it means if BLAS or LLAMAFILE doesn't need `B` requantized, then it can skip paying for the synchronization barrier between the INIT and COMPUTE phases. We can further refactor along this path, to remove all the INIT/COMP/FINAL code too. All ops can should be in charge of their own synchronization.

This change makes inference go ~5% faster for me.

ggerganov · 2024-05-22T11:28:35Z

On M2 Ultra I get these results:

./scripts/compare-commits.sh master pr/7455 \
  -m models/mistral-7b-v0.2/ggml-model-fp16.gguf \
  -m models/mistral-7b-v0.2/ggml-model-q8_0.gguf \
  -m models/mistral-7b-v0.2/ggml-model-q4_0.gguf -t 16 -ngl 0

CPU	Model	Model Size [GiB]	Test	t/s master	t/s pr/7455	Speedup
M2 Ultra	llama 7B F16	13.49	pp512	164.05	161.90	0.99
M2 Ultra	llama 7B F16	13.49	tg128	15.22	15.55	1.02
M2 Ultra	llama 7B F16	13.49	pp512+tg128	54.77	55.58	1.01
M2 Ultra	llama 7B Q4_0	3.83	pp512	160.76	132.11	0.82
M2 Ultra	llama 7B Q4_0	3.83	tg128	38.34	38.83	1.01
M2 Ultra	llama 7B Q4_0	3.83	pp512+tg128	96.01	90.40	0.94
M2 Ultra	llama 7B Q8_0	7.17	pp512	157.36	157.93	1.00
M2 Ultra	llama 7B Q8_0	7.17	tg128	26.06	26.07	1.00
M2 Ultra	llama 7B Q8_0	7.17	pp512+tg128	76.68	75.47	0.98

Which model did you use to benchmark the performance?

jart · 2024-05-22T14:01:33Z

This change doesn't move the needle on ARM. I'm only seeing speedups on x86. I notice the biggest gains with really tiny models where synchronization actually is a noticeable bottleneck in matrix multiplication. For example, prompt processing with a 265mb embedding model:

* 572 -> 662 w/ -t  8 on Intel i9-14900K     w/ mxbai-embed-large-v1.Q6_K.gguf
* 484 -> 482 w/ -t 16 on Apple M2            w/ mxbai-embed-large-v1.Q6_K.gguf
* 740 -> 988 w/ -t 20 on Threadripper 7995WX w/ mxbai-embed-large-v1.Q6_K.gguf
* 733 -> 933 w/ -t 16 on Threadripper 7995WX w/ mxbai-embed-large-v1.Q6_K.gguf
* 644 -> 742 w/ -t 96 on Threadripper 7995WX w/ mxbai-embed-large-v1.Q6_K.gguf

In this case the gain can be up to 35%.

github-actions · 2024-05-24T15:16:03Z

📈 llama.cpp server for bench-server-baseline on Standard_NC4as_T4_v3 for phi-2-q4_0: 535 iterations 🚀

Expand details for performance related PR only

Concurrent users: 8, duration: 10m
HTTP request : avg=8733.46ms p(95)=20701.48ms fails=, finish reason: stop=474 truncated=61
Prompt processing (pp): avg=96.91tk/s p(95)=423.75tk/s
Token generation (tg): avg=71.99tk/s p(95)=46.03tk/s
ggml-org/models/phi-2/ggml-model-q4_0.gguf parallel=8 ctx-size=16384 ngl=33 batch-size=2048 ubatch-size=256 pp=1024 pp+tg=2048 branch=thread commit=3cb42757ea4198062b38d4c94d4c60aea03d00e9

More

---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 535 iterations"
    y-axis "llamacpp:prompt_tokens_seconds"
    x-axis "llamacpp:prompt_tokens_seconds" 1716563134 --> 1716563756
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 569.67, 569.67, 569.67, 569.67, 569.67, 755.76, 755.76, 755.76, 755.76, 755.76, 779.53, 779.53, 779.53, 779.53, 779.53, 829.13, 829.13, 829.13, 829.13, 829.13, 844.86, 844.86, 844.86, 844.86, 844.86, 839.42, 839.42, 839.42, 839.42, 839.42, 853.34, 853.34, 853.34, 853.34, 853.34, 863.17, 863.17, 863.17, 863.17, 863.17, 858.89, 858.89, 858.89, 858.89, 858.89, 836.02, 836.02, 836.02, 836.02, 836.02, 859.67, 859.67, 859.67, 859.67, 859.67, 885.42, 885.42, 885.42, 885.42, 885.42, 897.99, 897.99, 897.99, 897.99, 897.99, 876.37, 876.37, 876.37, 876.37, 876.37, 853.31, 853.31, 853.31, 853.31, 853.31, 845.26, 845.26, 845.26, 845.26, 845.26, 849.68, 849.68, 849.68, 849.68, 849.68, 848.39, 848.39, 848.39, 848.39, 848.39, 862.4, 862.4, 862.4, 862.4, 862.4, 867.2, 867.2, 867.2, 867.2, 867.2, 862.44, 862.44, 862.44, 862.44, 862.44, 868.96, 868.96, 868.96, 868.96, 868.96, 871.18, 871.18, 871.18, 871.18, 871.18, 871.79, 871.79, 871.79, 871.79, 871.79, 839.67, 839.67, 839.67, 839.67, 839.67, 839.75, 839.75, 839.75, 839.75, 839.75, 840.86, 840.86, 840.86, 840.86, 840.86, 835.63, 835.63, 835.63, 835.63, 835.63, 834.08, 834.08, 834.08, 834.08, 834.08, 833.67, 833.67, 833.67, 833.67, 833.67, 833.99, 833.99, 833.99, 833.99, 833.99, 838.82, 838.82, 838.82, 838.82, 838.82, 838.85, 838.85, 838.85, 838.85, 838.85, 836.21, 836.21, 836.21, 836.21, 836.21, 840.71, 840.71, 840.71, 840.71, 840.71, 844.9, 844.9, 844.9, 844.9, 844.9, 850.75, 850.75, 850.75, 850.75, 850.75, 860.22, 860.22, 860.22, 860.22, 860.22, 859.33, 859.33, 859.33, 859.33, 859.33, 857.16, 857.16, 857.16, 857.16, 857.16, 856.1, 856.1, 856.1, 856.1, 856.1, 860.39, 860.39, 860.39, 860.39, 860.39, 861.45, 861.45, 861.45, 861.45, 861.45, 856.37, 856.37, 856.37, 856.37, 856.37, 842.94, 842.94, 842.94, 842.94, 842.94, 831.95, 831.95, 831.95, 831.95, 831.95, 831.34, 831.34, 831.34, 831.34, 831.34, 831.1, 831.1, 831.1, 831.1, 831.1, 833.56, 833.56, 833.56, 833.56, 833.56, 831.83, 831.83, 831.83, 831.83, 831.83, 834.8, 834.8, 834.8, 834.8, 834.8, 836.47, 836.47, 836.47, 836.47, 836.47, 832.98, 832.98, 832.98, 832.98, 832.98, 831.2, 831.2, 831.2, 831.2, 831.2, 836.66, 836.66, 836.66, 836.66, 836.66, 838.24, 838.24, 838.24, 838.24, 838.24, 838.76, 838.76, 838.76, 838.76, 838.76, 839.07, 839.07, 839.07, 839.07, 839.07, 840.1, 840.1, 840.1, 840.1, 840.1, 841.83, 841.83, 841.83, 841.83, 841.83, 841.99, 841.99]

More

---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 535 iterations"
    y-axis "llamacpp:predicted_tokens_seconds"
    x-axis "llamacpp:predicted_tokens_seconds" 1716563134 --> 1716563756
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 42.27, 42.27, 42.27, 42.27, 42.27, 26.97, 26.97, 26.97, 26.97, 26.97, 30.07, 30.07, 30.07, 30.07, 30.07, 33.26, 33.26, 33.26, 33.26, 33.26, 33.36, 33.36, 33.36, 33.36, 33.36, 34.48, 34.48, 34.48, 34.48, 34.48, 35.24, 35.24, 35.24, 35.24, 35.24, 35.15, 35.15, 35.15, 35.15, 35.15, 34.81, 34.81, 34.81, 34.81, 34.81, 34.86, 34.86, 34.86, 34.86, 34.86, 35.08, 35.08, 35.08, 35.08, 35.08, 34.51, 34.51, 34.51, 34.51, 34.51, 33.79, 33.79, 33.79, 33.79, 33.79, 33.43, 33.43, 33.43, 33.43, 33.43, 32.08, 32.08, 32.08, 32.08, 32.08, 31.82, 31.82, 31.82, 31.82, 31.82, 30.61, 30.61, 30.61, 30.61, 30.61, 30.37, 30.37, 30.37, 30.37, 30.37, 30.42, 30.42, 30.42, 30.42, 30.42, 30.23, 30.23, 30.23, 30.23, 30.23, 30.08, 30.08, 30.08, 30.08, 30.08, 30.11, 30.11, 30.11, 30.11, 30.11, 30.1, 30.1, 30.1, 30.1, 30.1, 30.32, 30.32, 30.32, 30.32, 30.32, 30.3, 30.3, 30.3, 30.3, 30.3, 30.29, 30.29, 30.29, 30.29, 30.29, 30.54, 30.54, 30.54, 30.54, 30.54, 30.59, 30.59, 30.59, 30.59, 30.59, 30.52, 30.52, 30.52, 30.52, 30.52, 30.39, 30.39, 30.39, 30.39, 30.39, 30.52, 30.52, 30.52, 30.52, 30.52, 30.66, 30.66, 30.66, 30.66, 30.66, 30.69, 30.69, 30.69, 30.69, 30.69, 30.84, 30.84, 30.84, 30.84, 30.84, 30.98, 30.98, 30.98, 30.98, 30.98, 31.03, 31.03, 31.03, 31.03, 31.03, 30.9, 30.9, 30.9, 30.9, 30.9, 30.75, 30.75, 30.75, 30.75, 30.75, 30.54, 30.54, 30.54, 30.54, 30.54, 30.16, 30.16, 30.16, 30.16, 30.16, 30.24, 30.24, 30.24, 30.24, 30.24, 30.35, 30.35, 30.35, 30.35, 30.35, 30.45, 30.45, 30.45, 30.45, 30.45, 30.65, 30.65, 30.65, 30.65, 30.65, 30.65, 30.65, 30.65, 30.65, 30.65, 30.27, 30.27, 30.27, 30.27, 30.27, 30.09, 30.09, 30.09, 30.09, 30.09, 29.44, 29.44, 29.44, 29.44, 29.44, 29.41, 29.41, 29.41, 29.41, 29.41, 29.48, 29.48, 29.48, 29.48, 29.48, 29.54, 29.54, 29.54, 29.54, 29.54, 29.51, 29.51, 29.51, 29.51, 29.51, 29.57, 29.57, 29.57, 29.57, 29.57, 29.62, 29.62, 29.62, 29.62, 29.62, 29.62, 29.62, 29.62, 29.62, 29.62, 29.54, 29.54, 29.54, 29.54, 29.54, 29.54, 29.54, 29.54, 29.54, 29.54, 29.65, 29.65, 29.65, 29.65, 29.65, 29.82, 29.82, 29.82, 29.82, 29.82, 29.85, 29.85, 29.85, 29.85, 29.85, 29.91, 29.91]

Details

More

---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 535 iterations"
    y-axis "llamacpp:kv_cache_usage_ratio"
    x-axis "llamacpp:kv_cache_usage_ratio" 1716563134 --> 1716563756
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.19, 0.19, 0.19, 0.19, 0.19, 0.17, 0.17, 0.17, 0.17, 0.17, 0.18, 0.18, 0.18, 0.18, 0.18, 0.17, 0.17, 0.17, 0.17, 0.17, 0.14, 0.14, 0.14, 0.14, 0.14, 0.15, 0.15, 0.15, 0.15, 0.15, 0.17, 0.17, 0.17, 0.17, 0.17, 0.2, 0.2, 0.2, 0.2, 0.2, 0.17, 0.17, 0.17, 0.17, 0.17, 0.14, 0.14, 0.14, 0.14, 0.14, 0.19, 0.19, 0.19, 0.19, 0.19, 0.27, 0.27, 0.27, 0.27, 0.27, 0.2, 0.2, 0.2, 0.2, 0.2, 0.41, 0.41, 0.41, 0.41, 0.41, 0.36, 0.36, 0.36, 0.36, 0.36, 0.32, 0.32, 0.32, 0.32, 0.32, 0.21, 0.21, 0.21, 0.21, 0.21, 0.14, 0.14, 0.14, 0.14, 0.14, 0.28, 0.28, 0.28, 0.28, 0.28, 0.25, 0.25, 0.25, 0.25, 0.25, 0.13, 0.13, 0.13, 0.13, 0.13, 0.21, 0.21, 0.21, 0.21, 0.21, 0.14, 0.14, 0.14, 0.14, 0.14, 0.14, 0.14, 0.14, 0.14, 0.14, 0.25, 0.25, 0.25, 0.25, 0.25, 0.13, 0.13, 0.13, 0.13, 0.13, 0.17, 0.17, 0.17, 0.17, 0.17, 0.3, 0.3, 0.3, 0.3, 0.3, 0.21, 0.21, 0.21, 0.21, 0.21, 0.17, 0.17, 0.17, 0.17, 0.17, 0.13, 0.13, 0.13, 0.13, 0.13, 0.15, 0.15, 0.15, 0.15, 0.15, 0.22, 0.22, 0.22, 0.22, 0.22, 0.1, 0.1, 0.1, 0.1, 0.1, 0.15, 0.15, 0.15, 0.15, 0.15, 0.28, 0.28, 0.28, 0.28, 0.28, 0.25, 0.25, 0.25, 0.25, 0.25, 0.33, 0.33, 0.33, 0.33, 0.33, 0.37, 0.37, 0.37, 0.37, 0.37, 0.22, 0.22, 0.22, 0.22, 0.22, 0.1, 0.1, 0.1, 0.1, 0.1, 0.13, 0.13, 0.13, 0.13, 0.13, 0.16, 0.16, 0.16, 0.16, 0.16, 0.28, 0.28, 0.28, 0.28, 0.28, 0.51, 0.51, 0.51, 0.51, 0.51, 0.5, 0.5, 0.5, 0.5, 0.5, 0.37, 0.37, 0.37, 0.37, 0.37, 0.19, 0.19, 0.19, 0.19, 0.19, 0.19, 0.19, 0.19, 0.19, 0.19, 0.13, 0.13, 0.13, 0.13, 0.13, 0.17, 0.17, 0.17, 0.17, 0.17, 0.18, 0.18, 0.18, 0.18, 0.18, 0.2, 0.2, 0.2, 0.2, 0.2, 0.21, 0.21, 0.21, 0.21, 0.21, 0.19, 0.19, 0.19, 0.19, 0.19, 0.27, 0.27, 0.27, 0.27, 0.27, 0.14, 0.14, 0.14, 0.14, 0.14, 0.13, 0.13, 0.13, 0.13, 0.13, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.21, 0.21]

More

---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 535 iterations"
    y-axis "llamacpp:requests_processing"
    x-axis "llamacpp:requests_processing" 1716563134 --> 1716563756
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 5.0, 5.0, 5.0, 5.0, 5.0, 6.0, 6.0, 6.0, 6.0, 6.0, 3.0, 3.0, 3.0, 3.0, 3.0, 2.0, 2.0, 2.0, 2.0, 2.0, 8.0, 8.0, 8.0, 8.0, 8.0, 5.0, 5.0, 5.0, 5.0, 5.0, 4.0, 4.0, 4.0, 4.0, 4.0, 8.0, 8.0, 8.0, 8.0, 8.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 8.0, 8.0, 8.0, 8.0, 8.0, 4.0, 4.0, 4.0, 4.0, 4.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 5.0, 5.0, 5.0, 5.0, 5.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 7.0, 7.0, 7.0, 7.0, 7.0, 6.0, 6.0, 6.0, 6.0, 6.0, 5.0, 5.0, 5.0, 5.0, 5.0, 2.0, 2.0, 2.0, 2.0, 2.0, 3.0, 3.0, 3.0, 3.0, 3.0, 7.0, 7.0, 7.0, 7.0, 7.0, 5.0, 5.0, 5.0, 5.0, 5.0, 6.0, 6.0, 6.0, 6.0, 6.0, 8.0, 8.0, 8.0, 8.0, 8.0, 1.0, 1.0, 1.0, 1.0, 1.0, 4.0, 4.0, 4.0, 4.0, 4.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 3.0, 3.0, 3.0, 3.0, 3.0, 8.0, 8.0, 8.0, 8.0, 8.0, 5.0, 5.0, 5.0, 5.0, 5.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 2.0, 2.0, 2.0, 2.0, 2.0, 3.0, 3.0, 3.0, 3.0, 3.0, 6.0, 6.0, 6.0, 6.0, 6.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 3.0, 3.0, 3.0, 3.0, 3.0, 5.0, 5.0, 5.0, 5.0, 5.0, 7.0, 7.0, 7.0, 7.0, 7.0, 6.0, 6.0, 6.0, 6.0, 6.0, 4.0, 4.0, 4.0, 4.0, 4.0, 8.0, 8.0, 8.0, 8.0, 8.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 4.0, 4.0, 4.0, 4.0, 4.0, 1.0, 1.0, 1.0, 1.0, 1.0, 6.0, 6.0, 6.0, 6.0, 6.0, 2.0, 2.0]

kunnis · 2024-05-26T01:39:22Z

The point of 6915 is that not all tasks will finish at the same speed, especially on larger models. If there's a situation that it doesn't handle well, find the comment that starts with If the chunking is poor for the number of threads on this setup, scrap the whole plan. Re-chunk it by thread. If you can detect the situation there, that should avoid using the atomic for syncing. Does a a higher chunk_size will resolve the issue? Perhaps there needs to be better selection logic for chunk_size.

I do like what you did with ggml_syncthreads It needs to support the exponential back-off.

jart added 5 commits May 21, 2024 23:51

Revert "ggml : use dynamic thread scheduling for matrix multiplication (

ae6ee0b

ggerganov#6915)" This reverts commit e1b40ac.

Avoid INIT synchronization barrier when possible

8435ab0

This change makes inference go ~5% faster for me.

Make atomic operations explicit

ebbc728

Fix CI errors

3cb4275

mofosyne added performance Speed related topics Review Complexity : Medium Generally require more time to grok but manageable by beginner to medium expertise level labels May 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Introduce ggml_syncthreads() #7455

Introduce ggml_syncthreads() #7455

jart commented May 22, 2024

ggerganov commented May 22, 2024

jart commented May 22, 2024

github-actions bot commented May 24, 2024

kunnis commented May 26, 2024

Introduce ggml_syncthreads() #7455

Are you sure you want to change the base?

Introduce ggml_syncthreads() #7455

Conversation

jart commented May 22, 2024

ggerganov commented May 22, 2024

jart commented May 22, 2024

github-actions bot commented May 24, 2024

kunnis commented May 26, 2024