Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce ggml_syncthreads() #7455

Open
wants to merge 5 commits into
base: master
Choose a base branch
from
Open

Introduce ggml_syncthreads() #7455

wants to merge 5 commits into from

Conversation

jart
Copy link
Contributor

@jart jart commented May 22, 2024

This change is an alternative proposal to #1507 and #6915. This pull request contains three commits, each detailing the three main synchronization tricks I used to make inference go ~6% faster in llamafile, while speeding up prompt processing too.

commit ebbc728e37d5d01e835603b1780ce8cc93bc50d0 (HEAD -> thread, jart/thread)
Author: Justine Tunney <jtunney@mozilla.com>
Date:   Wed May 22 00:35:13 2024 -0700

    Make atomic operations explicit

commit 8435ab0ae8adb6c7f61bffac690314b9bbb4cfdd
Author: Justine Tunney <jtunney@mozilla.com>
Date:   Wed May 22 00:31:28 2024 -0700

    Avoid INIT synchronization barrier when possible

    This change makes inference go ~5% faster for me.

commit 7deec14bd994d05d8b7fd6bbcb510c900dd56f36
Author: Justine Tunney <jtunney@mozilla.com>
Date:   Wed May 22 00:20:24 2024 -0700

    Make MUL_MAT initialization go fast

    Using an atomic to delegate slices of the matrix to separate threads is
    slow, because all the threads have to contend for the same memory spot.
    The right thing to do here is use a `chore` variable, where all threads
    perform the same computation independently.

    This change introduces the ggml_once() and ggml_syncthreads() functions
    which works the exact same way as CUDA. This is nice, since it means if
    BLAS or LLAMAFILE doesn't need `B` requantized, then it can skip paying
    for the synchronization barrier between the INIT and COMPUTE phases. We
    can further refactor along this path, to remove all the INIT/COMP/FINAL
    code too. All ops can should be in charge of their own synchronization.

This change only dips a toe in the water so to speak, by introducing CUDA-style primitives for CPU mode. You can actually go much further in this technical direction. If you do, I'll be very grateful, since it'll mean GGML is more aligned with llamafile and therefore much easier for me to import changes.

jart added 5 commits May 21, 2024 23:51
Using an atomic to delegate slices of the matrix to separate threads is
slow, because all the threads have to contend for the same memory spot.
The right thing to do here is use a `chore` variable, where all threads
perform the same computation independently.

This change introduces the ggml_once() and ggml_syncthreads() functions
which works the exact same way as CUDA. This is nice, since it means if
BLAS or LLAMAFILE doesn't need `B` requantized, then it can skip paying
for the synchronization barrier between the INIT and COMPUTE phases. We
can further refactor along this path, to remove all the INIT/COMP/FINAL
code too. All ops can should be in charge of their own synchronization.
This change makes inference go ~5% faster for me.
@mofosyne mofosyne added performance Speed related topics Review Complexity : Medium Generally require more time to grok but manageable by beginner to medium expertise level labels May 22, 2024
@ggerganov
Copy link
Owner

On M2 Ultra I get these results:

./scripts/compare-commits.sh master pr/7455 \
  -m models/mistral-7b-v0.2/ggml-model-fp16.gguf \
  -m models/mistral-7b-v0.2/ggml-model-q8_0.gguf \
  -m models/mistral-7b-v0.2/ggml-model-q4_0.gguf -t 16 -ngl 0
CPU Model Model Size [GiB] Test t/s master t/s pr/7455 Speedup
M2 Ultra llama 7B F16 13.49 pp512 164.05 161.90 0.99
M2 Ultra llama 7B F16 13.49 tg128 15.22 15.55 1.02
M2 Ultra llama 7B F16 13.49 pp512+tg128 54.77 55.58 1.01
M2 Ultra llama 7B Q4_0 3.83 pp512 160.76 132.11 0.82
M2 Ultra llama 7B Q4_0 3.83 tg128 38.34 38.83 1.01
M2 Ultra llama 7B Q4_0 3.83 pp512+tg128 96.01 90.40 0.94
M2 Ultra llama 7B Q8_0 7.17 pp512 157.36 157.93 1.00
M2 Ultra llama 7B Q8_0 7.17 tg128 26.06 26.07 1.00
M2 Ultra llama 7B Q8_0 7.17 pp512+tg128 76.68 75.47 0.98

Which model did you use to benchmark the performance?

@jart
Copy link
Contributor Author

jart commented May 22, 2024

This change doesn't move the needle on ARM. I'm only seeing speedups on x86. I notice the biggest gains with really tiny models where synchronization actually is a noticeable bottleneck in matrix multiplication. For example, prompt processing with a 265mb embedding model:

* 572 -> 662 w/ -t  8 on Intel i9-14900K     w/ mxbai-embed-large-v1.Q6_K.gguf
* 484 -> 482 w/ -t 16 on Apple M2            w/ mxbai-embed-large-v1.Q6_K.gguf
* 740 -> 988 w/ -t 20 on Threadripper 7995WX w/ mxbai-embed-large-v1.Q6_K.gguf
* 733 -> 933 w/ -t 16 on Threadripper 7995WX w/ mxbai-embed-large-v1.Q6_K.gguf
* 644 -> 742 w/ -t 96 on Threadripper 7995WX w/ mxbai-embed-large-v1.Q6_K.gguf

In this case the gain can be up to 35%.

Copy link
Contributor

📈 llama.cpp server for bench-server-baseline on Standard_NC4as_T4_v3 for phi-2-q4_0: 535 iterations 🚀

Expand details for performance related PR only
  • Concurrent users: 8, duration: 10m
  • HTTP request : avg=8733.46ms p(95)=20701.48ms fails=, finish reason: stop=474 truncated=61
  • Prompt processing (pp): avg=96.91tk/s p(95)=423.75tk/s
  • Token generation (tg): avg=71.99tk/s p(95)=46.03tk/s
  • ggml-org/models/phi-2/ggml-model-q4_0.gguf parallel=8 ctx-size=16384 ngl=33 batch-size=2048 ubatch-size=256 pp=1024 pp+tg=2048 branch=thread commit=3cb42757ea4198062b38d4c94d4c60aea03d00e9

prompt_tokens_seconds

More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 535 iterations"
    y-axis "llamacpp:prompt_tokens_seconds"
    x-axis "llamacpp:prompt_tokens_seconds" 1716563134 --> 1716563756
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 569.67, 569.67, 569.67, 569.67, 569.67, 755.76, 755.76, 755.76, 755.76, 755.76, 779.53, 779.53, 779.53, 779.53, 779.53, 829.13, 829.13, 829.13, 829.13, 829.13, 844.86, 844.86, 844.86, 844.86, 844.86, 839.42, 839.42, 839.42, 839.42, 839.42, 853.34, 853.34, 853.34, 853.34, 853.34, 863.17, 863.17, 863.17, 863.17, 863.17, 858.89, 858.89, 858.89, 858.89, 858.89, 836.02, 836.02, 836.02, 836.02, 836.02, 859.67, 859.67, 859.67, 859.67, 859.67, 885.42, 885.42, 885.42, 885.42, 885.42, 897.99, 897.99, 897.99, 897.99, 897.99, 876.37, 876.37, 876.37, 876.37, 876.37, 853.31, 853.31, 853.31, 853.31, 853.31, 845.26, 845.26, 845.26, 845.26, 845.26, 849.68, 849.68, 849.68, 849.68, 849.68, 848.39, 848.39, 848.39, 848.39, 848.39, 862.4, 862.4, 862.4, 862.4, 862.4, 867.2, 867.2, 867.2, 867.2, 867.2, 862.44, 862.44, 862.44, 862.44, 862.44, 868.96, 868.96, 868.96, 868.96, 868.96, 871.18, 871.18, 871.18, 871.18, 871.18, 871.79, 871.79, 871.79, 871.79, 871.79, 839.67, 839.67, 839.67, 839.67, 839.67, 839.75, 839.75, 839.75, 839.75, 839.75, 840.86, 840.86, 840.86, 840.86, 840.86, 835.63, 835.63, 835.63, 835.63, 835.63, 834.08, 834.08, 834.08, 834.08, 834.08, 833.67, 833.67, 833.67, 833.67, 833.67, 833.99, 833.99, 833.99, 833.99, 833.99, 838.82, 838.82, 838.82, 838.82, 838.82, 838.85, 838.85, 838.85, 838.85, 838.85, 836.21, 836.21, 836.21, 836.21, 836.21, 840.71, 840.71, 840.71, 840.71, 840.71, 844.9, 844.9, 844.9, 844.9, 844.9, 850.75, 850.75, 850.75, 850.75, 850.75, 860.22, 860.22, 860.22, 860.22, 860.22, 859.33, 859.33, 859.33, 859.33, 859.33, 857.16, 857.16, 857.16, 857.16, 857.16, 856.1, 856.1, 856.1, 856.1, 856.1, 860.39, 860.39, 860.39, 860.39, 860.39, 861.45, 861.45, 861.45, 861.45, 861.45, 856.37, 856.37, 856.37, 856.37, 856.37, 842.94, 842.94, 842.94, 842.94, 842.94, 831.95, 831.95, 831.95, 831.95, 831.95, 831.34, 831.34, 831.34, 831.34, 831.34, 831.1, 831.1, 831.1, 831.1, 831.1, 833.56, 833.56, 833.56, 833.56, 833.56, 831.83, 831.83, 831.83, 831.83, 831.83, 834.8, 834.8, 834.8, 834.8, 834.8, 836.47, 836.47, 836.47, 836.47, 836.47, 832.98, 832.98, 832.98, 832.98, 832.98, 831.2, 831.2, 831.2, 831.2, 831.2, 836.66, 836.66, 836.66, 836.66, 836.66, 838.24, 838.24, 838.24, 838.24, 838.24, 838.76, 838.76, 838.76, 838.76, 838.76, 839.07, 839.07, 839.07, 839.07, 839.07, 840.1, 840.1, 840.1, 840.1, 840.1, 841.83, 841.83, 841.83, 841.83, 841.83, 841.99, 841.99]
                    
Loading
predicted_tokens_seconds
More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 535 iterations"
    y-axis "llamacpp:predicted_tokens_seconds"
    x-axis "llamacpp:predicted_tokens_seconds" 1716563134 --> 1716563756
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 42.27, 42.27, 42.27, 42.27, 42.27, 26.97, 26.97, 26.97, 26.97, 26.97, 30.07, 30.07, 30.07, 30.07, 30.07, 33.26, 33.26, 33.26, 33.26, 33.26, 33.36, 33.36, 33.36, 33.36, 33.36, 34.48, 34.48, 34.48, 34.48, 34.48, 35.24, 35.24, 35.24, 35.24, 35.24, 35.15, 35.15, 35.15, 35.15, 35.15, 34.81, 34.81, 34.81, 34.81, 34.81, 34.86, 34.86, 34.86, 34.86, 34.86, 35.08, 35.08, 35.08, 35.08, 35.08, 34.51, 34.51, 34.51, 34.51, 34.51, 33.79, 33.79, 33.79, 33.79, 33.79, 33.43, 33.43, 33.43, 33.43, 33.43, 32.08, 32.08, 32.08, 32.08, 32.08, 31.82, 31.82, 31.82, 31.82, 31.82, 30.61, 30.61, 30.61, 30.61, 30.61, 30.37, 30.37, 30.37, 30.37, 30.37, 30.42, 30.42, 30.42, 30.42, 30.42, 30.23, 30.23, 30.23, 30.23, 30.23, 30.08, 30.08, 30.08, 30.08, 30.08, 30.11, 30.11, 30.11, 30.11, 30.11, 30.1, 30.1, 30.1, 30.1, 30.1, 30.32, 30.32, 30.32, 30.32, 30.32, 30.3, 30.3, 30.3, 30.3, 30.3, 30.29, 30.29, 30.29, 30.29, 30.29, 30.54, 30.54, 30.54, 30.54, 30.54, 30.59, 30.59, 30.59, 30.59, 30.59, 30.52, 30.52, 30.52, 30.52, 30.52, 30.39, 30.39, 30.39, 30.39, 30.39, 30.52, 30.52, 30.52, 30.52, 30.52, 30.66, 30.66, 30.66, 30.66, 30.66, 30.69, 30.69, 30.69, 30.69, 30.69, 30.84, 30.84, 30.84, 30.84, 30.84, 30.98, 30.98, 30.98, 30.98, 30.98, 31.03, 31.03, 31.03, 31.03, 31.03, 30.9, 30.9, 30.9, 30.9, 30.9, 30.75, 30.75, 30.75, 30.75, 30.75, 30.54, 30.54, 30.54, 30.54, 30.54, 30.16, 30.16, 30.16, 30.16, 30.16, 30.24, 30.24, 30.24, 30.24, 30.24, 30.35, 30.35, 30.35, 30.35, 30.35, 30.45, 30.45, 30.45, 30.45, 30.45, 30.65, 30.65, 30.65, 30.65, 30.65, 30.65, 30.65, 30.65, 30.65, 30.65, 30.27, 30.27, 30.27, 30.27, 30.27, 30.09, 30.09, 30.09, 30.09, 30.09, 29.44, 29.44, 29.44, 29.44, 29.44, 29.41, 29.41, 29.41, 29.41, 29.41, 29.48, 29.48, 29.48, 29.48, 29.48, 29.54, 29.54, 29.54, 29.54, 29.54, 29.51, 29.51, 29.51, 29.51, 29.51, 29.57, 29.57, 29.57, 29.57, 29.57, 29.62, 29.62, 29.62, 29.62, 29.62, 29.62, 29.62, 29.62, 29.62, 29.62, 29.54, 29.54, 29.54, 29.54, 29.54, 29.54, 29.54, 29.54, 29.54, 29.54, 29.65, 29.65, 29.65, 29.65, 29.65, 29.82, 29.82, 29.82, 29.82, 29.82, 29.85, 29.85, 29.85, 29.85, 29.85, 29.91, 29.91]
                    
Loading

Details

kv_cache_usage_ratio

More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 535 iterations"
    y-axis "llamacpp:kv_cache_usage_ratio"
    x-axis "llamacpp:kv_cache_usage_ratio" 1716563134 --> 1716563756
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.19, 0.19, 0.19, 0.19, 0.19, 0.17, 0.17, 0.17, 0.17, 0.17, 0.18, 0.18, 0.18, 0.18, 0.18, 0.17, 0.17, 0.17, 0.17, 0.17, 0.14, 0.14, 0.14, 0.14, 0.14, 0.15, 0.15, 0.15, 0.15, 0.15, 0.17, 0.17, 0.17, 0.17, 0.17, 0.2, 0.2, 0.2, 0.2, 0.2, 0.17, 0.17, 0.17, 0.17, 0.17, 0.14, 0.14, 0.14, 0.14, 0.14, 0.19, 0.19, 0.19, 0.19, 0.19, 0.27, 0.27, 0.27, 0.27, 0.27, 0.2, 0.2, 0.2, 0.2, 0.2, 0.41, 0.41, 0.41, 0.41, 0.41, 0.36, 0.36, 0.36, 0.36, 0.36, 0.32, 0.32, 0.32, 0.32, 0.32, 0.21, 0.21, 0.21, 0.21, 0.21, 0.14, 0.14, 0.14, 0.14, 0.14, 0.28, 0.28, 0.28, 0.28, 0.28, 0.25, 0.25, 0.25, 0.25, 0.25, 0.13, 0.13, 0.13, 0.13, 0.13, 0.21, 0.21, 0.21, 0.21, 0.21, 0.14, 0.14, 0.14, 0.14, 0.14, 0.14, 0.14, 0.14, 0.14, 0.14, 0.25, 0.25, 0.25, 0.25, 0.25, 0.13, 0.13, 0.13, 0.13, 0.13, 0.17, 0.17, 0.17, 0.17, 0.17, 0.3, 0.3, 0.3, 0.3, 0.3, 0.21, 0.21, 0.21, 0.21, 0.21, 0.17, 0.17, 0.17, 0.17, 0.17, 0.13, 0.13, 0.13, 0.13, 0.13, 0.15, 0.15, 0.15, 0.15, 0.15, 0.22, 0.22, 0.22, 0.22, 0.22, 0.1, 0.1, 0.1, 0.1, 0.1, 0.15, 0.15, 0.15, 0.15, 0.15, 0.28, 0.28, 0.28, 0.28, 0.28, 0.25, 0.25, 0.25, 0.25, 0.25, 0.33, 0.33, 0.33, 0.33, 0.33, 0.37, 0.37, 0.37, 0.37, 0.37, 0.22, 0.22, 0.22, 0.22, 0.22, 0.1, 0.1, 0.1, 0.1, 0.1, 0.13, 0.13, 0.13, 0.13, 0.13, 0.16, 0.16, 0.16, 0.16, 0.16, 0.28, 0.28, 0.28, 0.28, 0.28, 0.51, 0.51, 0.51, 0.51, 0.51, 0.5, 0.5, 0.5, 0.5, 0.5, 0.37, 0.37, 0.37, 0.37, 0.37, 0.19, 0.19, 0.19, 0.19, 0.19, 0.19, 0.19, 0.19, 0.19, 0.19, 0.13, 0.13, 0.13, 0.13, 0.13, 0.17, 0.17, 0.17, 0.17, 0.17, 0.18, 0.18, 0.18, 0.18, 0.18, 0.2, 0.2, 0.2, 0.2, 0.2, 0.21, 0.21, 0.21, 0.21, 0.21, 0.19, 0.19, 0.19, 0.19, 0.19, 0.27, 0.27, 0.27, 0.27, 0.27, 0.14, 0.14, 0.14, 0.14, 0.14, 0.13, 0.13, 0.13, 0.13, 0.13, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.21, 0.21]
                    
Loading
requests_processing
More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 535 iterations"
    y-axis "llamacpp:requests_processing"
    x-axis "llamacpp:requests_processing" 1716563134 --> 1716563756
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 5.0, 5.0, 5.0, 5.0, 5.0, 6.0, 6.0, 6.0, 6.0, 6.0, 3.0, 3.0, 3.0, 3.0, 3.0, 2.0, 2.0, 2.0, 2.0, 2.0, 8.0, 8.0, 8.0, 8.0, 8.0, 5.0, 5.0, 5.0, 5.0, 5.0, 4.0, 4.0, 4.0, 4.0, 4.0, 8.0, 8.0, 8.0, 8.0, 8.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 8.0, 8.0, 8.0, 8.0, 8.0, 4.0, 4.0, 4.0, 4.0, 4.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 5.0, 5.0, 5.0, 5.0, 5.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 7.0, 7.0, 7.0, 7.0, 7.0, 6.0, 6.0, 6.0, 6.0, 6.0, 5.0, 5.0, 5.0, 5.0, 5.0, 2.0, 2.0, 2.0, 2.0, 2.0, 3.0, 3.0, 3.0, 3.0, 3.0, 7.0, 7.0, 7.0, 7.0, 7.0, 5.0, 5.0, 5.0, 5.0, 5.0, 6.0, 6.0, 6.0, 6.0, 6.0, 8.0, 8.0, 8.0, 8.0, 8.0, 1.0, 1.0, 1.0, 1.0, 1.0, 4.0, 4.0, 4.0, 4.0, 4.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 3.0, 3.0, 3.0, 3.0, 3.0, 8.0, 8.0, 8.0, 8.0, 8.0, 5.0, 5.0, 5.0, 5.0, 5.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 2.0, 2.0, 2.0, 2.0, 2.0, 3.0, 3.0, 3.0, 3.0, 3.0, 6.0, 6.0, 6.0, 6.0, 6.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 3.0, 3.0, 3.0, 3.0, 3.0, 5.0, 5.0, 5.0, 5.0, 5.0, 7.0, 7.0, 7.0, 7.0, 7.0, 6.0, 6.0, 6.0, 6.0, 6.0, 4.0, 4.0, 4.0, 4.0, 4.0, 8.0, 8.0, 8.0, 8.0, 8.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 4.0, 4.0, 4.0, 4.0, 4.0, 1.0, 1.0, 1.0, 1.0, 1.0, 6.0, 6.0, 6.0, 6.0, 6.0, 2.0, 2.0]
                    
Loading

@kunnis
Copy link
Contributor

kunnis commented May 26, 2024

The point of 6915 is that not all tasks will finish at the same speed, especially on larger models. If there's a situation that it doesn't handle well, find the comment that starts with If the chunking is poor for the number of threads on this setup, scrap the whole plan. Re-chunk it by thread. If you can detect the situation there, that should avoid using the atomic for syncing. Does a a higher chunk_size will resolve the issue? Perhaps there needs to be better selection logic for chunk_size.

I do like what you did with ggml_syncthreads It needs to support the exponential back-off.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Speed related topics Review Complexity : Medium Generally require more time to grok but manageable by beginner to medium expertise level
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants