llama-bench : add support for the RPC backend #7435

rgerganov · 2024-05-21T10:59:42Z

No description provided.

rgerganov · 2024-05-21T11:08:53Z

The --rpc command line arg is treated a bit differently from other args of llama-bench. When you do:

$ bin/llama-bench -m ../../models/tinyllama-1b/ggml-model-f16.gguf --rpc localhost:50052 -rpc localhost:50053

it won't run two separate tests with localhost:50052 and localhost:50053 but a single one with localhost:50052,localhost:50053. So it is equivalent to:

$ bin/llama-bench -m ../../models/tinyllama-1b/ggml-model-f16.gguf --rpc localhost:50052,localhost:50053

Another thing that should be noted is that we re-load the model on every single run when using RPC. This is because we cannot free the RPC backend if we still have allocated RPC buffers.

slaren · 2024-05-21T11:23:59Z

The --rpc command line arg is treated a bit differently from other args of llama-bench.

I don't think this is ok, it is preferable to take each --rpc parameter as full string and simply don't allow passing multiple values separated with commas. So --rpc a:123,b:456 is parsed as a full string. There is also no reason to not support using different sets of rpc servers, like with any other parameter, so --rpc a:123,b:456 --rpc a:123,c:789 would test the two different combinations of servers.

Another thing that should be noted is that we re-load the model on every single run when using RPC. This is because we cannot free the RPC backend if we still have allocated RPC buffers.

This really should be fixed in the RPC backend rather than requiring specific workarounds in applications that are just using the llama.cpp API.

github-actions · 2024-05-21T11:47:06Z

📈 llama.cpp server for bench-server-baseline on Standard_NC4as_T4_v3 for phi-2-q4_0: 544 iterations 🚀

Expand details for performance related PR only

Concurrent users: 8, duration: 10m
HTTP request : avg=8616.73ms p(95)=21030.19ms fails=, finish reason: stop=491 truncated=53
Prompt processing (pp): avg=91.44tk/s p(95)=376.59tk/s
Token generation (tg): avg=31.76tk/s p(95)=45.74tk/s
ggml-org/models/phi-2/ggml-model-q4_0.gguf parallel=8 ctx-size=16384 ngl=33 batch-size=2048 ubatch-size=256 pp=1024 pp+tg=2048 branch=bench-rpc commit=3e886929889f237b74a5ea14a6b4f8be278e06f8

More

---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 544 iterations"
    y-axis "llamacpp:prompt_tokens_seconds"
    x-axis "llamacpp:prompt_tokens_seconds" 1716968864 --> 1716969492
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 239.72, 239.72, 239.72, 239.72, 239.72, 727.61, 727.61, 727.61, 727.61, 727.61, 758.3, 758.3, 758.3, 758.3, 758.3, 733.22, 733.22, 733.22, 733.22, 733.22, 745.84, 745.84, 745.84, 745.84, 745.84, 807.21, 807.21, 807.21, 807.21, 807.21, 800.0, 800.0, 800.0, 800.0, 800.0, 805.82, 805.82, 805.82, 805.82, 805.82, 829.48, 829.48, 829.48, 829.48, 829.48, 825.43, 825.43, 825.43, 825.43, 825.43, 843.12, 843.12, 843.12, 843.12, 843.12, 863.62, 863.62, 863.62, 863.62, 863.62, 879.59, 879.59, 879.59, 879.59, 879.59, 877.17, 877.17, 877.17, 877.17, 877.17, 876.26, 876.26, 876.26, 876.26, 876.26, 889.95, 889.95, 889.95, 889.95, 889.95, 888.99, 888.99, 888.99, 888.99, 888.99, 891.34, 891.34, 891.34, 891.34, 891.34, 887.55, 887.55, 887.55, 887.55, 887.55, 884.19, 884.19, 884.19, 884.19, 884.19, 856.58, 856.58, 856.58, 856.58, 856.58, 859.84, 859.84, 859.84, 859.84, 859.84, 864.71, 864.71, 864.71, 864.71, 864.71, 863.95, 863.95, 863.95, 863.95, 863.95, 867.36, 867.36, 867.36, 867.36, 867.36, 853.0, 853.0, 853.0, 853.0, 853.0, 856.2, 856.2, 856.2, 856.2, 856.2, 857.56, 857.56, 857.56, 857.56, 857.56, 833.23, 833.23, 833.23, 833.23, 833.23, 834.59, 834.59, 834.59, 834.59, 834.59, 834.55, 834.55, 834.55, 834.55, 834.55, 841.44, 841.44, 841.44, 841.44, 841.44, 839.35, 839.35, 839.35, 839.35, 839.35, 840.79, 840.79, 840.79, 840.79, 840.79, 831.76, 831.76, 831.76, 831.76, 831.76, 831.38, 831.38, 831.38, 831.38, 831.38, 835.42, 835.42, 835.42, 835.42, 835.42, 835.42, 835.42, 835.42, 835.42, 835.42, 832.08, 832.08, 832.08, 832.08, 832.08, 832.56, 832.56, 832.56, 832.56, 832.56, 834.53, 834.53, 834.53, 834.53, 834.53, 835.68, 835.68, 835.68, 835.68, 835.68, 848.03, 848.03, 848.03, 848.03, 848.03, 844.09, 844.09, 844.09, 844.09, 844.09, 799.53, 799.53, 799.53, 799.53, 799.53, 798.39, 798.39, 798.39, 798.39, 798.39, 796.77, 796.77, 796.77, 796.77, 796.77, 801.84, 801.84, 801.84, 801.84, 801.84, 802.92, 802.92, 802.92, 802.92, 802.92, 803.38, 803.38, 803.38, 803.38, 803.38, 806.83, 806.83, 806.83, 806.83, 806.83, 810.92, 810.92, 810.92, 810.92, 810.92, 814.04, 814.04, 814.04, 814.04, 814.04, 815.1, 815.1, 815.1, 815.1, 815.1, 822.29, 822.29, 822.29, 822.29, 822.29, 820.26, 820.26, 820.26, 820.26, 820.26, 820.79, 820.79, 820.79, 820.79, 820.79, 821.52, 821.52, 821.52, 821.52, 821.52, 821.28, 821.28, 821.28, 821.28, 821.28, 823.23, 823.23, 823.23, 823.23, 823.23, 826.94, 826.94, 826.94, 826.94, 826.94]

More

---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 544 iterations"
    y-axis "llamacpp:predicted_tokens_seconds"
    x-axis "llamacpp:predicted_tokens_seconds" 1716968864 --> 1716969492
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 24.54, 24.54, 24.54, 24.54, 24.54, 24.43, 24.43, 24.43, 24.43, 24.43, 23.31, 23.31, 23.31, 23.31, 23.31, 22.82, 22.82, 22.82, 22.82, 22.82, 25.05, 25.05, 25.05, 25.05, 25.05, 26.29, 26.29, 26.29, 26.29, 26.29, 27.76, 27.76, 27.76, 27.76, 27.76, 28.12, 28.12, 28.12, 28.12, 28.12, 29.3, 29.3, 29.3, 29.3, 29.3, 30.01, 30.01, 30.01, 30.01, 30.01, 30.13, 30.13, 30.13, 30.13, 30.13, 30.53, 30.53, 30.53, 30.53, 30.53, 30.48, 30.48, 30.48, 30.48, 30.48, 30.19, 30.19, 30.19, 30.19, 30.19, 30.17, 30.17, 30.17, 30.17, 30.17, 29.07, 29.07, 29.07, 29.07, 29.07, 28.74, 28.74, 28.74, 28.74, 28.74, 28.0, 28.0, 28.0, 28.0, 28.0, 27.99, 27.99, 27.99, 27.99, 27.99, 28.11, 28.11, 28.11, 28.11, 28.11, 28.18, 28.18, 28.18, 28.18, 28.18, 28.34, 28.34, 28.34, 28.34, 28.34, 28.42, 28.42, 28.42, 28.42, 28.42, 28.56, 28.56, 28.56, 28.56, 28.56, 28.92, 28.92, 28.92, 28.92, 28.92, 29.12, 29.12, 29.12, 29.12, 29.12, 29.25, 29.25, 29.25, 29.25, 29.25, 29.58, 29.58, 29.58, 29.58, 29.58, 29.56, 29.56, 29.56, 29.56, 29.56, 29.9, 29.9, 29.9, 29.9, 29.9, 30.18, 30.18, 30.18, 30.18, 30.18, 30.22, 30.22, 30.22, 30.22, 30.22, 30.39, 30.39, 30.39, 30.39, 30.39, 30.54, 30.54, 30.54, 30.54, 30.54, 30.65, 30.65, 30.65, 30.65, 30.65, 30.57, 30.57, 30.57, 30.57, 30.57, 30.41, 30.41, 30.41, 30.41, 30.41, 29.95, 29.95, 29.95, 29.95, 29.95, 29.61, 29.61, 29.61, 29.61, 29.61, 29.61, 29.61, 29.61, 29.61, 29.61, 29.84, 29.84, 29.84, 29.84, 29.84, 29.91, 29.91, 29.91, 29.91, 29.91, 30.04, 30.04, 30.04, 30.04, 30.04, 29.77, 29.77, 29.77, 29.77, 29.77, 29.57, 29.57, 29.57, 29.57, 29.57, 29.45, 29.45, 29.45, 29.45, 29.45, 28.47, 28.47, 28.47, 28.47, 28.47, 28.47, 28.47, 28.47, 28.47, 28.47, 28.47, 28.47, 28.47, 28.47, 28.47, 28.56, 28.56, 28.56, 28.56, 28.56, 28.57, 28.57, 28.57, 28.57, 28.57, 28.7, 28.7, 28.7, 28.7, 28.7, 28.7, 28.7, 28.7, 28.7, 28.7, 28.65, 28.65, 28.65, 28.65, 28.65, 28.64, 28.64, 28.64, 28.64, 28.64, 28.52, 28.52, 28.52, 28.52, 28.52, 28.52, 28.52, 28.52, 28.52, 28.52, 28.61, 28.61, 28.61, 28.61, 28.61, 28.81, 28.81, 28.81, 28.81, 28.81, 28.82, 28.82, 28.82, 28.82, 28.82, 28.94, 28.94, 28.94, 28.94, 28.94]

Details

More

---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 544 iterations"
    y-axis "llamacpp:kv_cache_usage_ratio"
    x-axis "llamacpp:kv_cache_usage_ratio" 1716968864 --> 1716969492
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.13, 0.13, 0.13, 0.13, 0.13, 0.41, 0.41, 0.41, 0.41, 0.41, 0.36, 0.36, 0.36, 0.36, 0.36, 0.34, 0.34, 0.34, 0.34, 0.34, 0.11, 0.11, 0.11, 0.11, 0.11, 0.25, 0.25, 0.25, 0.25, 0.25, 0.23, 0.23, 0.23, 0.23, 0.23, 0.11, 0.11, 0.11, 0.11, 0.11, 0.18, 0.18, 0.18, 0.18, 0.18, 0.13, 0.13, 0.13, 0.13, 0.13, 0.21, 0.21, 0.21, 0.21, 0.21, 0.28, 0.28, 0.28, 0.28, 0.28, 0.23, 0.23, 0.23, 0.23, 0.23, 0.22, 0.22, 0.22, 0.22, 0.22, 0.34, 0.34, 0.34, 0.34, 0.34, 0.4, 0.4, 0.4, 0.4, 0.4, 0.32, 0.32, 0.32, 0.32, 0.32, 0.27, 0.27, 0.27, 0.27, 0.27, 0.2, 0.2, 0.2, 0.2, 0.2, 0.15, 0.15, 0.15, 0.15, 0.15, 0.19, 0.19, 0.19, 0.19, 0.19, 0.2, 0.2, 0.2, 0.2, 0.2, 0.23, 0.23, 0.23, 0.23, 0.23, 0.12, 0.12, 0.12, 0.12, 0.12, 0.12, 0.12, 0.12, 0.12, 0.12, 0.11, 0.11, 0.11, 0.11, 0.11, 0.14, 0.14, 0.14, 0.14, 0.14, 0.14, 0.14, 0.14, 0.14, 0.14, 0.11, 0.11, 0.11, 0.11, 0.11, 0.12, 0.12, 0.12, 0.12, 0.12, 0.11, 0.11, 0.11, 0.11, 0.11, 0.16, 0.16, 0.16, 0.16, 0.16, 0.18, 0.18, 0.18, 0.18, 0.18, 0.18, 0.18, 0.18, 0.18, 0.18, 0.27, 0.27, 0.27, 0.27, 0.27, 0.3, 0.3, 0.3, 0.3, 0.3, 0.41, 0.41, 0.41, 0.41, 0.41, 0.46, 0.46, 0.46, 0.46, 0.46, 0.14, 0.14, 0.14, 0.14, 0.14, 0.11, 0.11, 0.11, 0.11, 0.11, 0.12, 0.12, 0.12, 0.12, 0.12, 0.09, 0.09, 0.09, 0.09, 0.09, 0.3, 0.3, 0.3, 0.3, 0.3, 0.59, 0.59, 0.59, 0.59, 0.59, 0.52, 0.52, 0.52, 0.52, 0.52, 0.56, 0.56, 0.56, 0.56, 0.56, 0.16, 0.16, 0.16, 0.16, 0.16, 0.23, 0.23, 0.23, 0.23, 0.23, 0.21, 0.21, 0.21, 0.21, 0.21, 0.21, 0.21, 0.21, 0.21, 0.21, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.27, 0.27, 0.27, 0.27, 0.27, 0.26, 0.26, 0.26, 0.26, 0.26, 0.28, 0.28, 0.28, 0.28, 0.28, 0.22, 0.22, 0.22, 0.22, 0.22, 0.07, 0.07, 0.07, 0.07, 0.07, 0.07, 0.07, 0.07, 0.07, 0.07, 0.13, 0.13, 0.13, 0.13, 0.13, 0.13, 0.13, 0.13, 0.13, 0.13, 0.19, 0.19, 0.19, 0.19, 0.19]

More

---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 544 iterations"
    y-axis "llamacpp:requests_processing"
    x-axis "llamacpp:requests_processing" 1716968864 --> 1716969492
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 3.0, 3.0, 3.0, 3.0, 3.0, 8.0, 8.0, 8.0, 8.0, 8.0, 5.0, 5.0, 5.0, 5.0, 5.0, 6.0, 6.0, 6.0, 6.0, 6.0, 4.0, 4.0, 4.0, 4.0, 4.0, 6.0, 6.0, 6.0, 6.0, 6.0, 1.0, 1.0, 1.0, 1.0, 1.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 6.0, 6.0, 6.0, 6.0, 6.0, 3.0, 3.0, 3.0, 3.0, 3.0, 4.0, 4.0, 4.0, 4.0, 4.0, 7.0, 7.0, 7.0, 7.0, 7.0, 5.0, 5.0, 5.0, 5.0, 5.0, 8.0, 8.0, 8.0, 8.0, 8.0, 6.0, 6.0, 6.0, 6.0, 6.0, 5.0, 5.0, 5.0, 5.0, 5.0, 8.0, 8.0, 8.0, 8.0, 8.0, 3.0, 3.0, 3.0, 3.0, 3.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 3.0, 3.0, 3.0, 3.0, 3.0, 1.0, 1.0, 1.0, 1.0, 1.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 1.0, 1.0, 1.0, 1.0, 1.0, 3.0, 3.0, 3.0, 3.0, 3.0, 1.0, 1.0, 1.0, 1.0, 1.0, 6.0, 6.0, 6.0, 6.0, 6.0, 4.0, 4.0, 4.0, 4.0, 4.0, 2.0, 2.0, 2.0, 2.0, 2.0, 4.0, 4.0, 4.0, 4.0, 4.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 5.0, 5.0, 5.0, 5.0, 5.0, 4.0, 4.0, 4.0, 4.0, 4.0, 5.0, 5.0, 5.0, 5.0, 5.0, 6.0, 6.0, 6.0, 6.0, 6.0, 5.0, 5.0, 5.0, 5.0, 5.0, 8.0, 8.0, 8.0, 8.0, 8.0, 6.0, 6.0, 6.0, 6.0, 6.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 5.0, 5.0, 5.0, 5.0, 5.0, 6.0, 6.0, 6.0, 6.0, 6.0, 7.0, 7.0, 7.0, 7.0, 7.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 4.0, 4.0, 4.0, 4.0, 4.0, 5.0, 5.0, 5.0, 5.0, 5.0, 4.0, 4.0, 4.0, 4.0, 4.0, 8.0, 8.0, 8.0, 8.0, 8.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 4.0, 4.0, 4.0, 4.0, 4.0, 3.0, 3.0, 3.0, 3.0, 3.0, 6.0, 6.0, 6.0, 6.0, 6.0, 4.0, 4.0, 4.0, 4.0, 4.0]

rgerganov · 2024-05-21T11:48:41Z

I don't think this is ok, it is preferable to take each --rpc parameter as full string and simply don't allow passing multiple values separated with commas. So --rpc a:123,b:456 is parsed as a full string. There is also no reason to not support using different sets of rpc servers, like with any other parameter, so --rpc a:123,b:456 --rpc a:123,c:789 would test the two different combinations of servers.

Fair enough, will fix this.

This really should be fixed in the RPC backend rather than requiring specific workarounds in applications that are just using the llama.cpp API.

Resource management becomes tricky if we allow buffer objects to outlive the backend which created them. One would expect that all resources allocated by the backend (like sockets) should be freed/closed when ggml_backend_XXX_free() is called. If buffers outlive their corresponding backend this won't be the case. In this case we need to take special care to deallocate resources when the last RPC buffer is freed or something like this?

slaren · 2024-05-21T11:52:51Z

In ggml-backend, the buffers are not tied to a backend instance. So it cannot be said that a backend created these objects, that's not what is happening. My suggestion would be to keep an internal pool of connections in shared_ptr, and reference them on each object tied to that connection, ie. buffers and backend instances. Once they are all freed, the connection can be closed.

rgerganov · 2024-05-21T12:14:18Z

My suggestion would be to keep an internal pool of connections in shared_ptr, and reference them on each object tied to that connection, ie. buffers and backend instances. Once they are all freed, the connection can be closed.

Ok, I will implement this in a separate PR and update this one when ready

rgerganov · 2024-05-29T06:57:40Z

this one is ready for review

github-actions bot added the examples label May 21, 2024

mofosyne added the Review Complexity : Low Trivial changes to code that most beginner devs (or those who want a break) can tackle. e.g. UI fix label May 21, 2024

This was referenced May 27, 2024

rpc: remove backend handle from global map when freed #7517

Closed

rpc : resource management rework #7562

Merged

llama-bench : add support for the RPC backend

3e88692

rgerganov force-pushed the bench-rpc branch from 38280a6 to 3e88692 Compare May 29, 2024 06:57

github-actions bot added the ggml changes relating to the ggml tensor library for machine learning label May 29, 2024

slaren approved these changes May 29, 2024

View reviewed changes

rgerganov merged commit 210d991 into ggerganov:master May 29, 2024
71 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

llama-bench : add support for the RPC backend #7435

llama-bench : add support for the RPC backend #7435

rgerganov commented May 21, 2024

rgerganov commented May 21, 2024

slaren commented May 21, 2024

github-actions bot commented May 21, 2024 •

edited

Loading

rgerganov commented May 21, 2024

slaren commented May 21, 2024

rgerganov commented May 21, 2024

rgerganov commented May 29, 2024

llama-bench : add support for the RPC backend #7435

llama-bench : add support for the RPC backend #7435

Conversation

rgerganov commented May 21, 2024

rgerganov commented May 21, 2024

slaren commented May 21, 2024

github-actions bot commented May 21, 2024 • edited Loading

rgerganov commented May 21, 2024

slaren commented May 21, 2024

rgerganov commented May 21, 2024

rgerganov commented May 29, 2024

github-actions bot commented May 21, 2024 •

edited

Loading