Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

llama-bench : add support for the RPC backend #7435

Merged
merged 1 commit into from
May 29, 2024

Conversation

rgerganov
Copy link
Collaborator

No description provided.

@rgerganov
Copy link
Collaborator Author

The --rpc command line arg is treated a bit differently from other args of llama-bench. When you do:

$ bin/llama-bench -m ../../models/tinyllama-1b/ggml-model-f16.gguf --rpc localhost:50052 -rpc localhost:50053

it won't run two separate tests with localhost:50052 and localhost:50053 but a single one with localhost:50052,localhost:50053. So it is equivalent to:

$ bin/llama-bench -m ../../models/tinyllama-1b/ggml-model-f16.gguf --rpc localhost:50052,localhost:50053

Another thing that should be noted is that we re-load the model on every single run when using RPC. This is because we cannot free the RPC backend if we still have allocated RPC buffers.

@slaren
Copy link
Collaborator

slaren commented May 21, 2024

The --rpc command line arg is treated a bit differently from other args of llama-bench.

I don't think this is ok, it is preferable to take each --rpc parameter as full string and simply don't allow passing multiple values separated with commas. So --rpc a:123,b:456 is parsed as a full string. There is also no reason to not support using different sets of rpc servers, like with any other parameter, so --rpc a:123,b:456 --rpc a:123,c:789 would test the two different combinations of servers.

Another thing that should be noted is that we re-load the model on every single run when using RPC. This is because we cannot free the RPC backend if we still have allocated RPC buffers.

This really should be fixed in the RPC backend rather than requiring specific workarounds in applications that are just using the llama.cpp API.

Copy link
Contributor

github-actions bot commented May 21, 2024

📈 llama.cpp server for bench-server-baseline on Standard_NC4as_T4_v3 for phi-2-q4_0: 544 iterations 🚀

Expand details for performance related PR only
  • Concurrent users: 8, duration: 10m
  • HTTP request : avg=8616.73ms p(95)=21030.19ms fails=, finish reason: stop=491 truncated=53
  • Prompt processing (pp): avg=91.44tk/s p(95)=376.59tk/s
  • Token generation (tg): avg=31.76tk/s p(95)=45.74tk/s
  • ggml-org/models/phi-2/ggml-model-q4_0.gguf parallel=8 ctx-size=16384 ngl=33 batch-size=2048 ubatch-size=256 pp=1024 pp+tg=2048 branch=bench-rpc commit=3e886929889f237b74a5ea14a6b4f8be278e06f8

prompt_tokens_seconds

More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 544 iterations"
    y-axis "llamacpp:prompt_tokens_seconds"
    x-axis "llamacpp:prompt_tokens_seconds" 1716968864 --> 1716969492
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 239.72, 239.72, 239.72, 239.72, 239.72, 727.61, 727.61, 727.61, 727.61, 727.61, 758.3, 758.3, 758.3, 758.3, 758.3, 733.22, 733.22, 733.22, 733.22, 733.22, 745.84, 745.84, 745.84, 745.84, 745.84, 807.21, 807.21, 807.21, 807.21, 807.21, 800.0, 800.0, 800.0, 800.0, 800.0, 805.82, 805.82, 805.82, 805.82, 805.82, 829.48, 829.48, 829.48, 829.48, 829.48, 825.43, 825.43, 825.43, 825.43, 825.43, 843.12, 843.12, 843.12, 843.12, 843.12, 863.62, 863.62, 863.62, 863.62, 863.62, 879.59, 879.59, 879.59, 879.59, 879.59, 877.17, 877.17, 877.17, 877.17, 877.17, 876.26, 876.26, 876.26, 876.26, 876.26, 889.95, 889.95, 889.95, 889.95, 889.95, 888.99, 888.99, 888.99, 888.99, 888.99, 891.34, 891.34, 891.34, 891.34, 891.34, 887.55, 887.55, 887.55, 887.55, 887.55, 884.19, 884.19, 884.19, 884.19, 884.19, 856.58, 856.58, 856.58, 856.58, 856.58, 859.84, 859.84, 859.84, 859.84, 859.84, 864.71, 864.71, 864.71, 864.71, 864.71, 863.95, 863.95, 863.95, 863.95, 863.95, 867.36, 867.36, 867.36, 867.36, 867.36, 853.0, 853.0, 853.0, 853.0, 853.0, 856.2, 856.2, 856.2, 856.2, 856.2, 857.56, 857.56, 857.56, 857.56, 857.56, 833.23, 833.23, 833.23, 833.23, 833.23, 834.59, 834.59, 834.59, 834.59, 834.59, 834.55, 834.55, 834.55, 834.55, 834.55, 841.44, 841.44, 841.44, 841.44, 841.44, 839.35, 839.35, 839.35, 839.35, 839.35, 840.79, 840.79, 840.79, 840.79, 840.79, 831.76, 831.76, 831.76, 831.76, 831.76, 831.38, 831.38, 831.38, 831.38, 831.38, 835.42, 835.42, 835.42, 835.42, 835.42, 835.42, 835.42, 835.42, 835.42, 835.42, 832.08, 832.08, 832.08, 832.08, 832.08, 832.56, 832.56, 832.56, 832.56, 832.56, 834.53, 834.53, 834.53, 834.53, 834.53, 835.68, 835.68, 835.68, 835.68, 835.68, 848.03, 848.03, 848.03, 848.03, 848.03, 844.09, 844.09, 844.09, 844.09, 844.09, 799.53, 799.53, 799.53, 799.53, 799.53, 798.39, 798.39, 798.39, 798.39, 798.39, 796.77, 796.77, 796.77, 796.77, 796.77, 801.84, 801.84, 801.84, 801.84, 801.84, 802.92, 802.92, 802.92, 802.92, 802.92, 803.38, 803.38, 803.38, 803.38, 803.38, 806.83, 806.83, 806.83, 806.83, 806.83, 810.92, 810.92, 810.92, 810.92, 810.92, 814.04, 814.04, 814.04, 814.04, 814.04, 815.1, 815.1, 815.1, 815.1, 815.1, 822.29, 822.29, 822.29, 822.29, 822.29, 820.26, 820.26, 820.26, 820.26, 820.26, 820.79, 820.79, 820.79, 820.79, 820.79, 821.52, 821.52, 821.52, 821.52, 821.52, 821.28, 821.28, 821.28, 821.28, 821.28, 823.23, 823.23, 823.23, 823.23, 823.23, 826.94, 826.94, 826.94, 826.94, 826.94]
                    
Loading
predicted_tokens_seconds
More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 544 iterations"
    y-axis "llamacpp:predicted_tokens_seconds"
    x-axis "llamacpp:predicted_tokens_seconds" 1716968864 --> 1716969492
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 24.54, 24.54, 24.54, 24.54, 24.54, 24.43, 24.43, 24.43, 24.43, 24.43, 23.31, 23.31, 23.31, 23.31, 23.31, 22.82, 22.82, 22.82, 22.82, 22.82, 25.05, 25.05, 25.05, 25.05, 25.05, 26.29, 26.29, 26.29, 26.29, 26.29, 27.76, 27.76, 27.76, 27.76, 27.76, 28.12, 28.12, 28.12, 28.12, 28.12, 29.3, 29.3, 29.3, 29.3, 29.3, 30.01, 30.01, 30.01, 30.01, 30.01, 30.13, 30.13, 30.13, 30.13, 30.13, 30.53, 30.53, 30.53, 30.53, 30.53, 30.48, 30.48, 30.48, 30.48, 30.48, 30.19, 30.19, 30.19, 30.19, 30.19, 30.17, 30.17, 30.17, 30.17, 30.17, 29.07, 29.07, 29.07, 29.07, 29.07, 28.74, 28.74, 28.74, 28.74, 28.74, 28.0, 28.0, 28.0, 28.0, 28.0, 27.99, 27.99, 27.99, 27.99, 27.99, 28.11, 28.11, 28.11, 28.11, 28.11, 28.18, 28.18, 28.18, 28.18, 28.18, 28.34, 28.34, 28.34, 28.34, 28.34, 28.42, 28.42, 28.42, 28.42, 28.42, 28.56, 28.56, 28.56, 28.56, 28.56, 28.92, 28.92, 28.92, 28.92, 28.92, 29.12, 29.12, 29.12, 29.12, 29.12, 29.25, 29.25, 29.25, 29.25, 29.25, 29.58, 29.58, 29.58, 29.58, 29.58, 29.56, 29.56, 29.56, 29.56, 29.56, 29.9, 29.9, 29.9, 29.9, 29.9, 30.18, 30.18, 30.18, 30.18, 30.18, 30.22, 30.22, 30.22, 30.22, 30.22, 30.39, 30.39, 30.39, 30.39, 30.39, 30.54, 30.54, 30.54, 30.54, 30.54, 30.65, 30.65, 30.65, 30.65, 30.65, 30.57, 30.57, 30.57, 30.57, 30.57, 30.41, 30.41, 30.41, 30.41, 30.41, 29.95, 29.95, 29.95, 29.95, 29.95, 29.61, 29.61, 29.61, 29.61, 29.61, 29.61, 29.61, 29.61, 29.61, 29.61, 29.84, 29.84, 29.84, 29.84, 29.84, 29.91, 29.91, 29.91, 29.91, 29.91, 30.04, 30.04, 30.04, 30.04, 30.04, 29.77, 29.77, 29.77, 29.77, 29.77, 29.57, 29.57, 29.57, 29.57, 29.57, 29.45, 29.45, 29.45, 29.45, 29.45, 28.47, 28.47, 28.47, 28.47, 28.47, 28.47, 28.47, 28.47, 28.47, 28.47, 28.47, 28.47, 28.47, 28.47, 28.47, 28.56, 28.56, 28.56, 28.56, 28.56, 28.57, 28.57, 28.57, 28.57, 28.57, 28.7, 28.7, 28.7, 28.7, 28.7, 28.7, 28.7, 28.7, 28.7, 28.7, 28.65, 28.65, 28.65, 28.65, 28.65, 28.64, 28.64, 28.64, 28.64, 28.64, 28.52, 28.52, 28.52, 28.52, 28.52, 28.52, 28.52, 28.52, 28.52, 28.52, 28.61, 28.61, 28.61, 28.61, 28.61, 28.81, 28.81, 28.81, 28.81, 28.81, 28.82, 28.82, 28.82, 28.82, 28.82, 28.94, 28.94, 28.94, 28.94, 28.94]
                    
Loading

Details

kv_cache_usage_ratio

More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 544 iterations"
    y-axis "llamacpp:kv_cache_usage_ratio"
    x-axis "llamacpp:kv_cache_usage_ratio" 1716968864 --> 1716969492
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.13, 0.13, 0.13, 0.13, 0.13, 0.41, 0.41, 0.41, 0.41, 0.41, 0.36, 0.36, 0.36, 0.36, 0.36, 0.34, 0.34, 0.34, 0.34, 0.34, 0.11, 0.11, 0.11, 0.11, 0.11, 0.25, 0.25, 0.25, 0.25, 0.25, 0.23, 0.23, 0.23, 0.23, 0.23, 0.11, 0.11, 0.11, 0.11, 0.11, 0.18, 0.18, 0.18, 0.18, 0.18, 0.13, 0.13, 0.13, 0.13, 0.13, 0.21, 0.21, 0.21, 0.21, 0.21, 0.28, 0.28, 0.28, 0.28, 0.28, 0.23, 0.23, 0.23, 0.23, 0.23, 0.22, 0.22, 0.22, 0.22, 0.22, 0.34, 0.34, 0.34, 0.34, 0.34, 0.4, 0.4, 0.4, 0.4, 0.4, 0.32, 0.32, 0.32, 0.32, 0.32, 0.27, 0.27, 0.27, 0.27, 0.27, 0.2, 0.2, 0.2, 0.2, 0.2, 0.15, 0.15, 0.15, 0.15, 0.15, 0.19, 0.19, 0.19, 0.19, 0.19, 0.2, 0.2, 0.2, 0.2, 0.2, 0.23, 0.23, 0.23, 0.23, 0.23, 0.12, 0.12, 0.12, 0.12, 0.12, 0.12, 0.12, 0.12, 0.12, 0.12, 0.11, 0.11, 0.11, 0.11, 0.11, 0.14, 0.14, 0.14, 0.14, 0.14, 0.14, 0.14, 0.14, 0.14, 0.14, 0.11, 0.11, 0.11, 0.11, 0.11, 0.12, 0.12, 0.12, 0.12, 0.12, 0.11, 0.11, 0.11, 0.11, 0.11, 0.16, 0.16, 0.16, 0.16, 0.16, 0.18, 0.18, 0.18, 0.18, 0.18, 0.18, 0.18, 0.18, 0.18, 0.18, 0.27, 0.27, 0.27, 0.27, 0.27, 0.3, 0.3, 0.3, 0.3, 0.3, 0.41, 0.41, 0.41, 0.41, 0.41, 0.46, 0.46, 0.46, 0.46, 0.46, 0.14, 0.14, 0.14, 0.14, 0.14, 0.11, 0.11, 0.11, 0.11, 0.11, 0.12, 0.12, 0.12, 0.12, 0.12, 0.09, 0.09, 0.09, 0.09, 0.09, 0.3, 0.3, 0.3, 0.3, 0.3, 0.59, 0.59, 0.59, 0.59, 0.59, 0.52, 0.52, 0.52, 0.52, 0.52, 0.56, 0.56, 0.56, 0.56, 0.56, 0.16, 0.16, 0.16, 0.16, 0.16, 0.23, 0.23, 0.23, 0.23, 0.23, 0.21, 0.21, 0.21, 0.21, 0.21, 0.21, 0.21, 0.21, 0.21, 0.21, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.27, 0.27, 0.27, 0.27, 0.27, 0.26, 0.26, 0.26, 0.26, 0.26, 0.28, 0.28, 0.28, 0.28, 0.28, 0.22, 0.22, 0.22, 0.22, 0.22, 0.07, 0.07, 0.07, 0.07, 0.07, 0.07, 0.07, 0.07, 0.07, 0.07, 0.13, 0.13, 0.13, 0.13, 0.13, 0.13, 0.13, 0.13, 0.13, 0.13, 0.19, 0.19, 0.19, 0.19, 0.19]
                    
Loading
requests_processing
More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 544 iterations"
    y-axis "llamacpp:requests_processing"
    x-axis "llamacpp:requests_processing" 1716968864 --> 1716969492
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 3.0, 3.0, 3.0, 3.0, 3.0, 8.0, 8.0, 8.0, 8.0, 8.0, 5.0, 5.0, 5.0, 5.0, 5.0, 6.0, 6.0, 6.0, 6.0, 6.0, 4.0, 4.0, 4.0, 4.0, 4.0, 6.0, 6.0, 6.0, 6.0, 6.0, 1.0, 1.0, 1.0, 1.0, 1.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 6.0, 6.0, 6.0, 6.0, 6.0, 3.0, 3.0, 3.0, 3.0, 3.0, 4.0, 4.0, 4.0, 4.0, 4.0, 7.0, 7.0, 7.0, 7.0, 7.0, 5.0, 5.0, 5.0, 5.0, 5.0, 8.0, 8.0, 8.0, 8.0, 8.0, 6.0, 6.0, 6.0, 6.0, 6.0, 5.0, 5.0, 5.0, 5.0, 5.0, 8.0, 8.0, 8.0, 8.0, 8.0, 3.0, 3.0, 3.0, 3.0, 3.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 3.0, 3.0, 3.0, 3.0, 3.0, 1.0, 1.0, 1.0, 1.0, 1.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 1.0, 1.0, 1.0, 1.0, 1.0, 3.0, 3.0, 3.0, 3.0, 3.0, 1.0, 1.0, 1.0, 1.0, 1.0, 6.0, 6.0, 6.0, 6.0, 6.0, 4.0, 4.0, 4.0, 4.0, 4.0, 2.0, 2.0, 2.0, 2.0, 2.0, 4.0, 4.0, 4.0, 4.0, 4.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 5.0, 5.0, 5.0, 5.0, 5.0, 4.0, 4.0, 4.0, 4.0, 4.0, 5.0, 5.0, 5.0, 5.0, 5.0, 6.0, 6.0, 6.0, 6.0, 6.0, 5.0, 5.0, 5.0, 5.0, 5.0, 8.0, 8.0, 8.0, 8.0, 8.0, 6.0, 6.0, 6.0, 6.0, 6.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 5.0, 5.0, 5.0, 5.0, 5.0, 6.0, 6.0, 6.0, 6.0, 6.0, 7.0, 7.0, 7.0, 7.0, 7.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 4.0, 4.0, 4.0, 4.0, 4.0, 5.0, 5.0, 5.0, 5.0, 5.0, 4.0, 4.0, 4.0, 4.0, 4.0, 8.0, 8.0, 8.0, 8.0, 8.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 4.0, 4.0, 4.0, 4.0, 4.0, 3.0, 3.0, 3.0, 3.0, 3.0, 6.0, 6.0, 6.0, 6.0, 6.0, 4.0, 4.0, 4.0, 4.0, 4.0]
                    
Loading

@rgerganov
Copy link
Collaborator Author

I don't think this is ok, it is preferable to take each --rpc parameter as full string and simply don't allow passing multiple values separated with commas. So --rpc a:123,b:456 is parsed as a full string. There is also no reason to not support using different sets of rpc servers, like with any other parameter, so --rpc a:123,b:456 --rpc a:123,c:789 would test the two different combinations of servers.

Fair enough, will fix this.

This really should be fixed in the RPC backend rather than requiring specific workarounds in applications that are just using the llama.cpp API.

Resource management becomes tricky if we allow buffer objects to outlive the backend which created them. One would expect that all resources allocated by the backend (like sockets) should be freed/closed when ggml_backend_XXX_free() is called. If buffers outlive their corresponding backend this won't be the case. In this case we need to take special care to deallocate resources when the last RPC buffer is freed or something like this?

@slaren
Copy link
Collaborator

slaren commented May 21, 2024

In ggml-backend, the buffers are not tied to a backend instance. So it cannot be said that a backend created these objects, that's not what is happening. My suggestion would be to keep an internal pool of connections in shared_ptr, and reference them on each object tied to that connection, ie. buffers and backend instances. Once they are all freed, the connection can be closed.

@rgerganov
Copy link
Collaborator Author

My suggestion would be to keep an internal pool of connections in shared_ptr, and reference them on each object tied to that connection, ie. buffers and backend instances. Once they are all freed, the connection can be closed.

Ok, I will implement this in a separate PR and update this one when ready

@mofosyne mofosyne added the Review Complexity : Low Trivial changes to code that most beginner devs (or those who want a break) can tackle. e.g. UI fix label May 21, 2024
@rgerganov
Copy link
Collaborator Author

this one is ready for review

@github-actions github-actions bot added the ggml changes relating to the ggml tensor library for machine learning label May 29, 2024
@rgerganov rgerganov merged commit 210d991 into ggerganov:master May 29, 2024
71 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
examples ggml changes relating to the ggml tensor library for machine learning Review Complexity : Low Trivial changes to code that most beginner devs (or those who want a break) can tackle. e.g. UI fix
Projects
Status: No status
Development

Successfully merging this pull request may close these issues.

3 participants