Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

llama : offload to RPC in addition to other backends #7640

Merged
merged 4 commits into from
Jun 3, 2024

Conversation

rgerganov
Copy link
Collaborator

This patch adds support for offloading layers to RPC servers in addition to other non-RPC backends. For example if you build with -DLLAMA_CUDA=ON -DLLAMA_RPC=ON, then you can offload to local GPU and remote server(s):

$ bin/main -m ../models/ggml-model-f16.gguf -p "Hello, my name is" -n 64 -ngl 99 -s 1236 --rpc localhost:50052
...
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:   no
ggml_cuda_init: CUDA_USE_TENSOR_CORES: yes
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA T1200 Laptop GPU, compute capability 7.5, VMM: yes
llm_load_tensors: ggml ctx size =    0,31 MiB
llm_load_tensors: offloading 22 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 23/23 layers to GPU
llm_load_tensors:        CPU buffer size =   125,00 MiB
llm_load_tensors:      CUDA0 buffer size =  1008,19 MiB
llm_load_tensors:        RPC buffer size =   965,16 MiB
..........................................................................................

I have tried to follow the existing patterns in llama.cpp and introduced device numbers for RPC servers which always come last.

When copying tensors, we need to handle the case when src and dst are not on the same backend. For CUDA I had to build with -DLLAMA_CUDA_NO_PEER_COPY=ON to make it work.

@mofosyne mofosyne added the Review Complexity : Medium Generally require more time to grok but manageable by beginner to medium expertise level label May 30, 2024
@github-actions github-actions bot added the ggml changes relating to the ggml tensor library for machine learning label May 30, 2024
Copy link
Contributor

github-actions bot commented May 30, 2024

📈 llama.cpp server for bench-server-baseline on Standard_NC4as_T4_v3 for phi-2-q4_0: 540 iterations 🚀

Expand details for performance related PR only
  • Concurrent users: 8, duration: 10m
  • HTTP request : avg=8675.69ms p(95)=21051.8ms fails=, finish reason: stop=494 truncated=46
  • Prompt processing (pp): avg=108.21tk/s p(95)=502.93tk/s
  • Token generation (tg): avg=32.13tk/s p(95)=47.06tk/s
  • ggml-org/models/phi-2/ggml-model-q4_0.gguf parallel=8 ctx-size=16384 ngl=33 batch-size=2048 ubatch-size=256 pp=1024 pp+tg=2048 branch=rpc-offload commit=243a3e4bb2ffb04248104fb375e61c55e5e42028

prompt_tokens_seconds

More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 540 iterations"
    y-axis "llamacpp:prompt_tokens_seconds"
    x-axis "llamacpp:prompt_tokens_seconds" 1717453421 --> 1717454047
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 325.6, 325.6, 325.6, 325.6, 325.6, 656.89, 656.89, 656.89, 656.89, 656.89, 642.68, 642.68, 642.68, 642.68, 642.68, 682.34, 682.34, 682.34, 682.34, 682.34, 767.91, 767.91, 767.91, 767.91, 767.91, 768.53, 768.53, 768.53, 768.53, 768.53, 787.43, 787.43, 787.43, 787.43, 787.43, 805.09, 805.09, 805.09, 805.09, 805.09, 823.29, 823.29, 823.29, 823.29, 823.29, 820.93, 820.93, 820.93, 820.93, 820.93, 824.63, 824.63, 824.63, 824.63, 824.63, 842.6, 842.6, 842.6, 842.6, 842.6, 846.88, 846.88, 846.88, 846.88, 846.88, 832.55, 832.55, 832.55, 832.55, 832.55, 838.03, 838.03, 838.03, 838.03, 838.03, 843.57, 843.57, 843.57, 843.57, 843.57, 843.0, 843.0, 843.0, 843.0, 843.0, 858.66, 858.66, 858.66, 858.66, 858.66, 861.85, 861.85, 861.85, 861.85, 861.85, 864.08, 864.08, 864.08, 864.08, 864.08, 869.79, 869.79, 869.79, 869.79, 869.79, 866.79, 866.79, 866.79, 866.79, 866.79, 871.86, 871.86, 871.86, 871.86, 871.86, 865.67, 865.67, 865.67, 865.67, 865.67, 865.32, 865.32, 865.32, 865.32, 865.32, 867.17, 867.17, 867.17, 867.17, 867.17, 880.33, 880.33, 880.33, 880.33, 880.33, 880.79, 880.79, 880.79, 880.79, 880.79, 876.98, 876.98, 876.98, 876.98, 876.98, 880.02, 880.02, 880.02, 880.02, 880.02, 877.9, 877.9, 877.9, 877.9, 877.9, 882.36, 882.36, 882.36, 882.36, 882.36, 891.23, 891.23, 891.23, 891.23, 891.23, 895.02, 895.02, 895.02, 895.02, 895.02, 898.44, 898.44, 898.44, 898.44, 898.44, 898.02, 898.02, 898.02, 898.02, 898.02, 896.17, 896.17, 896.17, 896.17, 896.17, 893.82, 893.82, 893.82, 893.82, 893.82, 895.74, 895.74, 895.74, 895.74, 895.74, 897.06, 897.06, 897.06, 897.06, 897.06, 896.71, 896.71, 896.71, 896.71, 896.71, 896.26, 896.26, 896.26, 896.26, 896.26, 894.59, 894.59, 894.59, 894.59, 894.59, 890.58, 890.58, 890.58, 890.58, 890.58, 888.28, 888.28, 888.28, 888.28, 888.28, 885.65, 885.65, 885.65, 885.65, 885.65, 887.92, 887.92, 887.92, 887.92, 887.92, 889.18, 889.18, 889.18, 889.18, 889.18, 887.59, 887.59, 887.59, 887.59, 887.59, 891.37, 891.37, 891.37, 891.37, 891.37, 890.16, 890.16, 890.16, 890.16, 890.16, 887.35, 887.35, 887.35, 887.35, 887.35, 889.83, 889.83, 889.83, 889.83, 889.83, 889.06, 889.06, 889.06, 889.06, 889.06, 892.23, 892.23, 892.23, 892.23, 892.23, 893.21, 893.21, 893.21, 893.21, 893.21, 891.71, 891.71, 891.71, 891.71, 891.71, 891.28, 891.28, 891.28, 891.28, 891.28, 891.71, 891.71, 891.71, 891.71, 891.71, 892.27, 892.27, 892.27, 892.27, 892.27, 892.68, 892.68, 892.68, 892.68]
                    
Loading
predicted_tokens_seconds
More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 540 iterations"
    y-axis "llamacpp:predicted_tokens_seconds"
    x-axis "llamacpp:predicted_tokens_seconds" 1717453421 --> 1717454047
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 41.72, 41.72, 41.72, 41.72, 41.72, 40.12, 40.12, 40.12, 40.12, 40.12, 31.49, 31.49, 31.49, 31.49, 31.49, 32.56, 32.56, 32.56, 32.56, 32.56, 33.56, 33.56, 33.56, 33.56, 33.56, 34.88, 34.88, 34.88, 34.88, 34.88, 35.37, 35.37, 35.37, 35.37, 35.37, 35.66, 35.66, 35.66, 35.66, 35.66, 36.21, 36.21, 36.21, 36.21, 36.21, 36.28, 36.28, 36.28, 36.28, 36.28, 35.95, 35.95, 35.95, 35.95, 35.95, 34.27, 34.27, 34.27, 34.27, 34.27, 34.15, 34.15, 34.15, 34.15, 34.15, 33.99, 33.99, 33.99, 33.99, 33.99, 32.23, 32.23, 32.23, 32.23, 32.23, 31.23, 31.23, 31.23, 31.23, 31.23, 31.21, 31.21, 31.21, 31.21, 31.21, 31.29, 31.29, 31.29, 31.29, 31.29, 30.74, 30.74, 30.74, 30.74, 30.74, 30.75, 30.75, 30.75, 30.75, 30.75, 30.72, 30.72, 30.72, 30.72, 30.72, 30.49, 30.49, 30.49, 30.49, 30.49, 30.66, 30.66, 30.66, 30.66, 30.66, 30.66, 30.66, 30.66, 30.66, 30.66, 30.87, 30.87, 30.87, 30.87, 30.87, 31.13, 31.13, 31.13, 31.13, 31.13, 31.05, 31.05, 31.05, 31.05, 31.05, 31.18, 31.18, 31.18, 31.18, 31.18, 31.48, 31.48, 31.48, 31.48, 31.48, 31.65, 31.65, 31.65, 31.65, 31.65, 31.85, 31.85, 31.85, 31.85, 31.85, 32.08, 32.08, 32.08, 32.08, 32.08, 32.1, 32.1, 32.1, 32.1, 32.1, 31.96, 31.96, 31.96, 31.96, 31.96, 31.8, 31.8, 31.8, 31.8, 31.8, 31.46, 31.46, 31.46, 31.46, 31.46, 31.11, 31.11, 31.11, 31.11, 31.11, 30.89, 30.89, 30.89, 30.89, 30.89, 31.02, 31.02, 31.02, 31.02, 31.02, 31.11, 31.11, 31.11, 31.11, 31.11, 31.21, 31.21, 31.21, 31.21, 31.21, 31.25, 31.25, 31.25, 31.25, 31.25, 31.09, 31.09, 31.09, 31.09, 31.09, 30.47, 30.47, 30.47, 30.47, 30.47, 30.32, 30.32, 30.32, 30.32, 30.32, 29.28, 29.28, 29.28, 29.28, 29.28, 29.12, 29.12, 29.12, 29.12, 29.12, 28.95, 28.95, 28.95, 28.95, 28.95, 28.96, 28.96, 28.96, 28.96, 28.96, 28.96, 28.96, 28.96, 28.96, 28.96, 29.05, 29.05, 29.05, 29.05, 29.05, 29.06, 29.06, 29.06, 29.06, 29.06, 29.09, 29.09, 29.09, 29.09, 29.09, 28.92, 28.92, 28.92, 28.92, 28.92, 28.83, 28.83, 28.83, 28.83, 28.83, 28.83, 28.83, 28.83, 28.83, 28.83, 28.88, 28.88, 28.88, 28.88, 28.88, 29.05, 29.05, 29.05, 29.05, 29.05, 29.08, 29.08, 29.08, 29.08, 29.08, 29.23, 29.23, 29.23, 29.23, 29.23, 29.31, 29.31, 29.31, 29.31]
                    
Loading

Details

kv_cache_usage_ratio

More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 540 iterations"
    y-axis "llamacpp:kv_cache_usage_ratio"
    x-axis "llamacpp:kv_cache_usage_ratio" 1717453421 --> 1717454047
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.14, 0.14, 0.14, 0.14, 0.14, 0.45, 0.45, 0.45, 0.45, 0.45, 0.18, 0.18, 0.18, 0.18, 0.18, 0.08, 0.08, 0.08, 0.08, 0.08, 0.21, 0.21, 0.21, 0.21, 0.21, 0.14, 0.14, 0.14, 0.14, 0.14, 0.17, 0.17, 0.17, 0.17, 0.17, 0.13, 0.13, 0.13, 0.13, 0.13, 0.15, 0.15, 0.15, 0.15, 0.15, 0.18, 0.18, 0.18, 0.18, 0.18, 0.22, 0.22, 0.22, 0.22, 0.22, 0.26, 0.26, 0.26, 0.26, 0.26, 0.26, 0.26, 0.26, 0.26, 0.26, 0.47, 0.47, 0.47, 0.47, 0.47, 0.26, 0.26, 0.26, 0.26, 0.26, 0.23, 0.23, 0.23, 0.23, 0.23, 0.21, 0.21, 0.21, 0.21, 0.21, 0.27, 0.27, 0.27, 0.27, 0.27, 0.21, 0.21, 0.21, 0.21, 0.21, 0.26, 0.26, 0.26, 0.26, 0.26, 0.22, 0.22, 0.22, 0.22, 0.22, 0.2, 0.2, 0.2, 0.2, 0.2, 0.1, 0.1, 0.1, 0.1, 0.1, 0.13, 0.13, 0.13, 0.13, 0.13, 0.16, 0.16, 0.16, 0.16, 0.16, 0.12, 0.12, 0.12, 0.12, 0.12, 0.1, 0.1, 0.1, 0.1, 0.1, 0.09, 0.09, 0.09, 0.09, 0.09, 0.14, 0.14, 0.14, 0.14, 0.14, 0.14, 0.14, 0.14, 0.14, 0.14, 0.09, 0.09, 0.09, 0.09, 0.09, 0.18, 0.18, 0.18, 0.18, 0.18, 0.24, 0.24, 0.24, 0.24, 0.24, 0.27, 0.27, 0.27, 0.27, 0.27, 0.38, 0.38, 0.38, 0.38, 0.38, 0.4, 0.4, 0.4, 0.4, 0.4, 0.2, 0.2, 0.2, 0.2, 0.2, 0.14, 0.14, 0.14, 0.14, 0.14, 0.11, 0.11, 0.11, 0.11, 0.11, 0.17, 0.17, 0.17, 0.17, 0.17, 0.11, 0.11, 0.11, 0.11, 0.11, 0.34, 0.34, 0.34, 0.34, 0.34, 0.57, 0.57, 0.57, 0.57, 0.57, 0.62, 0.62, 0.62, 0.62, 0.62, 0.56, 0.56, 0.56, 0.56, 0.56, 0.43, 0.43, 0.43, 0.43, 0.43, 0.26, 0.26, 0.26, 0.26, 0.26, 0.22, 0.22, 0.22, 0.22, 0.22, 0.2, 0.2, 0.2, 0.2, 0.2, 0.22, 0.22, 0.22, 0.22, 0.22, 0.16, 0.16, 0.16, 0.16, 0.16, 0.2, 0.2, 0.2, 0.2, 0.2, 0.33, 0.33, 0.33, 0.33, 0.33, 0.18, 0.18, 0.18, 0.18, 0.18, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.18, 0.18, 0.18, 0.18, 0.18, 0.15, 0.15, 0.15, 0.15, 0.15, 0.11, 0.11, 0.11, 0.11, 0.11, 0.17, 0.17, 0.17, 0.17, 0.17, 0.15, 0.15, 0.15, 0.15]
                    
Loading
requests_processing
More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 540 iterations"
    y-axis "llamacpp:requests_processing"
    x-axis "llamacpp:requests_processing" 1717453421 --> 1717454047
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 4.0, 4.0, 4.0, 4.0, 4.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 4.0, 4.0, 4.0, 4.0, 4.0, 5.0, 5.0, 5.0, 5.0, 5.0, 2.0, 2.0, 2.0, 2.0, 2.0, 1.0, 1.0, 1.0, 1.0, 1.0, 4.0, 4.0, 4.0, 4.0, 4.0, 3.0, 3.0, 3.0, 3.0, 3.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 8.0, 8.0, 8.0, 8.0, 8.0, 6.0, 6.0, 6.0, 6.0, 6.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 7.0, 7.0, 7.0, 7.0, 7.0, 2.0, 2.0, 2.0, 2.0, 2.0, 3.0, 3.0, 3.0, 3.0, 3.0, 8.0, 8.0, 8.0, 8.0, 8.0, 6.0, 6.0, 6.0, 6.0, 6.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 2.0, 2.0, 2.0, 2.0, 2.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 4.0, 4.0, 4.0, 4.0, 4.0, 6.0, 6.0, 6.0, 6.0, 6.0, 5.0, 5.0, 5.0, 5.0, 5.0, 3.0, 3.0, 3.0, 3.0, 3.0, 5.0, 5.0, 5.0, 5.0, 5.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 2.0, 2.0, 2.0, 2.0, 2.0, 6.0, 6.0, 6.0, 6.0, 6.0, 2.0, 2.0, 2.0, 2.0, 2.0, 5.0, 5.0, 5.0, 5.0, 5.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 5.0, 5.0, 5.0, 5.0, 5.0, 6.0, 6.0, 6.0, 6.0, 6.0, 5.0, 5.0, 5.0, 5.0, 5.0, 6.0, 6.0, 6.0, 6.0, 6.0, 2.0, 2.0, 2.0, 2.0, 2.0, 1.0, 1.0, 1.0, 1.0, 1.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 5.0, 5.0, 5.0, 5.0, 5.0, 7.0, 7.0, 7.0, 7.0, 7.0, 5.0, 5.0, 5.0, 5.0, 5.0, 1.0, 1.0, 1.0, 1.0, 1.0, 6.0, 6.0, 6.0, 6.0, 6.0, 5.0, 5.0, 5.0, 5.0, 5.0, 3.0, 3.0, 3.0, 3.0]
                    
Loading

@slaren
Copy link
Collaborator

slaren commented May 30, 2024

When copying tensors, we need to handle the case when src and dst are not on the same backend. For CUDA I had to build with -DLLAMA_CUDA_NO_PEER_COPY=ON to make it work.

The way this is supposed to work is that backends need to check the buffer type of the tensors to determine if they can perform the copy, and return false otherwise. The CUDA backend should already do this. What happens exactly if you don't use LLAMA_CUDA_NO_PEER_COPY, does it crash somewhere or does it produce wrong results?

ggml-rpc.h Outdated Show resolved Hide resolved
@rgerganov
Copy link
Collaborator Author

The way this is supposed to work is that backends need to check the buffer type of the tensors to determine if they can perform the copy, and return false otherwise. The CUDA backend should already do this. What happens exactly if you don't use LLAMA_CUDA_NO_PEER_COPY, does it crash somewhere or does it produce wrong results?

It crashes in ggml_backend_cuda_buffer_cpy_tensor() because it assumes that dst is allocated by the CUDA backend which is not the case. I think we should add a check in ggml_backend_tensor_copy() if both tensors are on the same backend and perform the slow copy if they are not.

@rgerganov rgerganov marked this pull request as ready for review May 31, 2024 08:46
@slaren
Copy link
Collaborator

slaren commented May 31, 2024

The tensor_copy function is meant to allow efficient copies between different backends. For example, it is used to copy tensors between different CUDA devices (each of which is a different backend). It could also be used in the RPC backend to copy tensors between different servers directly, without passing first through the host. This is a bug in ggml-backend, dst should always be allocated in the buffer of the called tensor_copy function, I will look into it.

@rgerganov rgerganov self-assigned this May 31, 2024
@slaren
Copy link
Collaborator

slaren commented May 31, 2024

rgerganov#1 should fix it.

@rgerganov
Copy link
Collaborator Author

@slaren thanks, I have verified your fix and merged it

rgerganov and others added 3 commits June 3, 2024 10:19
…uffer

- always initialize views in the view_src buffer

- add RPC backend to Makefile build

- add endpoint to all RPC object names
llama.cpp Outdated Show resolved Hide resolved
Co-authored-by: slaren <slarengh@gmail.com>
@slaren slaren linked an issue Jun 3, 2024 that may be closed by this pull request
@rgerganov rgerganov merged commit bde7cd3 into ggerganov:master Jun 3, 2024
70 checks passed
@zhouwg
Copy link
Contributor

zhouwg commented Jun 4, 2024

@rgerganov, hello, I'm sorry to disturb you. could you do a more carefully check before your PR merged to master branch? I know you are the brother of the original author of the great ggml(I personally think it's another FFmpeg focus in AI industry)/llama.cpp/whisper.cpp and you have some privileges and this is the key reason why your PR was approved so quickly although the quality of your PR is might-be need more check.

I'm not sure whether you might did wrong deletions in llama.cpp and cause the troubles for rebase operation by the other community backend developer(#6869):

weiguo:$ git diff 
diff --cc llama.cpp
index fce8e749,a10c3e1f..00000000
--- a/llama.cpp
+++ b/llama.cpp
@@@ -1748,6 -1664,36 +1748,39 @@@ static ggml_backend_buffer_type_t llama
      GGML_UNUSED(host_buffer);
  }
  
++<<<<<<< HEAD
++=======
+ static ggml_backend_buffer_type_t llama_default_buffer_type_offload(int gpu) {
+     ggml_backend_buffer_type_t buft = nullptr;
+ 
+ #ifdef GGML_USE_METAL
+     buft = ggml_backend_metal_buffer_type();
+ #elif defined(GGML_USE_CUDA)
+     buft = ggml_backend_cuda_buffer_type(gpu);
+ #elif defined(GGML_USE_VULKAN)
+     buft = ggml_backend_vk_buffer_type(gpu);
+ #elif defined(GGML_USE_SYCL)
+     buft = ggml_backend_sycl_buffer_type(gpu);
+ #elif defined(GGML_USE_CLBLAST)
+     buft = ggml_backend_opencl_buffer_type();
+ #elif defined(GGML_USE_KOMPUTE)
+     buft = ggml_backend_kompute_buffer_type(gpu);
+     if (buft == nullptr) {
+         LLAMA_LOG_WARN("%s: cannot use GPU %d, check `vulkaninfo --summary`\n", __func__, gpu);
+     }
+ #elif defined(GGML_USE_QNN)
+     buft = ggml_backend_qnn_buffer_type(gpu);
+ #endif
+ 
+     if (buft == nullptr) {
+         buft = llama_default_buffer_type_cpu(true);
+     }
+     return buft;
+ 
+     GGML_UNUSED(gpu);
+ }
+ 
++>>>>>>> b0c3013f2ea2c82a43248e43a0abfaebd5bb105a
  static ggml_backend_buffer_type_t llama_default_buffer_type_split(int fallback_gpu, const float * tensor_split) {
      ggml_backend_buffer_type_t buft = nullptr;
  
@@@ -1771,6 -1717,42 +1804,45 @@@
      GGML_UNUSED(tensor_split);
  }
  
++<<<<<<< HEAD
++=======
+ static size_t llama_get_device_count() {
+ #if defined(GGML_USE_CUDA)
+     return ggml_backend_cuda_get_device_count();
+ #elif defined(GGML_USE_SYCL)
+     return ggml_backend_sycl_get_device_count();
+ #elif defined(GGML_USE_VULKAN)
+     return ggml_backend_vk_get_device_count();
+ #elif defined(GGML_USE_QNN)
+     return ggml_backend_qnn_get_device_count();
+ #else
+     return 1;
+ #endif
+ }
+ 
+ static size_t llama_get_device_memory(int device) {
+ #if defined(GGML_USE_CUDA)
+     size_t total;
+     size_t free;
+     ggml_backend_cuda_get_device_memory(device, &free, &total);
+     return free;
+ #elif defined(GGML_USE_SYCL)
+     size_t total;
+     size_t free;
+     ggml_backend_sycl_get_device_memory(device, &free, &total);
+     return free;
+ #elif defined(GGML_USE_VULKAN)
+     size_t total;
+     size_t free;
+     ggml_backend_vk_get_device_memory(device, &free, &total);
+     return free;
+ #else
+     return 1;
+     GGML_UNUSED(device);
+ #endif
+ }
+ 
++>>>>>>> b0c3013f2ea2c82a43248e43a0abfaebd5bb105a

GGML/whisper.cpp/llama.cpp is now a 60k+ starers project and it's referenced directly/indirectly by many developers/companies/research institutions..., it's not a simple personal project now. I hope you can do more carefully check before your PR was merged to master branch.

I'll apologize sincerely if it's my misunderstanding.

Thanks for your understanding.

@rgerganov
Copy link
Collaborator Author

could you do a more carefully check before your PR merged to master branch?

My changes have been reviewed and the CI run was successful.

and you have some privileges and this is the key reason why your PR was approved so quickly although the quality of your PR is might-be need more check.

I don't have any privileges in this project and my code is being reviewed as everybody's else. You'd better have some arguments when saying something about the quality of my changes.

I'm not sure whether you might did wrong deletions in llama.cpp and cause the troubles for rebase operation by the other community backend developer

When someone is proposing a change it is their responsibility to address review comments and make sure that the change can be applied on current master. The changes I have done in llama.cpp are trivial and rebasing other changes on top of them is also trivial.

I hope you can do more carefully check before your PR was merged to master branch.

My change has been reviewed and approved by a core maintainer and the CI run was successful.

@zhouwg
Copy link
Contributor

zhouwg commented Jun 4, 2024

could you do a more carefully check before your PR merged to master branch?

My changes have been reviewed and the CI run was successful.

and you have some privileges and this is the key reason why your PR was approved so quickly although the quality of your PR is might-be need more check.

I don't have any privileges in this project and my code is being reviewed as everybody's else. You'd better have some arguments when saying something about the quality of my changes.

I'm not sure whether you might did wrong deletions in llama.cpp and cause the troubles for rebase operation by the other community backend developer

When someone is proposing a change it is their responsibility to address review comments and make sure that the change can be applied on current master. The changes I have done in llama.cpp are trivial and rebasing other changes on top of them is also trivial.

I hope you can do more carefully check before your PR was merged to master branch.

My change has been reviewed and approved by a core maintainer and the CI run was successful.

Yes, you are right:"When someone is proposing a change it is their responsibility to address review comments and make sure that the change can be applied on current master.". my rebase operation has an unnecessary conflict with your latest approved PR of ggml-rpc: you have many patches of ggml-rpc and at the same time the Intel's SYCL backend has also many patches, these are both make sense. BUT Intel's PR doesn't have side-effect to other backend and just focus internal refine in the source code of ggml-sycl.cpp. this is the reason why I said "the quality of your PR is might-be need more check".

You should did all the "preparation work"(I don't know how to describe it in English correctly at the moment) of ggml-rpc in this project in your first/very beginning PR and focus on internal changes/refine in the source code of ggml-rpc.cpp. BUT you are still submitting many PR/changes in the source code of llama.cpp for purpose of your ggml-rpc backend since your first ggml-rpc PR got approval. this is what I want to say. this is also the responsibility of the so-called core maintainer whom I know he is the maintainer of the ggml backend subsystem. accordingly, that's the reason why I said you have privilege.

You can see my ggml-qnn backend is still suspending in the review state since 04/24/2024 although I want it can be got approval and merged to master branch and then other community developers can contribute codes/ideas in the source code of ggml-qnn.cpp. this is the second reason why I said you have privilege.

I know the maintainer of ggml backend subsystem is also a real AI expert and a C/C++ master , but I don't think he is professional in your so many ggml-rpc PRs(If I was him, your first PR of ggml-rpc couldn't got approval).

@slaren
Copy link
Collaborator

slaren commented Jun 4, 2024

@zhouwg I am having a very hard time understating what you are complaining about here. It is the responsibility of PR authors to deal with merge conflicts caused by changes to the code merged to the repository. How could it be otherwise? The RPC backend has been submit to the same quality requirements as any other PRs merged to llama.cpp.

The QNN backend cannot be merged in its current state because it is not a functional backend. It is not correct to say that it has been waiting for review since 4/24/2024 because there is nothing to review until it is a functional backend that people can try. Generally, we will not merge non-functional code in the hope that other people will complete the missing parts.

You have two ways to make it functional: implement the missing operations, or wait until ggml-backend implements support for using the CPU backend for ops not supported by a backend. I understand that you are frustrated, but please try to be patient.

@zhouwg
Copy link
Contributor

zhouwg commented Jun 4, 2024

@zhouwg I am having a very hard time understating what you are complaining about here. It is the responsibility of PR authors to deal with merge conflicts caused by changes to the code merged to the repository. How could it be otherwise? The RPC backend has been submit to the same quality requirements as any other PRs merged to llama.cpp.

The QNN backend cannot be merged in its current state because it is not a functional backend. It is not correct to say that it has been waiting for review since 4/24/2024 because there is nothing to review until it is a functional backend that people can try. Generally, we will not merge non-functional code in the hope that other people will complete the missing parts.

You have two ways to make it functional: implement the missing operations, or wait until ggml-backend implements support for using the CPU backend for ops not supported by a backend. I understand that you are frustrated, but please try to be patient.

I'm sorry for that and pls don't care what I said(which is just a personal feeling/thoughts and I still think you are a real AI expert and C/C++ master) in this PR.

1.The "preparation work" or "HLD(high level design)" work should be done in the first PR of ggml-rpc and then the author of ggml-rpc can focus on internal refine/changes in the source code of ggml-rpc.cpp after the first PR was approved. just like what Intel SYCL backend did. I can handle the merge/rebase conflicts properly, but that is the exactly point I want to say. BTW, I really can't understand why such the "offload to rpc backend" was not done/verified in the very beginning PR of ggml-rpc backend. it's might be an non-functional backend as your words.

2.I provide a general approach based on the existing ggml backend subsystem for CPU&GPU / CPU&NPU mixed inference very easily in PR:refine backend subsystem for CPU&GPU / CPU&NPU mixed inference more easily for a specified GGML backend, but you said it's not correct and will not work although it works very well on my personal ggml learning project.

3.The QNN backend is more mature or ggml-rpc is more non-functional if we compare to the ggml-rpc backend, this is also a personal feeling.

4.I participate in this great/excellent project for learning and for fun and for make a little contribution although I know nothing/a little about real/hardcore AI tech: I submit PR what I can do and what I can hold.

5.The QNN backend works well as expected with whisper.cpp and llama.cpp and I provide UT in Android APK and UT in Android command line program and all the UT cases are also works fine as expected, but you still think it's not a functional backend.

6.You are the author/maintainer of ggml backend subsystem, you have rights to decide whether a specified PR about ggml backend can be accepted and I respect your decision.

Comment on lines +1890 to +1898
void ggml_backend_view_init(struct ggml_tensor * tensor) {
GGML_ASSERT(tensor->buffer == NULL);
GGML_ASSERT(tensor->view_src != NULL);
GGML_ASSERT(tensor->view_src->buffer != NULL);
GGML_ASSERT(tensor->view_src->data != NULL);

tensor->buffer = buffer;
tensor->buffer = tensor->view_src->buffer;
tensor->data = (char *)tensor->view_src->data + tensor->view_offs;
ggml_backend_buffer_init_tensor(buffer, tensor);
ggml_backend_buffer_init_tensor(tensor->buffer, tensor);
Copy link
Collaborator

@airMeng airMeng Jun 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This breaks the latest SYCL support. Could I know if all the other backends assumes the buffer already bound to the right tensor? Of cause I know it is the issue of SYCL itself, we are maintaining SYCL during spare time and are still begging for official support from the company :) I will look into it and try to fix it soon.

@slaren @rgerganov for awareness

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's the same issue that was fixed for the Vulkan backend in #7806, there are more details about why this happens and why the change was necessary there. The best way to fix this for the SYCL backend would be to remove the extras entirely in the same way they were removed from the CUDA backend.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we are maintaining SYCL during spare time and are still begging for official support from the company

Having a dedicated machine which runs the CI with the SYCL backend would be very helpful

@metal3d
Copy link
Contributor

metal3d commented Jul 13, 2024

I see that you reverted this merge. Does it mean that it will not work with others backend? I'm using Vulkan instead of CUDA (for many reasons), and actually I can see that rpc-server binary is linked to libvulkan, but the binary says "create_backend: using CPU backend".

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ggml changes relating to the ggml tensor library for machine learning Review Complexity : Medium Generally require more time to grok but manageable by beginner to medium expertise level
Projects
Development

Successfully merging this pull request may close these issues.

Bug: test run on stories15M-q4_0.gguf result in Segmentation fault.
6 participants