Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug: llama.cpp server reports inaccurate n_ctx_per_seq? #10186

Closed
horenbergerb opened this issue Nov 5, 2024 · 6 comments · Fixed by #10187
Closed

Bug: llama.cpp server reports inaccurate n_ctx_per_seq? #10186

horenbergerb opened this issue Nov 5, 2024 · 6 comments · Fixed by #10187
Labels
bug Something isn't working low severity Used to report low severity bugs in llama.cpp (e.g. cosmetic issues, non critical UI glitches)

Comments

@horenbergerb
Copy link

What happened?

Running a model and specifying 8192 context like so:

/llama-server --model Mistral-Large-Instruct-2407-IQ3_XXS.gguf -c 8192 -ngl 35

Causes the following to print during initialization:

llama_new_context_with_model: n_ctx_per_seq (4096) < n_ctx_train (131072) -- the full capacity of the model will not be
utilized

This freaked me out, because based on this discussion, the message implies that I'm actually only getting 4096 context due to parallelization. On the other hand, and I also see:

srv          init: initializing slots, n_slots = 1
slot         init: id  0 | task -1 | new slot n_ctx_slot = 8192                                                         slot        reset: id  0 | task -1 |

which is what I would expect.
This discrepancy seems to be due to the fact that the llama.cpp server temporarily increments n_parallel when loading the model (for a reason relating to Mamba? Not sure why we do this).
My concerns are:

  • What context is actually being used here? 8192 or 4096?
  • Should this be considered a bug, since the messages essentially contradict each other?

Please let me know if any other information is needed, but this should be easy to replicate. Thanks!

Name and Version

ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
version: 4033 (a9e8a9a0)
built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu

What operating system are you seeing the problem on?

Linux

Relevant log output

No response

@horenbergerb horenbergerb added bug-unconfirmed low severity Used to report low severity bugs in llama.cpp (e.g. cosmetic issues, non critical UI glitches) labels Nov 5, 2024
@horenbergerb horenbergerb changed the title Bug: llama.cpp server reports innaccurate n_ctx_per_seq? Bug: llama.cpp server reports inaccurate n_ctx_per_seq? Nov 5, 2024
@ggerganov
Copy link
Owner

It's using the correct context - 8192. The message is incorrect due to the hack that you noticed. Currently, this hack is not necessary - it was used for the old system prompt functionality, which was removed (#9811). I thought about keeping this extra sequence, as I had some ideas to utilize it. We should just remove the hack now.

@ggerganov
Copy link
Owner

Could you check if #10187 works as expected?

@horenbergerb
Copy link
Author

Before:

llama_new_context_with_model: n_ctx_per_seq (4096) < n_ctx_train (131072) -- the full capacity of the model will not be utilized
llama_kv_cache_init:  CUDA_Host KV buffer size =  1696.00 MiB
llama_kv_cache_init:      CUDA0 KV buffer size =  1120.00 MiB
llama_new_context_with_model: KV self size  = 2816.00 MiB, K (f16): 1408.00 MiB, V (f16): 1408.00 MiB
llama_new_context_with_model:        CPU  output buffer size =     0.25 MiB
llama_new_context_with_model:      CUDA0 compute buffer size =  1654.13 MiB
llama_new_context_with_model:  CUDA_Host compute buffer size =    40.01 MiB
llama_new_context_with_model: graph nodes  = 2822
llama_new_context_with_model: graph splits = 587 (with bs=512), 3 (with bs=1)
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
srv          init: initializing slots, n_slots = 1
slot         init: id  0 | task -1 | new slot n_ctx_slot = 8192
slot        reset: id  0 | task -1 |

After:

llama_new_context_with_model: n_ctx_per_seq (8192) < n_ctx_train (131072) -- the full capacity of the model will not be utilized
llama_kv_cache_init:  CUDA_Host KV buffer size =  1696.00 MiB
llama_kv_cache_init:      CUDA0 KV buffer size =  1120.00 MiB
llama_new_context_with_model: KV self size  = 2816.00 MiB, K (f16): 1408.00 MiB, V (f16): 1408.00 MiB
llama_new_context_with_model:        CPU  output buffer size =     0.12 MiB
llama_new_context_with_model:      CUDA0 compute buffer size =  1654.13 MiB
llama_new_context_with_model:  CUDA_Host compute buffer size =    40.01 MiB
llama_new_context_with_model: graph nodes  = 2822
llama_new_context_with_model: graph splits = 587 (with bs=512), 3 (with bs=1)
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
srv          init: initializing slots, n_slots = 1
slot         init: id  0 | task -1 | new slot n_ctx_slot = 8192
slot        reset: id  0 | task -1 |

Interesting that the CPU output buffer is slightly smaller. Is that expected?
Otherwise, inference seems to work fine on my machine. Haven't tried parallel processing, only the same command as the original message.

@Foreist
Copy link

Foreist commented Nov 14, 2024

I get the following results when I load llamacpp with the following message.
In fact, it appears that the model's output context length is limited to 2951.

LLAMA_ARG_N_GPU_LAYERS=100 /workspace/llama.cpp/llama-server -np 17 -m '/workspace/Gemma2-9B-qlora-1epoch-Q4_K_M.gguf' -v -t 12 -b 256 --ctx-size 50000 -fa

llama_new_context_with_model: n_ctx         = 50176
llama_new_context_with_model: n_ctx_per_seq = 2951
llama_new_context_with_model: n_batch       = 256
llama_new_context_with_model: n_ubatch      = 256
llama_new_context_with_model: flash_attn    = 1
llama_new_context_with_model: freq_base     = 10000.0
llama_new_context_with_model: freq_scale    = 1
llama_new_context_with_model: n_ctx_per_seq (2951) < n_ctx_train (8192) -- the full capacity of the model will not be utilized
llama_kv_cache_init:      CUDA0 KV buffer size = 16464.00 MiB
llama_new_context_with_model: KV self size  = 16464.00 MiB, K (f16): 8232.00 MiB, V (f16): 8232.00 MiB
llama_new_context_with_model:  CUDA_Host  output buffer size =    16.60 MiB
llama_new_context_with_model:      CUDA0 compute buffer size =   253.50 MiB
llama_new_context_with_model:  CUDA_Host compute buffer size =   101.50 MiB
llama_new_context_with_model: graph nodes  = 1398
llama_new_context_with_model: graph splits = 2
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
srv          init: initializing slots, n_slots = 17
slot         init: id  0 | task -1 | new slot n_ctx_slot = 2951
slot        reset: id  0 | task -1 |
slot         init: id  1 | task -1 | new slot n_ctx_slot = 2951
slot        reset: id  1 | task -1 |
slot         init: id  2 | task -1 | new slot n_ctx_slot = 2951
slot        reset: id  2 | task -1 |
slot         init: id  3 | task -1 | new slot n_ctx_slot = 2951
slot        reset: id  3 | task -1 |
slot         init: id  4 | task -1 | new slot n_ctx_slot = 2951
slot        reset: id  4 | task -1 |
slot         init: id  5 | task -1 | new slot n_ctx_slot = 2951
slot        reset: id  5 | task -1 |
slot         init: id  6 | task -1 | new slot n_ctx_slot = 2951
slot        reset: id  6 | task -1 |
slot         init: id  7 | task -1 | new slot n_ctx_slot = 2951
slot        reset: id  7 | task -1 |
slot         init: id  8 | task -1 | new slot n_ctx_slot = 2951
slot        reset: id  8 | task -1 |
slot         init: id  9 | task -1 | new slot n_ctx_slot = 2951
slot        reset: id  9 | task -1 |
slot         init: id 10 | task -1 | new slot n_ctx_slot = 2951
slot        reset: id 10 | task -1 |
slot         init: id 11 | task -1 | new slot n_ctx_slot = 2951
slot        reset: id 11 | task -1 |
slot         init: id 12 | task -1 | new slot n_ctx_slot = 2951
slot        reset: id 12 | task -1 |
slot         init: id 13 | task -1 | new slot n_ctx_slot = 2951
slot        reset: id 13 | task -1 |
slot         init: id 14 | task -1 | new slot n_ctx_slot = 2951
slot        reset: id 14 | task -1 |
slot         init: id 15 | task -1 | new slot n_ctx_slot = 2951
slot        reset: id 15 | task -1 |
slot         init: id 16 | task -1 | new slot n_ctx_slot = 2951
slot        reset: id 16 | task -1 |
main: model loaded```

@matt-vonhippel
Copy link

It looks like this shouldn't have been closed: @horenbergerb 's post above gives the same error message both before and after the change. I'm getting the same issue as well.

@dnck
Copy link

dnck commented Jan 23, 2025

Same issue over here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working low severity Used to report low severity bugs in llama.cpp (e.g. cosmetic issues, non critical UI glitches)
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants