Bug: llama.cpp server reports inaccurate n_ctx_per_seq? #10186

horenbergerb · 2024-11-05T20:06:45Z

What happened?

Running a model and specifying 8192 context like so:

/llama-server --model Mistral-Large-Instruct-2407-IQ3_XXS.gguf -c 8192 -ngl 35

Causes the following to print during initialization:

llama_new_context_with_model: n_ctx_per_seq (4096) < n_ctx_train (131072) -- the full capacity of the model will not be
utilized

This freaked me out, because based on this discussion, the message implies that I'm actually only getting 4096 context due to parallelization. On the other hand, and I also see:

srv          init: initializing slots, n_slots = 1
slot         init: id  0 | task -1 | new slot n_ctx_slot = 8192                                                         slot        reset: id  0 | task -1 |

which is what I would expect.
This discrepancy seems to be due to the fact that the llama.cpp server temporarily increments n_parallel when loading the model (for a reason relating to Mamba? Not sure why we do this).
My concerns are:

What context is actually being used here? 8192 or 4096?
Should this be considered a bug, since the messages essentially contradict each other?

Please let me know if any other information is needed, but this should be easy to replicate. Thanks!

Name and Version

ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
version: 4033 (a9e8a9a0)
built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu

What operating system are you seeing the problem on?

Linux

Relevant log output

No response

The text was updated successfully, but these errors were encountered:

ggerganov · 2024-11-05T20:22:33Z

It's using the correct context - 8192. The message is incorrect due to the hack that you noticed. Currently, this hack is not necessary - it was used for the old system prompt functionality, which was removed (#9811). I thought about keeping this extra sequence, as I had some ideas to utilize it. We should just remove the hack now.

ggerganov · 2024-11-05T20:33:59Z

Could you check if #10187 works as expected?

horenbergerb · 2024-11-05T21:01:52Z

Before:

llama_new_context_with_model: n_ctx_per_seq (4096) < n_ctx_train (131072) -- the full capacity of the model will not be utilized
llama_kv_cache_init:  CUDA_Host KV buffer size =  1696.00 MiB
llama_kv_cache_init:      CUDA0 KV buffer size =  1120.00 MiB
llama_new_context_with_model: KV self size  = 2816.00 MiB, K (f16): 1408.00 MiB, V (f16): 1408.00 MiB
llama_new_context_with_model:        CPU  output buffer size =     0.25 MiB
llama_new_context_with_model:      CUDA0 compute buffer size =  1654.13 MiB
llama_new_context_with_model:  CUDA_Host compute buffer size =    40.01 MiB
llama_new_context_with_model: graph nodes  = 2822
llama_new_context_with_model: graph splits = 587 (with bs=512), 3 (with bs=1)
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
srv          init: initializing slots, n_slots = 1
slot         init: id  0 | task -1 | new slot n_ctx_slot = 8192
slot        reset: id  0 | task -1 |

After:

llama_new_context_with_model: n_ctx_per_seq (8192) < n_ctx_train (131072) -- the full capacity of the model will not be utilized
llama_kv_cache_init:  CUDA_Host KV buffer size =  1696.00 MiB
llama_kv_cache_init:      CUDA0 KV buffer size =  1120.00 MiB
llama_new_context_with_model: KV self size  = 2816.00 MiB, K (f16): 1408.00 MiB, V (f16): 1408.00 MiB
llama_new_context_with_model:        CPU  output buffer size =     0.12 MiB
llama_new_context_with_model:      CUDA0 compute buffer size =  1654.13 MiB
llama_new_context_with_model:  CUDA_Host compute buffer size =    40.01 MiB
llama_new_context_with_model: graph nodes  = 2822
llama_new_context_with_model: graph splits = 587 (with bs=512), 3 (with bs=1)
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
srv          init: initializing slots, n_slots = 1
slot         init: id  0 | task -1 | new slot n_ctx_slot = 8192
slot        reset: id  0 | task -1 |

Interesting that the CPU output buffer is slightly smaller. Is that expected?
Otherwise, inference seems to work fine on my machine. Haven't tried parallel processing, only the same command as the original message.

Foreist · 2024-11-14T05:12:00Z

I get the following results when I load llamacpp with the following message.
In fact, it appears that the model's output context length is limited to 2951.

LLAMA_ARG_N_GPU_LAYERS=100 /workspace/llama.cpp/llama-server -np 17 -m '/workspace/Gemma2-9B-qlora-1epoch-Q4_K_M.gguf' -v -t 12 -b 256 --ctx-size 50000 -fa

llama_new_context_with_model: n_ctx         = 50176
llama_new_context_with_model: n_ctx_per_seq = 2951
llama_new_context_with_model: n_batch       = 256
llama_new_context_with_model: n_ubatch      = 256
llama_new_context_with_model: flash_attn    = 1
llama_new_context_with_model: freq_base     = 10000.0
llama_new_context_with_model: freq_scale    = 1
llama_new_context_with_model: n_ctx_per_seq (2951) < n_ctx_train (8192) -- the full capacity of the model will not be utilized
llama_kv_cache_init:      CUDA0 KV buffer size = 16464.00 MiB
llama_new_context_with_model: KV self size  = 16464.00 MiB, K (f16): 8232.00 MiB, V (f16): 8232.00 MiB
llama_new_context_with_model:  CUDA_Host  output buffer size =    16.60 MiB
llama_new_context_with_model:      CUDA0 compute buffer size =   253.50 MiB
llama_new_context_with_model:  CUDA_Host compute buffer size =   101.50 MiB
llama_new_context_with_model: graph nodes  = 1398
llama_new_context_with_model: graph splits = 2
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
srv          init: initializing slots, n_slots = 17
slot         init: id  0 | task -1 | new slot n_ctx_slot = 2951
slot        reset: id  0 | task -1 |
slot         init: id  1 | task -1 | new slot n_ctx_slot = 2951
slot        reset: id  1 | task -1 |
slot         init: id  2 | task -1 | new slot n_ctx_slot = 2951
slot        reset: id  2 | task -1 |
slot         init: id  3 | task -1 | new slot n_ctx_slot = 2951
slot        reset: id  3 | task -1 |
slot         init: id  4 | task -1 | new slot n_ctx_slot = 2951
slot        reset: id  4 | task -1 |
slot         init: id  5 | task -1 | new slot n_ctx_slot = 2951
slot        reset: id  5 | task -1 |
slot         init: id  6 | task -1 | new slot n_ctx_slot = 2951
slot        reset: id  6 | task -1 |
slot         init: id  7 | task -1 | new slot n_ctx_slot = 2951
slot        reset: id  7 | task -1 |
slot         init: id  8 | task -1 | new slot n_ctx_slot = 2951
slot        reset: id  8 | task -1 |
slot         init: id  9 | task -1 | new slot n_ctx_slot = 2951
slot        reset: id  9 | task -1 |
slot         init: id 10 | task -1 | new slot n_ctx_slot = 2951
slot        reset: id 10 | task -1 |
slot         init: id 11 | task -1 | new slot n_ctx_slot = 2951
slot        reset: id 11 | task -1 |
slot         init: id 12 | task -1 | new slot n_ctx_slot = 2951
slot        reset: id 12 | task -1 |
slot         init: id 13 | task -1 | new slot n_ctx_slot = 2951
slot        reset: id 13 | task -1 |
slot         init: id 14 | task -1 | new slot n_ctx_slot = 2951
slot        reset: id 14 | task -1 |
slot         init: id 15 | task -1 | new slot n_ctx_slot = 2951
slot        reset: id 15 | task -1 |
slot         init: id 16 | task -1 | new slot n_ctx_slot = 2951
slot        reset: id 16 | task -1 |
main: model loaded```

matt-vonhippel · 2024-11-21T10:20:58Z

It looks like this shouldn't have been closed: @horenbergerb 's post above gives the same error message both before and after the change. I'm getting the same issue as well.

dnck · 2025-01-23T20:46:03Z

Same issue over here.

horenbergerb added bug-unconfirmed low severity Used to report low severity bugs in llama.cpp (e.g. cosmetic issues, non critical UI glitches) labels Nov 5, 2024

horenbergerb changed the title ~~Bug: llama.cpp server reports innaccurate n_ctx_per_seq?~~ Bug: llama.cpp server reports inaccurate n_ctx_per_seq? Nov 5, 2024

ggerganov added bug Something isn't working and removed bug-unconfirmed labels Nov 5, 2024

ggerganov mentioned this issue Nov 5, 2024

server : remove hack for extra parallel slot #10187

Merged

ggerganov closed this as completed in #10187 Nov 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug: llama.cpp server reports inaccurate n_ctx_per_seq? #10186

Bug: llama.cpp server reports inaccurate n_ctx_per_seq? #10186

horenbergerb commented Nov 5, 2024

ggerganov commented Nov 5, 2024

ggerganov commented Nov 5, 2024

horenbergerb commented Nov 5, 2024

Foreist commented Nov 14, 2024 •

edited

Loading

matt-vonhippel commented Nov 21, 2024

dnck commented Jan 23, 2025

Bug: llama.cpp server reports inaccurate n_ctx_per_seq? #10186

Bug: llama.cpp server reports inaccurate n_ctx_per_seq? #10186

Comments

horenbergerb commented Nov 5, 2024

What happened?

Name and Version

What operating system are you seeing the problem on?

Relevant log output

ggerganov commented Nov 5, 2024

ggerganov commented Nov 5, 2024

horenbergerb commented Nov 5, 2024

Foreist commented Nov 14, 2024 • edited Loading

matt-vonhippel commented Nov 21, 2024

dnck commented Jan 23, 2025

Foreist commented Nov 14, 2024 •

edited

Loading