server : remove system prompt support #9811

ggerganov · 2024-10-09T19:10:10Z

The "system_prompt" related functionality is quite outdated and is introducing unnecessary complexity. It only sort of makes sense for non-finetuned models in order to save the computation of a common prefix when there are multiple parallel slots. But in practice, only finetuned models are utilized for this use case and they always require a chat template, which is incompatible with the current implementation of the system prompt. So in order to simplify the code a bit, we should remove the system prompt related functionality from the server.

GuillaumeBruand · 2024-10-10T09:49:09Z

I am using system_prompt along with parallel slots and the built-in chat template of each model. What will be the preferred way to keep the current behaviour, a conditionned chatbot for all my users ?

ggerganov · 2024-10-10T10:19:47Z

What does your context currently look? Is it:

[system prompt tokens]
<|im_start|>system
[another system prompt here?]<|im_end|>
<|im_start|>user
Hello<|im_end|>
<|im_start|>assistant

GuillaumeBruand · 2024-10-10T11:10:28Z

Kind of, for llama3.1 it is looking like

<|start_header_id|>system<|end_header_id|>
{system prompt here}<|eot_id|><|start_header_id|>user<|end_header_id|>
Hello<|eot_id|><|start_header_id|>assistant<|end_header_id|>
Hi there<|eot_id|><|start_header_id|>user<|end_header_id|>

My understanding is that {system prompt} is updated at launch or through the api endpoint, and to replicate the current behaviour I would need to update the template myself.

ggerganov · 2024-10-10T11:39:20Z

There are 2 types of system prompts:

The one implemented in llama-server that I would like to remove. The way it works is it is prefixed to all other tokens. Regardless if there is a chat template or not, the system prompt tokens of this kind will be at the start of the context (see my message earlier)
The second type of of system prompt is the one that can be passed from the client through the chat template by adding a message with role system for example. This continues to work as always - the client is responsible to send the appropriate system prompt

@GuillaumeBruand I'm afraid your context likely looks like this:

{system prompt here}
<|start_header_id|>system<|end_header_id|>
<|eot_id|><|start_header_id|>user<|end_header_id|>
Hello<|eot_id|><|start_header_id|>assistant<|end_header_id|>
Hi there<|eot_id|><|start_header_id|>user<|end_header_id|>

Which is technically incorrect. The system prompt configured at launch or passed through the API is not applied inside the chat templates - this is stated in the llama-server README, but it's easy to get confused, so one more reason to remove this functionality.

GuillaumeBruand · 2024-10-10T11:46:48Z

@ggerganov Thanks for the clarification, I will update my workflow according your recommandations so that I will not rely anymore on this (deprecated?) feature.

fquirin · 2024-10-15T15:23:28Z

@ggerganov I just noticed that the system_prompt feature was removed when I downloaded the new version and now I'm a bit confused.

What I understand from the discussion above is that 'system_prompt' never really worked well together with the chat template and actually put the prompt in front of the context formatted with the template. I did not know that and I'm surprised that it worked so well in my setup 😅.

Now that this has been removed, does this mean I have to submit the system prompt with every request to my server?

I really liked the 'system_prompt' option, because I'm always spinning up the server to use it with the same, rather long, system prompt 😞.

ggerganov · 2024-10-15T15:29:47Z

Yes, your understanding is correct. Although it wasn't used properly, I can imagine that it still helped in the intended way when prefixing the system prompt at the very beginning. But the main reason to remove this was to simplify a bit the logic in llama-server because the system prompt had to be taken care in many different places.

Now that this has been removed, does this mean I have to submit the system prompt with every request to my server?

Yes. We can probably think about reintroducing the option and use the CLI system prompt as default value for the chat template system prompt when it is not passed by the client. But somebody with more experience in chat templates would have to implement this.

fquirin · 2024-10-15T16:15:05Z

Thanks for clarifying!

I think it would be really great to reintroduce the system prompt option in some way to "pre-condition" the server.

One more question. Do I have to use 'n_keep' in combination with the system prompt if I send it with every request (and the whole chat history)?

ggerganov · 2024-10-15T16:34:47Z

One more question. Do I have to use 'n_keep' in combination with the system prompt if I send it with every request (and the whole chat history)?

No, just send the requests with "cache_prompt": true and it will reuse the biggest common prefix (i.e. chat template tokens + system tokens + anything else) from the previous request.

Generally n_keep is obsolete. Instead of the old context shifting where n_keep was needed, your client can make a ring buffer of chat messages and start the server with the --cache-reuse option. This way for very long conversations, in which you discard old messages from the ring buffer and insert new ones, the server will compute only the new tokens and reuse the previous cache via KV shifting (requires RoPE-based attention model). See #9866 for more details.

fquirin · 2024-10-15T17:28:18Z

Awesome, thanks a lot!

vmajor · 2025-01-12T08:12:21Z

Sorry for the slight necromancy, but I would love some clarity with how to handle the system prompt correctly.

Setting it at startup of llama-server is no longer possible/it never worked as intended.
We can reuse it if we send it as "cache_prompt": true and set --cache-reuse

What is the built in llama-server GUI doing when it comes to the system prompt? It handles it perfectly. It responds quickly after the initial response (prompt is not being sent every turn) and it adheres to the prompt perfectly, truly zero deviation which is very important for my application. Thus I actually want to know what is llama-server GUI doing and I want to replicate exactly that with my own calls to the API.

EDIT: never mind, I inspected the network for the payload, but if there is anything else that someone can recommend, input is welcome

ggerganov · 2025-01-12T08:27:01Z

The web ui is simply filling the correct chat template. For example, using Qwen:

{
    "messages": [
        {
            "role": "system",
            "content": "You are a helpful assistant."
        },
        {
            "id": 1736669686387,
            "role": "user",
            "content": "Hello"
        },
        {
            "id": 1736669686390,
            "role": "assistant",
            "content": "Hello! How can I assist you today?",
            "timings": {
                "prompt_n": 20,
                "prompt_ms": 696.321,
                "predicted_n": 13,
                "predicted_ms": 670.689
            }
        },
        {
            "id": 1736669772617,
            "role": "user",
            "content": "Just showing an example of system prompt usage."
        }
    ],
    "stream": true,
    "cache_prompt": true,
    ...
}

The system role is the message type that contains the system prompt. Just make sure to add it to the start of your request and it should work as expected. Add the -lv 1 to llama-server to inspect the received requests from the web ui and understand better what data is being sent.

p.s. @ngxson While writing this answer, I noticed that the web ui sends back "timings" information for previous messages. Think we should remove these from the requests.

atozj · 2025-01-19T03:02:42Z

give  an example of several scenarios I want to implement. Let's assume the startup prompt is 'a', the user system prompt is 'b', and the user prompt is 'c'. 1.When the user does not input a system prompt: The total input will be 'a' + 'c'. 2.When the user inputs a system prompt: The total input will be 'b' + 'c' or 'a' + 'b' + 'c'. The purpose of this approach is to enhance my friends' experience with open-source AI and make it more enjoyable. Teaching them how to set up prompts and other configurations is often too cumbersome, and they tend to give up quickly. Using a startup prompt effectively simplifies the process and improves the overall user experience. This translation maintains the technical details while using natural English phrasing. It also preserves the explanatory tone of the original text. Let me know if you need any adjustments!

…

------------------ 原始邮件 ------------------ 发件人: "ggerganov/llama.cpp" ***@***.***>; 发送时间: 2025年1月12日(星期天) 下午4:27 ***@***.***>; ***@***.******@***.***>; 主题: Re: [ggerganov/llama.cpp] server : remove system prompt support (Issue #9811) The web ui is simply filling the correct chat template. For example, using Qwen: { "messages": [ { "role": "system", "content": "You are a helpful assistant." }, { "id": 1736669686387, "role": "user", "content": "Hello" }, { "id": 1736669686390, "role": "assistant", "content": "Hello! How can I assist you today?", "timings": { "prompt_n": 20, "prompt_ms": 696.321, "predicted_n": 13, "predicted_ms": 670.689 } }, { "id": 1736669772617, "role": "user", "content": "Just showing an example of system prompt usage." } ], "stream": true, "cache_prompt": true, ... } The system role is the message type that contains the system prompt. Just make sure to add it to the start of your request and it should work as expected. Add the -lv 1 to llama-server to inspect the received requests from the web ui and understand better what data is being sent. p.s. @ngxson While writing this answer, I noticed that the web ui sends back "timings" information for previous messages. Think we should remove these from the requests. — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you commented.Message ID: ***@***.***>

ggerganov added refactoring Refactoring server labels Oct 9, 2024

ggerganov self-assigned this Oct 9, 2024

ggerganov added this to ggml : roadmap Oct 9, 2024

ggerganov moved this to Todo in ggml : roadmap Oct 9, 2024

arouene mentioned this issue Oct 10, 2024

Remove useless cli run parameter --prompt containers/ramalama#279

Merged

ggerganov mentioned this issue Oct 12, 2024

server : remove legacy system_prompt feature #9857

Merged

ggerganov moved this from Todo to In Progress in ggml : roadmap Oct 12, 2024

ngxson mentioned this issue Oct 12, 2024

server : remove self-extend features #9860

Merged

ggerganov closed this as completed in #9857 Oct 12, 2024

ggerganov moved this from In Progress to Done in ggml : roadmap Oct 12, 2024

ggerganov mentioned this issue Nov 5, 2024

Bug: llama.cpp server reports inaccurate n_ctx_per_seq? #10186

Closed

atozj mentioned this issue Nov 26, 2024

Feature Request: server default system prompt support like -spf in old version support gemma2 #10520

Open

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

server : remove system prompt support #9811

server : remove system prompt support #9811

ggerganov commented Oct 9, 2024

GuillaumeBruand commented Oct 10, 2024

ggerganov commented Oct 10, 2024

GuillaumeBruand commented Oct 10, 2024

ggerganov commented Oct 10, 2024

GuillaumeBruand commented Oct 10, 2024

fquirin commented Oct 15, 2024 •

edited

Loading

ggerganov commented Oct 15, 2024 •

edited

Loading

fquirin commented Oct 15, 2024 •

edited

Loading

ggerganov commented Oct 15, 2024 •

edited

Loading

fquirin commented Oct 15, 2024

vmajor commented Jan 12, 2025 •

edited

Loading

ggerganov commented Jan 12, 2025

atozj commented Jan 19, 2025 via email

server : remove system prompt support #9811

server : remove system prompt support #9811

Comments

ggerganov commented Oct 9, 2024

GuillaumeBruand commented Oct 10, 2024

ggerganov commented Oct 10, 2024

GuillaumeBruand commented Oct 10, 2024

ggerganov commented Oct 10, 2024

GuillaumeBruand commented Oct 10, 2024

fquirin commented Oct 15, 2024 • edited Loading

ggerganov commented Oct 15, 2024 • edited Loading

fquirin commented Oct 15, 2024 • edited Loading

ggerganov commented Oct 15, 2024 • edited Loading

fquirin commented Oct 15, 2024

vmajor commented Jan 12, 2025 • edited Loading

ggerganov commented Jan 12, 2025

atozj commented Jan 19, 2025 via email

fquirin commented Oct 15, 2024 •

edited

Loading

ggerganov commented Oct 15, 2024 •

edited

Loading

fquirin commented Oct 15, 2024 •

edited

Loading

ggerganov commented Oct 15, 2024 •

edited

Loading

vmajor commented Jan 12, 2025 •

edited

Loading