Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

server : remove system prompt support #9811

Closed
ggerganov opened this issue Oct 9, 2024 · 13 comments · Fixed by #9857
Closed

server : remove system prompt support #9811

ggerganov opened this issue Oct 9, 2024 · 13 comments · Fixed by #9857
Assignees
Labels

Comments

@ggerganov
Copy link
Owner

The "system_prompt" related functionality is quite outdated and is introducing unnecessary complexity. It only sort of makes sense for non-finetuned models in order to save the computation of a common prefix when there are multiple parallel slots. But in practice, only finetuned models are utilized for this use case and they always require a chat template, which is incompatible with the current implementation of the system prompt. So in order to simplify the code a bit, we should remove the system prompt related functionality from the server.

@ggerganov ggerganov self-assigned this Oct 9, 2024
@ggerganov ggerganov moved this to Todo in ggml : roadmap Oct 9, 2024
@GuillaumeBruand
Copy link

I am using system_prompt along with parallel slots and the built-in chat template of each model. What will be the preferred way to keep the current behaviour, a conditionned chatbot for all my users ?

@ggerganov
Copy link
Owner Author

What does your context currently look? Is it:

[system prompt tokens]
<|im_start|>system
[another system prompt here?]<|im_end|>
<|im_start|>user
Hello<|im_end|>
<|im_start|>assistant

@GuillaumeBruand
Copy link

Kind of, for llama3.1 it is looking like

<|start_header_id|>system<|end_header_id|>
{system prompt here}<|eot_id|><|start_header_id|>user<|end_header_id|>
Hello<|eot_id|><|start_header_id|>assistant<|end_header_id|>
Hi there<|eot_id|><|start_header_id|>user<|end_header_id|>

My understanding is that {system prompt} is updated at launch or through the api endpoint, and to replicate the current behaviour I would need to update the template myself.

@ggerganov
Copy link
Owner Author

There are 2 types of system prompts:

  • The one implemented in llama-server that I would like to remove. The way it works is it is prefixed to all other tokens. Regardless if there is a chat template or not, the system prompt tokens of this kind will be at the start of the context (see my message earlier)
  • The second type of of system prompt is the one that can be passed from the client through the chat template by adding a message with role system for example. This continues to work as always - the client is responsible to send the appropriate system prompt

@GuillaumeBruand I'm afraid your context likely looks like this:

{system prompt here}
<|start_header_id|>system<|end_header_id|>
<|eot_id|><|start_header_id|>user<|end_header_id|>
Hello<|eot_id|><|start_header_id|>assistant<|end_header_id|>
Hi there<|eot_id|><|start_header_id|>user<|end_header_id|>

Which is technically incorrect. The system prompt configured at launch or passed through the API is not applied inside the chat templates - this is stated in the llama-server README, but it's easy to get confused, so one more reason to remove this functionality.

@GuillaumeBruand
Copy link

@ggerganov Thanks for the clarification, I will update my workflow according your recommandations so that I will not rely anymore on this (deprecated?) feature.

@fquirin
Copy link

fquirin commented Oct 15, 2024

@ggerganov I just noticed that the system_prompt feature was removed when I downloaded the new version and now I'm a bit confused.

What I understand from the discussion above is that 'system_prompt' never really worked well together with the chat template and actually put the prompt in front of the context formatted with the template. I did not know that and I'm surprised that it worked so well in my setup 😅.

Now that this has been removed, does this mean I have to submit the system prompt with every request to my server?

I really liked the 'system_prompt' option, because I'm always spinning up the server to use it with the same, rather long, system prompt 😞.

@ggerganov
Copy link
Owner Author

ggerganov commented Oct 15, 2024

Yes, your understanding is correct. Although it wasn't used properly, I can imagine that it still helped in the intended way when prefixing the system prompt at the very beginning. But the main reason to remove this was to simplify a bit the logic in llama-server because the system prompt had to be taken care in many different places.

Now that this has been removed, does this mean I have to submit the system prompt with every request to my server?

Yes. We can probably think about reintroducing the option and use the CLI system prompt as default value for the chat template system prompt when it is not passed by the client. But somebody with more experience in chat templates would have to implement this.

@fquirin
Copy link

fquirin commented Oct 15, 2024

Thanks for clarifying!

I think it would be really great to reintroduce the system prompt option in some way to "pre-condition" the server.

One more question. Do I have to use 'n_keep' in combination with the system prompt if I send it with every request (and the whole chat history)?

@ggerganov
Copy link
Owner Author

ggerganov commented Oct 15, 2024

One more question. Do I have to use 'n_keep' in combination with the system prompt if I send it with every request (and the whole chat history)?

No, just send the requests with "cache_prompt": true and it will reuse the biggest common prefix (i.e. chat template tokens + system tokens + anything else) from the previous request.

Generally n_keep is obsolete. Instead of the old context shifting where n_keep was needed, your client can make a ring buffer of chat messages and start the server with the --cache-reuse option. This way for very long conversations, in which you discard old messages from the ring buffer and insert new ones, the server will compute only the new tokens and reuse the previous cache via KV shifting (requires RoPE-based attention model). See #9866 for more details.

@fquirin
Copy link

fquirin commented Oct 15, 2024

Awesome, thanks a lot!

@vmajor
Copy link

vmajor commented Jan 12, 2025

Sorry for the slight necromancy, but I would love some clarity with how to handle the system prompt correctly.

  1. Setting it at startup of llama-server is no longer possible/it never worked as intended.
  2. We can reuse it if we send it as "cache_prompt": true and set --cache-reuse

What is the built in llama-server GUI doing when it comes to the system prompt? It handles it perfectly. It responds quickly after the initial response (prompt is not being sent every turn) and it adheres to the prompt perfectly, truly zero deviation which is very important for my application. Thus I actually want to know what is llama-server GUI doing and I want to replicate exactly that with my own calls to the API.

EDIT: never mind, I inspected the network for the payload, but if there is anything else that someone can recommend, input is welcome

@ggerganov
Copy link
Owner Author

The web ui is simply filling the correct chat template. For example, using Qwen:

{
    "messages": [
        {
            "role": "system",
            "content": "You are a helpful assistant."
        },
        {
            "id": 1736669686387,
            "role": "user",
            "content": "Hello"
        },
        {
            "id": 1736669686390,
            "role": "assistant",
            "content": "Hello! How can I assist you today?",
            "timings": {
                "prompt_n": 20,
                "prompt_ms": 696.321,
                "predicted_n": 13,
                "predicted_ms": 670.689
            }
        },
        {
            "id": 1736669772617,
            "role": "user",
            "content": "Just showing an example of system prompt usage."
        }
    ],
    "stream": true,
    "cache_prompt": true,
    ...
}

The system role is the message type that contains the system prompt. Just make sure to add it to the start of your request and it should work as expected. Add the -lv 1 to llama-server to inspect the received requests from the web ui and understand better what data is being sent.

p.s. @ngxson While writing this answer, I noticed that the web ui sends back "timings" information for previous messages. Think we should remove these from the requests.

@atozj
Copy link

atozj commented Jan 19, 2025 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Status: Done
Development

Successfully merging a pull request may close this issue.

5 participants