Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

server: fix system_tokens being erased in kv_cache; #6312

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

MasterYi1024
Copy link
Contributor

Hi llama.cpp deveeloppers:)

I read the code these days. And I think there is a chance that system tokens may get erased in kv cache in the server example by this line:

llama_kv_cache_seq_rm (ctx, slot.id + 1, n_keep            , n_keep + n_discard);

From my little knowledge of the server code(may be wrong cause too many codes to read, sorry), I think in kv_cache, system_tokens are positioned from 0 to its length, and after system_tokens it's n_keep prompt_tokens. So If the code before remove the tokens between n_keep to n_keep + n_discarded, it will remove some of the system_tokens which makes the generation stop working or generate something meaningless.

Below is my test. This problem only can be duplicated with some specific count of tokens. And I just run into it in my daily tests, that's why anything was in Chinese. Sorry again ;p

The system prompt is to make the assistant summarize some text. I wrote this in a translater.json file and use -spf parameter to load it:

{
    "prompt": "Assistant's name is John. # CONTEXT # 我需要你总结概括一段文字。 # OBJECTIVE # 阅读用户发给你的文本,总结概括文本内容,为了用户阅读方便,请始终使用与原文本相同的语言进行总结概括。 # STYLE # 不需要有什么风格。 # TONE # 总结概括。 # AUDIENCE # 任何想要了解一段文字大意的人。 # RESPONSE # 回答应该明确易懂,简洁明了。使用与用户输入相同的语言。",
    "anti_prompt": "User",
    "assistant_name": "Assistant"
}

Then I start the server with these parameters, notice that -c was commented so its value is default 512. Also you can see I'm using Qwen model with a RTX4090 card:

"-m",
// Qwen1.5-7B-Chat
"/home/kylin/ai/models/my-converted/qwen1_5-7b-chat-q5_k_m.gguf",
// "-c",
// "2048",
"-ngl",
"256",
"-n",
"256",
"-t",
"32",
"-tb",
"32",
"-np",
"1",
"-cb",
"-spf",
"/home/kylin/ai/llama.cpp/prompts/translator.json",
"--verbose",
"--metrics",

With the server running, I call curl with five questions:

1. What is your name?
2. What is your sytem prompt?
3. some long text to summarize, I'll write it down later.
4. What is your name?
5. What is your system prompt?
  • before this pr

And these are the generations:
001

As you can see, the first time I ask about its system prompt and name, It answers right. But after I give it a long text to summarize, It forgets its name and system prompt(In this picture, I just ask about the name.).
002

  • apply this pr

Now is the generation after applying this pr:
003
004
005

Now you see, It remembered who it is and what it should do!

summary

I made this change just because I find its a little better than before. I don't really know about the logic of the two part of token shift in server example. I tried to read hard about them, but still not really clear. If there is some documentations on kv_cache and these two part of shift code in server example, that would be great. Thanks a lot:)

If this change is wrong, feel free to close the pr.

Bellow is the exacte text I send to server:

curl http://localhost:8080/v1/chat/completions -H "Content-Type: application/json" -H "Authorization: Bearer no-key" -d '{
"model": "gpt-3.5-turbo",
"messages": [
{
    "role": "user",
    "content": "小羊着急地呼唤着羊妈妈,羊妈妈似乎听到了什么,就在小羊刚刚抱过她的地方, 羊妈妈仿佛感觉到了小羊的存在。 羊妈妈回想起了和小羊一起度过的那些美好的瞬间,她想起自己曾经答应过小羊,要给他织一件毛衣的。于是,羊妈妈拿出床下已经放了很久的半件毛衣,拾回毛线针,认真地把它织完。此时,小羊 已经被抓回了队伍里。小羊不知道妈妈现在是否还在伤心,难过地坐在了 地上。而羊妈妈还在努力地织着毛衣。过了好久,羊妈妈终于织好了。她举起毛衣,像是想拿给天上的小羊看。小羊收到了妈妈传递来的情感,天上也出现了一 件毛衣。光秃秃的他小心地穿上了妈妈织的毛衣,开心地跳了起来,心满意足地去了河对岸,进入了天堂。进入 天堂后,小羊回过头看了一眼,发现狼守卫的背后秃了一大片。原来,如果光是用小羊的羊毛的话,绳子的长度是不够到 达地面的。狼 守卫乘小羊睡着的时候,偷偷剃下了自己背上的毛,把自己的毛也编进了绳子,让它更长了。狼守卫也回头看了一眼小羊,不好意思地笑了一下。小羊向他挥了挥手,到天堂里面去了。此时,羊妈妈看向了天空。夜晚到 了,天堂的门前已经没有在等候的人了"
}
], "cache_tokens": true
}'

I know the new version of server would say "context is too long for kv_cache, ...", so you have to use the exacte text to duplicate this issue.

Thanks in advance:)

Copy link
Collaborator

@phymbert phymbert left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not use the system global prompt, and I think taking into account chat completions route, it is not widely used. But if your theory is true, please show us first a test, ideally using the server test framework.

EDIT: I did read too quickly, running CI workflows

@phymbert
Copy link
Collaborator

And thanks for having looking into this. Contributions are really welcomed. You can have a look at the good first issue label if you plan to help further

@MasterYi1024
Copy link
Contributor Author

Thanks for your reply : )

I haven't read the server test framework codes. I'll work on it, and I will take a look at the good first issue, too.

@FSSRepo
Copy link
Collaborator

FSSRepo commented Apr 6, 2024

@phymbert Originally, I implemented that function so that the server can function as a chatbot with multiple clients, and the system prompt remains immutable and does not need to be reprocessed for each client. Thus, the client only needs to send the conversation context with its respective user and assistant name provided by /props endpoint.

@mofosyne mofosyne added bugfix fixes an issue or bug Review Complexity : Medium Generally require more time to grok but manageable by beginner to medium expertise level labels May 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bugfix fixes an issue or bug Review Complexity : Medium Generally require more time to grok but manageable by beginner to medium expertise level
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants