server: fix system_tokens being erased in kv_cache; #6312
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Hi llama.cpp deveeloppers:)
I read the code these days. And I think there is a chance that system tokens may get erased in kv cache in the server example by this line:
From my little knowledge of the server code(may be wrong cause too many codes to read, sorry), I think in
kv_cache
,system_tokens
are positioned from 0 to its length, and aftersystem_tokens
it'sn_keep
prompt_tokens. So If the code before remove the tokens betweenn_keep
ton_keep + n_discarded
, it will remove some of thesystem_tokens
which makes the generation stop working or generate something meaningless.Below is my test. This problem only can be duplicated with some specific count of tokens. And I just run into it in my daily tests, that's why anything was in Chinese. Sorry again ;p
The system prompt is to make the assistant summarize some text. I wrote this in a translater.json file and use -spf parameter to load it:
Then I start the server with these parameters, notice that
-c
was commented so its value is default 512. Also you can see I'm using Qwen model with a RTX4090 card:With the server running, I call curl with five questions:
And these are the generations:
As you can see, the first time I ask about its system prompt and name, It answers right. But after I give it a long text to summarize, It forgets its name and system prompt(In this picture, I just ask about the name.).
Now is the generation after applying this pr:
Now you see, It remembered who it is and what it should do!
summary
I made this change just because I find its a little better than before. I don't really know about the logic of the two part of token shift in server example. I tried to read hard about them, but still not really clear. If there is some documentations on kv_cache and these two part of shift code in server example, that would be great. Thanks a lot:)
If this change is wrong, feel free to close the pr.
Bellow is the exacte text I send to server:
I know the new version of server would say "context is too long for kv_cache, ...", so you have to use the exacte text to duplicate this issue.
Thanks in advance:)