Fix func call tokens for internlm2 #8506

RunningLeon · 2024-07-16T09:12:19Z

Fix function call tokens are not shown when calling llama-server for internlm2
Related issue #8405

I have read the contributing guidelines
Self-reported review complexity:
- Low
- Medium
- High

compilade · 2024-07-16T21:03:38Z

convert_hf_to_gguf.py

+                    if foken_data.get("special") and not foken_data["content"] in func_call_tokens:
                        toktypes[token_id] = SentencePieceTokenTypes.CONTROL


I'm not sure these tokens should be marked as USER_DEFINED; this would mean they would always be specially pre-tokenized even if parse_special is false, making it impossible to avoid injections from text containing these tokens when not desired. This is related to #8228.

If this is simply a display issue, then it might be more appropriate to revisit whether to detokenize the control tokens output by the model.

These may be relevant:

llama.cpp/examples/main/main.cpp

Line 858 in 5e116e8

assistant_ss << llama_token_to_piece(ctx, id, false);

llama.cpp/examples/server/server.cpp

Line 1185 in 5e116e8

const std::string token_str = llama_token_to_piece(ctx, result.tok, false);

@compilade hi, these special tokens are used to parse function content in the output string, meaning they should be pre-tokenized even if parse_special is false. The following is an example. Now llama-cli works with --special, but llama-server does not work with --special. Ideally, it's desired only to show these function-call related special tokens.

tool_calls = None if request.tool_choice != 'none' and '<|plugin|>' in text: if final_res.finish_reason == 'stop': final_res.finish_reason = 'tool_calls' # TODO may move to generate function text, action = text.split('<|action_start|><|plugin|>') action = action.split('<|action_end|>'.strip())[0] action = action[action.find('{'):]

@RunningLeon Thanks for giving an example.

these special tokens are used to parse function content in the output string, meaning they should be pre-tokenized even if parse_special is false.

If I understand correctly, what you want is to get the function call tokens to render in the output, right? Pre-tokenization is about the input. If these tokens were pre-tokenized even when parse_special is false, this means it would be impossible to include <|plugin|> in some non-special text without the model seeing it as the <|plugin|> token.

For example:

$ ./bin/llama-tokenize --log-disable -m ../models/internlm2_5-vocab.gguf -p "What is a <|plugin|>?" 1 -> '<s>' 3993 -> 'What' 505 -> ' is' 395 -> ' a' 262 -> ' ' 92538 -> '<|plugin|>' 345 -> '?' $ ./bin/llama-tokenize --log-disable -m ../models/internlm2_5-vocab.gguf -p "What is a <|plugin|>?" --no-parse-special 1 -> '<s>' 3993 -> 'What' 505 -> ' is' 395 -> ' a' 497 -> ' <' 352 -> '|' 9267 -> 'plugin' 352 -> '|' 46973 -> '>?'

If the problem is about the output of llama-server, this should be fixable by changing how it calls the llama_token_to_piece function.

Ideally, it's desired only to show these function-call related special tokens.

If you want to hide control tokens, while still showing these ones, then... hmm. This seems complicated to do with the current token attributes (USER_DEFINED and CONTROL), given that USER_DEFINED is intended for always pre-tokenized tokens like the multi-space tokens in GPT-NeoX, while CONTROL is intended for tokens with special meaning, like <|im_start|>, and in my opinion the function call tokens fit the intention for CONTROL tokens.

Why not show all special tokens, like llama-cli --special does?

Would this work?

diff --git a/examples/server/server.cpp b/examples/server/server.cpp index badeb912..7813a295 100644 --- a/examples/server/server.cpp +++ b/examples/server/server.cpp @@ -1182,7 +1182,7 @@ struct server_context { bool process_token(completion_token_output & result, server_slot & slot) { // remember which tokens were sampled - used for repetition penalties during sampling - const std::string token_str = llama_token_to_piece(ctx, result.tok, false); + const std::string token_str = llama_token_to_piece(ctx, result.tok, params.special); slot.sampled = result.tok; // search stop word and delete it

Now llama-cli works with --special, but llama-server does not work with --special.

This is because llama-cli handles it here:

llama.cpp/examples/main/main.cpp

Line 766 in 5e116e8

const std::string token_str = llama_token_to_piece(ctx, id, params.special);

@compilade thanks for your quick response. Yes, ideally, we want to hide control tokens and show function call related tokens. But since <|plugin|> can be input, there might be a problem. So llama-server with --special works for me. @apresence, hi, what do you think of it as the real user?

Yes, I believe that works! I'm happy to test it once a fix is available.

@compilade @ggerganov hi, guys. What's the good way to include special tokens as input when using llama-cli and llama-server? I find something interesting.
The sys prompt is '<|im_start|>system\nYou are InternLM2-Chat, a harmless AI assistant.<|im_end|>\n<|im_start|>system name=<|plugin|>[{"name": "get_current_weather", "parameters": {"required": ["location"], "type": "object", "properties": {"location": {"type": "string", "description": "The city and state, e.g. San Francisco, CA"}, "unit": {"type": "string"}}}, "description": "Get the current weather in a given location"}]<|im_end|>\n'
which has special tokens.
It does not work when using --prompt to pass the sys prompt while creating service using llama-server, but it works with --system-prompt-file when putting them in a local file. Besides, it also does not work when you put special tokens in messages like this

from openai import OpenAI client = OpenAI( api_key='YOUR_API_KEY', base_url='http://localhost:8080/v1' ) model_name = client.models.list().data[0].id response = client.chat.completions.create( model=model_name, messages=[ {"role": "system", "content": 'You are InternLM2-Chat, a harmless AI assistant.<|im_end|>\n<|im_start|>system name=<|plugin|>[{"name": "get_current_weather", "parameters": {"required": ["location"], "type": "object", "properties": {"location": {"type": "string", "description": "The city and state, e.g. San Francisco, CA"}, "unit": {"type": "string"}}}, "description": "Get the current weather in a given location"}]'}, {"role": "user", "content": "I want to know today's weather in Shanghai"}, ], temperature=0.8, top_p=0.8 ) print(response)

ggerganov · 2024-07-18T08:07:07Z

Fixed in #8553

fix func call tokens for internlm2

7b575e7

github-actions bot added the python python script changes label Jul 16, 2024

compilade reviewed Jul 16, 2024

View reviewed changes

This was referenced Jul 18, 2024

Bug: InternLM 2.5 Chat Tool Calls: Incorrect and Inconsistent Formatting #8405

Closed

fix special not work for llama-server #8553

Merged

ggerganov closed this Jul 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix func call tokens for internlm2 #8506

Fix func call tokens for internlm2 #8506

RunningLeon commented Jul 16, 2024

compilade Jul 16, 2024

RunningLeon Jul 17, 2024

compilade Jul 17, 2024

RunningLeon Jul 17, 2024

apresence Jul 18, 2024

RunningLeon Jul 18, 2024 •

edited

Loading

ggerganov commented Jul 18, 2024

		if foken_data.get("special") and not foken_data["content"] in func_call_tokens:
		toktypes[token_id] = SentencePieceTokenTypes.CONTROL

Fix func call tokens for internlm2 #8506

Fix func call tokens for internlm2 #8506

Conversation

RunningLeon commented Jul 16, 2024

compilade Jul 16, 2024

Choose a reason for hiding this comment

RunningLeon Jul 17, 2024

Choose a reason for hiding this comment

compilade Jul 17, 2024

Choose a reason for hiding this comment

RunningLeon Jul 17, 2024

Choose a reason for hiding this comment

apresence Jul 18, 2024

Choose a reason for hiding this comment

RunningLeon Jul 18, 2024 • edited Loading

Choose a reason for hiding this comment

ggerganov commented Jul 18, 2024

RunningLeon Jul 18, 2024 •

edited

Loading