[Feature]: Chat Prefix Completion #13005

liuyanyi · 2025-02-10T05:18:30Z

🚀 The feature, motivation and pitch

The chat prefix completion follows the Chat Completion API, where users provide an assistant's prefix message for the model to complete the rest of the message. This allows the user to manually specify the prefix returned by assistant, which is very helpful for existing reasoning models (Deepseek R1's response should start with ) or code generation(response start with ```python). Another very useful feature is to allow the model to continue outputting after the model output stops due to length

Alternatives

Here are the providers I know of that offer this functionality:

DeepSeek Chat Prefix Completion
Aliyun Dashscope Partial mode
siliconflow prefix
Mistral AI

Different providers have different parameter formats, for me I prefer the deepseek and mistral formats, here's an example from deepseek.

from openai import OpenAI

client = OpenAI(
    api_key="<your api key>",
    base_url="https://api.deepseek.com/beta",
)

messages = [
    {"role": "user", "content": "Please write quick sort code"},
    {"role": "assistant", "content": "```python\n", "prefix": True} # <- set prefix=True
]
response = client.chat.completions.create(
    model="deepseek-chat",
    messages=messages,
    stop=["```"],
)
print(response.choices[0].message.content)

Additional context

In addition to this feature, which I think is also relevant to the structured output reasoning model #12619.

I think the output of the reasoning model can be divided into two parts, the reasoning_content and the content, and as some of the previous discussions #12619 , the structured output should be applied to the last content.

We can implement the structured output of the reasoning model externally if Chat Prefix Completion are available. Example:

Step 1: Output message normally and set stop token to </think>
Step 2: Place the output of Step 1 in the assistant part and set prefix=True, while applying structured generation etc.

Current pr #12955 #12995 seems to require changes to the engine side or structured output engine (xgrammar), so if we decouple the thinking process from the actual output, could we implement this feature only in the frontend, similar to the beam search that was removed from the engine earlier.

Before submitting a new issue...

Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

The text was updated successfully, but these errors were encountered:

liuyanyi · 2025-02-10T06:02:40Z

Oh, i miss the continue_final_message arg in chat completion, this feature is already support by set two args: add_generation_prompt and continue_final_message

{
  "messages": [
    {
      "content": "Please generate a python hello world in markdown block",
      "role": "user"
    },
    {
      "content": "```python",
      "role": "assistant"
    }
  ],
  "model": "qwen",
  "add_generation_prompt": false,
  "continue_final_message": true
}

gaocegege · 2025-02-10T08:22:19Z

Current pr #12955 #12995 seems to require changes to the engine side or structured output engine (xgrammar), so if we decouple the thinking process from the actual output, could we implement this feature only in the frontend, similar to the beam search that was removed from the engine earlier.

I think it would be awesome if it could be implemented only in the frontend, but I'm not sure how to do that. Are you suggesting that we first send a request with the stop token </think> and then follow up with another request using continue_final_message along with the generated ...</think>?

liuyanyi · 2025-02-10T09:57:33Z

Current pr #12955 #12995 seems to require changes to the engine side or structured output engine (xgrammar), so if we decouple the thinking process from the actual output, could we implement this feature only in the frontend, similar to the beam search that was removed from the engine earlier.

I think it would be awesome if it could be implemented only in the frontend, but I'm not sure how to do that. Are you suggesting that we first send a request with the stop token </think> and then follow up with another request using continue_final_message along with the generated ...</think>?

Your understanding is right, I am doing some tests outside of vllm, but there are still some errors reported, I am still looking for the reason, the test code is as follows

import openai
from openai import OpenAI


client = OpenAI(api_key="sk-*", base_url="http://10.100.129.193:30005/v1")


# Step1
languages = [
    "English",
    "French",
    "Chinese",
    "Japanese",
]
msgs = [
    {
        "content": "Please Check the language of following text, The choice of the language can be {}\nThe text is: '你好吗，你是谁？'".format(
            ", ".join(languages)
        ),
        "role": "user",
    }
]

result = client.chat.completions.create(
    messages=msgs,
    model="DeepSeek-R1-Distill-Qwen-32B",
    stop=["</think>"],
    extra_body={"include_stop_str_in_output": True},
)

print(result.choices[0].message.content)

extra_think_msg = {
    "content": result.choices[0].message.content,
    "role": "assistant",
}

# Step2
new_messages = msgs + [extra_think_msg]
# Currently, An error occurs when the `continue_final_message` is set to True
result = client.chat.completions.create(
    messages=new_messages,
    model="DeepSeek-R1-Distill-Qwen-32B",
    extra_body={
        "guided_choice": languages,
        "add_generation_prompt": False,
        "continue_final_message": True,
    },
)

final_msg = result.choices[0].message.content
print(final_msg)

I'm reading up on the beam search part of vLLM, and I thought this part of the code would be of some reference

liuyanyi added the feature request label Feb 10, 2025

liuyanyi closed this as completed Feb 10, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature]: Chat Prefix Completion #13005

[Feature]: Chat Prefix Completion #13005

liuyanyi commented Feb 10, 2025

liuyanyi commented Feb 10, 2025

gaocegege commented Feb 10, 2025

liuyanyi commented Feb 10, 2025

[Feature]: Chat Prefix Completion #13005

[Feature]: Chat Prefix Completion #13005

Comments

liuyanyi commented Feb 10, 2025

🚀 The feature, motivation and pitch

Alternatives

Additional context

Before submitting a new issue...

liuyanyi commented Feb 10, 2025

gaocegege commented Feb 10, 2025

liuyanyi commented Feb 10, 2025