Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature]: Chat Prefix Completion #13005

Closed
1 task done
liuyanyi opened this issue Feb 10, 2025 · 3 comments
Closed
1 task done

[Feature]: Chat Prefix Completion #13005

liuyanyi opened this issue Feb 10, 2025 · 3 comments

Comments

@liuyanyi
Copy link
Contributor

🚀 The feature, motivation and pitch

The chat prefix completion follows the Chat Completion API, where users provide an assistant's prefix message for the model to complete the rest of the message. This allows the user to manually specify the prefix returned by assistant, which is very helpful for existing reasoning models (Deepseek R1's response should start with ) or code generation(response start with ```python). Another very useful feature is to allow the model to continue outputting after the model output stops due to length

Alternatives

Here are the providers I know of that offer this functionality:

  1. DeepSeek Chat Prefix Completion
  2. Aliyun Dashscope Partial mode
  3. siliconflow prefix
  4. Mistral AI

Different providers have different parameter formats, for me I prefer the deepseek and mistral formats, here's an example from deepseek.

from openai import OpenAI

client = OpenAI(
    api_key="<your api key>",
    base_url="https://api.deepseek.com/beta",
)

messages = [
    {"role": "user", "content": "Please write quick sort code"},
    {"role": "assistant", "content": "```python\n", "prefix": True} # <- set prefix=True
]
response = client.chat.completions.create(
    model="deepseek-chat",
    messages=messages,
    stop=["```"],
)
print(response.choices[0].message.content)

Additional context

In addition to this feature, which I think is also relevant to the structured output reasoning model #12619.

I think the output of the reasoning model can be divided into two parts, the reasoning_content and the content, and as some of the previous discussions #12619 , the structured output should be applied to the last content.

We can implement the structured output of the reasoning model externally if Chat Prefix Completion are available. Example:

  • Step 1: Output message normally and set stop token to </think>
  • Step 2: Place the output of Step 1 in the assistant part and set prefix=True, while applying structured generation etc.

Current pr #12955 #12995 seems to require changes to the engine side or structured output engine (xgrammar), so if we decouple the thinking process from the actual output, could we implement this feature only in the frontend, similar to the beam search that was removed from the engine earlier.

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.
@liuyanyi
Copy link
Contributor Author

Oh, i miss the continue_final_message arg in chat completion, this feature is already support by set two args: add_generation_prompt and continue_final_message

{
  "messages": [
    {
      "content": "Please generate a python hello world in markdown block",
      "role": "user"
    },
    {
      "content": "```python",
      "role": "assistant"
    }
  ],
  "model": "qwen",
  "add_generation_prompt": false,
  "continue_final_message": true
}

@gaocegege
Copy link
Contributor

Current pr #12955 #12995 seems to require changes to the engine side or structured output engine (xgrammar), so if we decouple the thinking process from the actual output, could we implement this feature only in the frontend, similar to the beam search that was removed from the engine earlier.

I think it would be awesome if it could be implemented only in the frontend, but I'm not sure how to do that. Are you suggesting that we first send a request with the stop token </think> and then follow up with another request using continue_final_message along with the generated ...</think>?

@liuyanyi
Copy link
Contributor Author

Current pr #12955 #12995 seems to require changes to the engine side or structured output engine (xgrammar), so if we decouple the thinking process from the actual output, could we implement this feature only in the frontend, similar to the beam search that was removed from the engine earlier.

I think it would be awesome if it could be implemented only in the frontend, but I'm not sure how to do that. Are you suggesting that we first send a request with the stop token </think> and then follow up with another request using continue_final_message along with the generated ...</think>?

Your understanding is right, I am doing some tests outside of vllm, but there are still some errors reported, I am still looking for the reason, the test code is as follows

import openai
from openai import OpenAI


client = OpenAI(api_key="sk-*", base_url="http://10.100.129.193:30005/v1")


# Step1
languages = [
    "English",
    "French",
    "Chinese",
    "Japanese",
]
msgs = [
    {
        "content": "Please Check the language of following text, The choice of the language can be {}\nThe text is: '你好吗,你是谁?'".format(
            ", ".join(languages)
        ),
        "role": "user",
    }
]

result = client.chat.completions.create(
    messages=msgs,
    model="DeepSeek-R1-Distill-Qwen-32B",
    stop=["</think>"],
    extra_body={"include_stop_str_in_output": True},
)

print(result.choices[0].message.content)

extra_think_msg = {
    "content": result.choices[0].message.content,
    "role": "assistant",
}

# Step2
new_messages = msgs + [extra_think_msg]
# Currently, An error occurs when the `continue_final_message` is set to True
result = client.chat.completions.create(
    messages=new_messages,
    model="DeepSeek-R1-Distill-Qwen-32B",
    extra_body={
        "guided_choice": languages,
        "add_generation_prompt": False,
        "continue_final_message": True,
    },
)

final_msg = result.choices[0].message.content
print(final_msg)

I'm reading up on the beam search part of vLLM, and I thought this part of the code would be of some reference

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants