-
-
Notifications
You must be signed in to change notification settings - Fork 5.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feature]: Chat Prefix Completion #13005
Comments
Oh, i miss the
|
I think it would be awesome if it could be implemented only in the frontend, but I'm not sure how to do that. Are you suggesting that we first send a request with the stop token |
Your understanding is right, I am doing some tests outside of vllm, but there are still some errors reported, I am still looking for the reason, the test code is as follows
I'm reading up on the beam search part of vLLM, and I thought this part of the code would be of some reference |
🚀 The feature, motivation and pitch
The chat prefix completion follows the Chat Completion API, where users provide an assistant's prefix message for the model to complete the rest of the message. This allows the user to manually specify the prefix returned by assistant, which is very helpful for existing reasoning models (Deepseek R1's response should start with ) or code generation(response start with ```python). Another very useful feature is to allow the model to continue outputting after the model output stops due to length
Alternatives
Here are the providers I know of that offer this functionality:
Different providers have different parameter formats, for me I prefer the deepseek and mistral formats, here's an example from deepseek.
Additional context
In addition to this feature, which I think is also relevant to the structured output reasoning model #12619.
I think the output of the reasoning model can be divided into two parts, the
reasoning_content
and thecontent
, and as some of the previous discussions #12619 , the structured output should be applied to the last content.We can implement the structured output of the reasoning model externally if Chat Prefix Completion are available. Example:
</think>
Current pr #12955 #12995 seems to require changes to the engine side or structured output engine (xgrammar), so if we decouple the thinking process from the actual output, could we implement this feature only in the frontend, similar to the beam search that was removed from the engine earlier.
Before submitting a new issue...
The text was updated successfully, but these errors were encountered: