You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Currently vLLM supports speculative decoding whereby a smaller model outputs speculative tokens for the main model to verify. Since both models are already loaded in VRAM, it would be helpful to be able to access the draft model directly and request inferencing from this bypassing the larger model (for cases where speed is more important than quality).
If both models are exposed, then the incoming request can specify which model to use and vLLM can direct to the correct one.
Alternatives
No response
Additional context
No response
Before submitting a new issue...
Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.
The text was updated successfully, but these errors were encountered:
🚀 The feature, motivation and pitch
Currently vLLM supports speculative decoding whereby a smaller model outputs speculative tokens for the main model to verify. Since both models are already loaded in VRAM, it would be helpful to be able to access the draft model directly and request inferencing from this bypassing the larger model (for cases where speed is more important than quality).
If both models are exposed, then the incoming request can specify which model to use and vLLM can direct to the correct one.
Alternatives
No response
Additional context
No response
Before submitting a new issue...
The text was updated successfully, but these errors were encountered: