-
Notifications
You must be signed in to change notification settings - Fork 546
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add vLLM
integration for the generation API
#163
Comments
Agree, would be nice to see an integration for vLLM into outlines |
Is there any progress on this? I'm using vllm with outlines - I'd be willing to help out / try and put out a PR if there's interest in any community contribution here. |
There's always an interest in community contributions :) However, this requires substantial changes in Outlines' codebase that has far-reaching design implications. We need a bit more time. After that design change the integration should be very straightforward. |
Oh absolutely. Not trying to nag. OSS work is voluntary, unpaid and time consuming. Just showing interest, and willing to help out if it'd accelerate things. Thanks for the update and the work you do! |
Hi @rlouf, outline is really awesome! I'm from the vLLM team and I'm quite excited about guideline's approach. We recently added the import outlines.models as models
from outlines.text.generate.regex import Regex
from vllm import LLM, SamplingParams
import torch
prompts = [
"What is the IP Address of Google",
]
# We are not using this model for actual inference. But it seems to be required for Regex class.
model = models.transformers("gpt2-medium", device="cuda")
regex_processor = Regex(
model,
regex_string=r"((25[0-5]|2[0-4]\d|[01]?\d\d?)\.){3}(25[0-5]|2[0-4]\d|[01]?\d\d?)",
max_tokens=16,
)
def create_proposal(token_ids, logits):
token_ids = torch.Tensor([token_ids]).long().to("cuda")
return regex_processor.create_proposal(token_ids, logits)
sampling_params = SamplingParams(
logits_processors=[create_proposal],
max_tokens=16,
)
# Create an LLM in vLLM.
llm = LLM(model="gpt2-medium")
for _ in range(10):
outputs = llm.generate(prompts=prompts, sampling_params=sampling_params, use_tqdm=False)
regex_processor.last_fsm_states.clear() # We have to do this because we are sharing a FSM.
for output in outputs:
prompt = output.prompt
generated_text = output.outputs[0].text
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}") However, there are a few issues:
|
Thank you, so is vLLM!
Compilation might be an issue when calling the model as a one-off. When deployed in a service this hardly matters since it only needs to happen once. We'll update the examples to show compilation and inference can be separated. In the following compilation only happens when import outlines
model = outlines.models.transformers("gpt2")
generator = outlines.text.generate.regex("[a-Z]")
result = generator("prompt")
I agree, this is partly what motivated #366. Everything is coupled and it's not ideal.
Yes, the refactor in #366 addresses this and removes the statefulness. Here's the design that is being implemented:
It should then be much easier to integrate outlines into vLLM by only having to pass the FSM part to the logits processor. |
vLLM allows for subtantially faster inference via smart management of the KV-cache. The library proposes seemless integration with some of the most popular HuggingFace models. We suggested in #150 to re-use some of the ideas in this paper/library for Outlines.
In a first iteration we can dispatch the sequence generation functions introduced in #139 to use vLLM's user-facing APIS, although the use would be limited to the simplest generation methods. Longer term, we should look into vLLM's internals and see if we can make Outlines compatible.
The text was updated successfully, but these errors were encountered: