Add `vLLM` integration for the generation API #163

rlouf · 2023-06-25T10:01:01Z

vLLM allows for subtantially faster inference via smart management of the KV-cache. The library proposes seemless integration with some of the most popular HuggingFace models. We suggested in #150 to re-use some of the ideas in this paper/library for Outlines.

In a first iteration we can dispatch the sequence generation functions introduced in #139 to use vLLM's user-facing APIS, although the use would be limited to the simplest generation methods. Longer term, we should look into vLLM's internals and see if we can make Outlines compatible.

louisoutin · 2023-08-24T08:16:17Z

Agree, would be nice to see an integration for vLLM into outlines

amir-in-a-cynch · 2023-11-08T15:37:45Z

Is there any progress on this? I'm using vllm with outlines - I'd be willing to help out / try and put out a PR if there's interest in any community contribution here.

rlouf · 2023-11-08T15:41:02Z

There's always an interest in community contributions :) However, this requires substantial changes in Outlines' codebase that has far-reaching design implications. We need a bit more time. After that design change the integration should be very straightforward.

amir-in-a-cynch · 2023-11-08T15:44:39Z

There's always an interest in community contributions :) However, this requires substantial changes in Outlines' codebase that has far-reaching design implications. We need a bit more time. After that design change the integration should be very straightforward.

Oh absolutely. Not trying to nag. OSS work is voluntary, unpaid and time consuming. Just showing interest, and willing to help out if it'd accelerate things. Thanks for the update and the work you do!

simon-mo · 2023-11-21T01:29:36Z

Hi @rlouf, outline is really awesome!

I'm from the vLLM team and I'm quite excited about guideline's approach. We recently added the logits_processors API and I tried to integrate vLLM with outlines. Here's a hacky version that I got working

import outlines.models as models
from outlines.text.generate.regex import Regex

from vllm import LLM, SamplingParams
import torch

prompts = [
    "What is the IP Address of Google",
]

# We are not using this model for actual inference. But it seems to be required for Regex class.
model = models.transformers("gpt2-medium", device="cuda")

regex_processor = Regex(
    model,
    regex_string=r"((25[0-5]|2[0-4]\d|[01]?\d\d?)\.){3}(25[0-5]|2[0-4]\d|[01]?\d\d?)",
    max_tokens=16,
)

def create_proposal(token_ids, logits):
    token_ids = torch.Tensor([token_ids]).long().to("cuda")
    return  regex_processor.create_proposal(token_ids, logits)

sampling_params = SamplingParams(
    logits_processors=[create_proposal],
    max_tokens=16,
)

# Create an LLM in vLLM.
llm = LLM(model="gpt2-medium")

for _ in range(10):
    outputs = llm.generate(prompts=prompts, sampling_params=sampling_params, use_tqdm=False)
    regex_processor.last_fsm_states.clear() # We have to do this because we are sharing a FSM.
    for output in outputs:
        prompt = output.prompt
        generated_text = output.outputs[0].text
        print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

However, there are a few issues:

The compilation time is about 10s, which is pretty long if it is used in online setting and supplied by the users. (I posted an issue here Tokenizer Aware FSM Construction #383)
There is no "processor" only API. Outlines currently "wraps" transformers. From vLLM's point of view, we want to continue to expose users the logits_processors in SamplingParams. It would be great if the users can use both library without one wrapping the other.
- As in above code, the model has to be initiated twice due to the reliance of model in Regex class
- Does Refactor the sequence generation #366 address this? If so what would the API be like?
There's also some issues in sharing Regex processor is impossible here due to statefulness of FSM.

rlouf · 2023-11-21T08:24:04Z

Hi @rlouf, outline is really awesome!

Thank you, so is vLLM!

The compilation time is about 10s, which is pretty long if it is used in online setting and supplied by the users. (I posted an issue here Tokenizer Aware FSM Construction #383)

Compilation might be an issue when calling the model as a one-off. When deployed in a service this hardly matters since it only needs to happen once. We'll update the examples to show compilation and inference can be separated. In the following compilation only happens when generator is created:

import outlines

model = outlines.models.transformers("gpt2")
generator = outlines.text.generate.regex("[a-Z]")

result = generator("prompt")

There is no "processor" only API. Outlines currently "wraps" transformers. From vLLM's point of view, we want to continue to expose users the logits_processors in SamplingParams. It would be great if the users can use both library without one wrapping the other.

I agree, this is partly what motivated #366. Everything is coupled and it's not ideal.

Does Refactor the sequence generation #366 address this? If so what would the API be like?

There's also some issues in sharing Regex processor is impossible here due to statefulness of FSM.

Yes, the refactor in #366 addresses this and removes the statefulness. Here's the design that is being implemented:

Generator holds the state. When it calls the Finite-State Machine with a state id, it gets the logits mask back. One it has generated the new token it queries the FSM for the nex state id by passing the new token id and the current state. Does this make sense?

It should then be much easier to integrate outlines into vLLM by only having to pass the FSM part to the logits processor.

rlouf added text enhancement labels Jun 25, 2023

brandonwillard mentioned this issue Sep 27, 2023

Using Outlines with vLLM/ FastChat #296

Closed

rlouf mentioned this issue Nov 15, 2023

Refactor the sequence generation #366

Merged

27 tasks

simon-mo mentioned this issue Nov 21, 2023

More modular interface? noamgat/lm-format-enforcer#29

Closed

dottxt-ai deleted a comment from noamgat Nov 21, 2023

rlouf mentioned this issue Nov 24, 2023

Support for Constrained decoding vllm-project/vllm#288

Closed

gottlike mentioned this issue Nov 28, 2023

llama : speed-up grammar sampling ggerganov/llama.cpp#4218

Open

rlouf closed this as completed Jan 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add `vLLM` integration for the generation API #163

Add `vLLM` integration for the generation API #163

rlouf commented Jun 25, 2023 •

edited

Loading

louisoutin commented Aug 24, 2023

amir-in-a-cynch commented Nov 8, 2023

rlouf commented Nov 8, 2023

amir-in-a-cynch commented Nov 8, 2023 •

edited

Loading

simon-mo commented Nov 21, 2023

rlouf commented Nov 21, 2023

Add vLLM integration for the generation API #163

Add vLLM integration for the generation API #163

Comments

rlouf commented Jun 25, 2023 • edited Loading

louisoutin commented Aug 24, 2023

amir-in-a-cynch commented Nov 8, 2023

rlouf commented Nov 8, 2023

amir-in-a-cynch commented Nov 8, 2023 • edited Loading

simon-mo commented Nov 21, 2023

rlouf commented Nov 21, 2023

Add `vLLM` integration for the generation API #163

Add `vLLM` integration for the generation API #163

rlouf commented Jun 25, 2023 •

edited

Loading

amir-in-a-cynch commented Nov 8, 2023 •

edited

Loading