Pipeline: simple API for assisted generation #34504

gante · 2024-10-30T12:18:43Z

What does this PR do?

Assisted generation + pipelines has a bad UX at the moment: the user must manually load the assistant model (and the assistant tokenizer, if applicable), which defeats the point of being simple to use.

This PR adds the ability to specify an assistant checkpoint at pipeline definition time -- the pipeline will take care of the rest 🤗

⚠️ While the feature was added for all pipelines that call .generate(), I haven't added a test on all of them. Many pipelines don't forward kwargs properly to .generate() which makes testing this transparent [same output, similar runtime] feature hard -- the best way to confirm assisted generation is running is by passing incompatible flags to .generate() to make it crash.

Example usage

from transformers import pipeline
import torch

pipe = pipeline(
    "text-generation",
    model="meta-llama/Llama-3.1-8B",
    assistant_model="meta-llama/Llama-3.2-1B",  # This extra line is all that's needed!
    torch_dtype=torch.bfloat16
)
pipe_output = pipe("Once upon a time, ", max_new_tokens=50, do_sample=False)
print(pipe_output[0]["generated_text"])

HuggingFaceDocBuilderDev · 2024-10-30T12:46:27Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

gante · 2024-10-30T16:33:16Z

docs/source/en/generation_strategies.md

@@ -416,16 +416,6 @@ Assisted decoding assumes the main and assistant models have the same tokenizer,
 Currently, only greedy search and sampling are supported with assisted decoding, and assisted decoding doesn't support batched inputs.
 To learn more about assisted decoding, check [this blog post](https://huggingface.co/blog/assisted-generation).

-#### Universal Assisted Decoding


The current version of the docs had the normal assisted generation docs under Universal Assisted Decoding [modification to the original technique to support different tokenizers]

Most of the diff is to isolate Universal Assisted Decoding

gante · 2024-10-30T16:34:49Z

docs/source/en/generation_strategies.md

-
-```python
->>> from transformers import AutoModelForCausalLM, AutoTokenizer
+<Tip>


new docs: an example of how to use assisted generation with pipelines

gante · 2024-10-30T16:36:43Z

src/transformers/generation/flax_utils.py

@@ -347,7 +347,6 @@ def generate(
            eos_token_id = generation_config.eos_token_id
            if isinstance(eos_token_id, list):
                eos_token_id = eos_token_id[0]
-            logger.warning(f"Setting `pad_token_id` to `eos_token_id`:{eos_token_id} for open-end generation.")


This is a legacy warning (before I joined 👀 ) that
a) is noisy
b) doesn't affect generation other than the ability to try to infer the attention mask when it is not passed

In the specific case of assisted generation, it was emitted once per assistant model call, so many times 😅 More harmful than useful.

Strong agree on removing this one!

gante · 2024-10-30T16:37:35Z

src/transformers/pipelines/text_generation.py

-            if len(stop_sequence_ids) > 1:
-                warnings.warn(
-                    "Stopping on a multiple token sequence is not yet supported on transformers. The first token of"
-                    " the stop sequence will be used as the stop sequence string in the interim."
-                )
-            generate_kwargs["eos_token_id"] = stop_sequence_ids[0]


(no longer true)

Do we support this for sequences of IDs as well as stop strings?

Yes, generate supports using multiple token IDs as stopping criteria!

In this particular case, this warning predates the introduction of multiple token IDs as stopping criteria in generate :) So the warning is no longer true (and hasn't been for a while)

LysandreJik · 2024-11-05T10:07:39Z

Thanks for the nice PR @gante!

@Rocketknight1, can you give it a first look as the pipeline owner? Thanks!

Rocketknight1

This seems very clean overall! The changes to pipeline code are relatively small, because all the actual action happens in generate(), so this PR really just takes care of forwarding assistant models/tokenizers to that method.

Rocketknight1 · 2024-11-05T13:47:45Z

src/transformers/generation/flax_utils.py

@@ -347,7 +347,6 @@ def generate(
            eos_token_id = generation_config.eos_token_id
            if isinstance(eos_token_id, list):
                eos_token_id = eos_token_id[0]
-            logger.warning(f"Setting `pad_token_id` to `eos_token_id`:{eos_token_id} for open-end generation.")


Strong agree on removing this one!

Rocketknight1 · 2024-11-05T13:49:23Z

src/transformers/pipelines/text_generation.py

-            if len(stop_sequence_ids) > 1:
-                warnings.warn(
-                    "Stopping on a multiple token sequence is not yet supported on transformers. The first token of"
-                    " the stop sequence will be used as the stop sequence string in the interim."
-                )
-            generate_kwargs["eos_token_id"] = stop_sequence_ids[0]


Do we support this for sequences of IDs as well as stop strings?

src/transformers/pipelines/text_to_audio.py

Rocketknight1 · 2024-11-05T13:51:03Z

tests/pipelines/test_pipelines_text_generation.py

+    @require_torch
+    def test_pipeline_assisted_generation(self):
+        """Tests that we can run assisted generation in the pipeline"""
+        model = "distilbert/distilgpt2"


Suggested change

model = "distilbert/distilgpt2"

model = "distilbert/distilgpt2"

Distilgpt2 is still relatively large for a non-slow test when we're just checking for errors, rather than comparing outputs! Maybe there's a tiny-random model we can use?

done 👍 (and confirmed that it works)

Rocketknight1 · 2024-11-05T13:53:42Z

src/transformers/pipelines/base.py

+        _, loaded_assistant_model = infer_framework_load_model(assistant_model, config=assistant_config)
+        loaded_assistant_model = loaded_assistant_model.to(device=model.device, dtype=model.dtype)


infer_framework_load_model is framework-agnostic code, but .to() is PyTorch-only. Maybe we should add a warning if a user tries to use this with TF, since it's not supported at all?

Added the following after checking that assistant_model is not None, and before the commented lines. In a nutshell, the cases where .to is not supported are caught in advance

if not isinstance(model, PreTrainedModel): raise ValueError( "Assisted generation, triggered by the `assistant_model` argument, is only available for " "`PreTrainedModel` model instances. For instance, TF or JAX models are not supported." )

Yes, looks good!

danielkorat · 2024-11-15T14:27:03Z

@gante will this PR make it to next release? 🙏

danielkorat · 2024-12-01T19:10:58Z

Hi @LysandreJik 👋
Can we try to merge this before the next release?
This will drastically shorten the code examples we used in the HF blog and social posts.
Thanks!

danielkorat · 2025-01-07T12:03:42Z

hi @gante, could you please merge when free? 🙏

Co-authored-by: Matt <Rocketknight1@users.noreply.github.com>

gante · 2025-01-07T19:15:24Z

@Rocketknight1 PR comments addressed 🫡 Let me know if you are happy with the PR!

Assuming CI is green: do I need to ping more folks for review?

Rocketknight1 · 2025-01-08T17:01:29Z

@gante yes, you can go ahead and merge this whenever you're happy!

mvp

82dc8f5

handle different tokenizer

d8d44b6

gante changed the title ~~mvp~~ Pipeline: simple API for assisted generation Oct 30, 2024

docs and tests

632a98d

gante requested a review from LysandreJik October 30, 2024 16:31

gante commented Oct 30, 2024

View reviewed changes

docs clarification

5d95e69

gante commented Oct 30, 2024

View reviewed changes

Rocketknight1 approved these changes Nov 5, 2024

View reviewed changes

gante and others added 7 commits January 7, 2025 18:50

Update src/transformers/pipelines/text_to_audio.py

504561f

Co-authored-by: Matt <Rocketknight1@users.noreply.github.com>

use tiny model instead

d9dad2c

error msg

03f1f99

Merge branch 'main' into pipeline_assistant

c0ae7a9

move

6aad376

move

3dde3c7

Merge branch 'main' into pipeline_assistant

d7f03b7

gante and others added 3 commits January 8, 2025 09:38

Merge branch 'main' into pipeline_assistant

4033d03

Merge branch 'main' into pipeline_assistant

7526ba5

make fiuxp

6d6df35

gante merged commit 76da6ca into huggingface:main Jan 8, 2025
25 checks passed

gante deleted the pipeline_assistant branch January 8, 2025 17:08

gante mentioned this pull request Jan 17, 2025

[pipeline] missing import regarding assisted generation #35752

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pipeline: simple API for assisted generation #34504

Pipeline: simple API for assisted generation #34504

gante commented Oct 30, 2024 •

edited

Loading

HuggingFaceDocBuilderDev commented Oct 30, 2024

gante Oct 30, 2024

gante Oct 30, 2024

gante Oct 30, 2024

Rocketknight1 Nov 5, 2024

gante Oct 30, 2024

Rocketknight1 Nov 5, 2024

gante Jan 7, 2025 •

edited

Loading

LysandreJik commented Nov 5, 2024

Rocketknight1 left a comment

Rocketknight1 Nov 5, 2024

Rocketknight1 Nov 5, 2024

Rocketknight1 Nov 5, 2024

gante Jan 7, 2025

Rocketknight1 Nov 5, 2024

gante Jan 7, 2025 •

edited

Loading

Rocketknight1 Jan 7, 2025

danielkorat commented Nov 15, 2024

danielkorat commented Dec 1, 2024

danielkorat commented Jan 7, 2025

gante commented Jan 7, 2025 •

edited

Loading

Rocketknight1 commented Jan 8, 2025

	model = "distilbert/distilgpt2"
	model = "distilbert/distilgpt2"

		_, loaded_assistant_model = infer_framework_load_model(assistant_model, config=assistant_config)
		loaded_assistant_model = loaded_assistant_model.to(device=model.device, dtype=model.dtype)

Pipeline: simple API for assisted generation #34504

Pipeline: simple API for assisted generation #34504

Conversation

gante commented Oct 30, 2024 • edited Loading

What does this PR do?

HuggingFaceDocBuilderDev commented Oct 30, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gante Jan 7, 2025 • edited Loading

Choose a reason for hiding this comment

LysandreJik commented Nov 5, 2024

Rocketknight1 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gante Jan 7, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

danielkorat commented Nov 15, 2024

danielkorat commented Dec 1, 2024

danielkorat commented Jan 7, 2025

gante commented Jan 7, 2025 • edited Loading

Rocketknight1 commented Jan 8, 2025

gante commented Oct 30, 2024 •

edited

Loading

gante Jan 7, 2025 •

edited

Loading

gante Jan 7, 2025 •

edited

Loading

gante commented Jan 7, 2025 •

edited

Loading