Generate: remove deprecated code due to `Cache` and `cache_position` being default #31898

gante · 2024-07-10T18:48:34Z

What does this PR do?

Simplifies prepare_inputs_for_generation on models using Cache
Removes the unused _reorder_cache function on models using Cache

Slow tests run (and passing / same failures as in main):

(Cache integration tests) RUN_SLOW=1 py.test -vv tests/utils/test_cache_utils.py
(generate integration tests) RUN_SLOW=1 py.test -vv tests/generation/test_utils.py
(reference LLM model, llama) RUN_SLOW=1 py.test -vv tests/models/llama/test_modeling_llama.py
(reference MoE model, mixtral) RUN_SLOW=1 py.test -vv tests/models/mixtral/test_modeling_mixtral.py
Slow tests for ALL other models in the diff

gante · 2024-07-10T18:49:24Z

src/transformers/generation/utils.py

@@ -689,13 +689,16 @@ def _update_model_kwargs_for_generation(
                    dim=-1,
                )

-        if (


TL;DR cache_position now always exists, regardless of use_cache

gante · 2024-07-10T18:52:14Z

src/transformers/models/llama/modeling_llama.py

-            # If we are about to go beyond the maximum cache length, we need to crop the input attention mask.
-            if (
-                max_cache_length is not None
-                and attention_mask is not None
-                and cache_length + input_ids.shape[1] > max_cache_length
-            ):
-                attention_mask = attention_mask[:, -max_cache_length:]


_update_causal_mask handles the corner case this was originally meant to cover

src/transformers/models/llama/modeling_llama.py

gante · 2024-07-10T18:52:59Z

src/transformers/models/llama/modeling_llama.py

-    @staticmethod
-    def _reorder_cache(past_key_values, beam_idx):
-        reordered_past = ()
-        for layer_past in past_key_values:
-            reordered_past += (
-                tuple(past_state.index_select(0, beam_idx.to(past_state.device)) for past_state in layer_past),
-            )
-        return reordered_past


This was used with legacy caches only

gante · 2024-07-10T18:55:49Z

src/transformers/models/llama/modeling_llama.py

+            if inputs_embeds is not None:  # Exception 1
+                input_ids = input_ids[:, -cache_position.shape[0] :]
+            elif input_ids.shape[1] != cache_position.shape[0]:  # Default case (the "else", a no op, is Exception 2)
+                input_ids = input_ids[:, cache_position]


This line holds the logic from end-to-end generate compilation. The other two lines are exceptions to ensure we don't lose BC. The comment at the top should be MUCH clearer now.

gante · 2024-07-10T18:57:10Z

src/transformers/models/llama/modeling_llama.py

-        cache_position=None,
-        use_cache=True,
-        **kwargs,
+        self, input_ids, cache_position, past_key_values=None, attention_mask=None, inputs_embeds=None, **kwargs


Note: cache_position now is a mandatory input

ArthurZucker

SOOOOO much cleaner

ArthurZucker · 2024-07-10T18:59:52Z

src/transformers/models/llama/modeling_llama.py

-            # If we are about to go beyond the maximum cache length, we need to crop the input attention mask.
-            if (
-                max_cache_length is not None
-                and attention_mask is not None
-                and cache_length + input_ids.shape[1] > max_cache_length
-            ):
-                attention_mask = attention_mask[:, -max_cache_length:]


src/transformers/models/llama/modeling_llama.py

ArthurZucker · 2024-07-10T19:00:43Z

src/transformers/models/llama/modeling_llama.py

-    @staticmethod
-    def _reorder_cache(past_key_values, beam_idx):
-        reordered_past = ()
-        for layer_past in past_key_values:
-            reordered_past += (
-                tuple(past_state.index_select(0, beam_idx.to(past_state.device)) for past_state in layer_past),
-            )
-        return reordered_past


src/transformers/models/llama/modeling_llama.py

HuggingFaceDocBuilderDev · 2024-07-10T19:12:39Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

gante · 2024-07-11T19:38:51Z

@ArthurZucker ready for a final check :) (tons of slow tests ran on my end, should be safe to merge)

ArthurZucker

LGTM then! i trust our tests for this, would be nice to see the results of the full suit! cc @ydshieh

src/transformers/generation/utils.py

ArthurZucker · 2024-07-12T05:13:23Z

src/transformers/models/gemma2/diff_gemma2.py

ydshieh · 2024-07-12T08:59:38Z

LGTM then! i trust our tests for this, would be nice to see the results of the full suit! cc @ydshieh

Do you want me to trigger a full (GitHub Action) CI for this PR during this weekend (before merge)?

ArthurZucker · 2024-07-12T09:04:28Z

yeah would be nice !

ydshieh · 2024-07-12T09:05:21Z

OK will do (once today's CI run is over)

gante · 2024-07-12T09:52:02Z

@ydshieh please ping me when the run is over 🤗

btw, there are MANY broken tests in the list of models changed in this PR (on main), mostly SDPA, FA2, and integration tests :o I should work on it 🤔

ydshieh · 2024-07-12T11:58:51Z

@gante

The run (triggered for this PR) is likely to be over tomorrow morning (if I trigger it this evening). I will let you know in any case.

Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>

ydshieh · 2024-07-12T15:11:23Z

CI is running here

gante · 2024-07-14T14:16:50Z

slow CI looks good (same issues as in main), merging 🤗

gante commented Jul 10, 2024

View reviewed changes

src/transformers/models/llama/modeling_llama.py Show resolved Hide resolved

gante commented Jul 10, 2024

View reviewed changes

gante requested a review from ArthurZucker July 10, 2024 18:58

ArthurZucker reviewed Jul 10, 2024

View reviewed changes

gante changed the title ~~Generate: remove deprecated code due to DynamicCache and cache_position being default~~ Generate: remove deprecated code due to Cache and cache_position being default Jul 11, 2024

gante mentioned this pull request Jul 11, 2024

Generate: end-to-end compilation #30788

Merged

3 tasks

gante marked this pull request as ready for review July 11, 2024 19:38

gante requested a review from ArthurZucker July 11, 2024 19:38

ArthurZucker approved these changes Jul 12, 2024

View reviewed changes

gante and others added 9 commits July 12, 2024 12:10

tmp commit

7d638f7

shorter

32e7aa9

nit

ab64f9f

explicit kwargs

5c0f6af

propagate changes

e8e085b

mass propagation with a few manual touches (let's see how CI behaves)

50c8260

fix cacheless case

d0ff984

Update src/transformers/generation/utils.py

8223ec4

Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>

make fixup

02f7417

gante force-pushed the short_prep_inputs branch from 8cd4e6c to 02f7417 Compare July 12, 2024 12:10

gante mentioned this pull request Jul 12, 2024

RecurrentGemma: add generative tests #31498

Closed

gante merged commit 739a631 into huggingface:main Jul 14, 2024
23 checks passed

gante deleted the short_prep_inputs branch July 14, 2024 14:17

vasqu mentioned this pull request Jul 15, 2024

Refactor: Causal Mask Update and Prepare for Generate Vasqu-Adibvafa/Mamba2#3

Merged

snorfyang mentioned this pull request Jul 26, 2024

LLaVA cannot use beam search after 4.43.0 #32234

Closed

4 tasks

zucchini-nlp mentioned this pull request Jul 26, 2024

SinkCache with Qwen1.5 broken in 4.43.0+ #32233

Closed

4 tasks

jiminha mentioned this pull request Aug 8, 2024

Add _reorder_cache back to Llama for HPU huggingface/optimum-habana#1233

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Generate: remove deprecated code due to `Cache` and `cache_position` being default #31898

Generate: remove deprecated code due to `Cache` and `cache_position` being default #31898

gante commented Jul 10, 2024 •

edited

Loading

gante Jul 10, 2024

gante Jul 10, 2024

ArthurZucker Jul 10, 2024

gante Jul 10, 2024

ArthurZucker Jul 10, 2024

gante Jul 10, 2024

gante Jul 10, 2024

ArthurZucker left a comment

ArthurZucker Jul 10, 2024

ArthurZucker Jul 10, 2024

HuggingFaceDocBuilderDev commented Jul 10, 2024

gante commented Jul 11, 2024 •

edited

Loading

ArthurZucker left a comment

ArthurZucker Jul 12, 2024

ydshieh commented Jul 12, 2024

ArthurZucker commented Jul 12, 2024

ydshieh commented Jul 12, 2024

gante commented Jul 12, 2024

ydshieh commented Jul 12, 2024

ydshieh commented Jul 12, 2024

gante commented Jul 14, 2024

@@ @@ -689,13 +689,16 @@ def _update_model_kwargs_for_generation( @@
                                   dim=-1,
                               )
-                      if (

Generate: remove deprecated code due to Cache and cache_position being default #31898

Generate: remove deprecated code due to Cache and cache_position being default #31898

Conversation

gante commented Jul 10, 2024 • edited Loading

What does this PR do?

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ArthurZucker left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

HuggingFaceDocBuilderDev commented Jul 10, 2024

gante commented Jul 11, 2024 • edited Loading

ArthurZucker left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ydshieh commented Jul 12, 2024

ArthurZucker commented Jul 12, 2024

ydshieh commented Jul 12, 2024

gante commented Jul 12, 2024

ydshieh commented Jul 12, 2024

ydshieh commented Jul 12, 2024

gante commented Jul 14, 2024

Generate: remove deprecated code due to `Cache` and `cache_position` being default #31898

Generate: remove deprecated code due to `Cache` and `cache_position` being default #31898

gante commented Jul 10, 2024 •

edited

Loading

gante commented Jul 11, 2024 •

edited

Loading