`torch.compile` compatibility with `generate` + static cache #29114

fxmarty · 2024-02-19T17:02:22Z

This PR adds the support of torch.compile when using generate + static cache.

src/transformers/cache_utils.py

fxmarty · 2024-02-19T17:04:31Z

src/transformers/models/llama/modeling_llama.py

+            # The `contiguous()` here is necessary to have a static stride during (non-speculative) decoding. torchdynamo otherwise
+            # recompiles graphs as the stride of the inputs is a guard.
+            # TODO: We don't really need to handle the input_ids here, and this contiguous() call could be removed if we were
+            # simply using GenerationMixin.greedy_search `next_tokens` variable directly (which is already contiguous), instead of
+            # doing a torch.cat + then slice.
+            model_inputs = {"input_ids": input_ids.contiguous()}


As per the comment. If input_ids is not contiguous, its stride is different at each forward call in generate, which triggers a recompilation at every step in the loop. This is very slow.

Ideally we would want to just use next_tokens instead of having this contiguous() call, but let's do a proper refactor in an other PR

Yep, next token would be less "costly" but I think we need all of them for speculative? Anyways noticed that as well when I was compiling, just let it be at the time!

HuggingFaceDocBuilderDev · 2024-02-19T17:21:40Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

ArthurZucker

The int vs tensor « bug » is expected, that is precisely why we use « cache_positions » for generation.
‘I think generate needs to handle that because I’m not sure compile will like the in place tensor modification

fxmarty · 2024-02-20T16:49:27Z

@ArthurZucker @gante @LysandreJik This PR fixes many issues with the current torch.compile + static cache + generate implementation, as follow.

1. Always keep the same stride for inputs in the decode phase

generate apparently does not use directly the next_tokens variable as the next input_ids. Instead, the next_tokens are concatenated with previous tokens, and then sliced, which results in the input tensors having different stride while having the same shape:

------------- loop forward in generate 0
model_inputs input ids shape torch.Size([2, 7])
model_inputs input ids stride (7, 1)
model_inputs position_ids shape torch.Size([2, 7])
model_inputs position_ids stride (7, 1)
------------- loop forward in generate 1
model_inputs input ids shape torch.Size([2, 1])
model_inputs input ids stride (8, 1)
model_inputs position_ids shape torch.Size([2, 1])
model_inputs position_ids stride (8, 1)
------------- loop forward in generate 2
model_inputs input ids shape torch.Size([2, 1])
model_inputs input ids stride (9, 1)
model_inputs position_ids shape torch.Size([2, 1])
model_inputs position_ids stride (9, 1)
------------- loop forward in generate 3
model_inputs input ids shape torch.Size([2, 1])
model_inputs input ids stride (10, 1)
model_inputs position_ids shape torch.Size([2, 1])
model_inputs position_ids stride (10, 1)
------------- loop forward in generate 4
etc.

This is bad because with torch.compile there are guards on the stride of the inputs, and thus recompilation is triggered in the decode phase while this is really not necessary.

V0220 17:27:08.283680 140705660285312 torch/_dynamo/guards.py:1381] Recompiling function forward in /home/felix/transformers/src/transformers/models/llama/modeling_llama.py:1127
V0220 17:27:08.283680 140705660285312 torch/_dynamo/guards.py:1381]     triggered by the following guard failure(s):
V0220 17:27:08.283680 140705660285312 torch/_dynamo/guards.py:1381]     - tensor 'L['input_ids']' stride mismatch at index 0. expected 7, actual 8

&

V0220 17:27:12.636874 140705660285312 torch/_dynamo/guards.py:1381] Recompiling function torch_dynamo_resume_in_forward_at_989 in /home/felix/transformers/src/transformers/models/llama/modeling_llama.py:989
V0220 17:27:12.636874 140705660285312 torch/_dynamo/guards.py:1381]     triggered by the following guard failure(s):
V0220 17:27:12.636874 140705660285312 torch/_dynamo/guards.py:1381]     - tensor 'L['position_ids']' stride mismatch at index 0. expected 7, actual 8

2. Do not compile `_update_causal_mask`

_update_causal_mask uses the input attention_mask length in its code. I believe this results in an FX placehoder being a SymInt,

V0220 11:30:30.341809 140023118176640 torch/_dynamo/output_graph.py:1084] [2/1]  ===== __compiled_fn_12 =====
V0220 11:30:30.341809 140023118176640 torch/_dynamo/output_graph.py:1084] [2/1]  /home/felix/miniconda3/envs/fx/lib/python3.9/site-packages/torch/fx/_lazy_graph_module.py class GraphModule(torch.nn.Module):
V0220 11:30:30.341809 140023118176640 torch/_dynamo/output_graph.py:1084] [2/1]     def forward(self, s0 : torch.SymInt, L_attention_mask_ : torch.Tensor):
V0220 11:30:30.341809 140023118176640 torch/_dynamo/output_graph.py:1084] [2/1]         l_attention_mask_ = L_attention_mask_

which retriggers CUDAGraph capture for every decode step:

I0220 11:30:57.881897 140023118176640 torch/_inductor/cudagraph_trees.py:375] recording cudagraph tree for symint key 13
I0220 11:30:57.902060 140023118176640 torch/_inductor/cudagraph_trees.py:375] recording cudagraph tree for symint key 14
I0220 11:30:57.922157 140023118176640 torch/_inductor/cudagraph_trees.py:375] recording cudagraph tree for symint key 15
I0220 11:30:57.942382 140023118176640 torch/_inductor/cudagraph_trees.py:375] recording cudagraph tree for symint key 16
I0220 11:30:57.962995 140023118176640 torch/_inductor/cudagraph_trees.py:375] recording cudagraph tree for symint key 17
I0220 11:30:57.983414 140023118176640 torch/_inductor/cudagraph_trees.py:375] recording cudagraph tree for symint key 18
I0220 11:30:58.004108 140023118176640 torch/_inductor/cudagraph_trees.py:375] recording cudagraph tree for symint key 19
I0220 11:30:58.024602 140023118176640 torch/_inductor/cudagraph_trees.py:375] recording cudagraph tree for symint key 20
I0220 11:30:58.045034 140023118176640 torch/_inductor/cudagraph_trees.py:375] recording cudagraph tree for symint key 21
I0220 11:30:58.065743 140023118176640 torch/_inductor/cudagraph_trees.py:375] recording cudagraph tree for symint key 22
I0220 11:30:58.086160 140023118176640 torch/_inductor/cudagraph_trees.py:375] recording cudagraph tree for symint key 23
I0220 11:30:58.106957 140023118176640 torch/_inductor/cudagraph_trees.py:375] recording cudagraph tree for symint key 24
I0220 11:30:58.127666 140023118176640 torch/_inductor/cudagraph_trees.py:375] recording cudagraph tree for symint key 25

This is very slow. We avoid capturing _update_causal_mask with @torch.compiler.disable, which fixes the issue (no more cuda graph capture after the very first decode step).

3. Avoid using a stateful int `seen_tokens` (PyTorch bug)

On main, StaticCache's seen_tokens is bugged only when using torch.compile prior. Convince yourself with:

from transformers import AutoTokenizer, AutoModelForCausalLM, GenerationConfig
import torch
from transformers.cache_utils import StaticCache

tokenizer = AutoTokenizer.from_pretrained(
    "NousResearch/Llama-2-7b-chat-hf", padding_side="left", pad_token="<s>"
)

with torch.device("cuda"):
    model = AutoModelForCausalLM.from_pretrained(
        "NousResearch/Llama-2-7b-chat-hf",
        torch_dtype=torch.float16,
        attn_implementation="sdpa",
    )

inputs = tokenizer(
    ["I would", "Today I am in Paris and"], padding=True, return_tensors="pt"
).to(model.device)

new_tokens = 10
gen_config = GenerationConfig(
    max_new_tokens=new_tokens,
    min_new_tokens=new_tokens,
    use_cache=True,
    pad_token_id=tokenizer.pad_token_id,
    num_beams=1,
    do_sample=False,
    eos_token_id=None,  # This is required for min_new_tokens to actually have an effect.
)
model.generation_config.eos_token_id = None  # greedy_search falls back on this eos_token_id that we need to set to None as well for min_new_tokens to have an effect.

gen_out = model.generate(**inputs, generation_config=gen_config)

decoded = tokenizer.batch_decode(gen_out, skip_special_tokens=True)

print("decoded", decoded)

print("compiling...")

model.forward = torch.compile(model.forward, mode="reduce-overhead")
print("Finished compile call")

# warmup
gen_out = model.generate(**inputs, generation_config=gen_config, cache_implementation="static")

print("\n\n\n\n\n\n----- second call")
gen_out = model.generate(**inputs, generation_config=gen_config, cache_implementation="static")

decoded = tokenizer.batch_decode(gen_out, skip_special_tokens=True)

print("decoded static", decoded)

which yields

Using a torch.Tensor updated in-place instead of int fixes the bug, however we then hit what I believe to be a torch.compile bug where subclasses are added after the torch.compile call (in the _setup_cache). Even with the above fix, there is still a bug where seen_tokens is not properly updated. By making sure _setup_cache is called BEFORE torch.compile, this issue disappears. However this required an API change, so disreguarding this approach.

Instead, remove seen_tokens altogether from StaticCache.

Results

On main (ee3af60):

-------------- STATIC CACHE
compiling...
torch.compile call: 703.207 ms
/home/felix/miniconda3/envs/fx/lib/python3.9/site-packages/torch/_inductor/compile_fx.py:148: UserWarning: TensorFloat32 tensor cores for float32 matrix multiplication available but not enabled. Consider setting `torch.set_float32_matmul_precision('high')` for better performance.
  warnings.warn(

- 0-th `generate` call latency per token (new_tokens=20): 10591.283 ms

- 1-th `generate` call latency per token (new_tokens=20): 3284.225 ms

- 2-th `generate` call latency per token (new_tokens=20): 132.769 ms

- 3-th `generate` call latency per token (new_tokens=20): 11.211 ms

- 4-th `generate` call latency per token (new_tokens=20): 11.160 ms
decoded static ['I would like to know how to get a copy of my medical records from my primary care physician.\n', 'Today I am in Paris and I am feeling very grateful for this opportunity to explore this beautiful city. I have always wanted to visit']

On this branch (0c03b7d):

-------------- STATIC CACHE
compiling...
torch.compile call: 729.943 ms
/home/felix/miniconda3/envs/fx/lib/python3.9/site-packages/torch/_inductor/compile_fx.py:148: UserWarning: TensorFloat32 tensor cores for float32 matrix multiplication available but not enabled. Consider setting `torch.set_float32_matmul_precision('high')` for better performance.
  warnings.warn(

- 0-th `generate` call latency per token (new_tokens=20): 4121.241 ms

- 1-th `generate` call latency per token (new_tokens=20): 239.070 ms

- 2-th `generate` call latency per token (new_tokens=20): 11.592 ms

- 3-th `generate` call latency per token (new_tokens=20): 11.602 ms

- 4-th `generate` call latency per token (new_tokens=20): 11.618 ms
decoded static ['I would like to know how to get a copy of my medical records from my primary care physician.\n', 'Today I am in Paris and I am feeling very grateful for this opportunity to explore this beautiful city. I have always wanted to visit']

src/transformers/generation/utils.py

fxmarty · 2024-02-20T19:11:14Z

src/transformers/generation/utils.py

@@ -1975,6 +1990,7 @@ def contrastive_search(
                    model_kwargs,
                    is_encoder_decoder=self.config.is_encoder_decoder,
                    standardize_cache_format=True,
+                    model_inputs=model_inputs,


we need to update model_kwargs with model_inputs["cache_position"], hence the additional argument

ArthurZucker

Thanks for the PR!
Good to have faster compile, and remove the seen token api to rely on cache positions which is IMO less brittle!
⚡ compile is good. I just have to check my benchmarks for FA2 and compiled static cache. Your snippet goes from 11.2 to 11.8 seconds, acceptable IMO (as mentions this is probably the update causal mask not being compiled, would be nice to compile / just compile the part that post process it after the slicing!)

ArthurZucker · 2024-02-21T00:47:13Z

src/transformers/cache_utils.py

        return k_out, v_out

    def get_seq_length(self, layer_idx: Optional[int] = 0) -> int:
        """Returns the sequence length of the cached states that were seen by the model. `layer_idx` kept for BC"""
-        return self.seen_tokens
+        # TODO: Fix once the stateful `int` bug in PyTorch is fixed.


I am not sure it will ever be fixed!
If fixed, this will only be used in prepare inputs for generation, but my plan forward is to rely on the cache_postiions entirely, not the state of the cache to know the iteration we are at in the generate function! cc @gante

src/transformers/generation/utils.py

ArthurZucker · 2024-02-21T00:51:02Z

src/transformers/generation/utils.py

Use of model_inputs is indeed cleaner for me, as we only keep track of the cached_positions and increment them = no need for seen tokens!
I'll let @gante validate it all, but LGTM

ArthurZucker · 2024-02-21T00:51:55Z

src/transformers/models/llama/modeling_llama.py

        if use_cache:  # kept for BC (cache positions)
            if not isinstance(past_key_values, StaticCache):
                past_key_values = DynamicCache.from_legacy_cache(past_key_values)
-            past_seen_tokens = past_key_values.get_seq_length()
+                past_seen_tokens = past_key_values.get_seq_length()


if we assume use_cache is used with cache_positions (generate) then we probably don't need that anymore do we?
Bit breaky but still

@ArthurZucker Overall we should not assume cache_positions is an input to the model, e.g. for ONNX this would break things

Why not? If we decide this to be the new format is it not better for ONNX as well?

src/transformers/models/llama/modeling_llama.py

ArthurZucker · 2024-02-21T00:55:14Z

src/transformers/models/llama/modeling_llama.py

-            cache_position = torch.arange(
-                past_length, past_length + position_ids.shape[-1], device=position_ids.device
-            )
+        cache_position = torch.arange(past_length, past_length + position_ids.shape[-1], device=position_ids.device)


if the cache positions are given as input in the kwargs, and the length is 1, we can just increment it no ? (no arange that way)

@ArthurZucker I left it that way because I don't properly understand the relationship between position_ids, cache_position, especially for speculative decoding. Maybe it can be improved later.

position_ids != cache_position if padding basically

src/transformers/models/llama/modeling_llama.py

ArthurZucker · 2024-02-21T00:57:06Z

src/transformers/models/llama/modeling_llama.py

+            # The `contiguous()` here is necessary to have a static stride during (non-speculative) decoding. torchdynamo otherwise
+            # recompiles graphs as the stride of the inputs is a guard.
+            # TODO: We don't really need to handle the input_ids here, and this contiguous() call could be removed if we were
+            # simply using GenerationMixin.greedy_search `next_tokens` variable directly (which is already contiguous), instead of
+            # doing a torch.cat + then slice.
+            model_inputs = {"input_ids": input_ids.contiguous()}


Yep, next token would be less "costly" but I think we need all of them for speculative? Anyways noticed that as well when I was compiling, just let it be at the time!

Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>

ArthurZucker · 2024-02-21T08:50:54Z

torch._dynamo.exc.Unsupported: 'inline in skipfiles: LlamaModel._update_causal_mask | _fn /home/arthur/miniconda3/envs/py39/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py, skipped according skipfiles.SKIP_DIRS'

failing on torch2.2 let's wait a tad bit (that was full graph!)

fxmarty · 2024-02-21T10:28:34Z

Leaving this open for now, as we would like to avoid @torch.compiler.disable and keep compatibility with fullgraph=True.

There is likely a bug in PyTorch where CUDA graphs are rerecorded while they should not, so we can't simply remove @torch.compiler.disable.

gante

LGTM

Looking forward, I think we can get rid of cache_position, there are many places where this information is present and one of them has to be compatible with torch.compile 😅

gante · 2024-02-21T10:35:34Z

@fxmarty btw, I am working on PR that has some of the changes that you added here (such as resetting the cache after generate), we might get merge conflicts :)

fxmarty · 2024-02-21T10:51:47Z

Thank you @gante, awesome! Yes, I think there needs to be an alignment at some point between all different archs, it's getting a bit complex with all the different approaches.

At the end of the day after discussing with @ArthurZucker, merging but not cherry picking in the release. I removed the @torch.compiler.disable decorator for the reason above.

I believe there is a bug in PyTorch where cuda graphs are somehow rerecorded in the second pass.

- 0-th `generate` call latency per token (new_tokens=100): 744.209 ms

- 1-th `generate` call latency per token (new_tokens=100): 1773.975 ms

- 2-th `generate` call latency per token (new_tokens=100): 11.069 ms

- 3-th `generate` call latency per token (new_tokens=100): 11.042 ms

- 4-th `generate` call latency per token (new_tokens=100): 11.035 ms
decoded static ["I would like to know how to get a copy of my medical records from my primary care physician.\nI would like to know how to get a copy of my medical records from my primary care physician.\nGetting a copy of your medical records from your primary care physician can be a straightforward process, but it's important to follow the proper steps to ensure you receive a complete and accurate copy of your records. Here are the general steps you can take:\n\n1. Contact your", "Today I am in Paris and I am feeling very grateful for this opportunity to explore this beautiful city. I have always wanted to visit Paris and now I am finally here, and it is even more beautiful than I imagined. The Eiffel Tower is stunning, the Louvre is incredible, and the food is delicious. I am soaking up every moment and making the most of my time here. I can't wait to see what the rest of the trip has in store for me. #grateful"]

fxmarty · 2024-02-21T11:19:12Z

^ reference for this pytorch/pytorch#120309

fix compatibility

Loading
Loading status checks…

6d669ee

fxmarty marked this pull request as draft February 19, 2024 17:02

fxmarty commented Feb 19, 2024

View reviewed changes

src/transformers/cache_utils.py Outdated Show resolved Hide resolved

fxmarty commented Feb 19, 2024

View reviewed changes

src/transformers/cache_utils.py Show resolved Hide resolved

fxmarty commented Feb 19, 2024

View reviewed changes

ArthurZucker reviewed Feb 19, 2024

View reviewed changes

fxmarty added 5 commits February 20, 2024 14:28

working version

Loading
Loading status checks…

4891050

Merge branch 'main' into compile-compatibility-generate

Loading
Loading status checks…

724c694

cleanup

Loading
Loading status checks…

0a00d6b

sanity checks

Loading
Loading status checks…

7472549

more sanity

Loading
Loading status checks…

b214766

fxmarty added 2 commits February 20, 2024 17:52

working version WITH refactor

Loading
Loading status checks…

b9b627c

working without API change

Loading
Loading status checks…

0c03b7d

fxmarty requested review from ArthurZucker, gante and LysandreJik February 20, 2024 18:17

fxmarty changed the title ~~WIP: fix generate compatibility with torch.compile~~ Make torch.compile compilation >2x faster when using static cache + generate Feb 20, 2024

fxmarty added 2 commits February 20, 2024 19:47

cleanup & tests pass

Loading
Loading status checks…

28cdee0

more cleaning

Loading
Loading status checks…

190e0cf

fxmarty marked this pull request as ready for review February 20, 2024 18:50

fxmarty mentioned this pull request Feb 20, 2024

Static cache + torch.compile: better documentation for prefill static sequence length #29151

Closed

fxmarty requested a review from amyeroberts February 20, 2024 18:58

fxmarty added 2 commits February 20, 2024 20:00

fix test

Loading
Loading status checks…

5019e81

fix tests

Loading
Loading status checks…

80b9072

fxmarty commented Feb 20, 2024

View reviewed changes

src/transformers/generation/utils.py Outdated Show resolved Hide resolved

fxmarty commented Feb 20, 2024

View reviewed changes

fxmarty mentioned this pull request Feb 20, 2024

Stateful int is not updated when using torch.compile pytorch/pytorch#120248

Closed

ArthurZucker approved these changes Feb 21, 2024

View reviewed changes

fxmarty and others added 2 commits February 21, 2024 09:34

Update src/transformers/generation/utils.py

Loading
Loading status checks…

a660486

Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>

smaller comment

Loading
Loading status checks…

2935462

update comment

Loading
Loading status checks…

5dbcef4

fxmarty changed the title ~~Make torch.compile compilation >2x faster when using static cache + generate~~ torch.compile compatibility with generate + static cache Feb 21, 2024

gante approved these changes Feb 21, 2024

View reviewed changes

update comment

Loading
Loading status checks…

a8c4e10

fxmarty merged commit cc4a664 into main Feb 21, 2024
21 checks passed

fxmarty deleted the compile-compatibility-generate branch February 21, 2024 11:19

fxmarty mentioned this pull request Feb 21, 2024

[generate + static cache + torch.compile] ability to pass statically shaped 4D attention_mask to the model forward #29165

Closed

gante mentioned this pull request Feb 27, 2024

Idefics: generate fix #29320

Merged

fxmarty mentioned this pull request Jul 25, 2024

>3-5x faster torch.compile forward compilation for autoregressive decoder models #32227

Merged

anijain2305 mentioned this pull request Sep 9, 2024

torch.compile compatibility 2noise/ChatTTS#748

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`torch.compile` compatibility with `generate` + static cache #29114

`torch.compile` compatibility with `generate` + static cache #29114

fxmarty commented Feb 19, 2024 •

edited

Loading

fxmarty Feb 19, 2024

ArthurZucker Feb 21, 2024

HuggingFaceDocBuilderDev commented Feb 19, 2024

ArthurZucker left a comment

fxmarty commented Feb 20, 2024 •

edited

Loading

fxmarty Feb 20, 2024

ArthurZucker left a comment

ArthurZucker Feb 21, 2024

ArthurZucker Feb 21, 2024

ArthurZucker Feb 21, 2024

fxmarty Feb 21, 2024

ArthurZucker Feb 21, 2024

ArthurZucker Feb 21, 2024

fxmarty Feb 21, 2024

ArthurZucker Feb 21, 2024

ArthurZucker Feb 21, 2024

ArthurZucker commented Feb 21, 2024 •

edited

Loading

fxmarty commented Feb 21, 2024

gante left a comment

gante commented Feb 21, 2024

fxmarty commented Feb 21, 2024

fxmarty commented Feb 21, 2024

torch.compile compatibility with generate + static cache #29114

torch.compile compatibility with generate + static cache #29114

Conversation

fxmarty commented Feb 19, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

HuggingFaceDocBuilderDev commented Feb 19, 2024

ArthurZucker left a comment

Choose a reason for hiding this comment

fxmarty commented Feb 20, 2024 • edited Loading

1. Always keep the same stride for inputs in the decode phase

2. Do not compile _update_causal_mask

3. Avoid using a stateful int seen_tokens (PyTorch bug)

Results

Choose a reason for hiding this comment

ArthurZucker left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ArthurZucker commented Feb 21, 2024 • edited Loading

fxmarty commented Feb 21, 2024

gante left a comment

Choose a reason for hiding this comment

gante commented Feb 21, 2024

fxmarty commented Feb 21, 2024

fxmarty commented Feb 21, 2024

`torch.compile` compatibility with `generate` + static cache #29114

`torch.compile` compatibility with `generate` + static cache #29114

fxmarty commented Feb 19, 2024 •

edited

Loading

fxmarty commented Feb 20, 2024 •

edited

Loading

2. Do not compile `_update_causal_mask`

3. Avoid using a stateful int `seen_tokens` (PyTorch bug)

ArthurZucker commented Feb 21, 2024 •

edited

Loading