[`generate` + static cache + `torch.compile`] ability to pass statically shaped 4D `attention_mask` to the model forward #29165

fxmarty · 2024-02-21T11:25:54Z

Feature request

Currently, the attention_mask passed is 2D and of dynamic shapes.

This causes issues when using a compiled model with model.forward = torch.compile(model.forward, mode="reduce-overhead"), see pytorch/pytorch#120309 & #29114.

----- in forward 0
name=input_ids, shape=torch.Size([2, 7]), stride=(7, 1), dtype=torch.int64, device=cuda:0
name=position_ids, shape=torch.Size([2, 7]), stride=(7, 1), dtype=torch.int64, device=cuda:0
name=cache_position, shape=torch.Size([7]), stride=(1,), dtype=torch.int64, device=cuda:0
name=past_key_values, value=None
name=use_cache, value=True
name=attention_mask, shape=torch.Size([2, 7]), stride=(7, 1), dtype=torch.int64, device=cuda:0
forward call latency: 1784.737 ms      <---------------------------- EXTREMELY SLOW.
----- in forward 1
name=input_ids, shape=torch.Size([2, 1]), stride=(1, 1), dtype=torch.int64, device=cuda:0
name=position_ids, shape=torch.Size([2, 1]), stride=(1, 1), dtype=torch.int64, device=cuda:0
name=cache_position, shape=torch.Size([1]), stride=(1,), dtype=torch.int64, device=cuda:0
name=past_key_values, value=None
name=use_cache, value=True
name=attention_mask, shape=torch.Size([2, 8]), stride=(8, 1), dtype=torch.int64, device=cuda:0
forward call latency: 1851.579 ms      <---------------------------- EXTREMELY SLOW.
----- in forward 2
name=input_ids, shape=torch.Size([2, 1]), stride=(1, 1), dtype=torch.int64, device=cuda:0
name=position_ids, shape=torch.Size([2, 1]), stride=(1, 1), dtype=torch.int64, device=cuda:0
name=cache_position, shape=torch.Size([1]), stride=(1,), dtype=torch.int64, device=cuda:0
name=past_key_values, value=None
name=use_cache, value=True
name=attention_mask, shape=torch.Size([2, 9]), stride=(9, 1), dtype=torch.int64, device=cuda:0
forward call latency: 1421.504 ms      <---------------------------- EXTREMELY SLOW.
----- in forward 3
name=input_ids, shape=torch.Size([2, 1]), stride=(1, 1), dtype=torch.int64, device=cuda:0
name=position_ids, shape=torch.Size([2, 1]), stride=(1, 1), dtype=torch.int64, device=cuda:0
name=cache_position, shape=torch.Size([1]), stride=(1,), dtype=torch.int64, device=cuda:0
name=past_key_values, value=None
name=use_cache, value=True
name=attention_mask, shape=torch.Size([2, 10]), stride=(10, 1), dtype=torch.int64, device=cuda:0
forward call latency: 1740.283 ms      <---------------------------- EXTREMELY SLOW.

Instead, we may want to pass directly 4D masks of static shape to the model, avoiding cuda graphs recomputation. This may allow the compile time to go down from >3 min to a mere 50s.

Motivation

/

Your contribution

/

The text was updated successfully, but these errors were encountered:

amyeroberts · 2024-02-21T15:55:54Z

cc @gante

gante · 2024-03-27T11:21:23Z

@fxmarty I had a quick look at this -- we still have models (like gpt2) that exclusively accept 2D masks. We would have to rework that before making generate prepare a 4D mask.

We may want to work with padded tensors instead? In our TF/PT XLA implementation we have static shapes everywhere.

fxmarty · 2024-03-27T12:17:55Z

@gante thank you! What do you mean by work with padded tensors?

gante · 2024-03-27T12:20:52Z

@fxmarty If we want to generate with max_length=512, the attention mask is always kept with sequence length = 512. Data is moved around inside the tensor as needed.

fxmarty · 2024-03-27T12:42:13Z

Oh yes, I think that's what I am suggesting here!

fxmarty added the Generation label Feb 21, 2024

fxmarty added the Compilation Issues related to torchdynamo and torchinductor label Feb 28, 2024

amyeroberts added the Cache label Mar 24, 2024

huggingface deleted a comment from github-actions bot Mar 25, 2024

github-actions bot closed this as completed Apr 29, 2024

fxmarty reopened this Apr 29, 2024

huggingface deleted a comment from github-actions bot Apr 29, 2024

huggingface deleted a comment from github-actions bot May 24, 2024

huggingface deleted a comment from github-actions bot Jun 18, 2024

huggingface deleted a comment from github-actions bot Jul 16, 2024

fxmarty mentioned this issue Jul 25, 2024

>3-5x faster torch.compile forward compilation for autoregressive decoder models #32227

Merged

fxmarty closed this as completed in #32227 Jul 31, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[`generate` + static cache + `torch.compile`] ability to pass statically shaped 4D `attention_mask` to the model forward #29165

[`generate` + static cache + `torch.compile`] ability to pass statically shaped 4D `attention_mask` to the model forward #29165

fxmarty commented Feb 21, 2024

amyeroberts commented Feb 21, 2024

gante commented Mar 27, 2024 •

edited

Loading

fxmarty commented Mar 27, 2024

gante commented Mar 27, 2024

fxmarty commented Mar 27, 2024

[generate + static cache + torch.compile] ability to pass statically shaped 4D attention_mask to the model forward #29165

[generate + static cache + torch.compile] ability to pass statically shaped 4D attention_mask to the model forward #29165

Comments

fxmarty commented Feb 21, 2024

Feature request

Motivation

Your contribution

amyeroberts commented Feb 21, 2024

gante commented Mar 27, 2024 • edited Loading

fxmarty commented Mar 27, 2024

gante commented Mar 27, 2024

fxmarty commented Mar 27, 2024

[`generate` + static cache + `torch.compile`] ability to pass statically shaped 4D `attention_mask` to the model forward #29165

[`generate` + static cache + `torch.compile`] ability to pass statically shaped 4D `attention_mask` to the model forward #29165

gante commented Mar 27, 2024 •

edited

Loading