Falcon: batched generation #26137

gante · 2023-09-13T10:05:53Z

What does this PR do?

This PR does three things:

Fixes the minimum float number added to the attention mask, in the positions the attention mask is 0. In some numerical precisions, the numerical attention mask was getting -inf, which wrecked downstream computations.
Adds the position_ids input to Falcon, which is needed for proper batched generation. When it is not passed, the forward pass builds the position ids from the sequence length, which does not account for the left-padding in batched generation -- the model could still generate, but the results should be slightly better after the fix.
Add tests for batched generation with left padding

gante · 2023-09-13T10:07:35Z

src/transformers/models/falcon/modeling_falcon.py

        total_length = seq_len + past_key_values_length
        if total_length > self.seq_len_cached:
            self._set_cos_sin_cache(total_length, device, dtype)
-        return (
-            self.cos_cached[:, past_key_values_length : seq_len + past_key_values_length],


the slicing here is equivalent to building position ids from the sequence length, without taking into account any potential left-padding

HuggingFaceDocBuilderDev · 2023-09-13T10:25:03Z

The documentation is not available anymore as the PR was closed or merged.

gante · 2023-09-13T10:50:02Z

@ArthurZucker woops, sorry, there are still tests to fix, I will ping you again when they are fixed!

gante · 2023-09-13T11:18:33Z

@ArthurZucker ready now

gante · 2023-09-13T11:21:07Z

src/transformers/models/falcon/modeling_falcon.py

@@ -415,7 +437,11 @@ def forward(
        else:
            present = None

-        attention_mask_float = (attention_mask * 1.0).masked_fill(attention_mask, float("-1e9")).to(query_layer.dtype)


cc @Rocketknight1 this 1e-9 was causing problems in some numerical precisions (it would be converted to -inf) :p

Ah, my bad!

ArthurZucker

Thanks for adding batch support! Let's be careful with padding as we have been getting a lot of issues regarding this! If we have one reference implementation (used in Llama) would be great to re-use!
Otherwise, LGTM

src/transformers/models/falcon/modeling_falcon.py

ArthurZucker · 2023-09-13T13:17:03Z

src/transformers/models/falcon/modeling_falcon.py

@@ -99,19 +99,40 @@ def _set_cos_sin_cache(self, seq_len, device, dtype):
        self.cos_cached = self.cos_cached.type(dtype)
        self.sin_cached = self.sin_cached.type(dtype)

-    def cos_sin(self, seq_len: int, past_key_values_length: int, device="cpu", dtype=torch.bfloat16) -> torch.Tensor:
+    def cos_sin(


I'm gonna be a bit noisy here, but this looks a LOT like the rotary embedding we have in Llama no?
The query expansion is also supported there, not sure how much of an overhead it is to first apply rotary then expand:

query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) if past_key_value is not None: # reuse k, v, self_attention key_states = torch.cat([past_key_value[0], key_states], dim=2) value_states = torch.cat([past_key_value[1], value_states], dim=2) past_key_value = (key_states, value_states) if use_cache else None # repeat k/v heads if n_kv_heads < n_heads key_states = repeat_kv(key_states, self.num_key_value_groups) value_states = repeat_kv(value_states, self.num_key_value_groups)

and also storing the full size key and values is less memory efficient no? (unrelated to the PR).

I think they are the same. So, we would benefit from copying the structure (at least in terms of complexity for us, the maintainers) 👍

I would like to push it to the future, though, as I'm about to go on long holidays and I'd like to enable batched generation on Falcon :D

Let's just add a TODO then 😉

gante · 2023-09-13T13:50:05Z

Let's be careful with padding as we have been getting a lot of issues regarding this!

@ArthurZucker 100% agreed! If you come across a new model, plz make sure there is a test for this 🙏

gante · 2023-09-13T13:57:19Z

@ArthurZucker suggestions applied 💪

ArthurZucker

Thanks a lot! Let's add a TODO so that we don't ever forget we have to refactor this in the futur! 😉

ArthurZucker · 2023-09-13T15:06:09Z

src/transformers/models/falcon/modeling_falcon.py

@@ -99,19 +99,40 @@ def _set_cos_sin_cache(self, seq_len, device, dtype):
        self.cos_cached = self.cos_cached.type(dtype)
        self.sin_cached = self.sin_cached.type(dtype)

-    def cos_sin(self, seq_len: int, past_key_values_length: int, device="cpu", dtype=torch.bfloat16) -> torch.Tensor:
+    def cos_sin(


Let's just add a TODO then 😉

tmp commit

f1dcfee

gante commented Sep 13, 2023

View reviewed changes

fix min_float issue; add tests

1321f7e

gante requested a review from ArthurZucker September 13, 2023 10:47

fix tests

e1bf074

gante marked this pull request as ready for review September 13, 2023 11:18

gante commented Sep 13, 2023

View reviewed changes

ArthurZucker reviewed Sep 13, 2023

View reviewed changes

PR suggestions by Arthur

e701126

ArthurZucker approved these changes Sep 13, 2023

View reviewed changes

Add TODO

941d352

gante merged commit a796f7e into huggingface:main Sep 13, 2023

gante deleted the falcon_batch branch September 13, 2023 16:02

parambharat pushed a commit to parambharat/transformers that referenced this pull request Sep 26, 2023

Falcon: batched generation (huggingface#26137)

b48ccf9

iosonopersia mentioned this pull request Oct 5, 2023

position_ids is now required by FalconRotaryEmbedding forward method huggingface/optimum#1430

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Falcon: batched generation #26137

Falcon: batched generation #26137

gante commented Sep 13, 2023 •

edited

Loading

gante Sep 13, 2023

HuggingFaceDocBuilderDev commented Sep 13, 2023 •

edited

Loading

gante commented Sep 13, 2023

gante commented Sep 13, 2023

gante Sep 13, 2023 •

edited

Loading

Rocketknight1 Sep 13, 2023

ArthurZucker left a comment

ArthurZucker Sep 13, 2023

gante Sep 13, 2023 •

edited

Loading

ArthurZucker Sep 13, 2023

gante commented Sep 13, 2023

gante commented Sep 13, 2023

ArthurZucker left a comment

ArthurZucker Sep 13, 2023

Falcon: batched generation #26137

Falcon: batched generation #26137

Conversation

gante commented Sep 13, 2023 • edited Loading

What does this PR do?

gante Sep 13, 2023

Choose a reason for hiding this comment

HuggingFaceDocBuilderDev commented Sep 13, 2023 • edited Loading

gante commented Sep 13, 2023

gante commented Sep 13, 2023

gante Sep 13, 2023 • edited Loading

Choose a reason for hiding this comment

Rocketknight1 Sep 13, 2023

Choose a reason for hiding this comment

ArthurZucker left a comment

Choose a reason for hiding this comment

ArthurZucker Sep 13, 2023

Choose a reason for hiding this comment

gante Sep 13, 2023 • edited Loading

Choose a reason for hiding this comment

ArthurZucker Sep 13, 2023

Choose a reason for hiding this comment

gante commented Sep 13, 2023

gante commented Sep 13, 2023

ArthurZucker left a comment

Choose a reason for hiding this comment

ArthurZucker Sep 13, 2023

Choose a reason for hiding this comment

gante commented Sep 13, 2023 •

edited

Loading

HuggingFaceDocBuilderDev commented Sep 13, 2023 •

edited

Loading

gante Sep 13, 2023 •

edited

Loading

gante Sep 13, 2023 •

edited

Loading