[Wav2Vec2 Conformer] Fix inference float16 #25985

sanchit-gandhi · 2023-09-05T10:26:26Z

What does this PR do?

Fixes #25964 - the Wav2Vec2 conformer model with rotary embeddings now works when we load it from_pretrained with float16. The issue was originating in the rotary embedding layer, which was returning the positional embeddings in float32 always

sanchit-gandhi · 2023-09-05T10:30:54Z

tests/pipelines/test_pipelines_automatic_speech_recognition.py

@@ -901,6 +901,26 @@ def test_speech_to_text_leveraged(self):
        output = speech_recognizer(filename)
        self.assertEqual(output, {"text": "a man said to the universe sir i exist"})

+    @slow
+    @require_torch_gpu
+    def test_wav2vec2_conformer_float16(self):


This is the error repro that was failing before @Vaibhavs10 - added a slow integration test to make sure this works after the fix

Perfect! Thanks <3

sanchit-gandhi · 2023-09-05T10:54:09Z

src/transformers/models/wav2vec2_conformer/modeling_wav2vec2_conformer.py

@@ -406,13 +406,15 @@ def forward(self, hidden_states):
            return self.cached_rotary_positional_embedding

        self.cached_sequence_length = sequence_length
+        # Embeddings are computed in the dtype of the inv_freq constant


See https://github.com/facebookresearch/fairseq/blob/4db264940f281a6f47558d17387b1455d4abd8d9/fairseq/modules/rotary_positional_embedding.py#L30

This now looks a lot like:

class LlamaRotaryEmbedding(torch.nn.Module): def __init__(self, dim, max_position_embeddings=2048, base=10000, device=None): super().__init__() self.dim = dim self.max_position_embeddings = max_position_embeddings self.base = base inv_freq = 1.0 / (self.base ** (torch.arange(0, self.dim, 2).float().to(device) / self.dim)) self.register_buffer("inv_freq", inv_freq, persistent=False) # Build here to make `torch.jit.trace` work. self._set_cos_sin_cache( seq_len=max_position_embeddings, device=self.inv_freq.device, dtype=torch.get_default_dtype() ) def _set_cos_sin_cache(self, seq_len, device, dtype): self.max_seq_len_cached = seq_len t = torch.arange(self.max_seq_len_cached, device=device, dtype=self.inv_freq.dtype) freqs = torch.einsum("i,j->ij", t, self.inv_freq) # Different from paper, but it uses a different permutation in order to obtain the same calculation emb = torch.cat((freqs, freqs), dim=-1) self.register_buffer("cos_cached", emb.cos()[None, None, :, :].to(dtype), persistent=False) self.register_buffer("sin_cached", emb.sin()[None, None, :, :].to(dtype), persistent=False) def forward(self, x, seq_len=None): # x: [bs, num_attention_heads, seq_len, head_size] if seq_len > self.max_seq_len_cached: self._set_cos_sin_cache(seq_len=seq_len, device=x.device, dtype=x.dtype) return ( self.cos_cached[:, :, :seq_len, ...].to(dtype=x.dtype), self.sin_cached[:, :, :seq_len, ...].to(dtype=x.dtype), )

Wondering if we can add copied from and use this / wondering if the dynamic scaling could also work for audio models?

Can't use # Copied from on the whole module since the Wav2Vec2ConformerRotaryPositionalEmbedding accepts the config as an argument, but LlamaRotaryEmbedding uses various ad-hoc arguments. But we could do a similar dynamic slicing - will add this in a follow-up PR so as not to block @Vaibhavs10

HuggingFaceDocBuilderDev · 2023-09-05T10:55:00Z

The documentation is not available anymore as the PR was closed or merged.

ylacombe

LGTM ! Thanks for taking care of this!

ArthurZucker

Looks good to me! Left a nit, but I think we can use the LlamaRotary class now 😄

ArthurZucker · 2023-09-05T13:04:56Z

src/transformers/models/wav2vec2_conformer/modeling_wav2vec2_conformer.py

@@ -406,13 +406,15 @@ def forward(self, hidden_states):
            return self.cached_rotary_positional_embedding

        self.cached_sequence_length = sequence_length
+        # Embeddings are computed in the dtype of the inv_freq constant


This now looks a lot like:

class LlamaRotaryEmbedding(torch.nn.Module): def __init__(self, dim, max_position_embeddings=2048, base=10000, device=None): super().__init__() self.dim = dim self.max_position_embeddings = max_position_embeddings self.base = base inv_freq = 1.0 / (self.base ** (torch.arange(0, self.dim, 2).float().to(device) / self.dim)) self.register_buffer("inv_freq", inv_freq, persistent=False) # Build here to make `torch.jit.trace` work. self._set_cos_sin_cache( seq_len=max_position_embeddings, device=self.inv_freq.device, dtype=torch.get_default_dtype() ) def _set_cos_sin_cache(self, seq_len, device, dtype): self.max_seq_len_cached = seq_len t = torch.arange(self.max_seq_len_cached, device=device, dtype=self.inv_freq.dtype) freqs = torch.einsum("i,j->ij", t, self.inv_freq) # Different from paper, but it uses a different permutation in order to obtain the same calculation emb = torch.cat((freqs, freqs), dim=-1) self.register_buffer("cos_cached", emb.cos()[None, None, :, :].to(dtype), persistent=False) self.register_buffer("sin_cached", emb.sin()[None, None, :, :].to(dtype), persistent=False) def forward(self, x, seq_len=None): # x: [bs, num_attention_heads, seq_len, head_size] if seq_len > self.max_seq_len_cached: self._set_cos_sin_cache(seq_len=seq_len, device=x.device, dtype=x.dtype) return ( self.cos_cached[:, :, :seq_len, ...].to(dtype=x.dtype), self.sin_cached[:, :, :seq_len, ...].to(dtype=x.dtype), )

Wondering if we can add copied from and use this / wondering if the dynamic scaling could also work for audio models?

* [Wav2Vec2 Conformer] Fix inference float16 * fix test * fix test more * clean pipe test

sanchit-gandhi added 2 commits September 5, 2023 11:25

[Wav2Vec2 Conformer] Fix inference float16

4a87f77

fix test

8b1da48

sanchit-gandhi commented Sep 5, 2023

View reviewed changes

sanchit-gandhi added 2 commits September 5, 2023 11:32

fix test more

066e9d8

clean pipe test

148ca3c

sanchit-gandhi requested review from patrickvonplaten, ArthurZucker and ylacombe September 5, 2023 10:52

sanchit-gandhi commented Sep 5, 2023

View reviewed changes

ylacombe approved these changes Sep 5, 2023

View reviewed changes

ArthurZucker approved these changes Sep 5, 2023

View reviewed changes

patrickvonplaten approved these changes Sep 5, 2023

View reviewed changes

sanchit-gandhi merged commit 8d51801 into huggingface:main Sep 5, 2023

sanchit-gandhi deleted the w2v2-conformer branch September 5, 2023 17:26

parambharat pushed a commit to parambharat/transformers that referenced this pull request Sep 26, 2023

[Wav2Vec2 Conformer] Fix inference float16 (huggingface#25985)

5198967

* [Wav2Vec2 Conformer] Fix inference float16 * fix test * fix test more * clean pipe test

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Wav2Vec2 Conformer] Fix inference float16 #25985

[Wav2Vec2 Conformer] Fix inference float16 #25985

sanchit-gandhi commented Sep 5, 2023 •

edited

Loading

sanchit-gandhi Sep 5, 2023

Vaibhavs10 Sep 5, 2023

sanchit-gandhi Sep 5, 2023

ArthurZucker Sep 5, 2023

sanchit-gandhi Sep 5, 2023

HuggingFaceDocBuilderDev commented Sep 5, 2023 •

edited

Loading

ylacombe left a comment

ArthurZucker left a comment

ArthurZucker Sep 5, 2023

[Wav2Vec2 Conformer] Fix inference float16 #25985

[Wav2Vec2 Conformer] Fix inference float16 #25985

Conversation

sanchit-gandhi commented Sep 5, 2023 • edited Loading

What does this PR do?

sanchit-gandhi Sep 5, 2023

Choose a reason for hiding this comment

Vaibhavs10 Sep 5, 2023

Choose a reason for hiding this comment

sanchit-gandhi Sep 5, 2023

Choose a reason for hiding this comment

ArthurZucker Sep 5, 2023

Choose a reason for hiding this comment

sanchit-gandhi Sep 5, 2023

Choose a reason for hiding this comment

HuggingFaceDocBuilderDev commented Sep 5, 2023 • edited Loading

ylacombe left a comment

Choose a reason for hiding this comment

ArthurZucker left a comment

Choose a reason for hiding this comment

ArthurZucker Sep 5, 2023

Choose a reason for hiding this comment

sanchit-gandhi commented Sep 5, 2023 •

edited

Loading

HuggingFaceDocBuilderDev commented Sep 5, 2023 •

edited

Loading