Reducing memory usage: removing useless logits computation in generate() #31292

Cyrilvallez · 2024-06-06T15:50:41Z

What does this PR do?

This is the PR related to the discussion in #30860.
I followed was has been done in Jamba and added support for the num_logits_to_keep argument in forward(). However, even if this argument is None, the logits will only be upcasted to float if labels are passed (in order to accurately compute the loss). Otherwise, the upcasting only happen in the generate() functions.

For now, I only modified Llama and Mistral, but if you agree on the changes I will add support for more models.

Benchmarks

Here I provide some benchmarks of the peak memory usage. For each input size, I generated 10 additional tokens.
Of course, since for few additional tokens the memory peak scales only with the first forward pass (at least when computing the whole logits matrix), and that the first forward scales linearly with input size and batch size (with new attention algorithms), the gain is actually constant for all input sizes and generation methods (except for contrastive search, which artificially increase the batch size after the first forward, thus the memory usage is slightly different). However, I still provide results for all generation methods here for completeness.

Basically we get:
Llama3 8B -> MIND-BLOWING 3.62 memory usage reduction factor (due to large vocabulary)
Llama2 7B -> 1.17 reduction factor
Mistral 7B -> 1.32 reduction factor

Note that the memory reduction shown here is on top of whatever gains #30536 already provides for small new additional tokens, as I am comparing memory with the main transformers branch after it was merged. It integrates very nicely with that last PR, as the last one was providing most benefits when generating more tokens, and this one provides gains for small new number of tokens.

greedy.pdf
sample.pdf
beam sample.pdf
beam search.pdf
group beam search.pdf
contrastive search.pdf

Here is a link to the benchmark script: https://gist.github.com/Cyrilvallez/92f48e402aa2968c854a8128796f50c3

Who can review?

@ArthurZucker @gante Let me know what you think about the proposed changes!

amyeroberts · 2024-06-06T16:21:00Z

cc @ArthurZucker @gante

ArthurZucker

Thanks! Looks quite alright IMO!

ArthurZucker · 2024-06-18T07:57:27Z

src/transformers/models/llama/modeling_llama.py

+            # Only compute necessary logits, and do not upcast them to float if we are not computing the loss
+            if num_logits_to_keep is None:
+                logits = [F.linear(hidden_states, lm_head_slices[i]) for i in range(self.config.pretraining_tp)]
+            else:
+                logits = [
+                    F.linear(hidden_states[:, -num_logits_to_keep:, :], lm_head_slices[i])
+                    for i in range(self.config.pretraining_tp)
+                ]


let's not update this, pretraining TP is really never used in practice. I'll deprecate it

src/transformers/models/llama/modeling_llama.py

src/transformers/generation/utils.py

HuggingFaceDocBuilderDev · 2024-06-18T08:19:53Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

gante

LGTM, thank you for further reducing the memory needs @Cyrilvallez 💛

num_logits_to_keep is not the prettiest interface, but I can't think of a better one (as discussed in the PR that introduced it).

I'm happy with the PR with the exception of BC handling

src/transformers/models/mistral/modeling_mistral.py

gante · 2024-06-18T13:29:07Z

btw, a ratio of 3x lower peak memory consumption is 🔥 🔥 🔥

gante

LGTM 👍

Cyrilvallez · 2024-06-21T09:07:16Z

I just added the change to more models and rebased to avoid conflicts with new commits in main!
For Cohere-based models, I most notably computed a memory gain ratio of 6.68 due to the very large 256k vocabulary size 🚀🔥

Last thing to take into account is your comment about the signature @ArthurZucker but not sure I understood correctly what you wanted to do 🤓

ArthurZucker

Overall LGTM

src/transformers/generation/utils.py

ArthurZucker · 2024-07-12T09:12:36Z

src/transformers/models/cohere/modeling_cohere.py

+        if num_logits_to_keep is None:
+            logits = self.lm_head(hidden_states).float()
+        else:
+            logits = self.lm_head(hidden_states[:, -num_logits_to_keep:, :]).float()


we can default num_logits to keep to 0 to always slice (no codepaths)

uhmmm does self.lm_head(hidden_states[:, -num_logits_to_keep:, :]).float() with the default value 0? 🤔

ArthurZucker · 2024-07-12T09:13:22Z

Make sure to rebase as the state of the main branch was changed quite a bit!

Cyrilvallez · 2024-07-12T12:40:38Z

Will do! However, when playing with torch.compile, I noticed that adding a logger.warning_once() in the forward breaks the graph with the following error: Unsupported: call_method UserDefinedObjectVariable(Logger) warning_once [ConstantVariable()] {}. This is with PyTorch latest version (2.3.1). So I will make sure to change that/make it compatible as well.

Cyrilvallez · 2024-07-12T13:53:38Z

DO NOT MERGE YET
Everything else is good, but still need to sort out the logger.warning_once/compile issue

Cyrilvallez · 2024-07-17T12:10:36Z

@ArthurZucker @gante everything is now ready.
From my tests, it seems like compile does not support any print-like functionality at the moment, either from print, logger or warnings.
I first wanted to add a logger.warning_once_compile_safe function which I thought would simplify things and come in handy in the future as well, but couldn't because it needs to import torch in the logging module which breaks things.
So I just added a compile check everywhere.

ringohoffman · 2024-07-24T17:22:18Z

@ArthurZucker is this planned for review this week? I’m pretty eager to consume this PR.

ArthurZucker · 2024-07-26T10:11:04Z

Yes! Reviewing asap!

Oxi84 · 2024-08-01T13:06:10Z

Looking forward to testing this out, gemma2 uses a lot of memory otherwise and is a top model.

ArthurZucker

Well LGTM one thing missing is a test in the mixins !

ArthurZucker · 2024-08-05T05:27:24Z

src/transformers/models/cohere/modeling_cohere.py

-        logits = self.lm_head(hidden_states)
+        if labels is None and not is_torchdynamo_compiling():
+            logger.warning_once(
+                "Starting from v4.44, the `logits` model output will have the same type as the model (except at train time, where it will always be FP32)"


Suggested change

"Starting from v4.44, the `logits` model output will have the same type as the model (except at train time, where it will always be FP32)"

"Starting from v4.45, the `logits` model output will have the same type as the model (except at train time, where it will always be FP32)"

ArthurZucker · 2024-08-05T05:27:50Z

src/transformers/generation/utils.py

Maybe let's comment that we need .float() for full precision soft

ringohoffman · 2024-08-07T00:50:35Z

Hey @Cyrilvallez, thanks for your work. Just checking in regarding this PR. Do you have a plan to finish it up some time soon? I'm very excited for it to land!

Cyrilvallez · 2024-08-08T06:17:45Z

Hi @ringohoffman, don't worry I am not forgetting about this 😉 I'm currently on vacation so I will try to wrap it up quickly end of August when I come back if I have time. Worst case scenario, it will be ready mid-September.

In the meantime, you can install transformers from my fork if you want to already benefit from it (pip install git+https://github.com/Cyrilvallez/transformers@logits-dtype). Or even better, you can clone my fork and rebase it on transformers/main to get all the new stuff + this PR.

Boubou78000 · 2024-08-08T07:52:40Z

Does this PR actually fixes gemma2 or just Gemma?

Cyrilvallez · 2024-08-08T08:18:56Z

Gemma2 was not released yet when I started this, but don't worry I will add it as well, it's on the roadmap 🤗

Cyrilvallez · 2024-08-21T10:42:03Z

@ArthurZucker I added support for Gemma2 as well as tests, ready for last review 🤗
Red CIs are not related to the PR

Cyrilvallez · 2024-08-23T09:48:57Z

No worries! All good on the CIs and ready to be merged 🤗

ringohoffman · 2024-08-23T17:00:27Z

Congrats, @Cyrilvallez!

When is this planned to be released? @ArthurZucker @gante

gante · 2024-08-23T17:15:02Z

@ringohoffman our rule of thumb is to release every month, so it should be in ~2 weeks 🤗

Some misses from huggingface#31292 and huggingface#33902

* Only cast logits to float when computing loss Some misses from #31292 and #33902 * Move logits.float() into existing if labels is not None branch

* Only cast logits to float when computing loss Some misses from huggingface#31292 and huggingface#33902 * Move logits.float() into existing if labels is not None branch

…e() (huggingface#31292) * Add .float() in all generation methods logit outputs * Switch float-casting of logits to training only for main models * Add `num_logits_to_keep` in Llama and add it by default in generate * Apply style * Add num_logits_to_keep as arg in prepare_input_for_generation * Add support for Mistral * Revert models except llama and mistral * Fix default None value in _supports_num_logits_to_keep() * Fix dimension of dummy input * Add exception for prophetnet in _supports_num_logits_to_keep() * Update _supports_num_logits_to_keep() to use inspect.signature() * Add deprecation cycle + remove modification with pretraining_tp * Apply style * Add most used models * Apply style * Make `num_logits_to_keep` an int in all cases to remove if-else clause * Add compile check for the warning * Fix torch versions * style * Add gemma2 * Update warning version * Add comment about .float operations in generation utils * Add tests in GenerationTesterMixin and ModelTesterMixin * Fix batch size for assisted decoding in tests * fix small issues in test * refacor test * fix slicing removing dim issue * Add nemotron support (should fix check-copy issue in CIs) * Trigger new CIs * Trigger new CIs * Bump version * Bump version in TODO * Trigger CIs * remove blank space * Trigger CIs

* Only cast logits to float when computing loss Some misses from huggingface#31292 and huggingface#33902 * Move logits.float() into existing if labels is not None branch

Cyrilvallez force-pushed the logits-dtype branch from 4ebd091 to cfdef17 Compare June 6, 2024 22:03

ArthurZucker reviewed Jun 18, 2024

View reviewed changes

gante reviewed Jun 18, 2024

View reviewed changes

src/transformers/models/mistral/modeling_mistral.py Outdated Show resolved Hide resolved

gante approved these changes Jun 20, 2024

View reviewed changes

Cyrilvallez force-pushed the logits-dtype branch from 8cfce7b to 4236c05 Compare June 21, 2024 07:56

ArthurZucker reviewed Jul 12, 2024

View reviewed changes

Cyrilvallez force-pushed the logits-dtype branch from dc4a1bb to 9623111 Compare July 12, 2024 13:46

Cyrilvallez force-pushed the logits-dtype branch 3 times, most recently from 9855f62 to f4da824 Compare July 17, 2024 11:46

ArthurZucker self-requested a review July 26, 2024 10:11

zucchini-nlp mentioned this pull request Jul 29, 2024

Gemma2 and flash-attention #32188

Merged

ArthurZucker approved these changes Aug 5, 2024

View reviewed changes

Cyrilvallez force-pushed the logits-dtype branch from 1141198 to e34d512 Compare August 20, 2024 08:17

Cyrilvallez added 8 commits August 23, 2024 11:03

refacor test

e837425

fix slicing removing dim issue

26863ca

Add nemotron support (should fix check-copy issue in CIs)

3c3eeaa

Trigger new CIs

c400865

Trigger new CIs

802eca8

Bump version

4d6fae6

Bump version in TODO

f12f172

Trigger CIs

7b1a26c

Cyrilvallez force-pushed the logits-dtype branch from 1b62c0e to 7b1a26c Compare August 23, 2024 09:08

Cyrilvallez added 2 commits August 23, 2024 11:19

remove blank space

b11b048

Trigger CIs

f03adfb

gante merged commit 22e6f14 into huggingface:main Aug 23, 2024
23 checks passed

gante mentioned this pull request Aug 23, 2024

Test: add higher atol in test_forward_with_num_logits_to_keep #33093

Merged

zucchini-nlp mentioned this pull request Aug 28, 2024

Add video text to text docs #33164

Merged

gante mentioned this pull request Sep 5, 2024

Static KV cache status: How to use it? Does it work for all models? #33270

Closed

xyangk mentioned this pull request Sep 26, 2024

DPO trainer supports num_logits_to_keep to save memory huggingface/trl#2129

Merged

5 tasks

ringohoffman mentioned this pull request Oct 2, 2024

Remove logits.float() #33902

Merged

5 tasks

ringohoffman added a commit to ringohoffman/transformers that referenced this pull request Oct 14, 2024

Only cast logits to float when computing loss

2eaa05e

Some misses from huggingface#31292 and huggingface#33902

ringohoffman mentioned this pull request Oct 14, 2024

Only cast logits to float when computing loss #34147

Merged

5 tasks

ArthurZucker pushed a commit that referenced this pull request Oct 18, 2024

Only cast logits to float when computing loss (#34147)

816f442

* Only cast logits to float when computing loss Some misses from #31292 and #33902 * Move logits.float() into existing if labels is not None branch

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reducing memory usage: removing useless logits computation in generate() #31292

Reducing memory usage: removing useless logits computation in generate() #31292

Cyrilvallez commented Jun 6, 2024 •

edited

Loading

amyeroberts commented Jun 6, 2024

ArthurZucker left a comment

ArthurZucker Jun 18, 2024

HuggingFaceDocBuilderDev commented Jun 18, 2024

gante left a comment

gante commented Jun 18, 2024

gante left a comment

Cyrilvallez commented Jun 21, 2024 •

edited

Loading

ArthurZucker left a comment

ArthurZucker Jul 12, 2024

gante Jul 12, 2024

ArthurZucker commented Jul 12, 2024

Cyrilvallez commented Jul 12, 2024

Cyrilvallez commented Jul 12, 2024

Cyrilvallez commented Jul 17, 2024

ringohoffman commented Jul 24, 2024

ArthurZucker commented Jul 26, 2024 •

edited

Loading

Oxi84 commented Aug 1, 2024

ArthurZucker left a comment

ArthurZucker Aug 5, 2024

ArthurZucker Aug 5, 2024

ringohoffman commented Aug 7, 2024

Cyrilvallez commented Aug 8, 2024

Boubou78000 commented Aug 8, 2024

Cyrilvallez commented Aug 8, 2024

Cyrilvallez commented Aug 21, 2024 •

edited

Loading

Cyrilvallez commented Aug 23, 2024

ringohoffman commented Aug 23, 2024

gante commented Aug 23, 2024

	"Starting from v4.44, the `logits` model output will have the same type as the model (except at train time, where it will always be FP32)"
	"Starting from v4.45, the `logits` model output will have the same type as the model (except at train time, where it will always be FP32)"

Reducing memory usage: removing useless logits computation in generate() #31292

Reducing memory usage: removing useless logits computation in generate() #31292

Conversation

Cyrilvallez commented Jun 6, 2024 • edited Loading

What does this PR do?

Benchmarks

Who can review?

amyeroberts commented Jun 6, 2024

ArthurZucker left a comment

Choose a reason for hiding this comment

ArthurZucker Jun 18, 2024

Choose a reason for hiding this comment

HuggingFaceDocBuilderDev commented Jun 18, 2024

gante left a comment

Choose a reason for hiding this comment

gante commented Jun 18, 2024

gante left a comment

Choose a reason for hiding this comment

Cyrilvallez commented Jun 21, 2024 • edited Loading

ArthurZucker left a comment

Choose a reason for hiding this comment

ArthurZucker Jul 12, 2024

Choose a reason for hiding this comment

gante Jul 12, 2024

Choose a reason for hiding this comment

ArthurZucker commented Jul 12, 2024

Cyrilvallez commented Jul 12, 2024

Cyrilvallez commented Jul 12, 2024

Cyrilvallez commented Jul 17, 2024

ringohoffman commented Jul 24, 2024

ArthurZucker commented Jul 26, 2024 • edited Loading

Oxi84 commented Aug 1, 2024

ArthurZucker left a comment

Choose a reason for hiding this comment

ArthurZucker Aug 5, 2024

Choose a reason for hiding this comment

ArthurZucker Aug 5, 2024

Choose a reason for hiding this comment

ringohoffman commented Aug 7, 2024

Cyrilvallez commented Aug 8, 2024

Boubou78000 commented Aug 8, 2024

Cyrilvallez commented Aug 8, 2024

Cyrilvallez commented Aug 21, 2024 • edited Loading

Cyrilvallez commented Aug 23, 2024

ringohoffman commented Aug 23, 2024

gante commented Aug 23, 2024

Cyrilvallez commented Jun 6, 2024 •

edited

Loading

Cyrilvallez commented Jun 21, 2024 •

edited

Loading

ArthurZucker commented Jul 26, 2024 •

edited

Loading

Cyrilvallez commented Aug 21, 2024 •

edited

Loading