Init cache on meta device #35164

zucchini-nlp · 2024-12-09T11:06:27Z

What does this PR do?

Initializes cache on meta device until we get the first key/values and infer the device from that. Removed layer_device_map as not needed anymore in most cases, because initializing cache on meta device is the default value when no device is given. Offloaded Static cache still would require layer device map or one device, since it prefetches key/values in advance and we cannot infer the device as soon as we see the input.

One thing to note is that Offloaded Static cache can never run on current main because torch.cuda.stream() is not fullgraph compile compatible. Found a related issue from torch team on that: pytorch/pytorch#92804. On current main I can't run even with graph breaks, and it fails after 3rd layer on torch.cuda.default_stream(self.device).wait_stream(self._prefetch_stream), but with this branch it runs if we allow graph breaks until compile cache limit is reached

Since this is the only cache that is behaving different, maybe we can make it not instance of StaticCache at least until the fullgraph compile is working

Slow tests on test_cache_utils.py and on Llama are green, when compared to main branch. Some slow llama tests are red on main and do not use static cache, so it wasn't caused by this PR

Issue repro with 2 GPUs

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_id = 'google/gemma-2-2b-it'
tokenizer = AutoTokenizer.from_pretrained(model_id)

device_map = {"model.embed_tokens": 0, "model.norm": 1, "model.rotary_emb": 1, "lm_head": 0}
num_hidden_layers = 26
for i in range(num_hidden_layers):
    device_map[f"model.layers.{i}"] = 0 if i < 13 else 1

model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype="bfloat16",
    device_map=device_map,
)

inputs = tokenizer("Today is a beautiful day!", return_tensors='pt').to(0)
out = model(**inputs)

Benchmark on llama + compile with meta-llama/Llama-3.2-1B-Instruct:

HuggingFaceDocBuilderDev · 2024-12-09T11:33:17Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

gante

Regarding the offloaded cache: Is there a plausible use case where the current changes would be a fix? Given that is meant to be used in a low-resource setting (= no multiple gpus), I would leave it in its original implementation if multigpu is the only purpose of the changes. In offloading we want to be precise with devices, which is the opposite of meta.

Other than that LGTM :)

This PR has many moving pieces associated with it, so I'm leaving down a few questions to ensure we cross all i's and dot all t's:

The original issue is long. Let's document this PR with a minimal example to reproduce the issue. It will be useful in case we need to understand why we did this;
I'm assuming you ran tests locally, but let's write down which commands were run. (minimum: slow llama tests, slow cache tests)
Have you benchmarked llama + static cache + compilation? If yes, leave a note in the PR header. If not, please double-check :)
[if we want to keep the changes for the offloaded cache] Ditto for the offloaded cache, make sure it is benchmarked before and after these changes. I'm assuming existing tests would catch any correctness issue.

src/transformers/cache_utils.py

src/transformers/generation/utils.py

gante · 2025-01-13T15:53:12Z

tests/utils/test_cache_utils.py

@@ -462,26 +462,6 @@ def test_static_cache_greedy_decoding_pad_right(self, attn_implementation, cache
        with self.subTest(f"{attn_implementation}, static, eager"):
            self.assertListEqual(decoded, EXPECTED_GENERATION)

-        set_seed(0)


I suspect this is related to what you wrote about offloaded caches + cuda graphs.

We test compilation in other places, so I agree it is fine to delete :)

yep, exactly! Since we compile whenever a StaticCache is used, the model gets compiled for cache_implementation="offloaded_cache"

And these tests were not being run to catch the break in graphs. If we want to keep cache_implementation="offloaded_cache" working, we either disable compile specifically for this cache type or make this cache not instance of StaticCache. I am not sure if we are planning to keep the "auto-compile the forward if static cache" feature

ArthurZucker

It's indeed a better way to do this! Let's iterate a bit tho!

src/transformers/cache_utils.py

ArthurZucker · 2025-01-16T16:53:23Z

src/transformers/cache_utils.py

+        if k_out.device.type == "meta":
+            k_out = torch.zeros(*k_out.size(), device=key_states.device, dtype=key_states.dtype)
+            v_out = torch.zeros(*v_out.size(), device=value_states.device, dtype=value_states.dtype)
+            self.key_cache[layer_idx] = k_out
+            self.value_cache[layer_idx] = v_out


IMO the else branch should do:

k_out = self.key_cache[layer_idx] v_out = self.value_cache[layer_idx] ``` not sure memory wise if k_out is erased first or if you allocate more memory at this point.

Yes, it shows more memory allocation for me. I will rewrite a bit to make code more easily inspectable

zucchini-nlp · 2025-01-20T12:04:30Z

tests/utils/test_cache_utils.py

+            "Hello I am doing a project for my school and I am trying to make a program that will allow me to input a",
+            "Hello I am doing a project for my school and I am trying to make a program that will allow me to use a",


Test was never passing because it had no num_beams, the hub config hasn't changed since release. So I just fixed it

SunMarc

LGTM ! Thanks for fixing this !

ArthurZucker

before merging, let's make sure we use correct torch primitives / use the simplest code as possible !

src/transformers/cache_utils.py

Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>

ArthurZucker

A lot better, just be careful with one change and should be good!

ArthurZucker · 2025-01-21T13:16:07Z

src/transformers/cache_utils.py

-        self.conv_states: torch.Tensor = torch.zeros(
-            config.num_hidden_layers,
-            self.max_batch_size,
-            self.intermediate_size,
-            self.conv_kernel_size,
-            device=device,
-            dtype=dtype,
-        )
-        self.ssm_states: torch.Tensor = torch.zeros(
-            config.num_hidden_layers,
-            self.max_batch_size,
-            self.intermediate_size,
-            self.ssm_state_size,
-            device=device,
-            dtype=dtype,
-        )
-


why do we change the shape here?

so we can move tensors from "meta" to "cuda" layer by layer, whenever the cache is updated for given layer. Otherwise, we can move the whole 5D cache once when the fist layer is updated, but I just wanted to be consistent with other static cache classes

ok, sounds good in that case indeed devices can be different!

* init cache on meta device * offloaded static + enable tests * tests weren't running before :( * update * fix mamba * fix copies * update * address comments and fix tests * fix copies * Update src/transformers/cache_utils.py Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com> * update * mamba fix --------- Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>

zucchini-nlp added 4 commits December 4, 2024 14:00

init cache on meta device

4d26d03

offloaded static + enable tests

5384785

tests weren't running before :(

5d4861c

update

3180dc7

zucchini-nlp requested a review from gante December 9, 2024 11:06

fix mamba

21b4de6

zucchini-nlp changed the title ~~[WIP] Init cache on meta device~~ Init cache on meta device Dec 10, 2024

fix copies

0ea61c0

gante reviewed Jan 13, 2025

View reviewed changes

merge main

0e6d4ea

zucchini-nlp requested review from ArthurZucker and Rocketknight1 as code owners January 15, 2025 14:29

update

aed62b1

zucchini-nlp force-pushed the cache-multi-gpu branch from 2a1de72 to aed62b1 Compare January 15, 2025 15:52

ArthurZucker reviewed Jan 16, 2025

View reviewed changes

zucchini-nlp added 3 commits January 20, 2025 12:59

address comments and fix tests

60f8ab7

Merge remote-tracking branch 'upstream/main' into cache-multi-gpu

855026c

fix copies

600d8df

zucchini-nlp requested review from SunMarc, MekkCyber and muellerzr as code owners January 20, 2025 12:03

zucchini-nlp commented Jan 20, 2025

View reviewed changes

zucchini-nlp requested review from gante and ArthurZucker January 20, 2025 12:17

SunMarc approved these changes Jan 20, 2025

View reviewed changes

ArthurZucker reviewed Jan 21, 2025

View reviewed changes

src/transformers/cache_utils.py Outdated Show resolved Hide resolved

zucchini-nlp and others added 2 commits January 21, 2025 11:49

Update src/transformers/cache_utils.py

0aed5a7

Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>

update

3ccf386

ArthurZucker approved these changes Jan 21, 2025

View reviewed changes

mamba fix

db031fa

zucchini-nlp merged commit 373e50e into huggingface:main Jan 22, 2025
25 checks passed

SunMarc mentioned this pull request Jan 22, 2025

Gemma2: RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:7 and cuda:0! (when checking argument for argument index in method wrapper_CUDA__index_select) #34706

Open

4 tasks

zucchini-nlp mentioned this pull request Jan 23, 2025

Whisper: fix static cache CI #35852

Merged

zucchini-nlp mentioned this pull request Jan 31, 2025

Transformers PaliGemma evaluate and compute_loss fail with tensors/device errors #35990

Closed

4 tasks

gante mentioned this pull request Mar 4, 2025

[Cache] Don't initialize the cache on meta device #36543

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Init cache on meta device #35164

Init cache on meta device #35164

zucchini-nlp commented Dec 9, 2024 •

edited

Loading

HuggingFaceDocBuilderDev commented Dec 9, 2024

gante left a comment

gante Jan 13, 2025

zucchini-nlp Jan 15, 2025 •

edited

Loading

ArthurZucker left a comment

ArthurZucker Jan 16, 2025

zucchini-nlp Jan 20, 2025

zucchini-nlp Jan 20, 2025

SunMarc left a comment

ArthurZucker left a comment

ArthurZucker left a comment

ArthurZucker Jan 21, 2025

zucchini-nlp Jan 21, 2025 •

edited

Loading

ArthurZucker Jan 21, 2025

		"Hello I am doing a project for my school and I am trying to make a program that will allow me to input a",
		"Hello I am doing a project for my school and I am trying to make a program that will allow me to use a",

Init cache on meta device #35164

Init cache on meta device #35164

Conversation

zucchini-nlp commented Dec 9, 2024 • edited Loading

What does this PR do?

HuggingFaceDocBuilderDev commented Dec 9, 2024

gante left a comment

Choose a reason for hiding this comment

gante Jan 13, 2025

Choose a reason for hiding this comment

zucchini-nlp Jan 15, 2025 • edited Loading

Choose a reason for hiding this comment

ArthurZucker left a comment

Choose a reason for hiding this comment

ArthurZucker Jan 16, 2025

Choose a reason for hiding this comment

zucchini-nlp Jan 20, 2025

Choose a reason for hiding this comment

zucchini-nlp Jan 20, 2025

Choose a reason for hiding this comment

SunMarc left a comment

Choose a reason for hiding this comment

ArthurZucker left a comment

Choose a reason for hiding this comment

ArthurZucker left a comment

Choose a reason for hiding this comment

ArthurZucker Jan 21, 2025

Choose a reason for hiding this comment

zucchini-nlp Jan 21, 2025 • edited Loading

Choose a reason for hiding this comment

ArthurZucker Jan 21, 2025

Choose a reason for hiding this comment

zucchini-nlp commented Dec 9, 2024 •

edited

Loading

zucchini-nlp Jan 15, 2025 •

edited

Loading

zucchini-nlp Jan 21, 2025 •

edited

Loading