Fix accelerate failing tests #30836

SunMarc · 2024-05-15T16:05:46Z

What does this PR do ?

This PR fixes some tests (~ 30) are that contained in the accelerate_tests flag. Recently, I ran the following workflow to have an overview of the failing tests : https://github.com/huggingface/transformers/actions/runs/9081904684 or the following CLI: RUN_SLOW=True pytest -m accelerate_tests tests/
Related PR #30808

Model fixed:

BertModel
ClipModel
GPTSanJapaneseModel
M2M100Model
MambaModel - same fix as jamba
T5/MT5Model/UMT5Model - same fix since this is copied from code
Whisper
SiglipVisionModel for MP. As for the offload, I skipped the tests since fixing them would require to introduce a new attribute in PretrainedConfig (_preload_module_classes to be used in dispatch_model), not sure if it is worth it if this is the only model for now.

Tests skipped:

DbrxModel : because offload is not working and enabling it is not simple (core modeling code). However model parallelism works.

See the run here with most of the tests passing . There are only 3 tests failing but I skipped them in the latest commit (siglipvisionmodel)
cc @ydshieh

HuggingFaceDocBuilderDev · 2024-05-15T16:27:21Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

ArthurZucker

Thanks. Approving but for the ssm cache I find it a bit weird that we have to do these x.to().
Also in general, accelerate is supposed to handle allllllll the devices no? 👿

ArthurZucker · 2024-05-17T10:57:33Z

src/transformers/models/mamba/modeling_mamba.py

@@ -244,6 +244,7 @@ def slow_forward(self, input_states, cache_params: Optional[MambaCache]=None):
        # 2. Convolution sequence transformation
        if cache_params is not None:
            ssm_state = cache_params.ssm_states[self.layer_idx].clone()


if the cache param is always on the wrong device we are always gonna have to do the transfer here no?

yes that's right !

ArthurZucker · 2024-05-17T10:58:18Z

src/transformers/models/whisper/modeling_whisper.py

@@ -1388,7 +1388,7 @@ def forward(
                inputs_embeds, past_key_values_length=past_key_values_length, position_ids=position_ids
            )

-        hidden_states = inputs_embeds + positions
+        hidden_states = inputs_embeds + positions.to(inputs_embeds.device)


what was the issue with this? It' s a bit weird here, device placement of the inputs is not automatic? Or is the embedding layer alone place on another device than the embed_position?

inputs_embeds and postions can be on different devices (could be fixed if we find a way to make sure that they are on the same device, however, we would probably need to add another attribute in PretrainedConfig). We can't really move automatically these inputs to the right device since we do the ops in the forward itself. Wrapping them in a nn.Module to perform the op would work but that's not a good solution.

Alright got it!

ArthurZucker · 2024-05-17T11:02:24Z

src/transformers/models/bert/modeling_bert.py

@@ -962,7 +962,7 @@ class BertModel(BertPreTrainedModel):
    `add_cross_attention` set to `True`; an `encoder_hidden_states` is then expected as an input to the forward pass.
    """

-    _no_split_modules = ["BertEmbeddings"]
+    _no_split_modules = ["BertEmbeddings", "BertLayer"]


Is this a general solution that will work for everyone? (I mean why was this failing the test)

Yes that will work for everyone ! I don't know why the layer were not added to _no_split_modules, this is what we do for all models.

ArthurZucker

Long due! Thanks for updating 😄

ydshieh · 2024-05-21T14:04:55Z

Just wanna to know if those are failing for quite some time or it's a recent changes (in either transformers or accelerate) causing those failures? If it's the later case, could you elaborate it a bit more?

SunMarc · 2024-05-21T15:39:49Z

Just wanna to know if those are failing for quite some time or it's a recent changes (in either transformers or accelerate) causing those failures? If it's the later case, could you elaborate it a bit more?

I'm not sure how long these tests have been failing. However, most of the tests were failing for a good reason (didn't put the right no_split_modules, modeling code not compatible)

ydshieh

ok, thanks~

Fix accelerate tests

e721282

SunMarc added 10 commits May 15, 2024 18:28

fix clip

25e55c6

skip dbrx tests

016c7a9

fix GPTSan

13338c3

fix M2M100Model

29a2f47

same fix as jamba

aa3fecd

fix mt5

9be4ff0

Fix T5Model

39565ff

Fix umt5 model

e4fe868

fix switch_transformers

9f74e31

fix whisper

b4ad36d

SunMarc mentioned this pull request May 16, 2024

Fix low cpu mem usage tests #30808

Merged

SunMarc added 3 commits May 16, 2024 18:11

fix gptsan again

2c7c091

fix siglip recent test

181d0f9

skip siglip tests

b63dab5

SunMarc requested review from ArthurZucker and ydshieh May 17, 2024 10:44

wrong place fixed

1dddbd3

ArthurZucker approved these changes May 17, 2024

View reviewed changes

ArthurZucker approved these changes May 20, 2024

View reviewed changes

ydshieh approved these changes May 21, 2024

View reviewed changes

Merge remote-tracking branch 'upstream/main' into fix-accelerate-tests

2de522a

SunMarc closed this in #30808 May 22, 2024

SunMarc reopened this May 22, 2024

ydshieh mentioned this pull request May 23, 2024

fix models that fail in test_model_parallelism #30876

Closed

SunMarc merged commit 8366b57 into main May 23, 2024
24 checks passed

SunMarc deleted the fix-accelerate-tests branch May 23, 2024 15:18

This was referenced Jan 18, 2025

multi-gpu: fix tensor device placements for various models #35763

Merged

multi-gpu: test_model_parallel_beam_search tests fail with "RuntimeError: Expected all tensors to be on the same device" #35762

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix accelerate failing tests #30836

Fix accelerate failing tests #30836

SunMarc commented May 15, 2024 •

edited

Loading

HuggingFaceDocBuilderDev commented May 15, 2024

ArthurZucker left a comment

ArthurZucker May 17, 2024

SunMarc May 17, 2024

ArthurZucker May 17, 2024

SunMarc May 17, 2024 •

edited

Loading

ArthurZucker May 20, 2024

ArthurZucker May 17, 2024

SunMarc May 17, 2024

ArthurZucker left a comment

ydshieh commented May 21, 2024

SunMarc commented May 21, 2024

ydshieh left a comment

Fix accelerate failing tests #30836

Fix accelerate failing tests #30836

Conversation

SunMarc commented May 15, 2024 • edited Loading

What does this PR do ?

HuggingFaceDocBuilderDev commented May 15, 2024

ArthurZucker left a comment

Choose a reason for hiding this comment

ArthurZucker May 17, 2024

Choose a reason for hiding this comment

SunMarc May 17, 2024

Choose a reason for hiding this comment

ArthurZucker May 17, 2024

Choose a reason for hiding this comment

SunMarc May 17, 2024 • edited Loading

Choose a reason for hiding this comment

ArthurZucker May 20, 2024

Choose a reason for hiding this comment

ArthurZucker May 17, 2024

Choose a reason for hiding this comment

SunMarc May 17, 2024

Choose a reason for hiding this comment

ArthurZucker left a comment

Choose a reason for hiding this comment

ydshieh commented May 21, 2024

SunMarc commented May 21, 2024

ydshieh left a comment

Choose a reason for hiding this comment

SunMarc commented May 15, 2024 •

edited

Loading

SunMarc May 17, 2024 •

edited

Loading