🚨🚨🚨 An attempt to fix #29554. Include 'LayerNorm.' in gamma/beta rename scope, optimize string search. #35615

rwightman · 2025-01-10T17:46:46Z

What does this PR do?

As per #29554, this string replacement has been in transformers for a long time. Since the original Bert Tensorflow ports. Over the years, searching issues, looking at code there has been confusion, non-ideal workarounds to rename weights when including in transformers to avoid this. I believe this should be fixed. It's impacting timm integrations with transformers, @qubvel worked around it for the TimmWrapperModel, but it's not very practical to work around it for the TimmBackbone

This is a proposed fix, it still has a rename, but the scope of the rename is reduced considerably by prepending 'LayerNorm.' to the search string. This should limit scope enough to be unlikely to impact other models being brought in, or timm integrations, 'LayerNorm' as an attribute is not PEP compliant or expected in most languages, so new code is unlikely to have it. It would be helpful to have someone with 5+ years memory to confirm that Bert was the only likely model impacted. The weights on the Hub for Bert definitely need this.

I also made a few optimizations

when searching strings, it is good practice to use .startswith, or .endswith if it is appropriate to limit search space considerably. Here we are only looking at the ending of the strings. This means most comparisons will exit after comparing one or two characters, vs right now where 'in' will be searching over most of every key.
the fix_ fns that were added to address the timm compat, were redoing a key != new key check, which again will compare a lot of characters before exit since replaced string is at the end, we have the info on replacement in the fix fn, so passing out a Tuple[str, bool] is I feel an appropriate tradeoff
for logging, it is not necessary to search every key again, only need to do if changed, and don't need to bother checking if exists in dict since the O() of key in dict is the same as dict[key] = value and will just overwrite existing

The key names (first few) of the Bert weights that rely on the rename are as follows, this change assumes any other impacted models would be dealing with a similar pattern...

bert.embeddings		
bert.embeddings.position_embeddings.weight	[512, 768]	
bert.embeddings.token_type_embeddings.weight	[2, 768]	
bert.embeddings.word_embeddings.weight	[30 522, 768]	
bert.embeddings.LayerNorm.beta	[768]	
bert.embeddings.LayerNorm.gamma	[768]	
bert.encoder		
bert.encoder.layer.0.attention.self.key.bias	[768]	
bert.encoder.layer.0.attention.self.key.weight	[768, 768]	
bert.encoder.layer.0.attention.self.query.bias	[768]	
bert.encoder.layer.0.attention.self.query.weight	[768, 768]	
bert.encoder.layer.0.attention.self.value.bias	[768]	
bert.encoder.layer.0.attention.self.value.weight	[768, 768]	
bert.encoder.layer.0.attention.output.dense.bias	[768]	
bert.encoder.layer.0.attention.output.dense.weight	[768, 768]	
bert.encoder.layer.0.attention.output.LayerNorm.beta	[768]	
bert.encoder.layer.0.attention.output.LayerNorm.gamma	[768]	
bert.encoder.layer.0.intermediate.dense.bias	[3 072]	
bert.encoder.layer.0.intermediate.dense.weight	[3 072, 768]	
bert.encoder.layer.0.output.dense.bias	[768]	
bert.encoder.layer.0.output.dense.weight	[768, 3 072]	
bert.encoder.layer.0.output.LayerNorm.beta
...

…ta rename scope, reduce number of characters searched on every load considerably.

rwightman · 2025-01-10T17:47:37Z

Also, can add to this PR, the parameterization replacement should probably use .endswith as well?

rwightman · 2025-01-10T17:54:04Z

Oh yeah, and the tests do need updating. But a thing about the tests, they are testing a use of the rename which would have been changed by the old code but not the new code, but I believe it is not reflective of any valid case?

Namely in test_modeling_utils.py there is this module

        class TestModelGamma(PreTrainedModel):
            def __init__(self, config):
                super().__init__(config)
                self.gamma_param = nn.Parameter(torch.ones(10))
                self.post_init()

            def forward(self):
                return self.gamma_param.sum()

Now gamma_param would be replaced by weight_param by the old code, that itself was problematic as the original rename appears to be stepping well outside the intended scope. I don't believe there was any valid case where 'gamma' in the middle of a key should have been replaced, but need some insight from others here...

qubvel

Agree! Thanks for the fix. This approach is safer than removing it entirely, so let's move forward and keep an eye on it if anything is broken.

Just need to adjust the test case and add 🚨🚨🚨 for the PR name

One more comment re issue with renaming gamma/beta related to timm

Add warning message for beta and gamma parameters #31654 (comment)

HuggingFaceDocBuilderDev · 2025-01-10T18:34:31Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

rwightman · 2025-01-10T18:40:46Z

@qubvel do the sirens go at the beginning or end of the PR name? :)

So, re the user's comment about timm convnext and the warning, doesn't the current warning fire on load? I mean it got reduced in priority so it won't appear unless you explicity set info level so maybe that's why?

There is still risk he so we should make an effort to ensure this is unlikely to break anything. I'm not sure if there's a (somewhat) efficient way to check hub weights, rip through safetensor metadata for key names? Or if there are any extensive CI tests that try to load popular model weights? timm CI when run locally (not in github runner) will try to load weights into every model looking for errors.

qubvel · 2025-01-10T19:10:56Z

do the sirens go at the beginning or end of the PR name? :)

It would be better to include it at the beginning of the PR. This is added in case we introduce a breaking change, so we can communicate it upon release.

Or if there are any extensive CI tests that try to load popular model weights?

Not sure if any. cc @ydshieh re tests

One way could be to use huggingface_hub to fetch the most popular transformers checkpoints for each model type, but I don't know if we could fetch safetensors files metadata only. And there could be old *pth files as well.
cc @hanouticelina

Another way to fix this issue might be to add an "excluding" pattern specifically for timm models, like if "gamma" in layer_name and not "<layer>.gamma" in layer_name

rwightman · 2025-01-10T19:18:33Z

@qubvel while an excluding pattern could be used, I really think this is some baggage that should be taken out of transformers or at least significantly reduced in scope as here. This has made numerous renames necessary when models are brought into transformers over the years that cause naming to differ from norms, paper eqns (can see traces of this in convert code, past issue searches, etc). Also, it's just a big code 'WTF' and I feel strongly that removing and improving those really is worthwhile for a project over the long hual.

On the server (HF Hub backend) side of things, I thought there might be a way to rip through safetensors metadata at scale and do some sort of query...

ydshieh · 2025-01-10T19:24:37Z

Or if there are any extensive CI tests that try to load popular model weights?

We have integration tests when a model is added to transformers and those tests are using one (or a few) hub weights that the contributors or team members decide to take, and usually they are the most popular/used/famous weights for that added model.

However, for a single model, we don't try to load weights from many hub repositories to check all of them are working.

Also, those integration tests are running only in a daily basis (or some specific events) on GitHub actions.

rwightman · 2025-01-10T19:28:48Z

Or if there are any extensive CI tests that try to load popular model weights?

We have integration tests when a model is added to transformers and those tests are using one (or a few) hub weights that the contributors or team members decide to take, and usually they are the most popular/used/famous weights for that added model.

However, for a single model, we don't try to load weights from many hub repositories to check all of them are working.

Also, those integration tests are running only in a daily basis (or some specific events) on GitHub actions.

So in this case if there are tests that load one/few weights for the most common models I think that'd be a good signal. This rename should really just be impacting a set of models where there the original weights were ported when LayerNorm.gamma/beta was used, but needed to be updated to .weight/bias (perhaps because nn.LayerNorm got introduced formally as a layer in torch but was a custom module prior?).

Any fine-tune of the impacted models (like Bert as we know) that happened after all this was said and done (and this was like 4-5 years ago I believe), would have been saved with the renamed name. It should only have been in some original weighs ported from TF before a specific change in Transformers use of LayerNorm.

ArthurZucker

Makes sense, thankful for the detailed PR description !
The models that have self.LayerNorm:

align
albert
altclip
bert
bigbird
blip
blip2
clap
...

A lot of them have this because bert had it, or because at some point we wanted to be able to load TF models. TBH this does not matter anymore.
In super favor of reducing the scope + im proving perfs!

rwightman · 2025-01-13T14:16:09Z

@ArthurZucker any existing model that has the .LayerNorm.* pattern would be safe. It would be converted by the updated code if needed and would have been renamed by the old code as well. I did check some of those models and didnt' find any cases where there was a .gamma/.beta in hub weight files though

My only concern here is that there some model that we're not aware of that was relying on the rename for 'gamma' or 'beta' w/o the preceding LayerNorm. But I doubt that's the case.

FWIW Most propagation of the unforunate .LayerNorm name is not because of any original conversion, but because of cut & paste / copied from the original Bert module code.

rwightman · 2025-01-14T16:57:46Z

While improving the gamma/beta rename, similar optimization should be done for the parameterization renames, all of these string searhes would be appropriate to change to .endswith() correct? so keys always end with weight_g or parametrizations.weight.original0, etc when they are present?

# to avoid logging parametrized weight norm renaming
if hasattr(nn.utils.parametrizations, "weight_norm"):
    if "weight_g" in key:
        return key.replace("weight_g", "parametrizations.weight.original0"), True
    if "weight_v" in key:
        return key.replace("weight_v", "parametrizations.weight.original1"), True
else:
    if "parametrizations.weight.original0" in key:
        return key.replace("parametrizations.weight.original0", "weight_g"), True
    if "parametrizations.weight.original1" in key:
        return key.replace("parametrizations.weight.original1", "weight_v"), True

…ut weight norm and layer norm renaming.

rwightman · 2025-01-14T18:28:54Z

Okay, I found some models wav2vec2, hubert, etc that rely on the weight norm parametrization rename. I verified that .endswith is appropriate. Tested, and added comments so people will have more context when they have a WTF moment on these renames in the future...

qubvel · 2025-01-17T11:37:08Z

Thanks for the fix!

…a/beta rename scope, optimize string search. (huggingface#35615) * An attempt to fix huggingface#29554. Include 'LayerNorm.' in gamma/beta rename scope, reduce number of characters searched on every load considerably. * Fix fix on load issue * Fix gamma/beta warning test * A style complaint * Improve efficiency of weight norm key rename. Add better comments about weight norm and layer norm renaming. * Habitual elif redunant with the return

SUMMARY: - Requires next transformers release/ when the following commit is released: huggingface/transformers#35615 - Adds group act order case which previously could not be loaded through the AutoModel pathway due to incorrect substring replacement of the `weight_g_idx` parameter Testing: - Test passes locally with transformers 4.49

An attempt to fix huggingface#29554. Include 'LayerNorm.' in gamma/be…

faeed0d

…ta rename scope, reduce number of characters searched on every load considerably.

rwightman requested a review from qubvel January 10, 2025 17:46

rwightman requested review from Rocketknight1 and ArthurZucker as code owners January 10, 2025 17:46

Fix fix on load issue

eed2570

rwightman requested a review from amyeroberts as a code owner January 10, 2025 18:05

rwightman removed the request for review from amyeroberts January 10, 2025 18:08

qubvel reviewed Jan 10, 2025

View reviewed changes

rwightman changed the title ~~An attempt to fix #29554. Include 'LayerNorm.' in gamma/beta rename scope, optimize string search.~~ 🚨🚨🚨 An attempt to fix #29554. Include 'LayerNorm.' in gamma/beta rename scope, optimize string search. Jan 10, 2025

rwightman added 2 commits January 10, 2025 13:11

Fix gamma/beta warning test

9a9641c

A style complaint

0d2211e

ArthurZucker approved these changes Jan 13, 2025

View reviewed changes

This was referenced Jan 13, 2025

from_pretrained fails to save weights.py and layers.py into cache, therefore fails to find them in cache #35619

Closed

Better handeling of hardcoded component in PretrainedModel.from_pretrained. #35617

Closed

rwightman added 2 commits January 14, 2025 10:23

Improve efficiency of weight norm key rename. Add better comments abo…

725d86d

…ut weight norm and layer norm renaming.

Habitual elif redunant with the return

5bc205c

rwightman merged commit 8c1b5d3 into huggingface:main Jan 17, 2025
25 checks passed

qubvel mentioned this pull request Jan 17, 2025

Fix weight loading of weight_g_idx compressed-tensors parameters #35741

Closed

5 tasks

dsikka mentioned this pull request Jan 17, 2025

Add group act order case to lm_eval test vllm-project/llm-compressor#1080

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

🚨🚨🚨 An attempt to fix #29554. Include 'LayerNorm.' in gamma/beta rename scope, optimize string search. #35615

🚨🚨🚨 An attempt to fix #29554. Include 'LayerNorm.' in gamma/beta rename scope, optimize string search. #35615

rwightman commented Jan 10, 2025 •

edited

Loading

rwightman commented Jan 10, 2025

rwightman commented Jan 10, 2025 •

edited

Loading

qubvel left a comment

HuggingFaceDocBuilderDev commented Jan 10, 2025

rwightman commented Jan 10, 2025

qubvel commented Jan 10, 2025 •

edited

Loading

rwightman commented Jan 10, 2025 •

edited

Loading

ydshieh commented Jan 10, 2025 •

edited

Loading

rwightman commented Jan 10, 2025 •

edited

Loading

ArthurZucker left a comment

rwightman commented Jan 13, 2025 •

edited

Loading

rwightman commented Jan 14, 2025

rwightman commented Jan 14, 2025

qubvel commented Jan 17, 2025

🚨🚨🚨 An attempt to fix #29554. Include 'LayerNorm.' in gamma/beta rename scope, optimize string search. #35615

🚨🚨🚨 An attempt to fix #29554. Include 'LayerNorm.' in gamma/beta rename scope, optimize string search. #35615

Conversation

rwightman commented Jan 10, 2025 • edited Loading

What does this PR do?

rwightman commented Jan 10, 2025

rwightman commented Jan 10, 2025 • edited Loading

qubvel left a comment

Choose a reason for hiding this comment

HuggingFaceDocBuilderDev commented Jan 10, 2025

rwightman commented Jan 10, 2025

qubvel commented Jan 10, 2025 • edited Loading

rwightman commented Jan 10, 2025 • edited Loading

ydshieh commented Jan 10, 2025 • edited Loading

rwightman commented Jan 10, 2025 • edited Loading

ArthurZucker left a comment

Choose a reason for hiding this comment

rwightman commented Jan 13, 2025 • edited Loading

rwightman commented Jan 14, 2025

rwightman commented Jan 14, 2025

qubvel commented Jan 17, 2025

rwightman commented Jan 10, 2025 •

edited

Loading

rwightman commented Jan 10, 2025 •

edited

Loading

qubvel commented Jan 10, 2025 •

edited

Loading

rwightman commented Jan 10, 2025 •

edited

Loading

ydshieh commented Jan 10, 2025 •

edited

Loading

rwightman commented Jan 10, 2025 •

edited

Loading

rwightman commented Jan 13, 2025 •

edited

Loading