Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

🚨🚨🚨 An attempt to fix #29554. Include 'LayerNorm.' in gamma/beta rename scope, optimize string search. #35615

Merged
merged 6 commits into from
Jan 17, 2025

Conversation

rwightman
Copy link
Contributor

@rwightman rwightman commented Jan 10, 2025

What does this PR do?

As per #29554, this string replacement has been in transformers for a long time. Since the original Bert Tensorflow ports. Over the years, searching issues, looking at code there has been confusion, non-ideal workarounds to rename weights when including in transformers to avoid this. I believe this should be fixed. It's impacting timm integrations with transformers, @qubvel worked around it for the TimmWrapperModel, but it's not very practical to work around it for the TimmBackbone

This is a proposed fix, it still has a rename, but the scope of the rename is reduced considerably by prepending 'LayerNorm.' to the search string. This should limit scope enough to be unlikely to impact other models being brought in, or timm integrations, 'LayerNorm' as an attribute is not PEP compliant or expected in most languages, so new code is unlikely to have it. It would be helpful to have someone with 5+ years memory to confirm that Bert was the only likely model impacted. The weights on the Hub for Bert definitely need this.

I also made a few optimizations

  • when searching strings, it is good practice to use .startswith, or .endswith if it is appropriate to limit search space considerably. Here we are only looking at the ending of the strings. This means most comparisons will exit after comparing one or two characters, vs right now where 'in' will be searching over most of every key.

  • the fix_ fns that were added to address the timm compat, were redoing a key != new key check, which again will compare a lot of characters before exit since replaced string is at the end, we have the info on replacement in the fix fn, so passing out a Tuple[str, bool] is I feel an appropriate tradeoff

  • for logging, it is not necessary to search every key again, only need to do if changed, and don't need to bother checking if exists in dict since the O() of key in dict is the same as dict[key] = value and will just overwrite existing

The key names (first few) of the Bert weights that rely on the rename are as follows, this change assumes any other impacted models would be dealing with a similar pattern...

bert.embeddings		
bert.embeddings.position_embeddings.weight	[512, 768]	
bert.embeddings.token_type_embeddings.weight	[2, 768]	
bert.embeddings.word_embeddings.weight	[30 522, 768]	
bert.embeddings.LayerNorm.beta	[768]	
bert.embeddings.LayerNorm.gamma	[768]	
bert.encoder		
bert.encoder.layer.0.attention.self.key.bias	[768]	
bert.encoder.layer.0.attention.self.key.weight	[768, 768]	
bert.encoder.layer.0.attention.self.query.bias	[768]	
bert.encoder.layer.0.attention.self.query.weight	[768, 768]	
bert.encoder.layer.0.attention.self.value.bias	[768]	
bert.encoder.layer.0.attention.self.value.weight	[768, 768]	
bert.encoder.layer.0.attention.output.dense.bias	[768]	
bert.encoder.layer.0.attention.output.dense.weight	[768, 768]	
bert.encoder.layer.0.attention.output.LayerNorm.beta	[768]	
bert.encoder.layer.0.attention.output.LayerNorm.gamma	[768]	
bert.encoder.layer.0.intermediate.dense.bias	[3 072]	
bert.encoder.layer.0.intermediate.dense.weight	[3 072, 768]	
bert.encoder.layer.0.output.dense.bias	[768]	
bert.encoder.layer.0.output.dense.weight	[768, 3 072]	
bert.encoder.layer.0.output.LayerNorm.beta
...

…ta rename scope, reduce number of characters searched on every load considerably.
@rwightman
Copy link
Contributor Author

Also, can add to this PR, the parameterization replacement should probably use .endswith as well?

@rwightman
Copy link
Contributor Author

rwightman commented Jan 10, 2025

Oh yeah, and the tests do need updating. But a thing about the tests, they are testing a use of the rename which would have been changed by the old code but not the new code, but I believe it is not reflective of any valid case?

Namely in test_modeling_utils.py there is this module

        class TestModelGamma(PreTrainedModel):
            def __init__(self, config):
                super().__init__(config)
                self.gamma_param = nn.Parameter(torch.ones(10))
                self.post_init()

            def forward(self):
                return self.gamma_param.sum()

Now gamma_param would be replaced by weight_param by the old code, that itself was problematic as the original rename appears to be stepping well outside the intended scope. I don't believe there was any valid case where 'gamma' in the middle of a key should have been replaced, but need some insight from others here...

@rwightman rwightman removed the request for review from amyeroberts January 10, 2025 18:08
Copy link
Member

@qubvel qubvel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree! Thanks for the fix. This approach is safer than removing it entirely, so let's move forward and keep an eye on it if anything is broken.

Just need to adjust the test case and add 🚨🚨🚨 for the PR name

One more comment re issue with renaming gamma/beta related to timm

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@rwightman
Copy link
Contributor Author

@qubvel do the sirens go at the beginning or end of the PR name? :)

So, re the user's comment about timm convnext and the warning, doesn't the current warning fire on load? I mean it got reduced in priority so it won't appear unless you explicity set info level so maybe that's why?

There is still risk he so we should make an effort to ensure this is unlikely to break anything. I'm not sure if there's a (somewhat) efficient way to check hub weights, rip through safetensor metadata for key names? Or if there are any extensive CI tests that try to load popular model weights? timm CI when run locally (not in github runner) will try to load weights into every model looking for errors.

@qubvel
Copy link
Member

qubvel commented Jan 10, 2025

do the sirens go at the beginning or end of the PR name? :)

It would be better to include it at the beginning of the PR. This is added in case we introduce a breaking change, so we can communicate it upon release.

Or if there are any extensive CI tests that try to load popular model weights?

Not sure if any. cc @ydshieh re tests

One way could be to use huggingface_hub to fetch the most popular transformers checkpoints for each model type, but I don't know if we could fetch safetensors files metadata only. And there could be old *pth files as well.
cc @hanouticelina

Another way to fix this issue might be to add an "excluding" pattern specifically for timm models, like if "gamma" in layer_name and not "<layer>.gamma" in layer_name

@rwightman
Copy link
Contributor Author

rwightman commented Jan 10, 2025

@qubvel while an excluding pattern could be used, I really think this is some baggage that should be taken out of transformers or at least significantly reduced in scope as here. This has made numerous renames necessary when models are brought into transformers over the years that cause naming to differ from norms, paper eqns (can see traces of this in convert code, past issue searches, etc). Also, it's just a big code 'WTF' and I feel strongly that removing and improving those really is worthwhile for a project over the long hual.

On the server (HF Hub backend) side of things, I thought there might be a way to rip through safetensors metadata at scale and do some sort of query...

@rwightman rwightman changed the title An attempt to fix #29554. Include 'LayerNorm.' in gamma/beta rename scope, optimize string search. 🚨🚨🚨 An attempt to fix #29554. Include 'LayerNorm.' in gamma/beta rename scope, optimize string search. Jan 10, 2025
@ydshieh
Copy link
Collaborator

ydshieh commented Jan 10, 2025

Or if there are any extensive CI tests that try to load popular model weights?

We have integration tests when a model is added to transformers and those tests are using one (or a few) hub weights that the contributors or team members decide to take, and usually they are the most popular/used/famous weights for that added model.

However, for a single model, we don't try to load weights from many hub repositories to check all of them are working.

Also, those integration tests are running only in a daily basis (or some specific events) on GitHub actions.

@rwightman
Copy link
Contributor Author

rwightman commented Jan 10, 2025

Or if there are any extensive CI tests that try to load popular model weights?

We have integration tests when a model is added to transformers and those tests are using one (or a few) hub weights that the contributors or team members decide to take, and usually they are the most popular/used/famous weights for that added model.

However, for a single model, we don't try to load weights from many hub repositories to check all of them are working.

Also, those integration tests are running only in a daily basis (or some specific events) on GitHub actions.

So in this case if there are tests that load one/few weights for the most common models I think that'd be a good signal. This rename should really just be impacting a set of models where there the original weights were ported when LayerNorm.gamma/beta was used, but needed to be updated to .weight/bias (perhaps because nn.LayerNorm got introduced formally as a layer in torch but was a custom module prior?).

Any fine-tune of the impacted models (like Bert as we know) that happened after all this was said and done (and this was like 4-5 years ago I believe), would have been saved with the renamed name. It should only have been in some original weighs ported from TF before a specific change in Transformers use of LayerNorm.

Copy link
Collaborator

@ArthurZucker ArthurZucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense, thankful for the detailed PR description !
The models that have self.LayerNorm:

  • align
  • albert
  • altclip
  • bert
  • bigbird
  • blip
  • blip2
  • clap
  • ...

A lot of them have this because bert had it, or because at some point we wanted to be able to load TF models. TBH this does not matter anymore.
In super favor of reducing the scope + im proving perfs!

@rwightman
Copy link
Contributor Author

rwightman commented Jan 13, 2025

@ArthurZucker any existing model that has the .LayerNorm.* pattern would be safe. It would be converted by the updated code if needed and would have been renamed by the old code as well. I did check some of those models and didnt' find any cases where there was a .gamma/.beta in hub weight files though

My only concern here is that there some model that we're not aware of that was relying on the rename for 'gamma' or 'beta' w/o the preceding LayerNorm. But I doubt that's the case.

FWIW Most propagation of the unforunate .LayerNorm name is not because of any original conversion, but because of cut & paste / copied from the original Bert module code.

@rwightman
Copy link
Contributor Author

While improving the gamma/beta rename, similar optimization should be done for the parameterization renames, all of these string searhes would be appropriate to change to .endswith() correct? so keys always end with weight_g or parametrizations.weight.original0, etc when they are present?

# to avoid logging parametrized weight norm renaming
if hasattr(nn.utils.parametrizations, "weight_norm"):
    if "weight_g" in key:
        return key.replace("weight_g", "parametrizations.weight.original0"), True
    if "weight_v" in key:
        return key.replace("weight_v", "parametrizations.weight.original1"), True
else:
    if "parametrizations.weight.original0" in key:
        return key.replace("parametrizations.weight.original0", "weight_g"), True
    if "parametrizations.weight.original1" in key:
        return key.replace("parametrizations.weight.original1", "weight_v"), True

@rwightman
Copy link
Contributor Author

Okay, I found some models wav2vec2, hubert, etc that rely on the weight norm parametrization rename. I verified that .endswith is appropriate. Tested, and added comments so people will have more context when they have a WTF moment on these renames in the future...

@rwightman rwightman merged commit 8c1b5d3 into huggingface:main Jan 17, 2025
25 checks passed
@qubvel
Copy link
Member

qubvel commented Jan 17, 2025

Thanks for the fix!

bursteratom pushed a commit to bursteratom/transformers that referenced this pull request Jan 31, 2025
…a/beta rename scope, optimize string search. (huggingface#35615)

* An attempt to fix huggingface#29554. Include 'LayerNorm.' in gamma/beta rename scope, reduce number of characters searched on every load considerably.

* Fix fix on load issue

* Fix gamma/beta warning test

* A style complaint

* Improve efficiency of weight norm key rename. Add better comments about weight norm and layer norm renaming.

* Habitual elif redunant with the return
elvircrn pushed a commit to elvircrn/transformers that referenced this pull request Feb 13, 2025
…a/beta rename scope, optimize string search. (huggingface#35615)

* An attempt to fix huggingface#29554. Include 'LayerNorm.' in gamma/beta rename scope, reduce number of characters searched on every load considerably.

* Fix fix on load issue

* Fix gamma/beta warning test

* A style complaint

* Improve efficiency of weight norm key rename. Add better comments about weight norm and layer norm renaming.

* Habitual elif redunant with the return
dsikka added a commit to vllm-project/llm-compressor that referenced this pull request Feb 21, 2025
SUMMARY:
- Requires next transformers release/ when the following commit is
released: huggingface/transformers#35615
- Adds group act order case which previously could not be loaded through
the AutoModel pathway due to incorrect substring replacement of the
`weight_g_idx` parameter

Testing:
- Test passes locally with transformers 4.49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants