-
Notifications
You must be signed in to change notification settings - Fork 28.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
🚨🚨🚨 An attempt to fix #29554. Include 'LayerNorm.' in gamma/beta rename scope, optimize string search. #35615
Conversation
…ta rename scope, reduce number of characters searched on every load considerably.
Also, can add to this PR, the parameterization replacement should probably use .endswith as well? |
Oh yeah, and the tests do need updating. But a thing about the tests, they are testing a use of the rename which would have been changed by the old code but not the new code, but I believe it is not reflective of any valid case? Namely in
Now |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree! Thanks for the fix. This approach is safer than removing it entirely, so let's move forward and keep an eye on it if anything is broken.
Just need to adjust the test case and add 🚨🚨🚨 for the PR name
One more comment re issue with renaming gamma/beta related to timm
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
@qubvel do the sirens go at the beginning or end of the PR name? :) So, re the user's comment about timm convnext and the warning, doesn't the current warning fire on load? I mean it got reduced in priority so it won't appear unless you explicity set info level so maybe that's why? There is still risk he so we should make an effort to ensure this is unlikely to break anything. I'm not sure if there's a (somewhat) efficient way to check hub weights, rip through safetensor metadata for key names? Or if there are any extensive CI tests that try to load popular model weights? timm CI when run locally (not in github runner) will try to load weights into every model looking for errors. |
It would be better to include it at the beginning of the PR. This is added in case we introduce a breaking change, so we can communicate it upon release.
Not sure if any. cc @ydshieh re tests One way could be to use Another way to fix this issue might be to add an "excluding" pattern specifically for timm models, like |
@qubvel while an excluding pattern could be used, I really think this is some baggage that should be taken out of transformers or at least significantly reduced in scope as here. This has made numerous renames necessary when models are brought into transformers over the years that cause naming to differ from norms, paper eqns (can see traces of this in convert code, past issue searches, etc). Also, it's just a big code 'WTF' and I feel strongly that removing and improving those really is worthwhile for a project over the long hual. On the server (HF Hub backend) side of things, I thought there might be a way to rip through safetensors metadata at scale and do some sort of query... |
We have integration tests when a model is added to However, for a single model, we don't try to load weights from many hub repositories to check all of them are working. Also, those integration tests are running only in a daily basis (or some specific events) on GitHub actions. |
So in this case if there are tests that load one/few weights for the most common models I think that'd be a good signal. This rename should really just be impacting a set of models where there the original weights were ported when LayerNorm.gamma/beta was used, but needed to be updated to .weight/bias (perhaps because nn.LayerNorm got introduced formally as a layer in torch but was a custom module prior?). Any fine-tune of the impacted models (like Bert as we know) that happened after all this was said and done (and this was like 4-5 years ago I believe), would have been saved with the renamed name. It should only have been in some original weighs ported from TF before a specific change in Transformers use of LayerNorm. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense, thankful for the detailed PR description !
The models that have self.LayerNorm
:
- align
- albert
- altclip
- bert
- bigbird
- blip
- blip2
- clap
- ...
A lot of them have this because bert had it, or because at some point we wanted to be able to load TF models. TBH this does not matter anymore.
In super favor of reducing the scope + im proving perfs!
@ArthurZucker any existing model that has the My only concern here is that there some model that we're not aware of that was relying on the rename for 'gamma' or 'beta' w/o the preceding LayerNorm. But I doubt that's the case. FWIW Most propagation of the unforunate |
While improving the gamma/beta rename, similar optimization should be done for the parameterization renames, all of these string searhes would be appropriate to change to
|
…ut weight norm and layer norm renaming.
Okay, I found some models wav2vec2, hubert, etc that rely on the weight norm parametrization rename. I verified that .endswith is appropriate. Tested, and added comments so people will have more context when they have a WTF moment on these renames in the future... |
Thanks for the fix! |
…a/beta rename scope, optimize string search. (huggingface#35615) * An attempt to fix huggingface#29554. Include 'LayerNorm.' in gamma/beta rename scope, reduce number of characters searched on every load considerably. * Fix fix on load issue * Fix gamma/beta warning test * A style complaint * Improve efficiency of weight norm key rename. Add better comments about weight norm and layer norm renaming. * Habitual elif redunant with the return
…a/beta rename scope, optimize string search. (huggingface#35615) * An attempt to fix huggingface#29554. Include 'LayerNorm.' in gamma/beta rename scope, reduce number of characters searched on every load considerably. * Fix fix on load issue * Fix gamma/beta warning test * A style complaint * Improve efficiency of weight norm key rename. Add better comments about weight norm and layer norm renaming. * Habitual elif redunant with the return
SUMMARY: - Requires next transformers release/ when the following commit is released: huggingface/transformers#35615 - Adds group act order case which previously could not be loaded through the AutoModel pathway due to incorrect substring replacement of the `weight_g_idx` parameter Testing: - Test passes locally with transformers 4.49
What does this PR do?
As per #29554, this string replacement has been in transformers for a long time. Since the original Bert Tensorflow ports. Over the years, searching issues, looking at code there has been confusion, non-ideal workarounds to rename weights when including in transformers to avoid this. I believe this should be fixed. It's impacting
timm
integrations withtransformers
, @qubvel worked around it for theTimmWrapperModel
, but it's not very practical to work around it for theTimmBackbone
This is a proposed fix, it still has a rename, but the scope of the rename is reduced considerably by prepending 'LayerNorm.' to the search string. This should limit scope enough to be unlikely to impact other models being brought in, or timm integrations, 'LayerNorm' as an attribute is not PEP compliant or expected in most languages, so new code is unlikely to have it. It would be helpful to have someone with 5+ years memory to confirm that Bert was the only likely model impacted. The weights on the Hub for Bert definitely need this.
I also made a few optimizations
when searching strings, it is good practice to use .startswith, or .endswith if it is appropriate to limit search space considerably. Here we are only looking at the ending of the strings. This means most comparisons will exit after comparing one or two characters, vs right now where 'in' will be searching over most of every key.
the
fix_
fns that were added to address thetimm
compat, were redoing a key != new key check, which again will compare a lot of characters before exit since replaced string is at the end, we have the info on replacement in the fix fn, so passing out a Tuple[str, bool] is I feel an appropriate tradeofffor logging, it is not necessary to search every key again, only need to do if changed, and don't need to bother checking if exists in dict since the O() of
key in dict
is the same asdict[key] = value
and will just overwrite existingThe key names (first few) of the Bert weights that rely on the rename are as follows, this change assumes any other impacted models would be dealing with a similar pattern...