-
Notifications
You must be signed in to change notification settings - Fork 27.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement Maximal Update Parametrization (muP) #16157
Comments
Pinging maintainers for knowledge: @patrickvonplaten @sgugger @patil-suraj |
This is only really relevant for pretraining I assume no? I wonder whether it might make more sense to add this directly to |
Hi Patrick, I'm another maintainer of the It's true that the biggest payoff will probably come from applying our technique to large-scale pretraining, but We are more than happy to look into integration with other tools such as |
From what I gather of the As for integrating into Transformers, I think everyone would be delighted to see it as easily accessible as possible. There is just the (big) catch of modifying every modeling file for this. It's not really an option for two reasons:
As such it would be way more powerful if we could design a function that automatically converts a model to be used with muP. The first two points you mention are easy to do on an existing model (we can change the Linear layers on the fly and re-init the weights), the last one is a tiny bit more complex. I don't know if you have any idea on this. As for making the If we don't manage to have such a function, we also have a feature where you can host any modeling code on the Hub and have it run with Transformers (using the Let me know your thoughts! |
Hi @sgugger, is there any particular reason you say that |
You're right, I should have said that the adaptations you mention seem very targeted toward Transformer (in particularly point 3 above). |
Hi @sgugger, Like Greg said, only the third item is Transformer-specific (we should have noted it clearly). My concern wrt I like the idea of having a converter function so we keep the model files as clean as possible. I'd also like to point out that muAdam is simply a wrapper on top of torch Adam, which manipulates the parameter group dictionary to explicitly adjust learning rates according to muP. Perhaps this explicit conversion can be a part of the converter function instead to remove the dependency on the |
After discussion with Edward, we think perhaps hosting custom model code on the Hub would be the best way to go. We have some questions about this:
|
You can create a randomly initialized model with from transformers import AutoConfig, AutoModel
config = AutoConfig.from_pretrained(checkpoint_name)
model = AutoModel.from_config(config) As for the second point, not really. |
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. Please note that issues that do not follow the contributing guidelines are likely to be ignored. |
Sorry still working on this! |
Any update? |
@sodabeta7 has been working on this. @sodabeta7 could you summarize your progress? |
Hi, is there any updates after a year? Thanks! |
Curious too, any news? |
@thegregyang , I trained a model with Mup, Just wondering how could I convert my Mup model weight to SP so that I could load with huggingface? |
🚀 Feature request
This request is to open up a discussion on 1) whether it makes sense to implement Maximal Update Parametrization (abbreviated muP) in Huggingface, 2) if so, how to do it.
Motivation
Hi,
I'm a maintainer for the mup package (paper). This repo allows one to implement in their models a special parametrization called maximal update parametrization, or muP, that has the special property that narrow and wide networks share the same optimal hyperparameters (like learning rate, initialization, etc). This is demonstrated below on a Transformer trained with adam, where on the left we have the pytorch default parametrization and the right we have muP.
Most strikingly, this property can be used to tune hyperparameters for extremely large neural networks like GPT-3 that is too expensive to train more than once, by just tuning a tiny version of it. But even for "regular joe" users, muP can alleviate a lot of the pain when transitioning from exploration to scaling up and finding performance suffer for mysterious reasons. Transformers in particular is somewhat infamous for problems like training instability. So having muP integrated natively into Huggingface can benefit a lot of users at once.
muP can be implemented in a backward compatible way, as shown below, so users do not need to worry about it breaking existing codebases.
See this twitter thread for more (but brief) information about how this works, and this blog post for less brief overview.
Your contribution
Now let's return to the two questions at the beginning: 1) whether it makes sense to implement Maximal Update Parametrization (abbreviated muP) in Huggingface, 2) if so, how to do it.
For 1), the popularity (or not) of this issue should serve as an indicator of community interest, and the above makes the case for the utility of this integration.
For 2), we have examples of how to integrate muP with some common (PyTorch) Huggingface transformers in our mutransformers repo.
Current Example Implementation
In summary, to modify an existing Huggingface transformer to implement muP, one needs to
nn.Linear
tomup.MuReadout
._init_weights
method to usemup.init.*
methods instead ofnn.init.*
methods (or equivalent).mup.AdamW
instead of the pytorch or Huggingface version.In addition, when using a
mutransformer
, one needs to provide a "base shape file" that lets the model know how to properly scale the learning rate and attention with width. This is designed so that if the model parameter shapes are the same as the "base shapes", then the model is in the original parametrization, i.e. backward compatible.More Seamless Integration
Now, the mutransformers repo is primarily designed to serve as examples of how to implement muP into existing transformers. So all of the above can be streamlined if we really want seamless integration into Huggingface.
For example, the user interface for instantiating a model could just be the same as it is now, but we just have an additional flag
mup=True
inBertConfig
that says to switch onmup
.BertConfig
itself may carry a default set of base shapes for use in this scenario, which the user can also modify if necessary.In addition,
mup.MuAdamW
can be incorporated natively into Huggingface as well, so that there is no dependency on themup
package at all.muP for All Transformers?
As, currently, there is no automatic way of backfitting existing transformers, it could be quite a task to add muP to all of the transformers in Huggingface. So a good practical compromise is to just implement muP for the most commonly used models in Huggingface.
In the interim, research can be done on a method of such automatic backfitting. This could even involve a pull request into PyTorch core.
Conclusion
Again, this issue is intended to start the discussion of whether and how to make muP available to Huggingface users natively. It could be that the best course forward is to have users implement muP transformers themselves as in
mutransformers
, or even to buildmutransformers
into such a repo of muP transformers. And even if we do decide to integrate muP into Huggingface, there could be many ways to do it.I hope discussion here could elucidate the right course of action.
The text was updated successfully, but these errors were encountered: