-
Notifications
You must be signed in to change notification settings - Fork 25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
more VRAM savings: no first moment and factored second moment #25
base: main
Are you sure you want to change the base?
Conversation
Wow, yet again INCREDIBLY good work! I agree that these improvements are important and that the paper's original code can be moved to a |
Thank you for the pull request! Konstantin and I are looking to review this and other pull requests we have received very soon, hopefully in the next week or two. I think that an adafactor version is definitely very useful. |
* Adapted from this pull request: konstmish/prodigy#25
two issue have been reported with this PR that need further analysis:
|
Adafactor is used fairly often in practice, do the standard code-bases for it give any indication of how to resolve the issues your reporting? I will hold off merging until we have a better idea of what's going on. I'm a little suspect of clipping as a solution as it introduces additional parameters. |
In my test, I trained Stable Cascade (used OneTrainer) both unet and text encoder, full bf16, and I managed to avoid nans when I increased eps2 from 1e-30 to 1e-25, or by turning off factored |
did you run the same test on original Adafactor? did that work without changing eps2? If it works in original Adafactor but not with this PR, I should reproduce and analyze it. |
As a side note, because I was just thinking about what could be different between original Adafactor and this PR: This PR does updates to exp_avg as AdamW does and as Prodigy does. This is different from how the original Adafactor code updates exp_avg, but I think this is an oversight in the implementation of Adafactor that the authors of the paper did not intend. You can find more details here: huggingface/transformers#34506 Quite sure that is not the reason for NaNs tough, even though you did use beta1 > 0 in your tests, so this code path is used. |
sorry I didn't provide screenshots for OT config for the Cascade tests, I set Beta1 to None in both, only in the SD3.5L tenc1 test I left if at 0.9 because at that time I was testing schedulefree edition |
I quickly looked at the code and I'm not sure what |
the Adafactor authors introduced it in their paper first here in section 6: https://arxiv.org/pdf/1804.04235 Anecdotally I could observe that training was worse when comparing original Adafactor with this implementation before I implemented update clipping, even when I had set This might also be why people have reported having issues with Prodigy and high values of beta2 here: This is just a guess, but it would make sense: when Prodigy scales up |
prodigyopt/prodigy.py
Outdated
exp_avg_sq_row = state["exp_avg_sq_row"] | ||
exp_avg_sq_col = state["exp_avg_sq_col"] | ||
|
||
grad_sq=grad.square()+eps2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's not necessary to create this tensor as we're only interested in the sums of squares over rows and columns. Perhaps we could compute them directly using, for instance, torch.norm
by specifying dim
and squaring the result?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@konstmish I believe this is what the PyTorch implementation does: https://github.com/pytorch/pytorch/blob/ee3a4f068c7646ac2d645e7ad02e82c82c5238bf/torch/optim/_adafactor.py#L380-L392
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
interesting, but isn't this a bet (x*y) square()s are slower than (x+y) sqrt()s + (x+y)square()s?
do we know that this is the case?
Very interesting! Thank you for explaning the history behind this trick and sharing your experience with it. The code looks mostly good to me, I think we can merge it after making a few fixes and testing the method. We should decide on the default value of |
Just a question if you have the time, why did you choose to scale the EMAs by
the current default is taken from the Adafactor code, but I agree that further analysis is required because above someone has reported that they had to change eps2 to make it work on Stable Cascade (1e-25 instead of 1e-30). Might take me a while to reproduce that because of a broken GPU :/ |
Hey @dxqbYD, since this PR is on hold at least for now, could you make another pull request with just the change to remove
It's really just because of the theory, that's how we derived the method. Even though this multiplication shouldn't be important in the long run once |
Not sure either for training, but at the very least you can use beta1==0 for estimating the LR. In the SD community, Prodigy is often used to estimate a rough LR, and then Adam is used for training and finetuning the LR. |
I was able to reproce this now,.
...while in the original Adafactor code, I'll change the code above to
Technically, On the question why this only happened with Cascade, I guess that 0-gradients are quite rare. |
I want to have the adafactor version in the repo in the near future, so we can hopefully merge a variant of this PR soon. A few things that I need to understand:
|
@adefazio |
My implementation is strictly based on the transformers implementation of Adafactor: I have no opinion on which implementation is better but assumed that transformers is the reference implementation and tried to change as little as possible. About the epsilons, I might have added to an already confusing naming convention of multiple epsilons that aren't necessarily related:
I don't understand your point 1, but that might be because of the mixup of epsilons.
I don't have a strong opinion either way. Generally, it feels like we have 73 different optimizers with their own name each, but many of them could just be another hyperparameter to Adam. But if you want to go that route, testers on the OneTrainer discord have called it Prodifactor 🙂 |
Transformers' implementation is fine, I don't think there is any differene as long as it works and it's readable. I'd only try to keep it consistent with the notation in the adafactor paper to the extent it's possible.
Thanks, I got confused indeed. Ignore my point about the default value of
True. I'm still somewhat inclined to use a different optimizer for Adafactor though as it requires a different way of estimating
Haha, that's a cool name :)
Makes sense. In the current implementation of Prodigy, we tried to avoid having numbers that are too small by using the ratio of |
the only difference between the paper and the transformers implementation I've stumbled across is here: but I wouldn't copy this to the Prodigy implementation. there is no theory behind this, other than that it is probably an oversight by the original authors.
If you do go ahead and re-test Prodigy it would be great if you could do so with only one main loop. The current implementation does the EMA updates in one loop, then |
This is another PR with the intention of making Prodigy usable on low-vram systems and large models.
Combined with #22, the vram overhead by Prodigy can be reduced by roughly 95%.
Total VRAM usage by OneTrainer again as examples:
This is achieved by some of the concepts proposed in this paper https://arxiv.org/pdf/1804.04235. This PR therefore goes a bit beyond the scope of your paper.
I'd still propose to merge these PRs, because your repo is in wide production use, but limited by its large VRAM footprint.
You can always make a fork of the code, for people just interested in the code accompanying the paper.
This PR does 3 things:
The paper above has shown that the first EMA can be disabled and only the second moment be used, and still achieve good training results. Prodigy can already do that by setting
beta1
to 0, but with this PR it also doesn't allocate any VRAM anymore towards the then-unusedexp_avg
state.This saves 25% of vram.
if
factored
is set toTrue
, it doesn't store a full tensor for the second EMA anymore, but a factored approximation. Details can be found in the paper above.Depending on the shape of the model's parameters, this saves another 24 - 25% of vram.
The approximation of the second moment above can introduce unwanted spikes in parameter updates, that are not caught by regular gradient clipping. If
update_clip
is set toTrue
, it does another round of clipping after the EMA has been applied to the gradients.This makes a clearly noticeable difference when using the above, but I suspect this might even be beneficial for Prodigy independently of these vram optimizations:
People have reported having problems with a high beta2 setting and Prodigy here:
#8
This could be caused by the rapid changes in learning rate by Prodigy, that increase the variance of gradients, while the second moment EMA is out-of-date and indicating still a low variance. At a high beta2 this can be the case for a long time.
Large parameter updates are the result as shown here:
Update clipping scales down the parameter updates, not only the gradients before EMA.
The remaining 45% of 95% vram savings mentioned above are by #22
Note again: the code with default settings doesn't have any effect. Set the values mentioned above if you want to test this PR.
This PR is only the part described above. I have a merge of all 3 PRs in my github fork if you want to test everything together.