-
-
Notifications
You must be signed in to change notification settings - Fork 972
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OOM On Galore Axolotl #1448
Comments
yes same issue here. Even a 7B model takes a LOT of memory - much higher than the <24gb promised in the original repo. Is "activation checkpointing" of the repo equivalent to the "gradient checkpointing" in axolotl? My yaml for Yi (also adapted for Mistral-Hermes) :
|
Did you try the galore 8bit variants? |
I used the 8bit optimiser as seen above. Hermes 7b takes a shocking 36gb or so at seqlen 1200. And in theory - Yi is supposed to fit on a H100 with Galore - but it will OOM. How can the above yml be optimised further? |
@m626zNq @winglian I think I found the problem. In the axolotl Readme, I note that there are a number of layerwise optimisers:
According to a remark I saw in the original HF PR - these layerwise optimisers are essential to achieve much higher memory savings - but they come with limitations like perhaps not being able to work with multi-GPUs (see huggingface/transformers#29588 and original galore github). In any case when I used galore_adamw_8bit_layerwise I can train Hermes-M 7B in 20gb with batch size of 2800 tokens. So @m626zNq can try and probably close this "bug". But I do find the loss seems to fall much slower (if at all) for layerwise opt compared to the normal galore op. Guess things are quite unstable still. |
@jaredquekjz the 24GB from the paper is for a 7B parameter model. You're using yi-34B. That's still going to require much more VRAM, probably at least an 80GB A100 to full finetune with Galore. |
@m626zNq set |
@winglian - thanks for input. I used both the Yi and the Hermes-Mistral 7B for trial. For both - we need the layer wise optimiser for the memory savings to be maximum (20gb for Hermes as reported). But as shared - the layerwise opt may not be fully working stably yet (no grad norm and no loss decrease) - at least when I last trialed. Yi can load in one H100 with layerwise but v slow..
|
Yeah. Layerwise also requires that you use a gradient accumulation steps value of 1 |
@winglian I have tried all of those, still get OOM. I have tried every optimizer of galore, flash attention, deepspeed, etc.. *sorry for late response |
Still OOM. No idea what is going wrong. |
Just a thought [could be wrong] here due to a similar discussion I had: as far as I understand -- GaLore is run completely in BFloat16 precision without any automatic mixed precision. My sense is that using accelerate under the hood, is AMP being used which obviously requires more memory? [the SVD is done in float32 IIRC] -- not sure exactly though. Reference here |
I am encountering similar issues -- way too much VRAM is being used for GaLore tuning Llama 8b, for me (280 GB on 8x A6000s!) Something definitely seems wrong here. If the paper gives 24GB for a 7b model, presumably it should not take 280 GB for an 8b, even with a larger tokenizer? |
Everytime I've tried to fine tune Mistral-7b with SquAD dataset, I've got OOM error.
error: |
Please check that this issue hasn't been reported before.
Expected Behavior
Should start training without OOM, like Llama factory.
Current behaviour
Causing OOM issue on axolotl with my config. LLaMA Factory acted fine but axolotl is hating on me. On llama factory i was able to do 16bit, and 1024 rank, and 8k context, worked fine on same gpu. axolotl wont even work with 8bit and 128 rank, at 4k context,(out of mem)
I have tried:
Steps to reproduce
install galore: pip install galore-torch
run the config posted below
Config yaml
Possible solution
No response
Which Operating Systems are you using?
Python Version
3.11
axolotl branch-commit
main
Acknowledgements
The text was updated successfully, but these errors were encountered: