-
Notifications
You must be signed in to change notification settings - Fork 520
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OOM on VRAM during quantization #148
Comments
Hi David, thank you for your reporting this and doing the detailed analysis. It is pure gold to hear from users. I think there are two things: Loading the sd in the initialization context leads to duplication on the device and we might just keep things on the CPU at the beginning. As a hot fix, please change Line 191 in 945ffb3
to load the model to CPU. The code will then move things to cuda layer by layer.
I'll test this a bit more (I am at block number 10 in the 13B currently) and then submit a PR. |
So for quantizing 13B with this, I get: |
OK, the patch fixes the issue, I've quantized and tested 7, 13, and 30B so far, and am quantizing 65B now! |
Cool! Thank you for reporting back. Can you say, how much GPU memory you need to run the quantized models (either for evaluate or generate)? |
For generation using the quantised models on a 4090: This seems a little slow for a 4090. |
Yeah, there is a substantial perf hit at the moment that we're looking at mitigating. |
I just found a (potential) issue when quantizing the 13B+ models. Thing start out correctly and the first layer start quantizing correctly, but after reaching the level zero mlp level, I get on OOM error:
0 mlp.c_fc2 collecting stats quantizing time 22s quantization error 351.9
I get an out-of-memory error.
Earlier I didn't get far enough, due to problems now resolved from issue #141
Although I can in theory quantize 13B, in practice I can't even with 40Gb VRAM. I can't even start quantizing the 30B model, as I get OOM instantly.
I am using the command:
python quantize.py --model_size 30B --checkpoint_path checkpoints/lit-llama/30B/lit-llama.pth --tokenizer_path checkpoints/lit-llama/tokenizer.model --output_path llama-30b-gptq.4bit.pt --dtype bfloat16 --quantize gptq.int4
This is after using the new push, so the models are already in BF16 (the 30B model is now only 65 GB). I will try testing on our DGX on A100s, but I think it would be good to understand the high memory use.
The text was updated successfully, but these errors were encountered: