Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OOM on VRAM during quantization #148

Closed
dnhkng opened this issue Apr 17, 2023 · 6 comments · Fixed by #163
Closed

OOM on VRAM during quantization #148

dnhkng opened this issue Apr 17, 2023 · 6 comments · Fixed by #163

Comments

@dnhkng
Copy link
Contributor

dnhkng commented Apr 17, 2023

I just found a (potential) issue when quantizing the 13B+ models. Thing start out correctly and the first layer start quantizing correctly, but after reaching the level zero mlp level, I get on OOM error:
0 mlp.c_fc2 collecting stats quantizing time 22s quantization error 351.9
I get an out-of-memory error.

OutOfMemoryError: CUDA out of memory. Tried to allocate 730.00 MiB (GPU 0; 31.75 GiB total capacity; 29.65 GiB already allocated; 82.69 MiB free; 30.68 GiB reserved in total by PyTorch) 
If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Earlier I didn't get far enough, due to problems now resolved from issue #141

Although I can in theory quantize 13B, in practice I can't even with 40Gb VRAM. I can't even start quantizing the 30B model, as I get OOM instantly.

I am using the command:
python quantize.py --model_size 30B --checkpoint_path checkpoints/lit-llama/30B/lit-llama.pth --tokenizer_path checkpoints/lit-llama/tokenizer.model --output_path llama-30b-gptq.4bit.pt --dtype bfloat16 --quantize gptq.int4

This is after using the new push, so the models are already in BF16 (the 30B model is now only 65 GB). I will try testing on our DGX on A100s, but I think it would be good to understand the high memory use.

@t-vi
Copy link
Contributor

t-vi commented Apr 17, 2023

Hi David, thank you for your reporting this and doing the detailed analysis. It is pure gold to hear from users.

I think there are two things: Loading the sd in the initialization context leads to duplication on the device and we might just keep things on the CPU at the beginning.

As a hot fix, please change quantize.py around

device=device,

to load the model to CPU. The code will then move things to cuda layer by layer.

    with EmptyInitOnDevice(
-        device=device,
+        device="cpu",
         dtype=dtype,
     ):

I'll test this a bit more (I am at block number 10 in the 13B currently) and then submit a PR.
My apologies for the inconvenience and, again, thank you very much for reporting this.

@t-vi
Copy link
Contributor

t-vi commented Apr 17, 2023

So for quantizing 13B with this, I get:
Time for quantization: 1287.86 sec total
Memory used: 10.14 GB
works well on my 3090, I might get 30GB to work, too, but maybe del the state dict after loading.

@dnhkng
Copy link
Contributor Author

dnhkng commented Apr 18, 2023

OK, the patch fixes the issue, I've quantized and tested 7, 13, and 30B so far, and am quantizing 65B now!

@t-vi
Copy link
Contributor

t-vi commented Apr 18, 2023

Cool! Thank you for reporting back. Can you say, how much GPU memory you need to run the quantized models (either for evaluate or generate)?

@dnhkng
Copy link
Contributor Author

dnhkng commented Apr 18, 2023

For generation using the quantised models on a 4090:
7B + bfloat16: 4.60 GB, Time for inference: 10.34 sec total, 4.84 tokens/sec
7B: 5.19 GB, Time for inference: 16.08 sec total, 3.11 tokens/sec
13B + bfloat16: 8.12 GB, Time for inference: 20.39 sec total, 2.45 tokens/sec
13B: 8.91 GB, Time for inference: 31.48 sec total, 1.59 tokens/sec
30B + bfloat16: 18.74 GB, Time for inference: 53.21 sec total, 0.94 tokens/sec
30B: 19.39 GB, Time for inference: 80.77 sec total, 0.62 tokens/sec
65B: in progress...

This seems a little slow for a 4090.

@t-vi
Copy link
Contributor

t-vi commented Apr 18, 2023

Yeah, there is a substantial perf hit at the moment that we're looking at mitigating.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants