OOM on VRAM during quantization #148

dnhkng · 2023-04-17T12:18:48Z

I just found a (potential) issue when quantizing the 13B+ models. Thing start out correctly and the first layer start quantizing correctly, but after reaching the level zero mlp level, I get on OOM error:
0 mlp.c_fc2 collecting stats quantizing time 22s quantization error 351.9
I get an out-of-memory error.

OutOfMemoryError: CUDA out of memory. Tried to allocate 730.00 MiB (GPU 0; 31.75 GiB total capacity; 29.65 GiB already allocated; 82.69 MiB free; 30.68 GiB reserved in total by PyTorch) 
If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Earlier I didn't get far enough, due to problems now resolved from issue #141

Although I can in theory quantize 13B, in practice I can't even with 40Gb VRAM. I can't even start quantizing the 30B model, as I get OOM instantly.

I am using the command:
python quantize.py --model_size 30B --checkpoint_path checkpoints/lit-llama/30B/lit-llama.pth --tokenizer_path checkpoints/lit-llama/tokenizer.model --output_path llama-30b-gptq.4bit.pt --dtype bfloat16 --quantize gptq.int4

This is after using the new push, so the models are already in BF16 (the 30B model is now only 65 GB). I will try testing on our DGX on A100s, but I think it would be good to understand the high memory use.

The text was updated successfully, but these errors were encountered:

t-vi · 2023-04-17T14:38:15Z

Hi David, thank you for your reporting this and doing the detailed analysis. It is pure gold to hear from users.

I think there are two things: Loading the sd in the initialization context leads to duplication on the device and we might just keep things on the CPU at the beginning.

As a hot fix, please change quantize.py around

lit-llama/quantize.py

Line 191 in 945ffb3

device=device,

to load the model to CPU. The code will then move things to cuda layer by layer.

    with EmptyInitOnDevice(
-        device=device,
+        device="cpu",
         dtype=dtype,
     ):

I'll test this a bit more (I am at block number 10 in the 13B currently) and then submit a PR.
My apologies for the inconvenience and, again, thank you very much for reporting this.

t-vi · 2023-04-17T14:57:40Z

So for quantizing 13B with this, I get:
Time for quantization: 1287.86 sec total
Memory used: 10.14 GB
works well on my 3090, I might get 30GB to work, too, but maybe del the state dict after loading.

dnhkng · 2023-04-18T12:50:42Z

OK, the patch fixes the issue, I've quantized and tested 7, 13, and 30B so far, and am quantizing 65B now!

t-vi · 2023-04-18T13:13:10Z

Cool! Thank you for reporting back. Can you say, how much GPU memory you need to run the quantized models (either for evaluate or generate)?

dnhkng · 2023-04-18T14:06:41Z

For generation using the quantised models on a 4090:
7B + bfloat16: 4.60 GB, Time for inference: 10.34 sec total, 4.84 tokens/sec
7B: 5.19 GB, Time for inference: 16.08 sec total, 3.11 tokens/sec
13B + bfloat16: 8.12 GB, Time for inference: 20.39 sec total, 2.45 tokens/sec
13B: 8.91 GB, Time for inference: 31.48 sec total, 1.59 tokens/sec
30B + bfloat16: 18.74 GB, Time for inference: 53.21 sec total, 0.94 tokens/sec
30B: 19.39 GB, Time for inference: 80.77 sec total, 0.62 tokens/sec
65B: in progress...

This seems a little slow for a 4090.

t-vi · 2023-04-18T15:49:12Z

Yeah, there is a substantial perf hit at the moment that we're looking at mitigating.

t-vi mentioned this issue Apr 19, 2023

quantization: instantiate model on CPU #163

Merged

lantiga closed this as completed in #163 Apr 19, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OOM on VRAM during quantization #148

OOM on VRAM during quantization #148

dnhkng commented Apr 17, 2023 •

edited

Loading

t-vi commented Apr 17, 2023

t-vi commented Apr 17, 2023

dnhkng commented Apr 18, 2023

t-vi commented Apr 18, 2023

dnhkng commented Apr 18, 2023

t-vi commented Apr 18, 2023

OOM on VRAM during quantization #148

OOM on VRAM during quantization #148

Comments

dnhkng commented Apr 17, 2023 • edited Loading

t-vi commented Apr 17, 2023

t-vi commented Apr 17, 2023

dnhkng commented Apr 18, 2023

t-vi commented Apr 18, 2023

dnhkng commented Apr 18, 2023

t-vi commented Apr 18, 2023

dnhkng commented Apr 17, 2023 •

edited

Loading