Your session crashed after using all available RAM #34

Abdullah-kwl · 2024-04-01T12:45:41Z

I am using Free Google Colab-Notebook and GPU

I want to quantize a 7B model but am not able to quantize even getting an error when downloading the model from hugging face I simply pass the model id-"senseable/WestLake-7B-v2" but when it starts to download it occupied all the RAM. even though I passed Cuda to use Free Colabe GPU it again loaded on RAM and gave me an error that it uses all available RAM.

when I use BitsAndBytes quantization I simply pass BitsAndBytesConfig to AutoModelForCausalLM and it quantizes the model during Downloading on 4bit quantization it almost takes 5.5 GB of GPU so I use free Colab GPU

bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_compute_dtype=torch.bfloat16,
bnb_4bit_quant_type= "nf4",
bnb_4bit_use_double_quant=False,
)
model= AutoModelForCausalLM.from_pretrained(
"senseable/WestLake-7B-v2",
device_map="auto",
quantization_config=bnb_config,
trust_remote_code=True,
use_flash_attention_2=False,
torch_dtype=torch.bfloat16,
)

Can I perform HQQ-Quantization using AutoModelForCausalLM, I did not want to download the model and then perform HQQ quantization it may not be possible on free Colab GPU.

How can I perform HQQ-Quantization during downloading of the model as I did in BitsAndBytes-Quantization during downloading the model ?

mobicham · 2024-04-01T13:07:09Z

The current version requires the model to be on CPU, because the library is designed to work on any model, not necessarily a Hugging Face model. You'll need about 14GB of RAM (not Vram) to store the 7B model as fp16 on the CPU, free Google colab doesn't offer enough RAM that's why it crashed.

If you want to HQQ quantize Hugging Face models like how BNB does, you can use our branch of transformers that implements HQQ, which allows dynamic loading and quantization and that RAM issue shouldn't happen: huggingface/transformers#29637

mobicham closed this as completed Apr 2, 2024

Abdullah-kwl mentioned this issue Apr 2, 2024

.to is not supported for HQQ-quantized models #39

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Your session crashed after using all available RAM #34

Your session crashed after using all available RAM #34

Abdullah-kwl commented Apr 1, 2024 •

edited

Loading

mobicham commented Apr 1, 2024

Your session crashed after using all available RAM #34

Your session crashed after using all available RAM #34

Comments

Abdullah-kwl commented Apr 1, 2024 • edited Loading

mobicham commented Apr 1, 2024

Abdullah-kwl commented Apr 1, 2024 •

edited

Loading