You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I want to quantize a 7B model but am not able to quantize even getting an error when downloading the model from hugging face I simply pass the model id-"senseable/WestLake-7B-v2" but when it starts to download it occupied all the RAM. even though I passed Cuda to use Free Colabe GPU it again loaded on RAM and gave me an error that it uses all available RAM.
when I use BitsAndBytes quantization I simply pass BitsAndBytesConfig to AutoModelForCausalLM and it quantizes the model during Downloading on 4bit quantization it almost takes 5.5 GB of GPU so I use free Colab GPU
Can I perform HQQ-Quantization using AutoModelForCausalLM, I did not want to download the model and then perform HQQ quantization it may not be possible on free Colab GPU.
How can I perform HQQ-Quantization during downloading of the model as I did in BitsAndBytes-Quantization during downloading the model ?
The text was updated successfully, but these errors were encountered:
The current version requires the model to be on CPU, because the library is designed to work on any model, not necessarily a Hugging Face model. You'll need about 14GB of RAM (not Vram) to store the 7B model as fp16 on the CPU, free Google colab doesn't offer enough RAM that's why it crashed.
If you want to HQQ quantize Hugging Face models like how BNB does, you can use our branch of transformers that implements HQQ, which allows dynamic loading and quantization and that RAM issue shouldn't happen: huggingface/transformers#29637
I am using Free Google Colab-Notebook and GPU
I want to quantize a 7B model but am not able to quantize even getting an error when downloading the model from hugging face I simply pass the model id-"senseable/WestLake-7B-v2" but when it starts to download it occupied all the RAM. even though I passed Cuda to use Free Colabe GPU it again loaded on RAM and gave me an error that it uses all available RAM.
when I use BitsAndBytes quantization I simply pass BitsAndBytesConfig to AutoModelForCausalLM and it quantizes the model during Downloading on 4bit quantization it almost takes 5.5 GB of GPU so I use free Colab GPU
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_compute_dtype=torch.bfloat16,
bnb_4bit_quant_type= "nf4",
bnb_4bit_use_double_quant=False,
)
model= AutoModelForCausalLM.from_pretrained(
"senseable/WestLake-7B-v2",
device_map="auto",
quantization_config=bnb_config,
trust_remote_code=True,
use_flash_attention_2=False,
torch_dtype=torch.bfloat16,
)
Can I perform HQQ-Quantization using AutoModelForCausalLM, I did not want to download the model and then perform HQQ quantization it may not be possible on free Colab GPU.
How can I perform HQQ-Quantization during downloading of the model as I did in BitsAndBytes-Quantization during downloading the model ?
The text was updated successfully, but these errors were encountered: