Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Your session crashed after using all available RAM #34

Closed
Abdullah-kwl opened this issue Apr 1, 2024 · 1 comment
Closed

Your session crashed after using all available RAM #34

Abdullah-kwl opened this issue Apr 1, 2024 · 1 comment

Comments

@Abdullah-kwl
Copy link

Abdullah-kwl commented Apr 1, 2024

I am using Free Google Colab-Notebook and GPU

I want to quantize a 7B model but am not able to quantize even getting an error when downloading the model from hugging face I simply pass the model id-"senseable/WestLake-7B-v2" but when it starts to download it occupied all the RAM. even though I passed Cuda to use Free Colabe GPU it again loaded on RAM and gave me an error that it uses all available RAM.

Screenshot 2024-04-01 173244
Screenshot 2024-04-01 173224

when I use BitsAndBytes quantization I simply pass BitsAndBytesConfig to AutoModelForCausalLM and it quantizes the model during Downloading on 4bit quantization it almost takes 5.5 GB of GPU so I use free Colab GPU

bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_compute_dtype=torch.bfloat16,
bnb_4bit_quant_type= "nf4",
bnb_4bit_use_double_quant=False,
)
model= AutoModelForCausalLM.from_pretrained(
"senseable/WestLake-7B-v2",
device_map="auto",
quantization_config=bnb_config,
trust_remote_code=True,
use_flash_attention_2=False,
torch_dtype=torch.bfloat16,
)


Can I perform HQQ-Quantization using AutoModelForCausalLM, I did not want to download the model and then perform HQQ quantization it may not be possible on free Colab GPU.

How can I perform HQQ-Quantization during downloading of the model as I did in BitsAndBytes-Quantization during downloading the model ?

@mobicham
Copy link
Collaborator

mobicham commented Apr 1, 2024

The current version requires the model to be on CPU, because the library is designed to work on any model, not necessarily a Hugging Face model. You'll need about 14GB of RAM (not Vram) to store the 7B model as fp16 on the CPU, free Google colab doesn't offer enough RAM that's why it crashed.

If you want to HQQ quantize Hugging Face models like how BNB does, you can use our branch of transformers that implements HQQ, which allows dynamic loading and quantization and that RAM issue shouldn't happen: huggingface/transformers#29637

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants