-
Notifications
You must be signed in to change notification settings - Fork 27.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement Half-Quadratic Quantization (HQQ) #28328
Comments
This is very cool ! We are definitely interested in adding HQQ inference support in transformers. The cool thing is that indeed it seems you don't need to pre-quantize the weights in order to quantize the models. We'll explore a bit on our side and let you know how it goes |
Hi! I am the maintainer of the HQQ project, happy to assist with anything needed ! |
Very glad to e-meet you @mobicham ! do you have an email I can use so that we can contact you through Slack to iterate quickly? |
Glad to e-meet you @younesbelkada as well! Sure: hicham@mobiuslabs.com |
thanks @mobicham you should have received an invite by now! |
Closing as HQQ is now part of the release! |
Feature request
I would be curious if https://github.com/mobiusml/hqq can be supported in similar fashion to
autogptq
orautoawq
. hqq is most similar tobitsandbytes
nf4/fp4
datatypes, but offers 2,3,4,8 bit quantization.CC: @mobicham
Motivation
HQQ performs 2/3/4 bit quantization and can do drop-in replacement. Its fast for in-place quantization / non-pre-quantized weights and performs similar to bnb a expansion to fp16 at runtime (or similar).
Would be cool to support for models like mixtral to cut down the vram requirement.
Your contribution
Currently have no capacity for submitting an integration, but happy to review or assist.
The text was updated successfully, but these errors were encountered: