Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimised 4bit inference kernels #28568

Open
nivibilla opened this issue Jan 17, 2024 · 9 comments
Open

Optimised 4bit inference kernels #28568

nivibilla opened this issue Jan 17, 2024 · 9 comments
Labels
Feature request Request for a new feature

Comments

@nivibilla
Copy link

Feature request

Integration of new 4bit kernels

https://github.com/IST-DASLab/marlin

Motivation

provide faster Inference than awq/exllama for batch sizes upto 32

Your contribution

Just saw this today, can try provide sample notebook.

@amyeroberts
Copy link
Collaborator

cc @younesbelkada @SunMarc

@amyeroberts amyeroberts added the Feature request Request for a new feature label Jan 18, 2024
@younesbelkada
Copy link
Contributor

Thanks !
@efrantar can confirm but looking at the code it looks like you essentially just need to replace all Linear layers with marlin.Linear ? (not 100% sure) if the interface is simple we can definitely add that support I think by passing MarlinConfig through quantization_config in from_pretrained.

We also do have HQQ in the backlog #28328 but we are waiting to finalize #26610 from @poedator before adding any new quantization scheme

cc @Titus-von-Koeller just FYI

@SunMarc
Copy link
Member

SunMarc commented Jan 18, 2024

@qwopqwop200 seems to be working on adding marlin to AutoGPTQ. If it is merged, we will also have support with transformers quite easily. https://github.com/qwopqwop200/AutoGPTQ-add-marlin

@efrantar
Copy link

Yes, replacing the layers is pretty much it. It might also be possible to write a (not too complex) kernel to convert a GPTQ format model (groupsize 128, sym, no act-order; or any other quantization method that produces such models) to Marlin format on-the-fly (when loading the model) in reasonable time, which could be useful to have only a single storage format. However, I am not sure how many of the current GPTQ models on the hub already use the required settings for Marlin.

@younesbelkada
Copy link
Contributor

Thank you very much @efrantar for the precision! We will update you as soon as we merge #26610

@t3ls
Copy link

t3ls commented Jul 23, 2024

Any update for this feature?

@SunMarc
Copy link
Member

SunMarc commented Jul 24, 2024

I will have a look at it soon ! Since it is available on autogptq, the integration should be straightforward !

@blap
Copy link

blap commented Sep 14, 2024

@blap
Copy link

blap commented Sep 14, 2024

Any update?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Feature request Request for a new feature
Projects
None yet
Development

No branches or pull requests

7 participants