Optimised 4bit inference kernels #28568

nivibilla · 2024-01-17T23:18:15Z

Feature request

Integration of new 4bit kernels

https://github.com/IST-DASLab/marlin

Motivation

provide faster Inference than awq/exllama for batch sizes upto 32

Your contribution

Just saw this today, can try provide sample notebook.

amyeroberts · 2024-01-18T11:28:02Z

cc @younesbelkada @SunMarc

younesbelkada · 2024-01-18T14:39:15Z

Thanks !
@efrantar can confirm but looking at the code it looks like you essentially just need to replace all Linear layers with marlin.Linear ? (not 100% sure) if the interface is simple we can definitely add that support I think by passing MarlinConfig through quantization_config in from_pretrained.

We also do have HQQ in the backlog #28328 but we are waiting to finalize #26610 from @poedator before adding any new quantization scheme

cc @Titus-von-Koeller just FYI

SunMarc · 2024-01-18T15:27:13Z

@qwopqwop200 seems to be working on adding marlin to AutoGPTQ. If it is merged, we will also have support with transformers quite easily. https://github.com/qwopqwop200/AutoGPTQ-add-marlin

efrantar · 2024-01-18T22:16:09Z

Yes, replacing the layers is pretty much it. It might also be possible to write a (not too complex) kernel to convert a GPTQ format model (groupsize 128, sym, no act-order; or any other quantization method that produces such models) to Marlin format on-the-fly (when loading the model) in reasonable time, which could be useful to have only a single storage format. However, I am not sure how many of the current GPTQ models on the hub already use the required settings for Marlin.

younesbelkada · 2024-01-19T10:05:59Z

Thank you very much @efrantar for the precision! We will update you as soon as we merge #26610

t3ls · 2024-07-23T06:25:35Z

Any update for this feature?

SunMarc · 2024-07-24T15:19:57Z

I will have a look at it soon ! Since it is available on autogptq, the integration should be straightforward !

blap · 2024-09-14T20:20:36Z

https://github.com/ist-daslab/sparse-marlin

blap · 2024-09-14T20:22:17Z

Any update?

amyeroberts added the Feature request Request for a new feature label Jan 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimised 4bit inference kernels #28568

Optimised 4bit inference kernels #28568

nivibilla commented Jan 17, 2024

amyeroberts commented Jan 18, 2024

younesbelkada commented Jan 18, 2024

SunMarc commented Jan 18, 2024

efrantar commented Jan 18, 2024

younesbelkada commented Jan 19, 2024

t3ls commented Jul 23, 2024

SunMarc commented Jul 24, 2024

blap commented Sep 14, 2024

blap commented Sep 14, 2024

Optimised 4bit inference kernels #28568

Optimised 4bit inference kernels #28568

Comments

nivibilla commented Jan 17, 2024

Feature request

Motivation

Your contribution

amyeroberts commented Jan 18, 2024

younesbelkada commented Jan 18, 2024

SunMarc commented Jan 18, 2024

efrantar commented Jan 18, 2024

younesbelkada commented Jan 19, 2024

t3ls commented Jul 23, 2024

SunMarc commented Jul 24, 2024

blap commented Sep 14, 2024

blap commented Sep 14, 2024