LLM: 2bit quantization support #10042

rnwang04 · 2024-01-30T09:32:48Z

Description

should merge after https://github.com/intel-analytics/llm.cpp/pull/232

0. Workflow

add support for importance matrix and related preprocess
add iq2_xxs/ iq2_xs support in load_in_low_bit parameter and related quantization workflow update
verification of iq2_xxs / iq2_xs quantization
add dequantization support for iq2_xxs
verification of iq2_xxs dequantization support
rest token inference with iq2_xxs

1. Why the change?

To support 2bit quantization.
https://github.com/analytics-zoo/nano/issues/1015
This PR is just a starting point of iq2 support, which just add basic function support, will take a look into q2_K mixed quantization later.

2. User API changes

Related imatrix is downloaded from https://huggingface.co/datasets/ikawrakow/imatrix-from-wiki-train/tree/main .

from bigdl.llm.transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained('meta/Llama-2-7b-hf',
                                             load_in_low_bit='iq2_xxs',
                                             optimize_model=False,
                                             trust_remote_code=True,
                                             use_cache=True,
                                             imatrix='imatrix/llama-v2-7b.imatrix').to('xpu')
model = AutoModelForCausalLM.from_pretrained('mistral/Mistral-7B-Instruct-v0.2',
                                             load_in_low_bit='iq2_xxs',
                                             optimize_model=False,
                                             trust_remote_code=True,
                                             use_cache=True,
                                             imatrix='mistral-7b-instruct-v0.2.imatrix').to('xpu')

3. Summary of the change

Support loading and preprocess importance matrix which is produced by llama.cpp's format
Provide new ggml_quantize_tensor_with_weights API
Support basic functionality for iq2_xxs mixed with q4_0

4. How to test?

Unit test
Local test

* basis quantize support * fix new module name * small update * and mixed int4 with iq2_xxs * remove print * code refactor * fix style * meet code review

rnwang04 added the llm label Jan 30, 2024

rnwang04 marked this pull request as draft January 30, 2024 09:33

rnwang04 changed the title ~~LLM: 2bit quantization support~~ LLM: 2bit quantization support [WIP] Jan 30, 2024

rnwang04 marked this pull request as ready for review February 1, 2024 02:04

rnwang04 changed the title ~~LLM: 2bit quantization support [WIP]~~ LLM: 2bit quantization support Feb 1, 2024

rnwang04 requested review from jason-dai and MeouSker77 February 5, 2024 02:17

MeouSker77 approved these changes Feb 6, 2024

View reviewed changes

rnwang04 merged commit 96c5d4d into intel:main Feb 6, 2024
19 checks passed

rnwang04 deleted the support_iq2 branch February 6, 2024 06:58

rnwang04 added 8 commits February 6, 2024 17:23

basis quantize support

fbd8b00

fix new module name

cb5e320

small update

bdeabe0

and mixed int4 with iq2_xxs

67037c9

remove print

313fb4e

code refactor

300e097

fix style

a543a81

meet code review

21a2124

liu-shaojun pushed a commit that referenced this pull request Mar 25, 2024

LLM: 2bit quantization initial support (#10042)

d61f490

* basis quantize support * fix new module name * small update * and mixed int4 with iq2_xxs * remove print * code refactor * fix style * meet code review

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LLM: 2bit quantization support #10042

LLM: 2bit quantization support #10042

rnwang04 commented Jan 30, 2024 •

edited

Loading

LLM: 2bit quantization support #10042

LLM: 2bit quantization support #10042

Conversation

rnwang04 commented Jan 30, 2024 • edited Loading

Description

0. Workflow

1. Why the change?

2. User API changes

3. Summary of the change

4. How to test?

rnwang04 commented Jan 30, 2024 •

edited

Loading