Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LLM: 2bit quantization support #10042

Merged
merged 8 commits into from
Feb 6, 2024
Merged

LLM: 2bit quantization support #10042

merged 8 commits into from
Feb 6, 2024

Conversation

rnwang04
Copy link
Contributor

@rnwang04 rnwang04 commented Jan 30, 2024

Description

0. Workflow

  • add support for importance matrix and related preprocess
  • add iq2_xxs/ iq2_xs support in load_in_low_bit parameter and related quantization workflow update
  • verification of iq2_xxs / iq2_xs quantization
  • add dequantization support for iq2_xxs
  • verification of iq2_xxs dequantization support
  • rest token inference with iq2_xxs

1. Why the change?

To support 2bit quantization.
https://github.com/analytics-zoo/nano/issues/1015
This PR is just a starting point of iq2 support, which just add basic function support, will take a look into q2_K mixed quantization later.

2. User API changes

Related imatrix is downloaded from https://huggingface.co/datasets/ikawrakow/imatrix-from-wiki-train/tree/main .

from bigdl.llm.transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained('meta/Llama-2-7b-hf',
                                             load_in_low_bit='iq2_xxs',
                                             optimize_model=False,
                                             trust_remote_code=True,
                                             use_cache=True,
                                             imatrix='imatrix/llama-v2-7b.imatrix').to('xpu')
model = AutoModelForCausalLM.from_pretrained('mistral/Mistral-7B-Instruct-v0.2',
                                             load_in_low_bit='iq2_xxs',
                                             optimize_model=False,
                                             trust_remote_code=True,
                                             use_cache=True,
                                             imatrix='mistral-7b-instruct-v0.2.imatrix').to('xpu')

3. Summary of the change

  • Support loading and preprocess importance matrix which is produced by llama.cpp's format
  • Provide new ggml_quantize_tensor_with_weights API
  • Support basic functionality for iq2_xxs mixed with q4_0

4. How to test?

  • Unit test
  • Local test

@rnwang04 rnwang04 added the llm label Jan 30, 2024
@rnwang04 rnwang04 marked this pull request as draft January 30, 2024 09:33
@rnwang04 rnwang04 changed the title LLM: 2bit quantization support LLM: 2bit quantization support [WIP] Jan 30, 2024
@rnwang04 rnwang04 marked this pull request as ready for review February 1, 2024 02:04
@rnwang04 rnwang04 changed the title LLM: 2bit quantization support [WIP] LLM: 2bit quantization support Feb 1, 2024
@rnwang04 rnwang04 merged commit 96c5d4d into intel:main Feb 6, 2024
19 checks passed
@rnwang04 rnwang04 deleted the support_iq2 branch February 6, 2024 06:58
Jasonzzt pushed a commit to Jasonzzt/BigDL that referenced this pull request Feb 19, 2024
* basis quantize support

* fix new module name

* small update

* and mixed int4 with iq2_xxs

* remove print

* code refactor

* fix style

* meet code review
liu-shaojun pushed a commit that referenced this pull request Mar 25, 2024
* basis quantize support

* fix new module name

* small update

* and mixed int4 with iq2_xxs

* remove print

* code refactor

* fix style

* meet code review
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants