Skip to content

Latest commit

 

History

History
159 lines (103 loc) · 5.84 KB

README_GPTQ_EN.md

File metadata and controls

159 lines (103 loc) · 5.84 KB

The Method of Quantization and Inference for Yuan2.0-M32

0. Model Downloads

Model Sequence Length Type Download
Yuan2.0-M32-HF-INT4 16K HuggingFace ModelScope | HuggingFace | Netdisk | Wisemodel
Yuan2.0-M32-HF-INT8 16K HuggingFace ModelScope | HuggingFace | Netdisk | Wisemodel

1. Environment of AutoGPTQ

  • Environment requirements: CUDA version> 11.8
  • Container: Create a container using the image provided by the vllm
# enter docker containers
docker exec -it vllm_yuan bash

# enter directory
cd /mnt

# clone
git clone https://github.com/IEIT-Yuan/Yuan2.0-M32.git

# enter project
cd  Yuan2.0-M32/3rd_party/AutoGPTQ

# install autogptq
pip install auto-gptq --no-build-isolation

2. Quantize Yuan2.0-M32-HF model

The Steps for Quantizing Yuan2.0-M32 Model:

  • Step 1: Download Yuan2.0-M32-HF Model and move it to the specified path (/mnt/beegfs2/Yuan2-M32-HF), refer to vllm
  • Step 2: Download the datases, then move it to the specified path (/mnt/beegfs2/)
  • Step 3: Adjust the parameters according to the following script for the quantization operation.
# edit Yuan2-M32-int4.py
cd /mnt/beegfs2/Yuan2.0-M32/3rd_party/AutoGPTQ
vim Yuan2-M32-int4.py

'''
pretrained_model_dir = "/mnt/beegfs2/Yuan2-M32-HF"
quantized_model_dir = "/mnt/beegfs2/Yuan2-M32-GPTQ-int4"

tokenizer = LlamaTokenizer.from_pretrained("/mnt/beegfs2/Yuan2-M32-HF", add_eos_token=False, add_bos_token=False, eos_token='<eod>', use_fast=True)

examples = []
with open("/mnt/beegfs2/instruct_data.json", 'r', encoding='utf-8') as file: # path of datasets
    data = json.load(file)

for i, item in enumerate(data):
    if i >= 2000:
        break
    instruction = item.get('instruction', '')
    output = item.get('output', '')
    combined_text = instruction + " " + output
    examples.append(tokenizer(combined_text))

max_memory = {0: "80GIB", 1: "80GIB", 2: "80GIB", 3: "80GIB", 4: "80GIB", 5: "80GIB", 6: "80GIB", 7: "80GIB"}
quantize_config = BaseQuantizeConfig(
    bits=4,  # quantize model to 4-bit
    group_size=128,  # it is recommended to set the value to 128
    desc_act=False,  # set to False can significantly speed up inference but the perplexity may slightly bad
)
'''

# Modify pretrained_model_dir, specify the quantized_model_dir for the quantized model.
# Modify the path of the datasets.
# max_memory can specify the GPUs to be used.
# Adjust the quantization parameters, int4: set bits=4, int8: set bits=8. 
# Other parameters can refer to the default values.


# Run
python Yuan2-M32-int4.py

# The model quantization and packing process takes 8 hours approximately.
# You can use GPUs quantize the model to int4 and int8 separately at the same time.

3. Inference with Quantized Model

Quantization completed, you will get the checkpoint files with the suffix of '.safetensors', config.json and quantize_config.json in the folder. You need to first copy the tokenizer-related files from the Yuan2-M32-HF path.

# the path of Yuan2-M32-HF
cd /mnt/beegfs2/Yuan2-M32-HF

# copy tokenizer files to the path of Yuan2-M32-GPTQ-int4
cp special_tokens_map.json tokenizer* /mnt/beegfs2/Yuan2-M32-GPTQ-int4

# edit inference.py
cd /mnt/beegfs2/Yuan2.0-M32/3rd_party/AutoGPTQ
vim inference.py

'''
quantized_model_dir = "/mnt/beegfs2/Yuan2-M32-GPTQ-int4"

tokenizer = LlamaTokenizer.from_pretrained('/mnt/beegfs2/Yuan2-M32-GPTQ-int4', add_eos_token=False, add_bos_token=False, eos_token='<eod>')

model = AutoGPTQForCausalLM.from_quantized(quantized_model_dir, device="cuda:0", trust_remote_code=True)
'''
# edit paths of quantized_model_dir and tokenizer

# run inference.py
python inference.py

4. Evaluation

Parameters of HumanEval: generation_params = { "max_new_tokens": 512, "top_k": 1, "top_p": 0, "temperature": 1.0, }

Yuan-M32-HF inferenced with 2 80GB GPUs; Yuan2-M32-GPTQ-int4 and Yuan2-M32-GPTQ-int8 inferenced with single 80GB GPU

Result:

Model Accuracy Type HumanEval Inference Speed Inference Memory Usage
Yuan2-M32-HF BF16 73.17% 13.16 token/s 76.34 GB
Yuan2-M32-GPTQ-int8 INT8 72.56% 9.05 token/s 39.81 GB
Yuan2-M32-GPTQ-int4 INT4 66.46% 9.24 token/s 23.27GB