The Method of Quantization and Inference for Yuan2.0-M32

0. Model Downloads

Model	Sequence Length	Type	Download
Yuan2.0-M32-HF-INT4	16K	HuggingFace	ModelScope \| HuggingFace \| Netdisk \| Wisemodel
Yuan2.0-M32-HF-INT8	16K	HuggingFace	ModelScope \| HuggingFace \| Netdisk \| Wisemodel

1. Environment of AutoGPTQ

Environment requirements: CUDA version> 11.8
Container: Create a container using the image provided by the vllm

# enter docker containers
docker exec -it vllm_yuan bash

# enter directory
cd /mnt

# clone
git clone https://github.com/IEIT-Yuan/Yuan2.0-M32.git

# enter project
cd  Yuan2.0-M32/3rd_party/AutoGPTQ

# install autogptq
pip install auto-gptq --no-build-isolation

2. Quantize Yuan2.0-M32-HF model

The Steps for Quantizing Yuan2.0-M32 Model:

Step 1: Download Yuan2.0-M32-HF Model and move it to the specified path (/mnt/beegfs2/Yuan2-M32-HF), refer to vllm
Step 2: Download the datases, then move it to the specified path (/mnt/beegfs2/)
Step 3: Adjust the parameters according to the following script for the quantization operation.

# edit Yuan2-M32-int4.py
cd /mnt/beegfs2/Yuan2.0-M32/3rd_party/AutoGPTQ
vim Yuan2-M32-int4.py

'''
pretrained_model_dir = "/mnt/beegfs2/Yuan2-M32-HF"
quantized_model_dir = "/mnt/beegfs2/Yuan2-M32-GPTQ-int4"

tokenizer = LlamaTokenizer.from_pretrained("/mnt/beegfs2/Yuan2-M32-HF", add_eos_token=False, add_bos_token=False, eos_token='<eod>', use_fast=True)

examples = []
with open("/mnt/beegfs2/instruct_data.json", 'r', encoding='utf-8') as file: # path of datasets
    data = json.load(file)

for i, item in enumerate(data):
    if i >= 2000:
        break
    instruction = item.get('instruction', '')
    output = item.get('output', '')
    combined_text = instruction + " " + output
    examples.append(tokenizer(combined_text))

max_memory = {0: "80GIB", 1: "80GIB", 2: "80GIB", 3: "80GIB", 4: "80GIB", 5: "80GIB", 6: "80GIB", 7: "80GIB"}
quantize_config = BaseQuantizeConfig(
    bits=4,  # quantize model to 4-bit
    group_size=128,  # it is recommended to set the value to 128
    desc_act=False,  # set to False can significantly speed up inference but the perplexity may slightly bad
)
'''

# Modify pretrained_model_dir, specify the quantized_model_dir for the quantized model.
# Modify the path of the datasets.
# max_memory can specify the GPUs to be used.
# Adjust the quantization parameters, int4: set bits=4, int8: set bits=8. 
# Other parameters can refer to the default values.


# Run
python Yuan2-M32-int4.py

# The model quantization and packing process takes 8 hours approximately.
# You can use GPUs quantize the model to int4 and int8 separately at the same time.

3. Inference with Quantized Model

Quantization completed, you will get the checkpoint files with the suffix of '.safetensors', config.json and quantize_config.json in the folder. You need to first copy the tokenizer-related files from the Yuan2-M32-HF path.

# the path of Yuan2-M32-HF
cd /mnt/beegfs2/Yuan2-M32-HF

# copy tokenizer files to the path of Yuan2-M32-GPTQ-int4
cp special_tokens_map.json tokenizer* /mnt/beegfs2/Yuan2-M32-GPTQ-int4

# edit inference.py
cd /mnt/beegfs2/Yuan2.0-M32/3rd_party/AutoGPTQ
vim inference.py

'''
quantized_model_dir = "/mnt/beegfs2/Yuan2-M32-GPTQ-int4"

tokenizer = LlamaTokenizer.from_pretrained('/mnt/beegfs2/Yuan2-M32-GPTQ-int4', add_eos_token=False, add_bos_token=False, eos_token='<eod>')

model = AutoGPTQForCausalLM.from_quantized(quantized_model_dir, device="cuda:0", trust_remote_code=True)
'''
# edit paths of quantized_model_dir and tokenizer

# run inference.py
python inference.py

4. Evaluation

Parameters of HumanEval: generation_params = { "max_new_tokens": 512, "top_k": 1, "top_p": 0, "temperature": 1.0, }

Yuan-M32-HF inferenced with 2 80GB GPUs; Yuan2-M32-GPTQ-int4 and Yuan2-M32-GPTQ-int8 inferenced with single 80GB GPU

Result:

Model	Accuracy Type	HumanEval	Inference Speed	Inference Memory Usage
Yuan2-M32-HF	BF16	73.17%	13.16 token/s	76.34 GB
Yuan2-M32-GPTQ-int8	INT8	72.56%	9.05 token/s	39.81 GB
Yuan2-M32-GPTQ-int4	INT4	66.46%	9.24 token/s	23.27GB

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README_GPTQ_EN.md

README_GPTQ_EN.md

The Method of Quantization and Inference for Yuan2.0-M32

0. Model Downloads

1. Environment of AutoGPTQ

2. Quantize Yuan2.0-M32-HF model

3. Inference with Quantized Model

4. Evaluation

Files

README_GPTQ_EN.md

Latest commit

History

README_GPTQ_EN.md

File metadata and controls

The Method of Quantization and Inference for Yuan2.0-M32

0. Model Downloads

1. Environment of AutoGPTQ

2. Quantize Yuan2.0-M32-HF model

3. Inference with Quantized Model

4. Evaluation