Model | Sequence Length | Type | Download |
---|---|---|---|
Yuan2.0-M32-HF-INT4 | 16K | HuggingFace | ModelScope | HuggingFace | Netdisk | Wisemodel |
Yuan2.0-M32-HF-INT8 | 16K | HuggingFace | ModelScope | HuggingFace | Netdisk | Wisemodel |
- Environment requirements: CUDA version> 11.8
- Container: Create a container using the image provided by the vllm
# enter docker containers
docker exec -it vllm_yuan bash
# enter directory
cd /mnt
# clone
git clone https://github.com/IEIT-Yuan/Yuan2.0-M32.git
# enter project
cd Yuan2.0-M32/3rd_party/AutoGPTQ
# install autogptq
pip install auto-gptq --no-build-isolation
The Steps for Quantizing Yuan2.0-M32 Model:
- Step 1: Download Yuan2.0-M32-HF Model and move it to the specified path (/mnt/beegfs2/Yuan2-M32-HF), refer to vllm
- Step 2: Download the datases, then move it to the specified path (/mnt/beegfs2/)
- Step 3: Adjust the parameters according to the following script for the quantization operation.
# edit Yuan2-M32-int4.py
cd /mnt/beegfs2/Yuan2.0-M32/3rd_party/AutoGPTQ
vim Yuan2-M32-int4.py
'''
pretrained_model_dir = "/mnt/beegfs2/Yuan2-M32-HF"
quantized_model_dir = "/mnt/beegfs2/Yuan2-M32-GPTQ-int4"
tokenizer = LlamaTokenizer.from_pretrained("/mnt/beegfs2/Yuan2-M32-HF", add_eos_token=False, add_bos_token=False, eos_token='<eod>', use_fast=True)
examples = []
with open("/mnt/beegfs2/instruct_data.json", 'r', encoding='utf-8') as file: # path of datasets
data = json.load(file)
for i, item in enumerate(data):
if i >= 2000:
break
instruction = item.get('instruction', '')
output = item.get('output', '')
combined_text = instruction + " " + output
examples.append(tokenizer(combined_text))
max_memory = {0: "80GIB", 1: "80GIB", 2: "80GIB", 3: "80GIB", 4: "80GIB", 5: "80GIB", 6: "80GIB", 7: "80GIB"}
quantize_config = BaseQuantizeConfig(
bits=4, # quantize model to 4-bit
group_size=128, # it is recommended to set the value to 128
desc_act=False, # set to False can significantly speed up inference but the perplexity may slightly bad
)
'''
# Modify pretrained_model_dir, specify the quantized_model_dir for the quantized model.
# Modify the path of the datasets.
# max_memory can specify the GPUs to be used.
# Adjust the quantization parameters, int4: set bits=4, int8: set bits=8.
# Other parameters can refer to the default values.
# Run
python Yuan2-M32-int4.py
# The model quantization and packing process takes 8 hours approximately.
# You can use GPUs quantize the model to int4 and int8 separately at the same time.
Quantization completed, you will get the checkpoint files with the suffix of '.safetensors', config.json and quantize_config.json in the folder. You need to first copy the tokenizer-related files from the Yuan2-M32-HF path.
# the path of Yuan2-M32-HF
cd /mnt/beegfs2/Yuan2-M32-HF
# copy tokenizer files to the path of Yuan2-M32-GPTQ-int4
cp special_tokens_map.json tokenizer* /mnt/beegfs2/Yuan2-M32-GPTQ-int4
# edit inference.py
cd /mnt/beegfs2/Yuan2.0-M32/3rd_party/AutoGPTQ
vim inference.py
'''
quantized_model_dir = "/mnt/beegfs2/Yuan2-M32-GPTQ-int4"
tokenizer = LlamaTokenizer.from_pretrained('/mnt/beegfs2/Yuan2-M32-GPTQ-int4', add_eos_token=False, add_bos_token=False, eos_token='<eod>')
model = AutoGPTQForCausalLM.from_quantized(quantized_model_dir, device="cuda:0", trust_remote_code=True)
'''
# edit paths of quantized_model_dir and tokenizer
# run inference.py
python inference.py
Parameters of HumanEval: generation_params = { "max_new_tokens": 512, "top_k": 1, "top_p": 0, "temperature": 1.0, }
Yuan-M32-HF inferenced with 2 80GB GPUs; Yuan2-M32-GPTQ-int4 and Yuan2-M32-GPTQ-int8 inferenced with single 80GB GPU
Result:
Model | Accuracy Type | HumanEval | Inference Speed | Inference Memory Usage |
---|---|---|---|---|
Yuan2-M32-HF | BF16 | 73.17% | 13.16 token/s | 76.34 GB |
Yuan2-M32-GPTQ-int8 | INT8 | 72.56% | 9.05 token/s | 39.81 GB |
Yuan2-M32-GPTQ-int4 | INT4 | 66.46% | 9.24 token/s | 23.27GB |