Important
The End for QwenLM/vllm-gptq
Since December 2023, vllm has supported 4-bit GPTQ, followed by 8-bit GPTQ support since March 2024. Additionally, vllm now includes Marlin and MoE support.
This repository has fulfilled its role. We recommend transitioning to the original vllm for Qwen models to take advantage of the latest features and ongoing improvements.
本仓库是基于vLLM(版本0.2.2)进行修改的一个分支,主要为了支持Qwen系列大语言模型的GPTQ量化推理。
This repo is a fork of vLLM(Version: 0.2.2), which supports the GPTQ model inference of Qwen large language models.
该版本vLLM跟官方0.22版本的主要区别在于增加GPTQ int4量化模型支持。我们在Qwen-72B-Chat上测试了量化模型性能,结果如下表。
The features we added is to support GPTQ int4 quantization. We test on the Qwen-72B and the test performance is shown in the table.
context length | generate length | tokens/s | tokens/s | tokens/s | tokens/s | tokens/s | tokens/s | tokens/s | tokens/s |
---|---|---|---|---|---|---|---|---|---|
tp=8 | tp=8 | tp=4 | tp=4 | tp=2 | tp=2 | tp=1 | tp=1 | ||
fp16 a16w16 | int4 a16w4 | fp16 a16w16 | int4 a16w4 | fp16 a16w16 | int4 a16w4 | fp16 a16w16 | int4 a16w4 | ||
1 | 2k | 26.42 | 27.68 | 24.98 | 27.19 | 17.39 | 20.76 | - | 14.63 |
6k | 2k | 24.93 | 25.98 | 22.76 | 24.56 | - | 18.07 | - | - |
14k | 2k | 22.67 | 22.87 | 19.38 | 19.28 | - | 14.51 | - | - |
30k | 2k | 19.95 | 19.87 | 17.05 | 16.93 | - | - | - | - |
为了安装vLLM,你必须满足以下要求:
To install vLLM, you must meet the below requirements.
- torch >= 2.0
- cuda 11.8 or 12.1
目前,我们仅支持源码安装。
You can install vLLM from source.
如果你使用cuda 12.1和torch 2.1,你可以使用以下方法安装
If you use cuda 12.2 and torch 2.1, you can install vLLM by
git clone https://github.com/QwenLM/vllm-gptq.git
cd vllm-gptq
pip install -e .
其他情况下,安装可能较为复杂。一个可能的方式是,安装对应版本的cuda和PyTorch后,删除requirements.txt
的torch依赖,并删除pyproject.toml
,再尝试执行pip install -e .
。
In other cases, installation may be complicated. One possible way is to install the corresponding versions of CUDA and PyTorch, **delete the torch dependencies in Requirements.txt
, delete pyproject.toml
, and then try to execute pip install -e.
我们在此仅介绍如何运行Qwen的量化模型。
We only introduce how to run Qwen's quantized model.
-
如果想了解更多关于Qwen系列模型的用法,请访问Qwen官方仓库
-
如果想使用vLLM其他功能,请阅读 官方文档。
-
If you want to know more about the Qwen series model, visit [Qwen's official repo] (https://github.com/qwenlm/qwen)
-
If you want to use other functions of VLLM, read [Official Document] (https://github.com/vllm-project/vllm).
关于Qen量化模型的示例代码,代码目录在tests/qwen/。
Regarding the example code of Qwen quantized model, the code directory is in tests/qwen/.
注意:当前本仓库仅支持Int4量化模型。Int8量化模型将在后续支持。
Note: The current warehouse only supports Int4 quantized model. Int8 quantization will be supported in near future.
注意:运行以下代码,需要先进入对应的目录:tests/qwen/。
Note: To run the following code, you need to enter the directory 'tests/qwen/' first.
from vllm_wrapper import vLLMWrapper
if __name__ == '__main__':
model = "Qwen/Qwen-72B-Chat-Int4"
vllm_model = vLLMWrapper(model,
quantization = 'gptq',
dtype="float16",
tensor_parallel_size=1)
response, history = vllm_model.chat(query="你好",
history=None)
print(response)
response, history = vllm_model.chat(query="给我讲一个年轻人奋斗创业最终取得成功的故事。",
history=history)
print(response)
response, history = vllm_model.chat(query="给这个故事起一个标题",
history=history)
print(response)
除去安装vLLM外,以API方式调用模型需要额外安装fastchat
In addition to installing vLLM, you should install FastChat.
pip install fschat
step 1. 启动控制器
step 1. Launch the controller
python -m fastchat.serve.controller
step 2. 启动模型worker
step 2. Launch the model worker
python -m fastchat.serve.vllm_worker --model-path $model_path --tensor-parallel-size 1 --trust-remote-code
step 3. 启动服务器
step 3. Launch the openai api server
python -m fastchat.serve.openai_api_server --host localhost --port 8000
step 1. 安装openai-python
step 1. install openai-python
pip install --upgrade openai
step 2. 调用接口
step 2. Query APIs
import openai
# to get proper authentication, make sure to use a valid key that's listed in
# the --api-keys flag. if no flag value is provided, the `api_key` will be ignored.
openai.api_key = "EMPTY"
openai.api_base = "http://localhost:8000/v1"
model = "qwen"
call_args = {
'temperature': 1.0,
'top_p': 1.0,
'top_k': -1,
'max_tokens': 2048, # output-len
'presence_penalty': 1.0,
'frequency_penalty': 0.0,
}
# create a chat completion
completion = openai.ChatCompletion.create(
model=model,
messages=[{"role": "user", "content": "Hello! What is your name?"}],
**call_args
)
# print the completion
print(completion.choices[0].message.content)