Skip to content

Latest commit

 

History

History
46 lines (35 loc) · 1.24 KB

use_vllm.md

File metadata and controls

46 lines (35 loc) · 1.24 KB

使用VLLM

1. 首先启动 VLLM,自行选择模型

python -m vllm.entrypoints.openai.api_server --model /home/hmp/llm/cache/Qwen1___5-32B-Chat --tensor-parallel-size 2 --dtype=half

这里使用了存储在 /home/hmp/llm/cache/Qwen1___5-32B-Chat 的本地模型,可以根据自己的需求更改。

2. 测试 VLLM

curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
  "model": "/home/hmp/llm/cache/Qwen1___5-32B-Chat",
  "messages": [
  {"role": "system", "content": "You are a helpful assistant."},
  {"role": "user", "content": "怎么实现一个去中心化的控制器?"}
  ]
}'

3. 配置本项目

API_KEY = "sk-123456789xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx123456789"
LLM_MODEL = "vllm-/home/hmp/llm/cache/Qwen1___5-32B-Chat(max_token=4096)"
API_URL_REDIRECT = {"https://api.openai.com/v1/chat/completions": "http://localhost:8000/v1/chat/completions"}
"vllm-/home/hmp/llm/cache/Qwen1___5-32B-Chat(max_token=4096)"
其中
  "vllm-"                                     是前缀(必要)
  "/home/hmp/llm/cache/Qwen1___5-32B-Chat"    是模型名(必要)
  "(max_token=6666)"                          是配置(非必要)

4. 启动!

python main.py