Deploy LLM models across multiple worker nodes using the vLLM library. This template is designed to work with GPU workers.
This template makes heavy use of the vLLM library.
workers
: Number of workers per deployment (for tensor and pipeline parallelism, i.e. how many pieces to divide the model into)model_id
: Huggingface repository to load from Huggingface. This usually takes the form ofOrgName/ModelID
working_memory
: Temporary disk space where model weights are placed for loading. Needs to be big enough to hold the entire model weights in a single worker node.hf_token
(optional): Huggingface token, required to load licensed model weightsextra
(optional): any extra parameters to pass to vLLM engine. Expected format:--parameter1_name parameter1_value --parameterX_name parameterX_value
tensor_parallel_size
: Tensor parallelism (use the number of GPUs per node)pipeline_parallel_size
: Pipeline parallelism (use the number of nodes)
If you have a LiteLLM server deployed in your pool (default for public LLM pool), you can pass on the following parameters to rregister the model with it:
litellm_key
as the API key.litellm_base_url
as the endpoint for the LiteLLM job.
Get default values, edit them and deploy:
kalavai job defaults vllm > values.yaml
# edit values.yaml as required
kalavai job run vllm --values values.yaml
Find out the url endpoint of the model with:
$ kalavai job list
┏━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ Deployment ┃ Status ┃ Endpoint ┃
┡━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━┩
│ vllm-deployment-1 │ Available: All replicas are ready │ http://100.8.0.2:31992 │
└───────────────────┴───────────────────────────────────┴────────────────────────┘
This is a model endpoint that can be interacted as you would any LLM server. For example:
curl http://100.10.0.2:31992/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen/Qwen2.5-0.5B-Instruct",
"prompt": "San Francisco is a",
"max_tokens": 100,
"temperature": 0
}'
Also from python:
from openai import OpenAI
# Modify OpenAI's API key and API base to use vLLM's API server.
openai_api_key = "EMPTY"
openai_api_base = "http://100.8.0.2:31992/v1"
client = OpenAI(
api_key=openai_api_key,
base_url=openai_api_base,
)
completion = client.completions.create(model="facebook/opt-350m",
prompt="San Francisco is a")
print("Completion result:", completion)
Check out the qwen example,ready for deployment.