Skip to content

Latest commit

 

History

History
76 lines (58 loc) · 3.69 KB

README.md

File metadata and controls

76 lines (58 loc) · 3.69 KB

vLLM template

Deploy LLM models across multiple worker nodes using the vLLM library. This template is designed to work with GPU workers.

External references

This template makes heavy use of the vLLM library.

Key template variables

  • workers: Number of workers per deployment (for tensor and pipeline parallelism, i.e. how many pieces to divide the model into)
  • model_id: Huggingface repository to load from Huggingface. This usually takes the form of OrgName/ModelID
  • working_memory: Temporary disk space where model weights are placed for loading. Needs to be big enough to hold the entire model weights in a single worker node.
  • hf_token (optional): Huggingface token, required to load licensed model weights
  • extra (optional): any extra parameters to pass to vLLM engine. Expected format: --parameter1_name parameter1_value --parameterX_name parameterX_value
  • tensor_parallel_size: Tensor parallelism (use the number of GPUs per node)
  • pipeline_parallel_size: Pipeline parallelism (use the number of nodes)

If you have a LiteLLM server deployed in your pool (default for public LLM pool), you can pass on the following parameters to rregister the model with it:

  • litellm_key as the API key.
  • litellm_base_url as the endpoint for the LiteLLM job.

How to use

Get default values, edit them and deploy:

kalavai job defaults vllm > values.yaml
# edit values.yaml as required
kalavai job run vllm --values values.yaml

Find out the url endpoint of the model with:

$ kalavai job list 

┏━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ Deployment        ┃ Status                            ┃ Endpoint               ┃
┡━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━┩
│ vllm-deployment-1 │ Available: All replicas are ready │ http://100.8.0.2:31992 │
└───────────────────┴───────────────────────────────────┴────────────────────────┘

This is a model endpoint that can be interacted as you would any LLM server. For example:

curl http://100.10.0.2:31992/v1/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "Qwen/Qwen2.5-0.5B-Instruct",
        "prompt": "San Francisco is a",
        "max_tokens": 100,
        "temperature": 0
    }'

Also from python:

from openai import OpenAI

# Modify OpenAI's API key and API base to use vLLM's API server.
openai_api_key = "EMPTY"
openai_api_base = "http://100.8.0.2:31992/v1"
client = OpenAI(
    api_key=openai_api_key,
    base_url=openai_api_base,
)
completion = client.completions.create(model="facebook/opt-350m",
                                      prompt="San Francisco is a")
print("Completion result:", completion)

Examples

Check out the qwen example,ready for deployment.