Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add vLLM TPU Multihost example #234

Merged
merged 6 commits into from
Oct 24, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
File renamed without changes.
Original file line number Diff line number Diff line change
Expand Up @@ -16,16 +16,12 @@ spec:
- name: vllm-leader
image: <image-built-from-dockerfile>
env:
- name: RAY_CLUSTER_SIZE
valueFrom:
fieldRef:
fieldPath: metadata.annotations['leaderworkerset.sigs.k8s.io/size']
- name: HUGGING_FACE_HUB_TOKEN
value: <your-hf-token>
command:
- sh
- -c
- "/vllm-workspace/ray_init.sh leader --ray_cluster_size=$RAY_CLUSTER_SIZE;
- "/vllm-workspace/ray_init.sh leader --ray_cluster_size=$(LWS_GROUP_SIZE);
python3 -m vllm.entrypoints.openai.api_server --port 8080 --model meta-llama/Meta-Llama-3.1-405B-Instruct --tensor-parallel-size 8 --pipeline_parallel_size 2"
resources:
limits:
Expand Down
File renamed without changes.
118 changes: 118 additions & 0 deletions docs/examples/vllm/TPU/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,118 @@
# Deploy Distributed Inference Service with vLLM and LWS on TPUs

In this example, we will use LeaderWorkerSet to deploy a multi-host inference service with vLLM on multi-host TPU slice. It relies on the distributed runtime [Ray](https://docs.ray.io/en/latest/index.html). The main difference between deploying with TPUs and with GPUs is in the node selector, resource requested, and the vLLM image used.

## Install LeaderWorkerSet

Follow the step-by-step guide on how to install LWS. [View installation guide](https://github.com/kubernetes-sigs/lws/blob/main/docs/setup/install.md)


## Deploy LeaderWorkerSet of vLLM
In this example, we use LeaderWorkerSet to deploy two vLLM server replicas to serve the llama3-70b model on two v5e-16 TPU slices. You can view how to create a cluster with multiple TPU slices [here](https://cloud.google.com/kubernetes-engine/docs/how-to/tpus). Since the v5e-16 TPU slice has four hosts, so each vLLM replica will have 4 workers, and each worker will consume 4 TPUs (with tensor_parallel_size=16).
The leader pod runs the Ray head and the vLLM server, while the workers run the Ray workers.

```shell
kubectl apply -f lws.yaml
```

Verify the status of the vLLM pods
```shell
kubectl get pods
```

Should get an output similar to this
```shell
NAME READY STATUS RESTARTS AGE
vllm-0 1/1 Running 0 2s
vllm-0-1 1/1 Running 0 2s
vllm-0-2 1/1 Running 0 2s
vllm-0-3 1/1 Running 0 2s
vllm-1 1/1 Running 0 2s
vllm-1-1 1/1 Running 0 2s
vllm-1-2 1/1 Running 0 2s
vllm-1-3 1/1 Running 0 2s
```

Verify that it works by viewing the leader logs
```shell
kubectl logs vllm-0 -c vllm-leader
```

The output should be similar to the following
```
INFO 10-02 00:21:40 launcher.py:20] Available routes are:
INFO 10-02 00:21:40 launcher.py:28] Route: /openapi.json, Methods: GET, HEAD
INFO 10-02 00:21:40 launcher.py:28] Route: /docs, Methods: GET, HEAD
INFO 10-02 00:21:40 launcher.py:28] Route: /docs/oauth2-redirect, Methods: GET, HEAD
INFO 10-02 00:21:40 launcher.py:28] Route: /redoc, Methods: GET, HEAD
INFO 10-02 00:21:40 launcher.py:28] Route: /health, Methods: GET
INFO 10-02 00:21:40 launcher.py:28] Route: /tokenize, Methods: POST
INFO 10-02 00:21:40 launcher.py:28] Route: /detokenize, Methods: POST
INFO 10-02 00:21:40 launcher.py:28] Route: /v1/models, Methods: GET
INFO 10-02 00:21:40 launcher.py:28] Route: /version, Methods: GET
INFO 10-02 00:21:40 launcher.py:28] Route: /v1/chat/completions, Methods: POST
INFO 10-02 00:21:40 launcher.py:28] Route: /v1/completions, Methods: POST
INFO 10-02 00:21:40 launcher.py:28] Route: /v1/embeddings, Methods: POST
INFO 10-02 00:21:40 launcher.py:33] Launching Uvicorn with --limit_concurrency 32765. To avoid this limit at the expense of performance run with --disable-frontend-multiprocessing
INFO: Started server process [6803]
INFO: Waiting for application startup.
INFO: Application startup complete.
INFO: Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)
```

## Deploy ClusterIP Service

Apply the `service.yaml` manifest

```shell
kubectl apply -f service.yaml
```

Use `kubectl port-forward` to forward local port 8080 to a pod.
```shell
# Listen on port 8080 locally, forwarding to the targetPort of the service's port 8080 in a pod selected by the service
kubectl port-forward svc/vllm-leader 8080:8080
```

The output should be similar to the following
```shell
Forwarding from 127.0.0.1:8080 -> 8080
Forwarding from [::1]:8080 -> 8080
```

## Serve the model

Open another terminal and send a request
```shell
curl http://localhost:8080/v1/completions -H "Content-Type: application/json" -d '{
"model": "meta-llama/Meta-Llama-3-70B",
"prompt": "San Francisco is a",
"max_tokens": 7,
"temperature": 0
}'
```

The output should be similar to the following
```json
{
"id":"cmpl-7e795f36f17545eabd451a6dd8f70ce2",
"object":"text_completion",
"created":1727733988,
"model":"meta-llama/Meta-Llama-3-70B",
"choices":[
{
"index":0,
"text":" top holiday destination featuring scenic beauty and",
"logprobs":null,
"finish_reason":"length",
"stop_reason":null,
"prompt_logprobs":null
}
],
"usage":{
"prompt_tokens":5,
"total_tokens":12,
"completion_tokens":7
}
}
```
77 changes: 77 additions & 0 deletions docs/examples/vllm/TPU/lws.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,77 @@
apiVersion: leaderworkerset.x-k8s.io/v1
Edwinhr716 marked this conversation as resolved.
Show resolved Hide resolved
kind: LeaderWorkerSet
metadata:
name: vllm
spec:
rolloutStrategy:
type: RollingUpdate
rollingUpdateConfiguration:
maxUnavailable: 1
maxSurge: 1
replicas: 2
leaderWorkerTemplate:
size: 4
restartPolicy: RecreateGroupOnPodRestart
leaderTemplate:
metadata:
labels:
role: leader
spec:
nodeSelector:
cloud.google.com/gke-tpu-accelerator: tpu-v5-lite-podslice
cloud.google.com/gke-tpu-topology: 4x4
containers:
- name: vllm-leader
image: <image-built-from-dockerfile>
env:
- name: HUGGING_FACE_HUB_TOKEN
value: <your-hf-token>
command:
- sh
- -c
- "/vllm-workspace/ray_init.sh leader --ray_cluster_size=$(LWS_GROUP_SIZE);
python3 -m vllm.entrypoints.openai.api_server --port 8080 --model meta-llama/Meta-Llama-3-70B --tensor-parallel-size=16"
resources:
limits:
google.com/tpu: "4"
ports:
- containerPort: 8080
readinessProbe:
tcpSocket:
port: 8080
initialDelaySeconds: 15
periodSeconds: 10
volumeMounts:
- mountPath: /dev/shm
name: dshm
volumes:
- name: dshm
emptyDir:
medium: Memory
sizeLimit: 30Gi
workerTemplate:
spec:
nodeSelector:
cloud.google.com/gke-tpu-accelerator: tpu-v5-lite-podslice
cloud.google.com/gke-tpu-topology: 4x4
containers:
- name: vllm-worker
image: <image-built-from-dockerfile>
command:
- sh
- -c
- "/vllm-workspace/ray_init.sh worker --ray_address=$(LWS_LEADER_ADDRESS)"
resources:
limits:
google.com/tpu: "4"
env:
- name: HUGGING_FACE_HUB_TOKEN
value: <your-hf-token>
volumeMounts:
- mountPath: /dev/shm
name: dshm
volumes:
- name: dshm
emptyDir:
medium: Memory
sizeLimit: 30Gi
14 changes: 14 additions & 0 deletions docs/examples/vllm/TPU/service.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
apiVersion: v1
kind: Service
metadata:
name: vllm-leader
spec:
ports:
- name: http
port: 8080
protocol: TCP
targetPort: 8080
selector:
leaderworkerset.sigs.k8s.io/name: vllm
role: leader
type: ClusterIP
Original file line number Diff line number Diff line change
@@ -1,2 +1,2 @@
FROM docker.io/vllm/vllm-openai:v0.5.3.post1
FROM docker.io/vllm/vllm-openai:latest
COPY ray_init.sh /vllm-workspace/ray_init.sh
4 changes: 4 additions & 0 deletions docs/examples/vllm/build/Dockerfile.TPU
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
# The TPU vllm image can be built using the dockerfile found in
# in the vllm repo https://github.com/vllm-project/vllm/blob/main/Dockerfile.tpu
FROM tpu_vllm_image
COPY ray_init.sh /vllm-workspace/ray_init.sh