kubernetes-sigs · k8s-ci-robot · Oct 24, 2024 · Oct 18, 2024 · Oct 18, 2024 · Oct 24, 2024
diff --git a/docs/examples/vllm/README.md → docs/examples/vllm/GPU/README.md b/docs/examples/vllm/README.md → docs/examples/vllm/GPU/README.md
diff --git a/docs/examples/vllm/lws.yaml → docs/examples/vllm/GPU/lws.yaml b/docs/examples/vllm/lws.yaml → docs/examples/vllm/GPU/lws.yaml
@@ -16,16 +16,12 @@ spec:
           - name: vllm-leader
             image: <image-built-from-dockerfile>
             env:
-              - name: RAY_CLUSTER_SIZE
-                valueFrom:
-                  fieldRef:
-                    fieldPath: metadata.annotations['leaderworkerset.sigs.k8s.io/size']
               - name: HUGGING_FACE_HUB_TOKEN
                 value: <your-hf-token>
             command:
               - sh
               - -c
-              - "/vllm-workspace/ray_init.sh leader --ray_cluster_size=$RAY_CLUSTER_SIZE; 
+              - "/vllm-workspace/ray_init.sh leader --ray_cluster_size=$(LWS_GROUP_SIZE); 
                  python3 -m vllm.entrypoints.openai.api_server --port 8080 --model meta-llama/Meta-Llama-3.1-405B-Instruct --tensor-parallel-size 8 --pipeline_parallel_size 2"
             resources:
               limits:

diff --git a/docs/examples/vllm/service.yaml → docs/examples/vllm/GPU/service.yaml b/docs/examples/vllm/service.yaml → docs/examples/vllm/GPU/service.yaml
diff --git a/docs/examples/vllm/TPU/README.md b/docs/examples/vllm/TPU/README.md
@@ -0,0 +1,118 @@
+# Deploy Distributed Inference Service with vLLM and LWS on TPUs
+
+In this example, we will use LeaderWorkerSet to deploy a multi-host inference service with vLLM on multi-host TPU slice. It relies on the distributed runtime [Ray](https://docs.ray.io/en/latest/index.html). The main difference between deploying with TPUs and with GPUs is in the node selector, resource requested, and the vLLM image used. 
+
+## Install LeaderWorkerSet
+
+Follow the step-by-step guide on how to install LWS. [View installation guide](https://github.com/kubernetes-sigs/lws/blob/main/docs/setup/install.md)
+
+
+## Deploy LeaderWorkerSet of vLLM
+In this example, we use LeaderWorkerSet to deploy two vLLM server replicas to serve the llama3-70b model on two v5e-16 TPU slices. You can view how to create a cluster with multiple TPU slices [here](https://cloud.google.com/kubernetes-engine/docs/how-to/tpus). Since the v5e-16 TPU slice has four hosts, so each vLLM replica will have 4 workers, and each worker will consume 4 TPUs (with tensor_parallel_size=16). 
+The leader pod runs the Ray head and the vLLM server, while the workers run the Ray workers.
+
+```shell
+kubectl apply -f lws.yaml
+```
+
+Verify the status of the vLLM pods
+```shell
+kubectl get pods
+```
+
+Should get an output similar to this
+```shell
+NAME       READY   STATUS    RESTARTS   AGE
+vllm-0     1/1     Running   0          2s
+vllm-0-1   1/1     Running   0          2s
+vllm-0-2   1/1     Running   0          2s
+vllm-0-3   1/1     Running   0          2s
+vllm-1     1/1     Running   0          2s
+vllm-1-1   1/1     Running   0          2s
+vllm-1-2   1/1     Running   0          2s
+vllm-1-3   1/1     Running   0          2s
+```
+
+Verify that it works by viewing the leader logs
+```shell
+kubectl logs vllm-0 -c vllm-leader
+```
+
+The output should be similar to the following 
+```
+INFO 10-02 00:21:40 launcher.py:20] Available routes are:
+INFO 10-02 00:21:40 launcher.py:28] Route: /openapi.json, Methods: GET, HEAD
+INFO 10-02 00:21:40 launcher.py:28] Route: /docs, Methods: GET, HEAD
+INFO 10-02 00:21:40 launcher.py:28] Route: /docs/oauth2-redirect, Methods: GET, HEAD
+INFO 10-02 00:21:40 launcher.py:28] Route: /redoc, Methods: GET, HEAD
+INFO 10-02 00:21:40 launcher.py:28] Route: /health, Methods: GET
+INFO 10-02 00:21:40 launcher.py:28] Route: /tokenize, Methods: POST
+INFO 10-02 00:21:40 launcher.py:28] Route: /detokenize, Methods: POST
+INFO 10-02 00:21:40 launcher.py:28] Route: /v1/models, Methods: GET
+INFO 10-02 00:21:40 launcher.py:28] Route: /version, Methods: GET
+INFO 10-02 00:21:40 launcher.py:28] Route: /v1/chat/completions, Methods: POST
+INFO 10-02 00:21:40 launcher.py:28] Route: /v1/completions, Methods: POST
+INFO 10-02 00:21:40 launcher.py:28] Route: /v1/embeddings, Methods: POST
+INFO 10-02 00:21:40 launcher.py:33] Launching Uvicorn with --limit_concurrency 32765. To avoid this limit at the expense of performance run with --disable-frontend-multiprocessing
+INFO:     Started server process [6803]
+INFO:     Waiting for application startup.
+INFO:     Application startup complete.
+INFO:     Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)
+```
+
+## Deploy ClusterIP Service
+
+Apply the `service.yaml` manifest
+
+```shell
+kubectl apply -f service.yaml
+```
+
+Use `kubectl port-forward` to forward local port 8080 to a pod.
+```shell
+# Listen on port 8080 locally, forwarding to the targetPort of the service's port 8080 in a pod selected by the service
+kubectl port-forward svc/vllm-leader 8080:8080
+```
+
+The output should be similar to the following
+```shell
+Forwarding from 127.0.0.1:8080 -> 8080
+Forwarding from [::1]:8080 -> 8080
+```
+
+## Serve the model
+
+Open another terminal and send a request
+```shell
+curl http://localhost:8080/v1/completions -H "Content-Type: application/json" -d '{
+    "model": "meta-llama/Meta-Llama-3-70B",
+    "prompt": "San Francisco is a",
+    "max_tokens": 7,
+    "temperature": 0
+}'
+```
+
+The output should be similar to the following
+```json
+{
+    "id":"cmpl-7e795f36f17545eabd451a6dd8f70ce2",
+    "object":"text_completion",
+    "created":1727733988,
+    "model":"meta-llama/Meta-Llama-3-70B",
+    "choices":[
+        {
+            "index":0,
+            "text":" top holiday destination featuring scenic beauty and",
+            "logprobs":null,
+            "finish_reason":"length",
+            "stop_reason":null,
+            "prompt_logprobs":null
+        }
+    ],
+    "usage":{
+        "prompt_tokens":5,
+        "total_tokens":12,
+      "completion_tokens":7
+    }
+}
+```
diff --git a/docs/examples/vllm/TPU/lws.yaml b/docs/examples/vllm/TPU/lws.yaml
@@ -0,0 +1,77 @@
+apiVersion: leaderworkerset.x-k8s.io/v1
+kind: LeaderWorkerSet
+metadata:
+  name: vllm
+spec:
+  rolloutStrategy:
+    type: RollingUpdate
+    rollingUpdateConfiguration:
+      maxUnavailable: 1
+      maxSurge: 1
+  replicas: 2
+  leaderWorkerTemplate:
+    size: 4
+    restartPolicy: RecreateGroupOnPodRestart
+    leaderTemplate:
+      metadata:
+        labels:
+          role: leader
+      spec:
+        nodeSelector:
+          cloud.google.com/gke-tpu-accelerator: tpu-v5-lite-podslice
+          cloud.google.com/gke-tpu-topology: 4x4
+        containers:
+          - name: vllm-leader
+            image: <image-built-from-dockerfile>
+            env:
+              - name: HUGGING_FACE_HUB_TOKEN
+                value: <your-hf-token>
+            command:
+              - sh
+              - -c
+              - "/vllm-workspace/ray_init.sh leader --ray_cluster_size=$(LWS_GROUP_SIZE); 
+                 python3 -m vllm.entrypoints.openai.api_server --port 8080 --model meta-llama/Meta-Llama-3-70B --tensor-parallel-size=16"
+            resources:
+              limits:
+                google.com/tpu: "4"
+            ports:
+              - containerPort: 8080
+            readinessProbe:
+              tcpSocket:
+                port: 8080
+              initialDelaySeconds: 15
+              periodSeconds: 10
+            volumeMounts:
+              - mountPath: /dev/shm
+                name: dshm
+        volumes:
+        - name: dshm
+          emptyDir:
+            medium: Memory
+            sizeLimit: 30Gi
+    workerTemplate:
+      spec:
+        nodeSelector:
+          cloud.google.com/gke-tpu-accelerator: tpu-v5-lite-podslice
+          cloud.google.com/gke-tpu-topology: 4x4
+        containers:
+          - name: vllm-worker
+            image: <image-built-from-dockerfile>
+            command:
+              - sh
+              - -c
+              - "/vllm-workspace/ray_init.sh worker --ray_address=$(LWS_LEADER_ADDRESS)"
+            resources:
+              limits:
+                google.com/tpu: "4"
+            env:
+              - name: HUGGING_FACE_HUB_TOKEN
+                value: <your-hf-token>
+            volumeMounts:
+              - mountPath: /dev/shm
+                name: dshm
+        volumes:
+        - name: dshm
+          emptyDir:
+            medium: Memory
+            sizeLimit: 30Gi
diff --git a/docs/examples/vllm/TPU/service.yaml b/docs/examples/vllm/TPU/service.yaml
@@ -0,0 +1,14 @@
+apiVersion: v1
+kind: Service
+metadata:
+  name: vllm-leader
+spec:
+  ports:
+    - name: http
+      port: 8080
+      protocol: TCP
+      targetPort: 8080
+  selector:
+    leaderworkerset.sigs.k8s.io/name: vllm
+    role: leader
+  type: ClusterIP
diff --git a/docs/examples/vllm/build/Dockerfile → docs/examples/vllm/build/Dockerfile.GPU b/docs/examples/vllm/build/Dockerfile → docs/examples/vllm/build/Dockerfile.GPU
@@ -1,2 +1,2 @@
-FROM docker.io/vllm/vllm-openai:v0.5.3.post1
+FROM docker.io/vllm/vllm-openai:latest
 COPY ray_init.sh /vllm-workspace/ray_init.sh
diff --git a/docs/examples/vllm/build/Dockerfile.TPU b/docs/examples/vllm/build/Dockerfile.TPU
@@ -0,0 +1,4 @@
+# The TPU vllm image can be built using the dockerfile found in 
+# in the vllm repo https://github.com/vllm-project/vllm/blob/main/Dockerfile.tpu
+FROM tpu_vllm_image
+COPY ray_init.sh /vllm-workspace/ray_init.sh