add vllm distributed inference examples #133

gujingit · 2024-05-08T04:31:33Z

What type of PR is this?

/kind documentation

What this PR does / why we need it

add distributed vllm inference examples

Which issue(s) this PR fixes

Fixes #124

k8s-ci-robot · 2024-05-08T04:31:42Z

Hi @gujingit. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

liurupeng · 2024-05-08T05:58:24Z

exciting! this is what I have been waiting for!

liurupeng · 2024-05-08T05:58:35Z

/ok-to-test

kerthcet · 2024-05-08T08:46:10Z

/assign

kerthcet · 2024-05-08T14:03:41Z

Hi @gujingit have you tried this as crossing hosts? Can you provide the device info as well, like GPU cards, nodes number, network, to make this more clear. Thanks a lot.

kerthcet

I had a try but always OOMed, try to find the root cause.

kerthcet · 2024-05-08T15:34:26Z

docs/example/vllm/lws.yaml

+              - sh
+              - -c
+              - "/vllm-workspace/ray_init.sh head --ray_cluster_size=$RAY_CLUSTER_SIZE; 
+                 python3 -m vllm.entrypoints.openai.api_server --port 8080 --model facebook/opt-125m --gpu-memory-utilization 0.95 --tensor-parallel-size 2"


Let's use the default gpu memory utilization, already high enough I think. That's mean remove the --gpu-memory-utilization flag.

so the autoscaling is still done by the ray right? if Ray want to spin up more workers for scaling, right now it won't be possible

I don't think we want ray to do any autoscaling, ideally the leader emits a metric that triggers HPA to autoscale LWS

vLLM provisions Ray Actors based on the tensor parallel size. The LWS configuration sets up the Ray Nodes necessary to run the Ray Actors requested, so its static in this setup.

No configuration no scaling I think.

kerthcet · 2024-05-08T15:35:03Z

docs/example/vllm/README.md

+
+Follow the step-by-step guide on how to install LWS. [View installation guide](https://github.com/kubernetes-sigs/lws/blob/main/docs/setup/install.md)
+
+## Deploy LeaderWorkerSet Deployment


Add a blank line below.

kerthcet · 2024-05-08T15:44:28Z

docs/example/vllm/README.md

+kubectl apply -f lws.yaml
+```
+
+Verify the status of the vLLM Deployment


Suggested change

Verify the status of the vLLM Deployment

Verify the status of the vLLM Podds

kerthcet · 2024-05-08T15:46:12Z

docs/example/vllm/README.md

+Follow the step-by-step guide on how to install LWS. [View installation guide](https://github.com/kubernetes-sigs/lws/blob/main/docs/setup/install.md)
+
+## Deploy LeaderWorkerSet Deployment
+We use LeaderWorkerSet to deploy two vLLM model replicas, and each vLLM replicas has two pods (tp=2). 


Suggested change

We use LeaderWorkerSet to deploy two vLLM model replicas, and each vLLM replicas has two pods (tp=2).

We use LeaderWorkerSet to deploy two vLLM model replicas, and each vLLM replica has two pods (tp=2).

kerthcet · 2024-05-08T15:50:37Z

docs/example/vllm/README.md

+
+Follow the step-by-step guide on how to install LWS. [View installation guide](https://github.com/kubernetes-sigs/lws/blob/main/docs/setup/install.md)
+
+## Deploy LeaderWorkerSet Deployment


Suggested change

## Deploy LeaderWorkerSet Deployment

## Deploy LeaderWorkerSet of vLLM

docs/example/vllm/README.md

kerthcet · 2024-05-08T16:00:03Z

docs/example/vllm/README.md

@@ -0,0 +1,101 @@
+# Deploy distributed vLLM with LWS on GPUs


Maybe Deploy Distributed Inference Service with vLLM and LWS on GPUs

docs/example/vllm/lws.yaml

kerthcet · 2024-05-08T16:49:13Z

Found the reason why OOM because vllm will use a default swap space as 4GB, I put a limit of 4GB at the lws template which leads to the OOM. Can we append a new flag to the vllm api_server: --swap-space 2, so with a request of 4GB, there's no OOM happens.

kerthcet · 2024-05-08T16:49:37Z

docs/example/vllm/service.yaml

+kind: Service
+metadata:
+  name: vllm-leader
+  namespace: default


Remove this as we don't specify namespace at lws.

kerthcet · 2024-05-08T16:49:45Z

docs/example/vllm/service.yaml

+  selector:
+    leaderworkerset.sigs.k8s.io/name: vllm
+    role: leader
+  type: ClusterIP


Add a blank line below.

kerthcet · 2024-05-08T16:49:54Z

docs/example/vllm/lws.yaml

+              - name: LWS_NAME
+                valueFrom:
+                  fieldRef:
+                    fieldPath: metadata.labels['leaderworkerset.sigs.k8s.io/name']


Add a blank line below.

docs/example/vllm/build/ray_init.sh

gujingit · 2024-05-09T02:46:09Z

Add a new commit to solve all the comments abrove. If need to merge two commits, please let me know.

update README.md
update the leader pod commands
remove namespace in service.yaml
update subcommand from head to leader

@kerthcet

gujingit · 2024-05-09T03:12:22Z

Hi @gujingit have you tried this as crossing hosts? Can you provide the device info as well, like GPU cards, nodes number, network, to make this more clear. Thanks a lot.

Yes, I tried to deploy vllm crossing hosts. Kubernetes version: 1.28.9. There are 4 gpu nodes, each has 1 NVIDIA A10 (24GB RAM), 16 CPU, 60GB memory. CUDA version 12.2 .

ahg-g

Thanks a lot for this!

docs/example/vllm/build/ray_init.sh

gujingit · 2024-05-09T08:45:45Z

Modify ray_init.sh by removing set -e to ensure that the script continues running even if starting the Ray worker fails.

kerthcet

Only one nit, otherwise LGTM. Great work @gujingit !

/approve
/hold for @ahg-g @liurupeng

docs/example/vllm/README.md

…pace in service.yaml 4) update subcommand from head to leader Signed-off-by: zibai <zibai.gj@alibaba-inc.com>

ahg-g · 2024-05-10T15:24:16Z

/lgtm
/approve

k8s-ci-robot · 2024-05-10T15:24:22Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ahg-g, gujingit, kerthcet

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [ahg-g,kerthcet]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

ahg-g · 2024-05-10T15:35:48Z

/hold cancel

ahg-g · 2024-05-20T18:55:45Z

Hi @gujingit would you be able to add the following diagram to the guide?

gujingit · 2024-06-26T07:52:57Z

Hi @gujingit would you be able to add the following diagram to the guide?

@ahg-g tp=2 in the vllm example but the pic has 4 ray workers. Can I re-draw the pic with only 2 ray workers?

kerthcet · 2024-07-09T12:18:18Z

I think you can simply mention that we set the size=2 for minimum deployment, a bit different with the diagram.
Whisper: maybe the Ray head could be Ray Driver because head node = driver process + worker process.

add distributed vllm inference examples

d2d75b5

k8s-ci-robot added the kind/documentation Categorizes issue or PR as related to documentation. label May 8, 2024

k8s-ci-robot requested review from ahg-g and liurupeng May 8, 2024 04:31

k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label May 8, 2024

k8s-ci-robot added needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels May 8, 2024

gujingit changed the title ~~add distributed vllm inference examples~~ add vllm distributed inference examples May 8, 2024

k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels May 8, 2024

k8s-ci-robot assigned kerthcet May 8, 2024

kerthcet reviewed May 8, 2024

View reviewed changes

SachinVarghese reviewed May 8, 2024

View reviewed changes

docs/example/vllm/build/ray_init.sh Outdated Show resolved Hide resolved

ahg-g reviewed May 9, 2024

View reviewed changes

gujingit force-pushed the doc/distributed-vLLM branch 2 times, most recently from 6aaaedf to b3a8023 Compare May 9, 2024 08:42

kerthcet reviewed May 9, 2024

View reviewed changes

docs/example/vllm/README.md Outdated Show resolved Hide resolved

k8s-ci-robot added do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. approved Indicates a PR has been approved by an approver from all required OWNERS files. labels May 9, 2024

1) update README.md 2) update the leader pod commands 3) remove names…

f73d87f

…pace in service.yaml 4) update subcommand from head to leader Signed-off-by: zibai <zibai.gj@alibaba-inc.com>

gujingit force-pushed the doc/distributed-vLLM branch from b3a8023 to f73d87f Compare May 10, 2024 10:42

k8s-ci-robot assigned ahg-g May 10, 2024

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label May 10, 2024

k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label May 10, 2024

k8s-ci-robot merged commit decf166 into kubernetes-sigs:main May 10, 2024
7 checks passed

gujingit mentioned this pull request Jun 25, 2024

REQUEST: New membership for gujingit kubernetes/org#5036

Closed

11 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add vllm distributed inference examples #133

add vllm distributed inference examples #133

gujingit commented May 8, 2024

k8s-ci-robot commented May 8, 2024

liurupeng commented May 8, 2024

liurupeng commented May 8, 2024

kerthcet commented May 8, 2024

kerthcet commented May 8, 2024 •

edited

Loading

kerthcet left a comment •

edited

Loading

kerthcet May 8, 2024

liurupeng May 8, 2024

ahg-g May 8, 2024

jjk-g May 8, 2024

kerthcet May 9, 2024

kerthcet May 8, 2024

kerthcet May 8, 2024

kerthcet May 8, 2024

kerthcet May 8, 2024

kerthcet May 8, 2024 •

edited

Loading

kerthcet commented May 8, 2024

kerthcet May 8, 2024

kerthcet May 8, 2024

kerthcet May 8, 2024

gujingit commented May 9, 2024 •

edited

Loading

gujingit commented May 9, 2024 •

edited

Loading

ahg-g left a comment

gujingit commented May 9, 2024

kerthcet left a comment

ahg-g commented May 10, 2024

k8s-ci-robot commented May 10, 2024

ahg-g commented May 10, 2024

ahg-g commented May 20, 2024 •

edited

Loading

gujingit commented Jun 26, 2024 •

edited

Loading

kerthcet commented Jul 9, 2024


		Follow the step-by-step guide on how to install LWS. [View installation guide](https://github.com/kubernetes-sigs/lws/blob/main/docs/setup/install.md)

		## Deploy LeaderWorkerSet Deployment

	Verify the status of the vLLM Deployment
	Verify the status of the vLLM Podds

	We use LeaderWorkerSet to deploy two vLLM model replicas, and each vLLM replicas has two pods (tp=2).
	We use LeaderWorkerSet to deploy two vLLM model replicas, and each vLLM replica has two pods (tp=2).

	## Deploy LeaderWorkerSet Deployment
	## Deploy LeaderWorkerSet of vLLM

add vllm distributed inference examples #133

add vllm distributed inference examples #133

Conversation

gujingit commented May 8, 2024

What type of PR is this?

What this PR does / why we need it

Which issue(s) this PR fixes

k8s-ci-robot commented May 8, 2024

liurupeng commented May 8, 2024

liurupeng commented May 8, 2024

kerthcet commented May 8, 2024

kerthcet commented May 8, 2024 • edited Loading

kerthcet left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kerthcet May 8, 2024 • edited Loading

Choose a reason for hiding this comment

kerthcet commented May 8, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gujingit commented May 9, 2024 • edited Loading

gujingit commented May 9, 2024 • edited Loading

ahg-g left a comment

Choose a reason for hiding this comment

gujingit commented May 9, 2024

kerthcet left a comment

Choose a reason for hiding this comment

ahg-g commented May 10, 2024

k8s-ci-robot commented May 10, 2024

ahg-g commented May 10, 2024

ahg-g commented May 20, 2024 • edited Loading

gujingit commented Jun 26, 2024 • edited Loading

kerthcet commented Jul 9, 2024

kerthcet commented May 8, 2024 •

edited

Loading

kerthcet left a comment •

edited

Loading

kerthcet May 8, 2024 •

edited

Loading

gujingit commented May 9, 2024 •

edited

Loading

gujingit commented May 9, 2024 •

edited

Loading

ahg-g commented May 20, 2024 •

edited

Loading

gujingit commented Jun 26, 2024 •

edited

Loading