-
Notifications
You must be signed in to change notification settings - Fork 34
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add vllm distributed inference examples #133
add vllm distributed inference examples #133
Conversation
Hi @gujingit. Thanks for your PR. I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
exciting! this is what I have been waiting for! |
/ok-to-test |
/assign |
Hi @gujingit have you tried this as crossing hosts? Can you provide the device info as well, like GPU cards, nodes number, network, to make this more clear. Thanks a lot. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had a try but always OOMed, try to find the root cause.
docs/example/vllm/lws.yaml
Outdated
- sh | ||
- -c | ||
- "/vllm-workspace/ray_init.sh head --ray_cluster_size=$RAY_CLUSTER_SIZE; | ||
python3 -m vllm.entrypoints.openai.api_server --port 8080 --model facebook/opt-125m --gpu-memory-utilization 0.95 --tensor-parallel-size 2" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's use the default gpu memory utilization, already high enough I think. That's mean remove the --gpu-memory-utilization
flag.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so the autoscaling is still done by the ray right? if Ray want to spin up more workers for scaling, right now it won't be possible
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we want ray to do any autoscaling, ideally the leader emits a metric that triggers HPA to autoscale LWS
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
vLLM provisions Ray Actors based on the tensor parallel size. The LWS configuration sets up the Ray Nodes necessary to run the Ray Actors requested, so its static in this setup.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No configuration no scaling I think.
docs/example/vllm/README.md
Outdated
|
||
Follow the step-by-step guide on how to install LWS. [View installation guide](https://github.com/kubernetes-sigs/lws/blob/main/docs/setup/install.md) | ||
|
||
## Deploy LeaderWorkerSet Deployment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a blank line below.
docs/example/vllm/README.md
Outdated
kubectl apply -f lws.yaml | ||
``` | ||
|
||
Verify the status of the vLLM Deployment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Verify the status of the vLLM Deployment | |
Verify the status of the vLLM Podds |
docs/example/vllm/README.md
Outdated
Follow the step-by-step guide on how to install LWS. [View installation guide](https://github.com/kubernetes-sigs/lws/blob/main/docs/setup/install.md) | ||
|
||
## Deploy LeaderWorkerSet Deployment | ||
We use LeaderWorkerSet to deploy two vLLM model replicas, and each vLLM replicas has two pods (tp=2). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We use LeaderWorkerSet to deploy two vLLM model replicas, and each vLLM replicas has two pods (tp=2). | |
We use LeaderWorkerSet to deploy two vLLM model replicas, and each vLLM replica has two pods (tp=2). |
docs/example/vllm/README.md
Outdated
|
||
Follow the step-by-step guide on how to install LWS. [View installation guide](https://github.com/kubernetes-sigs/lws/blob/main/docs/setup/install.md) | ||
|
||
## Deploy LeaderWorkerSet Deployment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
## Deploy LeaderWorkerSet Deployment | |
## Deploy LeaderWorkerSet of vLLM |
docs/example/vllm/README.md
Outdated
@@ -0,0 +1,101 @@ | |||
# Deploy distributed vLLM with LWS on GPUs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe Deploy Distributed Inference Service with vLLM and LWS on GPUs
Found the reason why OOM because vllm will use a default swap space as 4GB, I put a limit of 4GB at the lws template which leads to the OOM. Can we append a new flag to the vllm api_server: |
docs/example/vllm/service.yaml
Outdated
kind: Service | ||
metadata: | ||
name: vllm-leader | ||
namespace: default |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove this as we don't specify namespace at lws.
docs/example/vllm/service.yaml
Outdated
selector: | ||
leaderworkerset.sigs.k8s.io/name: vllm | ||
role: leader | ||
type: ClusterIP |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a blank line below.
docs/example/vllm/lws.yaml
Outdated
- name: LWS_NAME | ||
valueFrom: | ||
fieldRef: | ||
fieldPath: metadata.labels['leaderworkerset.sigs.k8s.io/name'] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a blank line below.
Add a new commit to solve all the comments abrove. If need to merge two commits, please let me know.
|
Yes, I tried to deploy vllm crossing hosts. Kubernetes version: 1.28.9. There are 4 gpu nodes, each has 1 NVIDIA A10 (24GB RAM), 16 CPU, 60GB memory. CUDA version 12.2 . |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot for this!
6aaaedf
to
b3a8023
Compare
Modify ray_init.sh by removing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only one nit, otherwise LGTM. Great work @gujingit !
/approve
/hold for @ahg-g @liurupeng
…pace in service.yaml 4) update subcommand from head to leader Signed-off-by: zibai <zibai.gj@alibaba-inc.com>
b3a8023
to
f73d87f
Compare
/lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: ahg-g, gujingit, kerthcet The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/hold cancel |
Hi @gujingit would you be able to add the following diagram to the guide? |
I think you can simply mention that we set the size=2 for minimum deployment, a bit different with the diagram. |
What type of PR is this?
/kind documentation
What this PR does / why we need it
add distributed vllm inference examples
Which issue(s) this PR fixes
Fixes #124