Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add vllm distributed inference examples #133

Merged
merged 2 commits into from
May 10, 2024

Conversation

gujingit
Copy link
Contributor

@gujingit gujingit commented May 8, 2024

What type of PR is this?

/kind documentation

What this PR does / why we need it

add distributed vllm inference examples

Which issue(s) this PR fixes

Fixes #124

@k8s-ci-robot k8s-ci-robot added the kind/documentation Categorizes issue or PR as related to documentation. label May 8, 2024
@k8s-ci-robot k8s-ci-robot requested review from ahg-g and liurupeng May 8, 2024 04:31
@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label May 8, 2024
@k8s-ci-robot
Copy link
Contributor

Hi @gujingit. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels May 8, 2024
@gujingit gujingit changed the title add distributed vllm inference examples add vllm distributed inference examples May 8, 2024
@liurupeng
Copy link
Collaborator

exciting! this is what I have been waiting for!

@liurupeng
Copy link
Collaborator

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels May 8, 2024
@kerthcet
Copy link
Contributor

kerthcet commented May 8, 2024

/assign

@kerthcet
Copy link
Contributor

kerthcet commented May 8, 2024

Hi @gujingit have you tried this as crossing hosts? Can you provide the device info as well, like GPU cards, nodes number, network, to make this more clear. Thanks a lot.

Copy link
Contributor

@kerthcet kerthcet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had a try but always OOMed, try to find the root cause.

- sh
- -c
- "/vllm-workspace/ray_init.sh head --ray_cluster_size=$RAY_CLUSTER_SIZE;
python3 -m vllm.entrypoints.openai.api_server --port 8080 --model facebook/opt-125m --gpu-memory-utilization 0.95 --tensor-parallel-size 2"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's use the default gpu memory utilization, already high enough I think. That's mean remove the --gpu-memory-utilization flag.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so the autoscaling is still done by the ray right? if Ray want to spin up more workers for scaling, right now it won't be possible

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we want ray to do any autoscaling, ideally the leader emits a metric that triggers HPA to autoscale LWS

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

vLLM provisions Ray Actors based on the tensor parallel size. The LWS configuration sets up the Ray Nodes necessary to run the Ray Actors requested, so its static in this setup.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No configuration no scaling I think.


Follow the step-by-step guide on how to install LWS. [View installation guide](https://github.com/kubernetes-sigs/lws/blob/main/docs/setup/install.md)

## Deploy LeaderWorkerSet Deployment
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a blank line below.

kubectl apply -f lws.yaml
```

Verify the status of the vLLM Deployment
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Verify the status of the vLLM Deployment
Verify the status of the vLLM Podds

Follow the step-by-step guide on how to install LWS. [View installation guide](https://github.com/kubernetes-sigs/lws/blob/main/docs/setup/install.md)

## Deploy LeaderWorkerSet Deployment
We use LeaderWorkerSet to deploy two vLLM model replicas, and each vLLM replicas has two pods (tp=2).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
We use LeaderWorkerSet to deploy two vLLM model replicas, and each vLLM replicas has two pods (tp=2).
We use LeaderWorkerSet to deploy two vLLM model replicas, and each vLLM replica has two pods (tp=2).


Follow the step-by-step guide on how to install LWS. [View installation guide](https://github.com/kubernetes-sigs/lws/blob/main/docs/setup/install.md)

## Deploy LeaderWorkerSet Deployment
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
## Deploy LeaderWorkerSet Deployment
## Deploy LeaderWorkerSet of vLLM

docs/example/vllm/README.md Outdated Show resolved Hide resolved
@@ -0,0 +1,101 @@
# Deploy distributed vLLM with LWS on GPUs
Copy link
Contributor

@kerthcet kerthcet May 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe Deploy Distributed Inference Service with vLLM and LWS on GPUs

docs/example/vllm/lws.yaml Show resolved Hide resolved
@kerthcet
Copy link
Contributor

kerthcet commented May 8, 2024

Found the reason why OOM because vllm will use a default swap space as 4GB, I put a limit of 4GB at the lws template which leads to the OOM. Can we append a new flag to the vllm api_server: --swap-space 2, so with a request of 4GB, there's no OOM happens.

kind: Service
metadata:
name: vllm-leader
namespace: default
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove this as we don't specify namespace at lws.

selector:
leaderworkerset.sigs.k8s.io/name: vllm
role: leader
type: ClusterIP
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a blank line below.

- name: LWS_NAME
valueFrom:
fieldRef:
fieldPath: metadata.labels['leaderworkerset.sigs.k8s.io/name']
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a blank line below.

@gujingit
Copy link
Contributor Author

gujingit commented May 9, 2024

Add a new commit to solve all the comments abrove. If need to merge two commits, please let me know.

  1. update README.md
  2. update the leader pod commands
  3. remove namespace in service.yaml
  4. update subcommand from head to leader

@kerthcet

@gujingit
Copy link
Contributor Author

gujingit commented May 9, 2024

Hi @gujingit have you tried this as crossing hosts? Can you provide the device info as well, like GPU cards, nodes number, network, to make this more clear. Thanks a lot.

Yes, I tried to deploy vllm crossing hosts. Kubernetes version: 1.28.9. There are 4 gpu nodes, each has 1 NVIDIA A10 (24GB RAM), 16 CPU, 60GB memory. CUDA version 12.2 .

Copy link
Contributor

@ahg-g ahg-g left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for this!

docs/example/vllm/build/ray_init.sh Outdated Show resolved Hide resolved
docs/example/vllm/build/ray_init.sh Outdated Show resolved Hide resolved
docs/example/vllm/build/ray_init.sh Show resolved Hide resolved
docs/example/vllm/build/ray_init.sh Outdated Show resolved Hide resolved
docs/example/vllm/build/ray_init.sh Show resolved Hide resolved
@gujingit gujingit force-pushed the doc/distributed-vLLM branch 2 times, most recently from 6aaaedf to b3a8023 Compare May 9, 2024 08:42
@gujingit
Copy link
Contributor Author

gujingit commented May 9, 2024

Modify ray_init.sh by removing set -e to ensure that the script continues running even if starting the Ray worker fails.

Copy link
Contributor

@kerthcet kerthcet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only one nit, otherwise LGTM. Great work @gujingit !

/approve
/hold for @ahg-g @liurupeng

docs/example/vllm/README.md Outdated Show resolved Hide resolved
@k8s-ci-robot k8s-ci-robot added do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. approved Indicates a PR has been approved by an approver from all required OWNERS files. labels May 9, 2024
…pace in service.yaml 4) update subcommand from head to leader

Signed-off-by: zibai <zibai.gj@alibaba-inc.com>
@gujingit gujingit force-pushed the doc/distributed-vLLM branch from b3a8023 to f73d87f Compare May 10, 2024 10:42
@ahg-g
Copy link
Contributor

ahg-g commented May 10, 2024

/lgtm
/approve

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label May 10, 2024
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ahg-g, gujingit, kerthcet

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@ahg-g
Copy link
Contributor

ahg-g commented May 10, 2024

/hold cancel

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label May 10, 2024
@k8s-ci-robot k8s-ci-robot merged commit decf166 into kubernetes-sigs:main May 10, 2024
7 checks passed
@ahg-g
Copy link
Contributor

ahg-g commented May 20, 2024

Hi @gujingit would you be able to add the following diagram to the guide?

image

@gujingit
Copy link
Contributor Author

gujingit commented Jun 26, 2024

Hi @gujingit would you be able to add the following diagram to the guide?

image

@ahg-g tp=2 in the vllm example but the pic has 4 ray workers. Can I re-draw the pic with only 2 ray workers?

@kerthcet
Copy link
Contributor

kerthcet commented Jul 9, 2024

I think you can simply mention that we set the size=2 for minimum deployment, a bit different with the diagram.
Whisper: maybe the Ray head could be Ray Driver because head node = driver process + worker process.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/documentation Categorizes issue or PR as related to documentation. lgtm "Looks good to me", indicates that a PR is ready to be merged. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add example for multi-host serving of llama 70B on GPUs
7 participants