[Feature]: Support Ray-free multi-node distributed inference on resource managers like Kubernetes #3902

Jeffwan · 2024-04-07T21:58:25Z

🚀 The feature, motivation and pitch

Currently, distributed inference (TP) in vLLM relies on ray to orchestrate the gpu workers. I briefly check the code and seems the core distributed communication is provided by torch.distributed with nccl backend, actor's communication is not done in Ray's own protocol. In this case, Ray just plays the role of orchestration and resource reservation (placement group). Please correct me if I am wrong.

We do use Ray and KubeRay on Kubernetes and I've successfully tested vLLM distributed inference on this setup, confirming its functional operation. However, we have many users/platforms, we do not want to lock on Ray since some teams may not have enough Ray knowledge to cover the operation. My proposal is to provide a simple orchestration on top of GPUExecutor for those users who are familiar with cloud native techs. They would like to use Kubernetes's capability for orchestration (ray actors) and scheduling (placement group).

Ideally, we would have both Ray and Kubernetes as orchestrators for vLLMs, providing our platform users with alternative options for their needs.

Please help check whether this proposal makes sense. I can contribute to this feature.

Alternatives

No response

Additional context

No response

The text was updated successfully, but these errors were encountered:

robertgshaw2-redhat · 2024-04-07T22:10:32Z

Oops - realized you're talking about multi-node

Deleted my comment

simon-mo · 2024-04-08T20:32:45Z

I think this make sense. Adding @njhill here for context about local multiprocessing and @youkaichao for implementing the nccl wrapper/abstraction for distributed communication. The main work here seems to be setting the right env var through K8s (stateful set?) such that the containers know each other's address and their perspective group/rank.

youkaichao · 2024-04-08T21:03:06Z

Yeah, I think the main todo is we have to know the process is launched by kube etc, and only the master node & process starts the api server while the other processes just join the task group.

Currently we always let the main process launch all the other tasks. That's why we rely on ray, e.g. launching tasks in another node.

njhill · 2024-04-09T15:37:37Z

For multi-GPU single node without Ray, we have #3466, which works very well for us and I hope to get merged soon.

Currently we always let the main process launch all the other tasks. That's why we rely on ray, e.g. launching tasks in another node.

Yes, to get to multi-node without Ray, there are two parts needed: (1) distributed process orchestration and (2) some IPC mechanism that will work between nodes.

#3763 is a step towards (2), and we are thinking torch.distributed (probably with CPU/gloo) could be used for all IPC (i.e. including the message-passing that Ray is currently used for).

For (1) we would need to support an option to launch worker processes independently of the main process. Then Kubernetes could be used with either a pod per worker/gpu or per node. There is a proposal for a new Kube API to make this part easier, but in the meantime it could be achieved with more explicit manual configuration.

youkaichao · 2024-04-09T16:05:41Z

By the way, this is not tied to ray or kubernetes. A more general way is to free vllm of launching processes. Instead, we let others launch vllm processes, and only the master process launches api server.

For example, one option is torchrun, e.g.

# single node, multi-gpu
torchrun --nproc-per-node=n python -m vllm.entrypoints.openai.api_server $args

# multi node, on node 0
torchrun --nnodes 2 --nproc-per-node=n --rdzv_backend=c10d --rdzv_endpoint=${node_0_ip:port} python -m vllm.entrypoints.openai.api_server $args
# multi node, on node 1
torchrun --nnodes 2 --nproc-per-node=n --rdzv_backend=c10d --rdzv_endpoint=${node_0_ip:port} python -m vllm.entrypoints.openai.api_server $args

This way, torchrun launches multiple processes, and each process executes the module vllm.entrypoints.openai.api_server . Inside the module, all environment variables are set, and only the process with rank==0 launches an api server. The rest process directly joins the master process for distributed inference.

This method is aganostic to different cluster management tools. As long as we can execute commands on each node, we are good:

we can log into two nodes via two shells, and execute commands on them manually
we can use k8s related infra to execute commands on several pods
we can use slurm to execute commands on several nodes (related to [Misc] add option to get_distributed_init_method with "env://" #3833 )
...

The only thing needed from cluster management tools, is to assign a master ip and a master port, and then execute the same command on each node.

njhill · 2024-04-09T16:21:06Z

By the way, this is not tied to ray or kubernetes. A more general way is to free vllm of launching processes. Instead, we let others launch vllm processes, and only the master process launches api server.

@youkaichao yes, that's what I was referring to by "we would need to support an option to launch worker processes independently of the main process".

This way, torchrun launches multiple processes, and each process executes the module vllm.entrypoints.openai.api_server . Inside the module, all environment variables are set, and only the process with rank==0 launches an api server. The rest process directly joins the master process for distributed inference.

Agree, once we have that support then torchrun could be one option. See also the discussion here: #3691 (comment).

However IMHO we should support both modes. At least for single-node, it's also nice to be able to just launch one vllm process and have it run the other workers in a self-contained way as it does now.

youkaichao · 2024-04-09T16:28:44Z

At least for single-node, it's also nice to be able to just launch one vllm process and have it run the other workers in a self-contained way as it does now.

Agree. And that's kind of a UX problem, which can be done in a unified way. vllm can detect whether it is inside torchrun by inspecting environment variables. And when it is not inside torchrun but tensor parallel size > 1, then it knows to launch workers itself.

This is not limited to torchrun. If we go for other options for launch worker processes independently of the main process, vllm can also detect whether it is launched by these options.

pravingadakh · 2024-04-11T15:17:44Z

@Jeffwan Were you able to run vllm distributed in raycluster with tensor parallel successfully? If so could you please post the script you have used (and the raycluster configuration). We have been trying to get vllm to run in distributed manner with tensor parallel in ray but have failed so far.

Jeffwan · 2024-04-19T22:12:35Z

@youkaichao @simon-mo Seems we'd like to refactor the interface and abstraction first. I will do more testing in downstream and keep an eye on the #3587 at the same time.

Jeffwan · 2024-04-19T22:13:45Z

@pravingadakh yeah. Let me file a PR to improve the distributed inference guidance.

ahg-g · 2024-05-20T17:52:00Z

Just wanted to share that with LWS, Ray becomes an implementation detail of vllm when deploying on k8s: see https://docs.vllm.ai/en/latest/serving/deploying_with_lws.html for how that works.

github-actions · 2024-10-29T02:02:17Z

This issue has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this issue should remain open. Thank you!

github-actions · 2024-11-28T02:06:47Z

This issue has been automatically closed due to inactivity. Please feel free to reopen if you feel it is still relevant. Thank you!

Jeffwan added the feature request label Apr 7, 2024

youkaichao mentioned this issue Apr 12, 2024

[RFC]: Interface and Abstraction for Distributed Inference Environment #3587

Closed

lizzzcai mentioned this issue Aug 21, 2024

add a new API for multi-node/multi-gpu kserve/kserve#3871

Merged

9 tasks

ericxsun mentioned this issue Sep 2, 2024

[Usage]: Can and How we start server on multi-node multi-gpu with torchrun? #8021

Closed

1 task

github-actions bot added the stale label Oct 29, 2024

github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Nov 28, 2024

Jeffwan mentioned this issue Feb 10, 2025

[Feature]: Support torch.distributed as the runtime for multi-node inference #12511

Open

1 task

This was referenced Feb 19, 2025

Discussion: Pipeline parallelism support vllm-project/production-stack#101

Open

Support torchrun and SPMD-style offline inference #12071

Merged

v0.3.0 roadmap vllm-project/aibrix#698

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature]: Support Ray-free multi-node distributed inference on resource managers like Kubernetes #3902

[Feature]: Support Ray-free multi-node distributed inference on resource managers like Kubernetes #3902

Jeffwan commented Apr 7, 2024 •

edited

Loading

robertgshaw2-redhat commented Apr 7, 2024 •

edited

Loading

simon-mo commented Apr 8, 2024

youkaichao commented Apr 8, 2024

njhill commented Apr 9, 2024

youkaichao commented Apr 9, 2024

njhill commented Apr 9, 2024

youkaichao commented Apr 9, 2024

pravingadakh commented Apr 11, 2024

Jeffwan commented Apr 19, 2024

Jeffwan commented Apr 19, 2024

ahg-g commented May 20, 2024

github-actions bot commented Oct 29, 2024

github-actions bot commented Nov 28, 2024

[Feature]: Support Ray-free multi-node distributed inference on resource managers like Kubernetes #3902

[Feature]: Support Ray-free multi-node distributed inference on resource managers like Kubernetes #3902

Comments

Jeffwan commented Apr 7, 2024 • edited Loading

🚀 The feature, motivation and pitch

Alternatives

Additional context

robertgshaw2-redhat commented Apr 7, 2024 • edited Loading

simon-mo commented Apr 8, 2024

youkaichao commented Apr 8, 2024

njhill commented Apr 9, 2024

youkaichao commented Apr 9, 2024

njhill commented Apr 9, 2024

youkaichao commented Apr 9, 2024

pravingadakh commented Apr 11, 2024

Jeffwan commented Apr 19, 2024

Jeffwan commented Apr 19, 2024

ahg-g commented May 20, 2024

github-actions bot commented Oct 29, 2024

github-actions bot commented Nov 28, 2024

Jeffwan commented Apr 7, 2024 •

edited

Loading

robertgshaw2-redhat commented Apr 7, 2024 •

edited

Loading