-
-
Notifications
You must be signed in to change notification settings - Fork 5.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feature]: Support Ray-free multi-node distributed inference on resource managers like Kubernetes #3902
Comments
Oops - realized you're talking about multi-node Deleted my comment |
I think this make sense. Adding @njhill here for context about local multiprocessing and @youkaichao for implementing the nccl wrapper/abstraction for distributed communication. The main work here seems to be setting the right env var through K8s (stateful set?) such that the containers know each other's address and their perspective group/rank. |
Yeah, I think the main todo is we have to know the process is launched by kube etc, and only the master node & process starts the api server while the other processes just join the task group. Currently we always let the main process launch all the other tasks. That's why we rely on ray, e.g. launching tasks in another node. |
For multi-GPU single node without Ray, we have #3466, which works very well for us and I hope to get merged soon.
Yes, to get to multi-node without Ray, there are two parts needed: (1) distributed process orchestration and (2) some IPC mechanism that will work between nodes. #3763 is a step towards (2), and we are thinking torch.distributed (probably with CPU/gloo) could be used for all IPC (i.e. including the message-passing that Ray is currently used for). For (1) we would need to support an option to launch worker processes independently of the main process. Then Kubernetes could be used with either a pod per worker/gpu or per node. There is a proposal for a new Kube API to make this part easier, but in the meantime it could be achieved with more explicit manual configuration. |
By the way, this is not tied to For example, one option is # single node, multi-gpu
torchrun --nproc-per-node=n python -m vllm.entrypoints.openai.api_server $args
# multi node, on node 0
torchrun --nnodes 2 --nproc-per-node=n --rdzv_backend=c10d --rdzv_endpoint=${node_0_ip:port} python -m vllm.entrypoints.openai.api_server $args
# multi node, on node 1
torchrun --nnodes 2 --nproc-per-node=n --rdzv_backend=c10d --rdzv_endpoint=${node_0_ip:port} python -m vllm.entrypoints.openai.api_server $args This way, This method is aganostic to different cluster management tools. As long as we can execute commands on each node, we are good:
The only thing needed from cluster management tools, is to assign a master ip and a master port, and then execute the same command on each node. |
@youkaichao yes, that's what I was referring to by "we would need to support an option to launch worker processes independently of the main process".
Agree, once we have that support then However IMHO we should support both modes. At least for single-node, it's also nice to be able to just launch one vllm process and have it run the other workers in a self-contained way as it does now. |
Agree. And that's kind of a UX problem, which can be done in a unified way. This is not limited to |
@Jeffwan Were you able to run vllm distributed in raycluster with tensor parallel successfully? If so could you please post the script you have used (and the raycluster configuration). We have been trying to get vllm to run in distributed manner with tensor parallel in ray but have failed so far. |
@youkaichao @simon-mo Seems we'd like to refactor the interface and abstraction first. I will do more testing in downstream and keep an eye on the #3587 at the same time. |
@pravingadakh yeah. Let me file a PR to improve the distributed inference guidance. |
Just wanted to share that with LWS, Ray becomes an implementation detail of vllm when deploying on k8s: see https://docs.vllm.ai/en/latest/serving/deploying_with_lws.html for how that works. |
This issue has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this issue should remain open. Thank you! |
This issue has been automatically closed due to inactivity. Please feel free to reopen if you feel it is still relevant. Thank you! |
🚀 The feature, motivation and pitch
Currently, distributed inference (TP) in vLLM relies on ray to orchestrate the gpu workers. I briefly check the code and seems the core distributed communication is provided by
torch.distributed
with nccl backend, actor's communication is not done in Ray's own protocol. In this case, Ray just plays the role of orchestration and resource reservation (placement group). Please correct me if I am wrong.We do use Ray and KubeRay on Kubernetes and I've successfully tested vLLM distributed inference on this setup, confirming its functional operation. However, we have many users/platforms, we do not want to lock on Ray since some teams may not have enough Ray knowledge to cover the operation. My proposal is to provide a simple orchestration on top of
GPUExecutor
for those users who are familiar with cloud native techs. They would like to use Kubernetes's capability for orchestration (ray actors) and scheduling (placement group).Ideally, we would have both Ray and Kubernetes as orchestrators for vLLMs, providing our platform users with alternative options for their needs.
Please help check whether this proposal makes sense. I can contribute to this feature.
Alternatives
No response
Additional context
No response
The text was updated successfully, but these errors were encountered: