Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature]: Support Ray-free multi-node distributed inference on resource managers like Kubernetes #3902

Closed
Jeffwan opened this issue Apr 7, 2024 · 13 comments

Comments

@Jeffwan
Copy link
Contributor

Jeffwan commented Apr 7, 2024

🚀 The feature, motivation and pitch

Currently, distributed inference (TP) in vLLM relies on ray to orchestrate the gpu workers. I briefly check the code and seems the core distributed communication is provided by torch.distributed with nccl backend, actor's communication is not done in Ray's own protocol. In this case, Ray just plays the role of orchestration and resource reservation (placement group). Please correct me if I am wrong.

We do use Ray and KubeRay on Kubernetes and I've successfully tested vLLM distributed inference on this setup, confirming its functional operation. However, we have many users/platforms, we do not want to lock on Ray since some teams may not have enough Ray knowledge to cover the operation. My proposal is to provide a simple orchestration on top of GPUExecutor for those users who are familiar with cloud native techs. They would like to use Kubernetes's capability for orchestration (ray actors) and scheduling (placement group).

Ideally, we would have both Ray and Kubernetes as orchestrators for vLLMs, providing our platform users with alternative options for their needs.

Please help check whether this proposal makes sense. I can contribute to this feature.

Alternatives

No response

Additional context

No response

@robertgshaw2-redhat
Copy link
Collaborator

robertgshaw2-redhat commented Apr 7, 2024

Oops - realized you're talking about multi-node

Deleted my comment

@simon-mo
Copy link
Collaborator

simon-mo commented Apr 8, 2024

I think this make sense. Adding @njhill here for context about local multiprocessing and @youkaichao for implementing the nccl wrapper/abstraction for distributed communication. The main work here seems to be setting the right env var through K8s (stateful set?) such that the containers know each other's address and their perspective group/rank.

@youkaichao
Copy link
Member

Yeah, I think the main todo is we have to know the process is launched by kube etc, and only the master node & process starts the api server while the other processes just join the task group.

Currently we always let the main process launch all the other tasks. That's why we rely on ray, e.g. launching tasks in another node.

@njhill
Copy link
Member

njhill commented Apr 9, 2024

For multi-GPU single node without Ray, we have #3466, which works very well for us and I hope to get merged soon.

Currently we always let the main process launch all the other tasks. That's why we rely on ray, e.g. launching tasks in another node.

Yes, to get to multi-node without Ray, there are two parts needed: (1) distributed process orchestration and (2) some IPC mechanism that will work between nodes.

#3763 is a step towards (2), and we are thinking torch.distributed (probably with CPU/gloo) could be used for all IPC (i.e. including the message-passing that Ray is currently used for).

For (1) we would need to support an option to launch worker processes independently of the main process. Then Kubernetes could be used with either a pod per worker/gpu or per node. There is a proposal for a new Kube API to make this part easier, but in the meantime it could be achieved with more explicit manual configuration.

@youkaichao
Copy link
Member

By the way, this is not tied to ray or kubernetes. A more general way is to free vllm of launching processes. Instead, we let others launch vllm processes, and only the master process launches api server.

For example, one option is torchrun, e.g.

# single node, multi-gpu
torchrun --nproc-per-node=n python -m vllm.entrypoints.openai.api_server $args

# multi node, on node 0
torchrun --nnodes 2 --nproc-per-node=n --rdzv_backend=c10d --rdzv_endpoint=${node_0_ip:port} python -m vllm.entrypoints.openai.api_server $args
# multi node, on node 1
torchrun --nnodes 2 --nproc-per-node=n --rdzv_backend=c10d --rdzv_endpoint=${node_0_ip:port} python -m vllm.entrypoints.openai.api_server $args

This way, torchrun launches multiple processes, and each process executes the module vllm.entrypoints.openai.api_server . Inside the module, all environment variables are set, and only the process with rank==0 launches an api server. The rest process directly joins the master process for distributed inference.

This method is aganostic to different cluster management tools. As long as we can execute commands on each node, we are good:

The only thing needed from cluster management tools, is to assign a master ip and a master port, and then execute the same command on each node.

@njhill
Copy link
Member

njhill commented Apr 9, 2024

By the way, this is not tied to ray or kubernetes. A more general way is to free vllm of launching processes. Instead, we let others launch vllm processes, and only the master process launches api server.

@youkaichao yes, that's what I was referring to by "we would need to support an option to launch worker processes independently of the main process".

This way, torchrun launches multiple processes, and each process executes the module vllm.entrypoints.openai.api_server . Inside the module, all environment variables are set, and only the process with rank==0 launches an api server. The rest process directly joins the master process for distributed inference.

Agree, once we have that support then torchrun could be one option. See also the discussion here: #3691 (comment).

However IMHO we should support both modes. At least for single-node, it's also nice to be able to just launch one vllm process and have it run the other workers in a self-contained way as it does now.

@youkaichao
Copy link
Member

At least for single-node, it's also nice to be able to just launch one vllm process and have it run the other workers in a self-contained way as it does now.

Agree. And that's kind of a UX problem, which can be done in a unified way. vllm can detect whether it is inside torchrun by inspecting environment variables. And when it is not inside torchrun but tensor parallel size > 1, then it knows to launch workers itself.

This is not limited to torchrun. If we go for other options for launch worker processes independently of the main process, vllm can also detect whether it is launched by these options.

@pravingadakh
Copy link

@Jeffwan Were you able to run vllm distributed in raycluster with tensor parallel successfully? If so could you please post the script you have used (and the raycluster configuration). We have been trying to get vllm to run in distributed manner with tensor parallel in ray but have failed so far.

@Jeffwan
Copy link
Contributor Author

Jeffwan commented Apr 19, 2024

@youkaichao @simon-mo Seems we'd like to refactor the interface and abstraction first. I will do more testing in downstream and keep an eye on the #3587 at the same time.

@Jeffwan
Copy link
Contributor Author

Jeffwan commented Apr 19, 2024

@pravingadakh yeah. Let me file a PR to improve the distributed inference guidance.

@ahg-g
Copy link

ahg-g commented May 20, 2024

Just wanted to share that with LWS, Ray becomes an implementation detail of vllm when deploying on k8s: see https://docs.vllm.ai/en/latest/serving/deploying_with_lws.html for how that works.

Copy link

This issue has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this issue should remain open. Thank you!

@github-actions github-actions bot added the stale label Oct 29, 2024
Copy link

This issue has been automatically closed due to inactivity. Please feel free to reopen if you feel it is still relevant. Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants