[Usage]: 请问支持多机多卡吗 #9

glowwormX · 2025-01-08T02:05:50Z

Your current environment

我合并了npu_support和vllm_main的最新代码，单机npu上可以运行，想尝试多机，在使用ray时出现了问题
ray的环境看起没有问题，2*4npu，如下：

   322	+ ray status
   323	2025-01-08 09:54:50,931 - INFO - Note: detected 192 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.
   324	2025-01-08 09:54:50,931 - INFO - Note: NumExpr detected 192 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 8.
   325	2025-01-08 09:54:50,932 - INFO - NumExpr defaulting to 8 threads.
   326	======== Autoscaler status: 2025-01-08 09:54:50.500592 ========
   327	Node status
   328	---------------------------------------------------------------
   329	Active:
   330	 1 node_6319b4be5d08477d895022d694a6562d9232a238c5c81122fe4a9f68
   331	 1 node_0e31426c60128dba1e1b0e2562b027314a504bdc6d64cf914b285876
   332	Pending:
   333	 (no pending nodes)
   334	Recent failures:
   335	 (no failures)
   336	
   337	Resources
   338	---------------------------------------------------------------
   339	Usage:
   340	 0.0/180.0 CPU
   341	 0.0/8.0 NPU
   342	 0B/1.01TiB memory
   343	 0B/372.53GiB object_store_memory

我的启动命令：
vllm serve Qwen2_5/Qwen2.5-72B-Instruct/ --tensor-parallel-size=4 --pipeline-parallel-size 2 --block-size=128 --trust-remote-code --uvicorn-log-level=debug
报错：please ensure that world_size (8) is less than than max local Ascend npu count (4)，没有识别到ray

加了 --distributed-executor-backend=ray参数，报错：
ray.exceptions.RaySystemError: System error: Expected one of cpu, cuda, ipu, xpu, mkldnn, opengl, opencl, ideep, hip, ve, fpga, maia, xla, lazy, vulkan, mps, meta, hpu, mtia, privateuseone device type at start of device string: npu

How would you like to use vllm

希望能跑通多机

Before submitting a new issue...

Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

The text was updated successfully, but these errors were encountered:

wangshuai09 · 2025-01-09T06:39:03Z

ray.exceptions.RaySystemError: System error: Expected one of cpu, cuda, ipu, xpu, mkldnn, opengl, opencl, ideep, hip, ve, fpga, maia, xla, lazy, vulkan, mps, meta, hpu, mtia, privateuseone device type at start of device string: npu 这个日志报错的具体代码是什么

glowwormX · 2025-01-09T07:04:14Z

感谢回复，看着是ray_npu_executor.py里worker_ip = ray.get(worker.get_node_ip.remote())开始报的

   426	  File "/opt/dataset/data_dir/vllm_community/vllm/executor/ray_npu_executor.py", line 231, in __init__
   427	    super().__init__(*args, **kwargs)
   428	  File "/opt/dataset/data_dir/vllm_community/vllm/executor/ray_gpu_executor.py", line 504, in __init__
   429	    super().__init__(*args, **kwargs)
   430	  File "/opt/dataset/data_dir/vllm_community/vllm/executor/distributed_gpu_executor.py", line 26, in __init__
   431	    super().__init__(*args, **kwargs)
   432	  File "/opt/dataset/data_dir/vllm_community/vllm/executor/executor_base.py", line 36, in __init__
   433	    self._init_executor()
   434	  File "/opt/dataset/data_dir/vllm_community/vllm/executor/ray_gpu_executor.py", line 64, in _init_executor
   435	    self._init_workers_ray(placement_group)
   436	  File "/opt/dataset/data_dir/vllm_community/vllm/executor/ray_npu_executor.py", line 73, in _init_workers_ray
   437	    worker_ip = ray.get(worker.get_node_ip.remote())
   438	  File "/home/ma-user/anaconda3/envs/PyTorch-2.1.0/lib/python3.9/site-packages/ray/_private/auto_init_hook.py", line 21, in auto_init_wrapper
   439	    return fn(*args, **kwargs)
   440	  File "/home/ma-user/anaconda3/envs/PyTorch-2.1.0/lib/python3.9/site-packages/ray/_private/client_mode_hook.py", line 103, in wrapper
   441	    return func(*args, **kwargs)
   442	  File "/home/ma-user/anaconda3/envs/PyTorch-2.1.0/lib/python3.9/site-packages/ray/_private/worker.py", line 2753, in get
   443	    values, debugger_breakpoint = worker.get_objects(object_refs, timeout=timeout)
   444	  File "/home/ma-user/anaconda3/envs/PyTorch-2.1.0/lib/python3.9/site-packages/ray/_private/worker.py", line 906, in get_objects
   445	    raise value
   446	ray.exceptions.ActorDiedError: The actor died because of an error raised in its creation task, �[36mray::RayWorkerWrapper.__init__()�[39m (pid=3119, ip=172.16.11.218, actor_id=b23b7f6d8d9b96e753fcdded01000000, repr=<vllm.executor.ray_utils.RayWorkerWrapper object at 0xffd00edac700>)
   447	  At least one of the input arguments for this task could not be computed:
   448	ray.exceptions.RaySystemError: System error: Expected one of cpu, cuda, ipu, xpu, mkldnn, opengl, opencl, ideep, hip, ve, fpga, maia, xla, lazy, vulkan, mps, meta, hpu, mtia, privateuseone device type at start of device string: npu
   449	traceback: Traceback (most recent call last):
   450	RuntimeError: Expected one of cpu, cuda, ipu, xpu, mkldnn, opengl, opencl, ideep, hip, ve, fpga, maia, xla, lazy, vulkan, mps, meta, hpu, mtia, privateuseone device type at start of device string: npu
   451	[ERROR] 2025-01-08-11:38:18 (PID:2586, Device:-1, RankID:-1) ERR99999 UNKNOWN applicaiton exception

wangshuai09 · 2025-01-09T10:56:38Z

以前在其他三方库遇到过类似问题，没有import torch_npu导致的，你试试在报错的代码前加上import torch_npu可以修复吗

glowwormX · 2025-01-09T12:50:14Z

以前在其他三方库遇到过类似问题，没有import torch_npu导致的，你试试在报错的代码前加上import torch_npu可以修复吗

试了下不行，还是一样的错

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Usage]: 请问支持多机多卡吗 #9

[Usage]: 请问支持多机多卡吗 #9

glowwormX commented Jan 8, 2025

wangshuai09 commented Jan 9, 2025

glowwormX commented Jan 9, 2025

wangshuai09 commented Jan 9, 2025

glowwormX commented Jan 9, 2025

[Usage]: 请问支持多机多卡吗 #9

[Usage]: 请问支持多机多卡吗 #9

Comments

glowwormX commented Jan 8, 2025

Your current environment

How would you like to use vllm

Before submitting a new issue...

wangshuai09 commented Jan 9, 2025

glowwormX commented Jan 9, 2025

wangshuai09 commented Jan 9, 2025

glowwormX commented Jan 9, 2025