Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Usage]: 请问支持多机多卡吗 #9

Open
1 task done
glowwormX opened this issue Jan 8, 2025 · 4 comments
Open
1 task done

[Usage]: 请问支持多机多卡吗 #9

glowwormX opened this issue Jan 8, 2025 · 4 comments

Comments

@glowwormX
Copy link

Your current environment

我合并了npu_support和vllm_main的最新代码,单机npu上可以运行,想尝试多机,在使用ray时出现了问题
ray的环境看起没有问题,2*4npu,如下:

   322	+ ray status
   323	2025-01-08 09:54:50,931 - INFO - Note: detected 192 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.
   324	2025-01-08 09:54:50,931 - INFO - Note: NumExpr detected 192 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 8.
   325	2025-01-08 09:54:50,932 - INFO - NumExpr defaulting to 8 threads.
   326	======== Autoscaler status: 2025-01-08 09:54:50.500592 ========
   327	Node status
   328	---------------------------------------------------------------
   329	Active:
   330	 1 node_6319b4be5d08477d895022d694a6562d9232a238c5c81122fe4a9f68
   331	 1 node_0e31426c60128dba1e1b0e2562b027314a504bdc6d64cf914b285876
   332	Pending:
   333	 (no pending nodes)
   334	Recent failures:
   335	 (no failures)
   336	
   337	Resources
   338	---------------------------------------------------------------
   339	Usage:
   340	 0.0/180.0 CPU
   341	 0.0/8.0 NPU
   342	 0B/1.01TiB memory
   343	 0B/372.53GiB object_store_memory

我的启动命令:
vllm serve Qwen2_5/Qwen2.5-72B-Instruct/ --tensor-parallel-size=4 --pipeline-parallel-size 2 --block-size=128 --trust-remote-code --uvicorn-log-level=debug
报错:please ensure that world_size (8) is less than than max local Ascend npu count (4),没有识别到ray

加了 --distributed-executor-backend=ray参数,报错:
ray.exceptions.RaySystemError: System error: Expected one of cpu, cuda, ipu, xpu, mkldnn, opengl, opencl, ideep, hip, ve, fpga, maia, xla, lazy, vulkan, mps, meta, hpu, mtia, privateuseone device type at start of device string: npu

How would you like to use vllm

希望能跑通多机

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.
@wangshuai09
Copy link
Owner

ray.exceptions.RaySystemError: System error: Expected one of cpu, cuda, ipu, xpu, mkldnn, opengl, opencl, ideep, hip, ve, fpga, maia, xla, lazy, vulkan, mps, meta, hpu, mtia, privateuseone device type at start of device string: npu 这个日志报错的具体代码是什么

@glowwormX
Copy link
Author

感谢回复,看着是ray_npu_executor.py里worker_ip = ray.get(worker.get_node_ip.remote())开始报的

   426	  File "/opt/dataset/data_dir/vllm_community/vllm/executor/ray_npu_executor.py", line 231, in __init__
   427	    super().__init__(*args, **kwargs)
   428	  File "/opt/dataset/data_dir/vllm_community/vllm/executor/ray_gpu_executor.py", line 504, in __init__
   429	    super().__init__(*args, **kwargs)
   430	  File "/opt/dataset/data_dir/vllm_community/vllm/executor/distributed_gpu_executor.py", line 26, in __init__
   431	    super().__init__(*args, **kwargs)
   432	  File "/opt/dataset/data_dir/vllm_community/vllm/executor/executor_base.py", line 36, in __init__
   433	    self._init_executor()
   434	  File "/opt/dataset/data_dir/vllm_community/vllm/executor/ray_gpu_executor.py", line 64, in _init_executor
   435	    self._init_workers_ray(placement_group)
   436	  File "/opt/dataset/data_dir/vllm_community/vllm/executor/ray_npu_executor.py", line 73, in _init_workers_ray
   437	    worker_ip = ray.get(worker.get_node_ip.remote())
   438	  File "/home/ma-user/anaconda3/envs/PyTorch-2.1.0/lib/python3.9/site-packages/ray/_private/auto_init_hook.py", line 21, in auto_init_wrapper
   439	    return fn(*args, **kwargs)
   440	  File "/home/ma-user/anaconda3/envs/PyTorch-2.1.0/lib/python3.9/site-packages/ray/_private/client_mode_hook.py", line 103, in wrapper
   441	    return func(*args, **kwargs)
   442	  File "/home/ma-user/anaconda3/envs/PyTorch-2.1.0/lib/python3.9/site-packages/ray/_private/worker.py", line 2753, in get
   443	    values, debugger_breakpoint = worker.get_objects(object_refs, timeout=timeout)
   444	  File "/home/ma-user/anaconda3/envs/PyTorch-2.1.0/lib/python3.9/site-packages/ray/_private/worker.py", line 906, in get_objects
   445	    raise value
   446	ray.exceptions.ActorDiedError: The actor died because of an error raised in its creation task, �[36mray::RayWorkerWrapper.__init__()�[39m (pid=3119, ip=172.16.11.218, actor_id=b23b7f6d8d9b96e753fcdded01000000, repr=<vllm.executor.ray_utils.RayWorkerWrapper object at 0xffd00edac700>)
   447	  At least one of the input arguments for this task could not be computed:
   448	ray.exceptions.RaySystemError: System error: Expected one of cpu, cuda, ipu, xpu, mkldnn, opengl, opencl, ideep, hip, ve, fpga, maia, xla, lazy, vulkan, mps, meta, hpu, mtia, privateuseone device type at start of device string: npu
   449	traceback: Traceback (most recent call last):
   450	RuntimeError: Expected one of cpu, cuda, ipu, xpu, mkldnn, opengl, opencl, ideep, hip, ve, fpga, maia, xla, lazy, vulkan, mps, meta, hpu, mtia, privateuseone device type at start of device string: npu
   451	[ERROR] 2025-01-08-11:38:18 (PID:2586, Device:-1, RankID:-1) ERR99999 UNKNOWN applicaiton exception

@wangshuai09
Copy link
Owner

以前在其他三方库遇到过类似问题,没有import torch_npu导致的,你试试在报错的代码前加上import torch_npu可以修复吗

@glowwormX
Copy link
Author

以前在其他三方库遇到过类似问题,没有import torch_npu导致的,你试试在报错的代码前加上import torch_npu可以修复吗

试了下不行,还是一样的错

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants