-
-
Notifications
You must be signed in to change notification settings - Fork 6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug]: 2 nodes serving hanging #8447
Comments
I tried the test.py: Single node pass it but multi nodes failed NCCL_DEBUG=TRACE torchrun --nnodes 2 --nproc-per-node=2 --rdzv_backend=c10d --rdzv_endpoint=10.1.160.69 test.py The above exception was the direct cause of the following exception: Traceback (most recent call last): Did I type the ip address wrong? Ray cluster can be initialized with this address and "ray status" can show correct information. |
usually this means you need to set |
I set them both when I run: run_cluster.sh docker run |
sometimes it might be the dns problem, which can be complicated. you might want to try
i.e. manually assign |
that's strange then. what's your network config? |
This is my ifconfig output: bond1: flags=5187<UP,BROADCAST,RUNNING,MASTER,MULTICAST> mtu 9100 bond2: flags=5187<UP,BROADCAST,RUNNING,MASTER,MULTICAST> mtu 9100 bond3: flags=5187<UP,BROADCAST,RUNNING,MASTER,MULTICAST> mtu 9100 bond4: flags=5187<UP,BROADCAST,RUNNING,MASTER,MULTICAST> mtu 9100 bond5: flags=5187<UP,BROADCAST,RUNNING,MASTER,MULTICAST> mtu 9100 bond6: flags=5187<UP,BROADCAST,RUNNING,MASTER,MULTICAST> mtu 9100 bond7: flags=5187<UP,BROADCAST,RUNNING,MASTER,MULTICAST> mtu 9100 cbr0: flags=4419<UP,BROADCAST,RUNNING,PROMISC,MULTICAST> mtu 1500 docker0: flags=4099<UP,BROADCAST,MULTICAST> mtu 1500 docker_gwbridge: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500 enp100s0f0: flags=6211<UP,BROADCAST,RUNNING,SLAVE,MULTICAST> mtu 9100 enp100s0f1: flags=6211<UP,BROADCAST,RUNNING,SLAVE,MULTICAST> mtu 9100 enp132s0f0: flags=6211<UP,BROADCAST,RUNNING,SLAVE,MULTICAST> mtu 9100 enp132s0f1: flags=6211<UP,BROADCAST,RUNNING,SLAVE,MULTICAST> mtu 9100 enp164s0f0: flags=6211<UP,BROADCAST,RUNNING,SLAVE,MULTICAST> mtu 9100 enp164s0f1: flags=6211<UP,BROADCAST,RUNNING,SLAVE,MULTICAST> mtu 9100 enp196s0f0: flags=6211<UP,BROADCAST,RUNNING,SLAVE,MULTICAST> mtu 9100 enp196s0f1: flags=6211<UP,BROADCAST,RUNNING,SLAVE,MULTICAST> mtu 9100 enp228s0f0: flags=6211<UP,BROADCAST,RUNNING,SLAVE,MULTICAST> mtu 9100 enp228s0f1: flags=6211<UP,BROADCAST,RUNNING,SLAVE,MULTICAST> mtu 9100 enp36s0f0: flags=6211<UP,BROADCAST,RUNNING,SLAVE,MULTICAST> mtu 9100 enp36s0f1: flags=6211<UP,BROADCAST,RUNNING,SLAVE,MULTICAST> mtu 9100 enp52s0f0: flags=6211<UP,BROADCAST,RUNNING,SLAVE,MULTICAST> mtu 9100 enp52s0f1: flags=6211<UP,BROADCAST,RUNNING,SLAVE,MULTICAST> mtu 9100 enp68s0f0: flags=6211<UP,BROADCAST,RUNNING,SLAVE,MULTICAST> mtu 9100 enp68s0f1: flags=6211<UP,BROADCAST,RUNNING,SLAVE,MULTICAST> mtu 9100 eth0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500 lo: flags=73<UP,LOOPBACK,RUNNING> mtu 65536 veth2b847be: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500 veth3c302e63: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500 veth41503f00: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500 vethfae2868a: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500 |
your network config is so complicated. i suggest you talk to your network admin, to make sure the test script can run normally first. |
I replace the arguments: --rdzv_backend=c10d --rdzv_endpoint=10.1.160.69 |
if you use |
yeah, --node-rank is still added to both node, but nccl result is wrong |
interesting. what's your full command? |
your hardware / driver / nccl might be broken. |
if I use master_addr and master_port to run test.py and ignore NCCL output assertion (gloo test will pass), the program will hang at
|
looking at the logs above:
this means your IB config is wrong, and there are errors there. you need to contact your admin to fix it first. |
export NCCL_IB_GID_INDEX=3 fix this (default setting is GID 0/1 and IPV6 is not setting correctly), thank you! |
Your current environment
The output of `python collect_env.py`
Model Input Dumps
No response
🐛 Describe the bug
running command:
export NCCL_DEBUG=INFO
export CUDA_LAUNCH_BLOCKING=1
export VLLM_TRACE_FUNCTION=1
python3 api_server.py
--model my_model
-tp 8
-pp 2
--enforce-eager
--max-num-seqs=32
--dtype=bfloat16
--worker-use-ray
--gpu-memory-utilization 0.8
logs are hanging here:
INFO 09-12 23:40:47 utils.py:977] Found nccl from library libnccl.so.2
INFO 09-12 23:40:47 pynccl.py:63] vLLM is using nccl==2.20.5
VM-160-69-tencentos:561:561 [0] NCCL INFO Using non-device net plugin version 0
VM-160-69-tencentos:561:561 [0] NCCL INFO Using network IB
VM-160-69-tencentos:561:561 [0] NCCL INFO comm 0x1476dca0 rank 0 nranks 2 cudaDev 0 nvmlDev 0 busId 23000 commId 0xa015ae7e0a50005d - Init START
(RayWorkerWrapper pid=248, ip=10.1.160.68) INFO 09-12 23:40:47 utils.py:977] Found nccl from library libnccl.so.2
(RayWorkerWrapper pid=248, ip=10.1.160.68) INFO 09-12 23:40:47 pynccl.py:63] vLLM is using nccl==2.20.5
VM-160-69-tencentos:561:561 [0] NCCL INFO Setting affinity for GPU 0 to 0f,ffffffff,ffffffff,ffffffff
VM-160-69-tencentos:561:561 [0] NCCL INFO comm 0x1476dca0 rank 0 nRanks 2 nNodes 2 localRanks 1 localRank 0 MNNVL 0
VM-160-69-tencentos:561:561 [0] NCCL INFO Channel 00/02 : 0 1
VM-160-69-tencentos:561:561 [0] NCCL INFO Channel 01/02 : 0 1
VM-160-69-tencentos:561:561 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] -1/-1/-1->0->1
VM-160-69-tencentos:561:561 [0] NCCL INFO P2P Chunksize set to 131072
VM-160-69-tencentos:561:561 [0] NCCL INFO Channel 00/0 : 1[0] -> 0[0] [receive] via NET/IB/1/GDRDMA
VM-160-69-tencentos:561:561 [0] NCCL INFO Channel 01/0 : 1[0] -> 0[0] [receive] via NET/IB/1/GDRDMA
VM-160-69-tencentos:561:561 [0] NCCL INFO Channel 00/0 : 0[0] -> 1[0] [send] via NET/IB/1/GDRDMA
VM-160-69-tencentos:561:561 [0] NCCL INFO Channel 01/0 : 0[0] -> 1[0] [send] via NET/IB/1/GDRDMA
VM-160-69-tencentos:561:561 [0] NCCL INFO Connected all rings
VM-160-69-tencentos:561:561 [0] NCCL INFO Connected all trees
VM-160-69-tencentos:561:561 [0] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512
VM-160-69-tencentos:561:561 [0] NCCL INFO 2 coll channels, 0 collnet channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
VM-160-69-tencentos:561:561 [0] NCCL INFO comm 0x1476dca0 rank 0 nranks 2 cudaDev 0 nvmlDev 0 busId 23000 commId 0xa015ae7e0a50005d - Init COMPLETE
VM-160-69-tencentos:561:13506 [0] transport/net_ib.cc:1698 NCCL WARN NET/IB : Got completion from peer 10.1.160.68<57552> with status=12 opcode=0 len=0 vendor err 129 (Recv) localGid fe80::a288:c2ff:fe16:5c9c remoteGidsfe80::a288:c2ff:fe16:4e4c
VM-160-69-tencentos:561:13506 [0] NCCL INFO transport/net.cc:1298 -> 6
VM-160-69-tencentos:561:13506 [0] NCCL INFO proxy.cc:694 -> 6
VM-160-69-tencentos:561:13506 [0] NCCL INFO proxy.cc:874 -> 6 [Progress Thread]
VM-160-69-tencentos:561:13506 [0] transport/net_ib.cc:1698 NCCL WARN NET/IB : Got completion from peer 10.1.160.68<57552> with status=5 opcode=0 len=0 vendor err 249 (Recv) localGid fe80::a288:c2ff:fe16:5c9c remoteGidsfe80::a288:c2ff:fe16:4e4c
VM-160-69-tencentos:561:13506 [0] NCCL INFO transport/net.cc:1298 -> 6
VM-160-69-tencentos:561:13506 [0] NCCL INFO proxy.cc:694 -> 6
VM-160-69-tencentos:561:13506 [0] NCCL INFO proxy.cc:874 -> 6 [Progress Thread]
VM-160-69-tencentos:561:13506 [0] transport/net_ib.cc:1698 NCCL WARN NET/IB : Got completion from peer 10.1.160.68<57552> with status=5 opcode=0 len=0 vendor err 244 (Recv) localGid fe80::a288:c2ff:fe16:5c9c remoteGidsfe80::a288:c2ff:fe16:4e4c
VM-160-69-tencentos:561:13506 [0] NCCL INFO transport/net.cc:1298 -> 6
VM-160-69-tencentos:561:13506 [0] NCCL INFO proxy.cc:694 -> 6
VM-160-69-tencentos:561:13506 [0] NCCL INFO proxy.cc:874 -> 6 [Progress Thread]
VM-160-69-tencentos:561:13506 [0] transport/net_ib.cc:1698 NCCL WARN NET/IB : Got completion from peer 10.1.160.68<57552> with status=5 opcode=0 len=0 vendor err 249 (Recv) localGid fe80::a288:c2ff:fe16:5c9c remoteGidsfe80::a288:c2ff:fe16:4e4c
VM-160-69-tencentos:561:13506 [0] NCCL INFO transport/net.cc:1298 -> 6
VM-160-69-tencentos:561:13506 [0] NCCL INFO proxy.cc:694 -> 6
VM-160-69-tencentos:561:13506 [0] NCCL INFO proxy.cc:874 -> 6 [Progress Thread]
Before submitting a new issue...
The text was updated successfully, but these errors were encountered: