Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: 2 nodes serving hanging #8447

Closed
1 task done
AlvL1225 opened this issue Sep 13, 2024 · 19 comments · Fixed by #8514
Closed
1 task done

[Bug]: 2 nodes serving hanging #8447

AlvL1225 opened this issue Sep 13, 2024 · 19 comments · Fixed by #8514
Labels
bug Something isn't working

Comments

@AlvL1225
Copy link

AlvL1225 commented Sep 13, 2024

Your current environment

The output of `python collect_env.py`
PyTorch version: 2.4.0+cu121
Is debug build: False
CUDA used to build PyTorch: 12.1
ROCM used to build PyTorch: N/A

OS: Ubuntu 20.04.6 LTS (x86_64)
GCC version: (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0
Clang version: Could not collect
CMake version: Could not collect
Libc version: glibc-2.31

Python version: 3.10.14 (main, Apr  6 2024, 18:45:05) [GCC 9.4.0] (64-bit runtime)
Python platform: Linux-5.4.119-19.0009.28-x86_64-with-glibc2.31
Is CUDA available: True
CUDA runtime version: Could not collect
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration:
GPU 0: NVIDIA H800
GPU 1: NVIDIA H800
GPU 2: NVIDIA H800
GPU 3: NVIDIA H800
GPU 4: NVIDIA H800
GPU 5: NVIDIA H800
GPU 6: NVIDIA H800
GPU 7: NVIDIA H800

Nvidia driver version: 535.54.03
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Architecture:                    x86_64
CPU op-mode(s):                  32-bit, 64-bit
Byte Order:                      Little Endian
Address sizes:                   46 bits physical, 48 bits virtual
CPU(s):                          200
On-line CPU(s) list:             0-199
Thread(s) per core:              2
Core(s) per socket:              50
Socket(s):                       2
NUMA node(s):                    2
Vendor ID:                       GenuineIntel
CPU family:                      6
Model:                           143
Model name:                      Intel(R) Xeon(R) Platinum 8480+
Stepping:                        6
CPU MHz:                         2000.000
BogoMIPS:                        4000.00
Hypervisor vendor:               KVM
Virtualization type:             full
L1d cache:                       4.7 MiB
L1i cache:                       3.1 MiB
L2 cache:                        200 MiB
L3 cache:                        210 MiB
NUMA node0 CPU(s):               0-99
NUMA node1 CPU(s):               100-199
Vulnerability Itlb multihit:     Not affected
Vulnerability L1tf:              Not affected
Vulnerability Mds:               Not affected
Vulnerability Meltdown:          Not affected
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp
Vulnerability Spectre v1:        Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:        Mitigation; Enhanced IBRS, IBPB conditional, RSB filling
Vulnerability Srbds:             Not affected
Vulnerability Tsx async abort:   Not affected
Flags:                           fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc cpuid tsc_known_freq pni pclmulqdq monitor ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch cpuid_fault invpcid_single ssbd ibrs ibpb ibrs_enhanced fsgsbase bmi1 hle avx2 smep bmi2 erms invpcid rtm avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves avx512_bf16 wbnoinvd arat avx512vbmi umip avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq movdiri movdir64b fsrm arch_capabilities

Versions of relevant libraries:
[pip3] flashinfer==0.1.6+cu121torch2.4
[pip3] numpy==1.26.4
[pip3] nvidia-cublas-cu12==12.1.3.1
[pip3] nvidia-cuda-cupti-cu12==12.1.105
[pip3] nvidia-cuda-nvrtc-cu12==12.1.105
[pip3] nvidia-cuda-runtime-cu12==12.1.105
[pip3] nvidia-cudnn-cu12==9.1.0.70
[pip3] nvidia-cufft-cu12==11.0.2.54
[pip3] nvidia-curand-cu12==10.3.2.106
[pip3] nvidia-cusolver-cu12==11.4.5.107
[pip3] nvidia-cusparse-cu12==12.1.0.106
[pip3] nvidia-ml-py==12.560.30
[pip3] nvidia-nccl-cu12==2.20.5
[pip3] nvidia-nvjitlink-cu12==12.6.68
[pip3] nvidia-nvtx-cu12==12.1.105
[pip3] pyzmq==26.2.0
[pip3] torch==2.4.0
[pip3] torchvision==0.19.0
[pip3] transformers==4.44.2
[pip3] triton==3.0.0
[conda] Could not collect
ROCM Version: Could not collect
Neuron SDK Version: N/A
vLLM Version: 0.6.0@COMMIT_HASH_PLACEHOLDER
vLLM Build Flags:
CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled
GPU Topology:
GPU0	GPU1	GPU2	GPU3	GPU4	GPU5	GPU6	GPU7	NIC0	NIC1	NIC2	NIC3	NIC4	NIC5	NIC6	NIC7	CPU Affinity	NUMA Affinity	GPU NUMA ID
GPU0	 X 	NV8	NV8	NV8	NV8	NV8	NV8	NV8	PIX	NODE	NODE	NODE	SYS	SYS	SYS	SYS	0-99	0		N/A
GPU1	NV8	 X 	NV8	NV8	NV8	NV8	NV8	NV8	NODE	PIX	NODE	NODE	SYS	SYS	SYS	SYS	0-99	0		N/A
GPU2	NV8	NV8	 X 	NV8	NV8	NV8	NV8	NV8	NODE	NODE	PIX	NODE	SYS	SYS	SYS	SYS	0-99	0		N/A
GPU3	NV8	NV8	NV8	 X 	NV8	NV8	NV8	NV8	NODE	NODE	NODE	PIX	SYS	SYS	SYS	SYS	0-99	0		N/A
GPU4	NV8	NV8	NV8	NV8	 X 	NV8	NV8	NV8	SYS	SYS	SYS	SYS	PIX	NODE	NODE	NODE	100-199	1		N/A
GPU5	NV8	NV8	NV8	NV8	NV8	 X 	NV8	NV8	SYS	SYS	SYS	SYS	NODE	PIX	NODE	NODE	100-199	1		N/A
GPU6	NV8	NV8	NV8	NV8	NV8	NV8	 X 	NV8	SYS	SYS	SYS	SYS	NODE	NODE	PIX	NODE	100-199	1		N/A
GPU7	NV8	NV8	NV8	NV8	NV8	NV8	NV8	 X 	SYS	SYS	SYS	SYS	NODE	NODE	NODE	PIX	100-199	1		N/A
NIC0	PIX	NODE	NODE	NODE	SYS	SYS	SYS	SYS	 X 	NODE	NODE	NODE	SYS	SYS	SYS	SYS
NIC1	NODE	PIX	NODE	NODE	SYS	SYS	SYS	SYS	NODE	 X 	NODE	NODE	SYS	SYS	SYS	SYS
NIC2	NODE	NODE	PIX	NODE	SYS	SYS	SYS	SYS	NODE	NODE	 X 	NODE	SYS	SYS	SYS	SYS
NIC3	NODE	NODE	NODE	PIX	SYS	SYS	SYS	SYS	NODE	NODE	NODE	 X 	SYS	SYS	SYS	SYS
NIC4	SYS	SYS	SYS	SYS	PIX	NODE	NODE	NODE	SYS	SYS	SYS	SYS	 X 	NODE	NODE	NODE
NIC5	SYS	SYS	SYS	SYS	NODE	PIX	NODE	NODE	SYS	SYS	SYS	SYS	NODE	 X 	NODE	NODE
NIC6	SYS	SYS	SYS	SYS	NODE	NODE	PIX	NODE	SYS	SYS	SYS	SYS	NODE	NODE	 X 	NODE
NIC7	SYS	SYS	SYS	SYS	NODE	NODE	NODE	PIX	SYS	SYS	SYS	SYS	NODE	NODE	NODE	 X

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

NIC Legend:

  NIC0: mlx5_bond_0
  NIC1: mlx5_bond_1
  NIC2: mlx5_bond_2
  NIC3: mlx5_bond_3
  NIC4: mlx5_bond_4
  NIC5: mlx5_bond_5
  NIC6: mlx5_bond_6
  NIC7: mlx5_bond_7

Model Input Dumps

No response

🐛 Describe the bug

running command:

export NCCL_DEBUG=INFO
export CUDA_LAUNCH_BLOCKING=1
export VLLM_TRACE_FUNCTION=1

python3 api_server.py
--model my_model
-tp 8
-pp 2
--enforce-eager
--max-num-seqs=32
--dtype=bfloat16
--worker-use-ray
--gpu-memory-utilization 0.8

logs are hanging here:

INFO 09-12 23:40:47 utils.py:977] Found nccl from library libnccl.so.2
INFO 09-12 23:40:47 pynccl.py:63] vLLM is using nccl==2.20.5
VM-160-69-tencentos:561:561 [0] NCCL INFO Using non-device net plugin version 0
VM-160-69-tencentos:561:561 [0] NCCL INFO Using network IB
VM-160-69-tencentos:561:561 [0] NCCL INFO comm 0x1476dca0 rank 0 nranks 2 cudaDev 0 nvmlDev 0 busId 23000 commId 0xa015ae7e0a50005d - Init START
(RayWorkerWrapper pid=248, ip=10.1.160.68) INFO 09-12 23:40:47 utils.py:977] Found nccl from library libnccl.so.2
(RayWorkerWrapper pid=248, ip=10.1.160.68) INFO 09-12 23:40:47 pynccl.py:63] vLLM is using nccl==2.20.5
VM-160-69-tencentos:561:561 [0] NCCL INFO Setting affinity for GPU 0 to 0f,ffffffff,ffffffff,ffffffff
VM-160-69-tencentos:561:561 [0] NCCL INFO comm 0x1476dca0 rank 0 nRanks 2 nNodes 2 localRanks 1 localRank 0 MNNVL 0
VM-160-69-tencentos:561:561 [0] NCCL INFO Channel 00/02 : 0 1
VM-160-69-tencentos:561:561 [0] NCCL INFO Channel 01/02 : 0 1
VM-160-69-tencentos:561:561 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] -1/-1/-1->0->1
VM-160-69-tencentos:561:561 [0] NCCL INFO P2P Chunksize set to 131072
VM-160-69-tencentos:561:561 [0] NCCL INFO Channel 00/0 : 1[0] -> 0[0] [receive] via NET/IB/1/GDRDMA
VM-160-69-tencentos:561:561 [0] NCCL INFO Channel 01/0 : 1[0] -> 0[0] [receive] via NET/IB/1/GDRDMA
VM-160-69-tencentos:561:561 [0] NCCL INFO Channel 00/0 : 0[0] -> 1[0] [send] via NET/IB/1/GDRDMA
VM-160-69-tencentos:561:561 [0] NCCL INFO Channel 01/0 : 0[0] -> 1[0] [send] via NET/IB/1/GDRDMA
VM-160-69-tencentos:561:561 [0] NCCL INFO Connected all rings
VM-160-69-tencentos:561:561 [0] NCCL INFO Connected all trees
VM-160-69-tencentos:561:561 [0] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512
VM-160-69-tencentos:561:561 [0] NCCL INFO 2 coll channels, 0 collnet channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
VM-160-69-tencentos:561:561 [0] NCCL INFO comm 0x1476dca0 rank 0 nranks 2 cudaDev 0 nvmlDev 0 busId 23000 commId 0xa015ae7e0a50005d - Init COMPLETE

VM-160-69-tencentos:561:13506 [0] transport/net_ib.cc:1698 NCCL WARN NET/IB : Got completion from peer 10.1.160.68<57552> with status=12 opcode=0 len=0 vendor err 129 (Recv) localGid fe80::a288:c2ff:fe16:5c9c remoteGidsfe80::a288:c2ff:fe16:4e4c
VM-160-69-tencentos:561:13506 [0] NCCL INFO transport/net.cc:1298 -> 6
VM-160-69-tencentos:561:13506 [0] NCCL INFO proxy.cc:694 -> 6
VM-160-69-tencentos:561:13506 [0] NCCL INFO proxy.cc:874 -> 6 [Progress Thread]

VM-160-69-tencentos:561:13506 [0] transport/net_ib.cc:1698 NCCL WARN NET/IB : Got completion from peer 10.1.160.68<57552> with status=5 opcode=0 len=0 vendor err 249 (Recv) localGid fe80::a288:c2ff:fe16:5c9c remoteGidsfe80::a288:c2ff:fe16:4e4c
VM-160-69-tencentos:561:13506 [0] NCCL INFO transport/net.cc:1298 -> 6
VM-160-69-tencentos:561:13506 [0] NCCL INFO proxy.cc:694 -> 6
VM-160-69-tencentos:561:13506 [0] NCCL INFO proxy.cc:874 -> 6 [Progress Thread]

VM-160-69-tencentos:561:13506 [0] transport/net_ib.cc:1698 NCCL WARN NET/IB : Got completion from peer 10.1.160.68<57552> with status=5 opcode=0 len=0 vendor err 244 (Recv) localGid fe80::a288:c2ff:fe16:5c9c remoteGidsfe80::a288:c2ff:fe16:4e4c
VM-160-69-tencentos:561:13506 [0] NCCL INFO transport/net.cc:1298 -> 6
VM-160-69-tencentos:561:13506 [0] NCCL INFO proxy.cc:694 -> 6
VM-160-69-tencentos:561:13506 [0] NCCL INFO proxy.cc:874 -> 6 [Progress Thread]

VM-160-69-tencentos:561:13506 [0] transport/net_ib.cc:1698 NCCL WARN NET/IB : Got completion from peer 10.1.160.68<57552> with status=5 opcode=0 len=0 vendor err 249 (Recv) localGid fe80::a288:c2ff:fe16:5c9c remoteGidsfe80::a288:c2ff:fe16:4e4c
VM-160-69-tencentos:561:13506 [0] NCCL INFO transport/net.cc:1298 -> 6
VM-160-69-tencentos:561:13506 [0] NCCL INFO proxy.cc:694 -> 6
VM-160-69-tencentos:561:13506 [0] NCCL INFO proxy.cc:874 -> 6 [Progress Thread]

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.
@AlvL1225 AlvL1225 added the bug Something isn't working label Sep 13, 2024
@youkaichao
Copy link
Member

@AlvL1225
Copy link
Author

did you try https://docs.vllm.ai/en/latest/getting_started/debugging.html ?

I tried the test.py: Single node pass it but multi nodes failed

NCCL_DEBUG=TRACE torchrun --nnodes 2 --nproc-per-node=2 --rdzv_backend=c10d --rdzv_endpoint=10.1.160.69 test.py
W0912 23:57:32.917000 140318955108160 torch/distributed/run.py:779]
W0912 23:57:32.917000 140318955108160 torch/distributed/run.py:779] *****************************************
W0912 23:57:32.917000 140318955108160 torch/distributed/run.py:779] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
W0912 23:57:32.917000 140318955108160 torch/distributed/run.py:779] *****************************************
[E912 23:58:31.579984615 socket.cpp:957] [c10d] The client socket has timed out after 60s while trying to connect to (10.1.160.69, 29400).
Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 158, in _create_tcp_store
store = TCPStore(
torch.distributed.DistNetworkError: The client socket has timed out after 60s while trying to connect to (10.1.160.69, 29400).

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "/usr/local/bin/torchrun", line 8, in
sys.exit(main())
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 348, in wrapper
return f(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 901, in main
run(args)
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 892, in run
elastic_launch(
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 133, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 236, in launch_agent
rdzv_handler=rdzv_registry.get_rendezvous_handler(rdzv_parameters),
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/rendezvous/registry.py", line 66, in get_rendezvous_handler
return handler_registry.create_handler(params)
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/rendezvous/api.py", line 347, in create_handler
handler = creator(params)
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/rendezvous/registry.py", line 36, in _create_c10d_handler
backend, store = create_backend(params)
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 256, in create_backend
store = _create_tcp_store(params)
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 182, in _create_tcp_store
raise RendezvousConnectionError(
torch.distributed.elastic.rendezvous.api.RendezvousConnectionError: The connection to the C10d store has failed. See inner exception for details.

Did I type the ip address wrong? Ray cluster can be initialized with this address and "ray status" can show correct information.

@youkaichao
Copy link
Member

usually this means you need to set GLOO_SOCKET_IFNAME and NCCL_SOCKET_IFNAME . see https://docs.vllm.ai/en/latest/getting_started/debugging.html

@AlvL1225
Copy link
Author

usually this means you need to set GLOO_SOCKET_IFNAME and NCCL_SOCKET_IFNAME . see https://docs.vllm.ai/en/latest/getting_started/debugging.html

I set them both when I run: run_cluster.sh

docker run
--privileged
-e NCCL_IB_HCA=mlx5
-e NCCL_SOCKET_IFNAME=eth0
-e GLOO_SOCKET_IFNAME=eth0
--entrypoint /bin/bash
--network host
--name node
--shm-size 10.24g
--gpus all
-v "${MOUNT_PATH}:${MOUNT_PATH}"
${ADDITIONAL_ARGS}
"${DOCKER_IMAGE}" -c "${RAY_START_CMD}"

@youkaichao
Copy link
Member

sometimes it might be the dns problem, which can be complicated.

you might want to try

NCCL_DEBUG=TRACE torchrun --nnodes 2 --nproc-per-node=2 --node-rank 0 --rdzv_backend=c10d --rdzv_endpoint=10.1.160.69 test.py
NCCL_DEBUG=TRACE torchrun --nnodes 2 --nproc-per-node=2 --node-rank 1 --rdzv_backend=c10d --rdzv_endpoint=10.1.160.69 test.py

i.e. manually assign --node-rank to the command. of course, please make sure the --node-rank 0 has ip of 10.1.160.69 .

@AlvL1225
Copy link
Author

NCCL_DEBUG=TRACE torchrun --nnodes 2 --nproc-per-node=2 --node-rank 0 --rdzv_backend=c10d --rdzv_endpoint=10.1.160.69 test.py

image

I tried it with 10.1.160.69 as node rank 0 and 10.1.160.68 as node rank 1 but still got time out error.

I can ping from 10.1.160.68 to 10.1.160.69 with a super low latency.

@youkaichao
Copy link
Member

that's strange then. what's your network config?

@AlvL1225
Copy link
Author

that's strange then. what's your network config?

This is my ifconfig output:
bond0: flags=5187<UP,BROADCAST,RUNNING,MASTER,MULTICAST> mtu 9100
inet 30.140.15.138 netmask 255.255.255.252 broadcast 30.140.15.139
inet6 fe80::a288:c2ff:fe16:5c9c prefixlen 64 scopeid 0x20
ether a0:88:c2:16:5c:9c txqueuelen 1000 (Ethernet)
RX packets 63224789 bytes 7447411194 (6.9 GiB)
RX errors 0 dropped 2 overruns 0 frame 0
TX packets 21118659 bytes 2565794851 (2.3 GiB)
TX errors 0 dropped 5 overruns 0 carrier 0 collisions 0

bond1: flags=5187<UP,BROADCAST,RUNNING,MASTER,MULTICAST> mtu 9100
inet 30.140.15.162 netmask 255.255.255.252 broadcast 30.140.15.163
inet6 fe80::a288:c2ff:fe16:389c prefixlen 64 scopeid 0x20
ether a0:88:c2:16:38:9c txqueuelen 1000 (Ethernet)
RX packets 63324266 bytes 7473656708 (6.9 GiB)
RX errors 0 dropped 1 overruns 0 frame 0
TX packets 21119640 bytes 2565859398 (2.3 GiB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0

bond2: flags=5187<UP,BROADCAST,RUNNING,MASTER,MULTICAST> mtu 9100
inet 30.140.15.166 netmask 255.255.255.252 broadcast 30.140.15.167
inet6 fe80::a288:c2ff:fe16:b13c prefixlen 64 scopeid 0x20
ether a0:88:c2:16:b1:3c txqueuelen 1000 (Ethernet)
RX packets 63202899 bytes 7444408387 (6.9 GiB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 21127245 bytes 2565742553 (2.3 GiB)
TX errors 88 dropped 1 overruns 0 carrier 88 collisions 0

bond3: flags=5187<UP,BROADCAST,RUNNING,MASTER,MULTICAST> mtu 9100
inet 30.140.15.154 netmask 255.255.255.252 broadcast 30.140.15.155
inet6 fe80::a288:c2ff:fe16:65ac prefixlen 64 scopeid 0x20
ether a0:88:c2:16:65:ac txqueuelen 1000 (Ethernet)
RX packets 63224943 bytes 7447418511 (6.9 GiB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 21119098 bytes 2565821123 (2.3 GiB)
TX errors 0 dropped 1 overruns 0 carrier 0 collisions 0

bond4: flags=5187<UP,BROADCAST,RUNNING,MASTER,MULTICAST> mtu 9100
inet 30.140.15.170 netmask 255.255.255.252 broadcast 30.140.15.171
inet6 fe80::a288:c2ff:fe12:1384 prefixlen 64 scopeid 0x20
ether a0:88:c2:12:13:84 txqueuelen 1000 (Ethernet)
RX packets 63225641 bytes 7447462276 (6.9 GiB)
RX errors 0 dropped 2 overruns 0 frame 0
TX packets 21120236 bytes 2565889621 (2.3 GiB)
TX errors 0 dropped 1 overruns 0 carrier 0 collisions 0

bond5: flags=5187<UP,BROADCAST,RUNNING,MASTER,MULTICAST> mtu 9100
inet 30.140.15.202 netmask 255.255.255.252 broadcast 30.140.15.203
inet6 fe80::a288:c2ff:fe16:3bec prefixlen 64 scopeid 0x20
ether a0:88:c2:16:3b:ec txqueuelen 1000 (Ethernet)
RX packets 63224723 bytes 7447406167 (6.9 GiB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 21121490 bytes 2566000212 (2.3 GiB)
TX errors 0 dropped 1 overruns 0 carrier 0 collisions 0

bond6: flags=5187<UP,BROADCAST,RUNNING,MASTER,MULTICAST> mtu 9100
inet 30.140.15.206 netmask 255.255.255.252 broadcast 30.140.15.207
inet6 fe80::a288:c2ff:fe14:3c96 prefixlen 64 scopeid 0x20
ether a0:88:c2:14:3c:96 txqueuelen 1000 (Ethernet)
RX packets 63301126 bytes 7465396999 (6.9 GiB)
RX errors 0 dropped 1 overruns 0 frame 0
TX packets 21296651 bytes 2608830284 (2.4 GiB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0

bond7: flags=5187<UP,BROADCAST,RUNNING,MASTER,MULTICAST> mtu 9100
inet 30.140.15.194 netmask 255.255.255.252 broadcast 30.140.15.195
inet6 fe80::a288:c2ff:fe16:5a7c prefixlen 64 scopeid 0x20
ether a0:88:c2:16:5a:7c txqueuelen 1000 (Ethernet)
RX packets 63699185 bytes 7475868473 (6.9 GiB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 357381921 bytes 41217258691 (38.3 GiB)
TX errors 0 dropped 1 overruns 0 carrier 0 collisions 0

cbr0: flags=4419<UP,BROADCAST,RUNNING,PROMISC,MULTICAST> mtu 1500
inet 10.32.82.65 netmask 255.255.255.192 broadcast 10.32.82.127
inet6 fe80::459:ecff:fe34:f3df prefixlen 64 scopeid 0x20
ether 06:59:ec:34:f3:df txqueuelen 1000 (Ethernet)
RX packets 113471191750 bytes 613839555096150 (558.2 TiB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 160573624391 bytes 2005040284425576 (1.7 PiB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0

docker0: flags=4099<UP,BROADCAST,MULTICAST> mtu 1500
inet 172.17.0.1 netmask 255.255.0.0 broadcast 0.0.0.0
inet6 fe80::42:f9ff:fe85:8660 prefixlen 64 scopeid 0x20
ether 02:42:f9:85:86:60 txqueuelen 0 (Ethernet)
RX packets 37745 bytes 91972412 (87.7 MiB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 45472 bytes 2948688 (2.8 MiB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0

docker_gwbridge: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500
inet 172.18.0.1 netmask 255.255.0.0 broadcast 172.18.255.255
inet6 fe80::42:ffff:fe1c:5dd5 prefixlen 64 scopeid 0x20
ether 02:42:ff:1c:5d:d5 txqueuelen 0 (Ethernet)
RX packets 52657212 bytes 6151133805 (5.7 GiB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 10559805 bytes 1282925921 (1.1 GiB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0

enp100s0f0: flags=6211<UP,BROADCAST,RUNNING,SLAVE,MULTICAST> mtu 9100
ether a0:88:c2:16:65:ac txqueuelen 1000 (Ethernet)
RX packets 52608457 bytes 6138207935 (5.7 GiB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 10559560 bytes 1282913634 (1.1 GiB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0

enp100s0f1: flags=6211<UP,BROADCAST,RUNNING,SLAVE,MULTICAST> mtu 9100
ether a0:88:c2:16:65:ac txqueuelen 1000 (Ethernet)
RX packets 10616486 bytes 1309210576 (1.2 GiB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 10559540 bytes 1282907737 (1.1 GiB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0

enp132s0f0: flags=6211<UP,BROADCAST,RUNNING,SLAVE,MULTICAST> mtu 9100
ether a0:88:c2:12:13:84 txqueuelen 1000 (Ethernet)
RX packets 10617800 bytes 1309291719 (1.2 GiB)
RX errors 0 dropped 1 overruns 0 frame 0
TX packets 10560135 bytes 1282948720 (1.1 GiB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0

enp132s0f1: flags=6211<UP,BROADCAST,RUNNING,SLAVE,MULTICAST> mtu 9100
ether a0:88:c2:12:13:84 txqueuelen 1000 (Ethernet)
RX packets 52607841 bytes 6138170557 (5.7 GiB)
RX errors 0 dropped 1 overruns 0 frame 0
TX packets 10560103 bytes 1282941149 (1.1 GiB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0

enp164s0f0: flags=6211<UP,BROADCAST,RUNNING,SLAVE,MULTICAST> mtu 9100
ether a0:88:c2:16:3b:ec txqueuelen 1000 (Ethernet)
RX packets 52606869 bytes 6138115215 (5.7 GiB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 10560764 bytes 1283004321 (1.1 GiB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0

enp164s0f1: flags=6211<UP,BROADCAST,RUNNING,SLAVE,MULTICAST> mtu 9100
ether a0:88:c2:16:3b:ec txqueuelen 1000 (Ethernet)
RX packets 10617854 bytes 1309290952 (1.2 GiB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 10560728 bytes 1282996139 (1.1 GiB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0

enp196s0f0: flags=6211<UP,BROADCAST,RUNNING,SLAVE,MULTICAST> mtu 9100
ether a0:88:c2:14:3c:96 txqueuelen 1000 (Ethernet)
RX packets 52646292 bytes 6147130872 (5.7 GiB)
RX errors 0 dropped 1 overruns 0 frame 0
TX packets 10648310 bytes 1304433008 (1.2 GiB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0

enp196s0f1: flags=6211<UP,BROADCAST,RUNNING,SLAVE,MULTICAST> mtu 9100
ether a0:88:c2:14:3c:96 txqueuelen 1000 (Ethernet)
RX packets 10654835 bytes 1318266242 (1.2 GiB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 10648343 bytes 1304397524 (1.2 GiB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0

enp228s0f0: flags=6211<UP,BROADCAST,RUNNING,SLAVE,MULTICAST> mtu 9100
ether a0:88:c2:16:5a:7c txqueuelen 1000 (Ethernet)
RX packets 52844136 bytes 6152346379 (5.7 GiB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 178692129 bytes 20608765676 (19.1 GiB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0

enp228s0f1: flags=6211<UP,BROADCAST,RUNNING,SLAVE,MULTICAST> mtu 9100
ether a0:88:c2:16:5a:7c txqueuelen 1000 (Ethernet)
RX packets 10855051 bytes 1323522324 (1.2 GiB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 178689794 bytes 20608493263 (19.1 GiB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0

enp36s0f0: flags=6211<UP,BROADCAST,RUNNING,SLAVE,MULTICAST> mtu 9100
ether a0:88:c2:16:5c:9c txqueuelen 1000 (Ethernet)
RX packets 52605682 bytes 6138042213 (5.7 GiB)
RX errors 0 dropped 2 overruns 0 frame 0
TX packets 10559361 bytes 1282902935 (1.1 GiB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0

enp36s0f1: flags=6211<UP,BROADCAST,RUNNING,SLAVE,MULTICAST> mtu 9100
ether a0:88:c2:16:5c:9c txqueuelen 1000 (Ethernet)
RX packets 10619108 bytes 1309369096 (1.2 GiB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 10559300 bytes 1282892164 (1.1 GiB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0

enp52s0f0: flags=6211<UP,BROADCAST,RUNNING,SLAVE,MULTICAST> mtu 9100
ether a0:88:c2:16:38:9c txqueuelen 1000 (Ethernet)
RX packets 10667054 bytes 1322522903 (1.2 GiB)
RX errors 0 dropped 1 overruns 0 frame 0
TX packets 10559837 bytes 1282933725 (1.1 GiB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0

enp52s0f1: flags=6211<UP,BROADCAST,RUNNING,SLAVE,MULTICAST> mtu 9100
ether a0:88:c2:16:38:9c txqueuelen 1000 (Ethernet)
RX packets 52657212 bytes 6151133805 (5.7 GiB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 10559805 bytes 1282925921 (1.1 GiB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0

enp68s0f0: flags=6211<UP,BROADCAST,RUNNING,SLAVE,MULTICAST> mtu 9100
ether a0:88:c2:16:b1:3c txqueuelen 1000 (Ethernet)
RX packets 10595389 bytes 1306253865 (1.2 GiB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 10553781 bytes 1281830040 (1.1 GiB)
TX errors 88 dropped 0 overruns 0 carrier 88 collisions 0

enp68s0f1: flags=6211<UP,BROADCAST,RUNNING,SLAVE,MULTICAST> mtu 9100
ether a0:88:c2:16:b1:3c txqueuelen 1000 (Ethernet)
RX packets 52607526 bytes 6138156270 (5.7 GiB)
RX errors 0 dropped 2 overruns 0 frame 0
TX packets 10573470 bytes 1283913117 (1.1 GiB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0

eth0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500
inet 10.1.160.69 netmask 255.255.224.0 broadcast 10.1.191.255
inet6 fe80::5054:ff:fe96:5162 prefixlen 64 scopeid 0x20
ether 52:54:00:96:51:62 txqueuelen 1000 (Ethernet)
RX packets 1461339011379 bytes 2084136517831586 (1.8 PiB)
RX errors 0 dropped 93 overruns 0 frame 0
TX packets 523726552885 bytes 641622140322310 (583.5 TiB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0

lo: flags=73<UP,LOOPBACK,RUNNING> mtu 65536
inet 127.0.0.1 netmask 255.0.0.0
inet6 ::1 prefixlen 128 scopeid 0x10
loop txqueuelen 1000 (Local Loopback)
RX packets 143825491 bytes 185308359189 (172.5 GiB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 143825491 bytes 185308359189 (172.5 GiB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0

veth2b847be: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500
inet6 fe80::7482:c0ff:fe68:f97a prefixlen 64 scopeid 0x20
ether 76:82:c0:68:f9:7a txqueuelen 0 (Ethernet)
RX packets 0 bytes 0 (0.0 B)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 1014 bytes 71132 (69.4 KiB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0

veth3c302e63: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500
inet6 fe80::906f:81ff:fed4:f70 prefixlen 64 scopeid 0x20
ether 92:6f:81:d4:0f:70 txqueuelen 0 (Ethernet)
RX packets 42761986 bytes 201845674061 (187.9 GiB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 37748579 bytes 126737158120 (118.0 GiB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0

veth41503f00: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500
inet6 fe80::c00a:78ff:fe78:20e6 prefixlen 64 scopeid 0x20
ether c2:0a:78:78:20:e6 txqueuelen 0 (Ethernet)
RX packets 28286218 bytes 3422794430 (3.1 GiB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 33335691 bytes 403511488681 (375.7 GiB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0

vethfae2868a: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500
inet6 fe80::580a:5cff:fed2:ee59 prefixlen 64 scopeid 0x20
ether 5a:0a:5c:d2:ee:59 txqueuelen 0 (Ethernet)
RX packets 3044708 bytes 2039749923 (1.8 GiB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 3317867 bytes 430805180 (410.8 MiB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0

@youkaichao
Copy link
Member

your network config is so complicated. i suggest you talk to your network admin, to make sure the test script can run normally first.

@AlvL1225
Copy link
Author

AlvL1225 commented Sep 13, 2024

your network config is so complicated. i suggest you talk to your network admin, to make sure the test script can run normally first.

I replace the arguments: --rdzv_backend=c10d --rdzv_endpoint=10.1.160.69
with --master_addr 10.1.160.69 --master_port 23456
The connection can be build , but the first assertion triggerred: AssertionError: Expected 4, got 2.0.
It seems that the all_reduce_sum got the wrong value

@youkaichao
Copy link
Member

if you use --master_addr 10.1.160.69 --master_port 23456, I think you still need to add --node-rank ?

@AlvL1225
Copy link
Author

if you use --master_addr 10.1.160.69 --master_port 23456, I think you still need to add --node-rank ?

yeah, --node-rank is still added to both node, but nccl result is wrong

@youkaichao
Copy link
Member

interesting. what's your full command?

@AlvL1225
Copy link
Author

AlvL1225 commented Sep 13, 2024

interesting. what's your full command?

NCCL_DEBUG=TRACE torchrun --nnodes 2 --nproc-per-node=8 --node-rank 0 --master_addr 10.1.160.68 --master_port 23456 test.py

global world size = 16, however:

node0:
image
node1:
image

@AlvL1225
Copy link
Author

AlvL1225 commented Sep 13, 2024

interesting. what's your full command?

NCCL_DEBUG=TRACE torchrun --nnodes 2 --nproc-per-node=8 --node-rank 0 --master_addr 10.1.160.68 --master_port 23456 test.py (this time I change 10.1.160.68 as master)

global world size = 16, however:

node0: image

node1: image

node 1 got all_reduce_sum result=0, strange.

@youkaichao
Copy link
Member

your hardware / driver / nccl might be broken.

@AlvL1225
Copy link
Author

AlvL1225 commented Sep 13, 2024

your hardware / driver / nccl might be broken.

if I use master_addr and master_port to run test.py and ignore NCCL output assertion (gloo test will pass), the program will hang at
"pynccl = PyNcclCommunicator(group=gloo_group, device=local_rank)"
and the specific position is in the init function of PyNcclCommunicator:

 with torch.cuda.device(device):
            self.comm: ncclComm_t = self.nccl.ncclCommInitRank(
            self.world_size, self.unique_id, self.rank)
            self.stream = torch.cuda.Stream()

            # A small all_reduce for warmup.
            data = torch.zeros(1, device=device)
            self.all_reduce(data)
            # hang here ############
            self.stream.synchronize()
            del data

@youkaichao
Copy link
Member

looking at the logs above:

VM-160-69-tencentos:561:13506 [0] transport/net_ib.cc:1698 NCCL WARN NET/IB : Got completion from peer 10.1.160.68<57552> with status=12 opcode=0 len=0 vendor err 129 (Recv) localGid fe80::a288:c2ff:fe16:5c9c remoteGidsfe80::a288:c2ff:fe16:4e4c

this means your IB config is wrong, and there are errors there. you need to contact your admin to fix it first.

@AlvL1225
Copy link
Author

looking at the logs above:

VM-160-69-tencentos:561:13506 [0] transport/net_ib.cc:1698 NCCL WARN NET/IB : Got completion from peer 10.1.160.68<57552> with status=12 opcode=0 len=0 vendor err 129 (Recv) localGid fe80::a288:c2ff:fe16:5c9c remoteGidsfe80::a288:c2ff:fe16:4e4c

this means your IB config is wrong, and there are errors there. you need to contact your admin to fix it first.

export NCCL_IB_GID_INDEX=3 fix this (default setting is GID 0/1 and IPV6 is not setting correctly), thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants