Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Unexpected Error with Existing IP Address on bridge0 Interface #1903

Open
hotwa opened this issue Feb 25, 2025 · 3 comments
Open

[BUG] Unexpected Error with Existing IP Address on bridge0 Interface #1903

hotwa opened this issue Feb 25, 2025 · 3 comments

Comments

@hotwa
Copy link

hotwa commented Feb 25, 2025

Describe the bug

I encountered an issue on my Mac Mini where the bridge0 interface has a valid IP address, but an error is still reported indicating that the IP does not exist. Here is the output of ifconfig:

bridge0: flags=8863<UP,BROADCAST,SMART,RUNNING,SIMPLEX,MULTICAST> mtu 9000
        options=63<RXCSUM,TXCSUM,TSO4,TSO6>
        ether 36:6a:0f:f5:b0:80
        inet6 fe80::1442:a0bf:3f10:95ba%bridge0 prefixlen 64 secured scopeid 0x12 
        inet 10.25.0.1 netmask 0xffffff00 broadcast 10.25.0.255
        Configuration:
                id 0:0:0:0:0:0 priority 0 hellotime 0 fwddelay 0
                maxage 0 holdcnt 0 proto stp maxaddr 100 timeout 1200
                root id 0:0:0:0:0:0 priority 0 ifcost 0 port 0
                ipfilter disabled flags 0x0
        member: en2 flags=3<LEARNING,DISCOVER>
                ifmaxaddr 0 port 11 priority 0 path cost 0
        member: en3 flags=3<LEARNING,DISCOVER>
                ifmaxaddr 0 port 12 priority 0 path cost 0
        member: en4 flags=3<LEARNING,DISCOVER>
                ifmaxaddr 0 port 13 priority 0 path cost 0
        nd6 options=201<PERFORMNUD,DAD>
        media: autoselect
        status: active

As shown, the inet 10.25.0.1 address is clearly assigned to the bridge0 interface.

To Reproduce

mlx.launch \
  --hostfile /Volumes/long990max/hosts.json \
  --backend mpi \
  --mpi-arg "--mca btl tcp,self \
             --mca btl_tcp_if_include 10.25.0.0/24 \
             --mca oob_tcp_if_include 10.25.0.0/24 \
             --mca oob_tcp_disable_family ipv6 \
             --mca btl_tcp_links 4 \
             --mca plm_base_verbose 100 \
             --mca btl_base_verbose 100" \
  /Volumes/long990max/pipeline_generate.py \
  --prompt "What number is larger 6.9 or 6.11?" \
  --max-tokens 64 \
  --model /Volumes/long990max/exo_data/downloads/mlx-community--DeepSeek-R1-4bit \
  --verbose

Expected behavior
The system or application should recognize the IP address 10.25.0.1 on the bridge0 interface without reporting an error.

Desktop (please complete the following information):

  • OS Version: [e.g. MacOS 15.3.1]
  • Version
  • mlx 0.23.1
    mlx-lm 0.21.4

Additional context
Add any other context about the problem here.

@hotwa
Copy link
Author

hotwa commented Feb 25, 2025

run.log

This is my log.

@hotwa
Copy link
Author

hotwa commented Feb 25, 2025

with mlx.launch Inference Across 8 Machines – Thunderbolt Network Instability and Remote Daemon Failure

Description:
While running mlx.launch for inference across an 8-machine cluster, I encountered an issue where one of the machines (Mac-mini-3) reported a remote daemon failure and was disconnected during the process. The shared storage located at /Volumes/long990max is provided via a Thunderbolt network bridge with Samba service. During the inference run, I was monitoring Mac-mini-3 via VNC, and it suddenly disconnected. Shortly after, the shared storage was unmounted, and the following error appeared:

PRTE has lost communication with a remote daemon.

  HNP daemon   : [prterun-Mac-mini-1-39364@0,0] on node Mac-mini-1
  Remote daemon: [prterun-Mac-mini-1-39364@0,2] on node Mac-mini-3

This is usually due to either a failure of the TCP network
connection to the node, or possibly an internal failure of
the daemon itself. We cannot recover from this failure, and
therefore will terminate the job.
Setup Details:
Shared Storage: Thunderbolt 5 cable connected NAS via Samba (/Volumes/long990max)
Cluster: 8 Mac Minis connected through a Thunderbolt network bridge
Command Used:

mlx.launch \
  --hostfile /Volumes/long990max/hosts.json \
  --backend mpi \
  --mpi-arg "--mca btl tcp,self \
             --mca btl_tcp_if_include bridge0 \
             --mca oob_tcp_if_include bridge0 \
             --mca oob_tcp_disable_family ipv6 \
             --mca btl_tcp_links 4 \
             --mca plm_base_verbose 100 \
             --mca btl_base_verbose 100" \
  /Volumes/long990max/pipeline_generate.py \
  --prompt 'What number is larger 6.9 or 6.11?' \
  --max-tokens 64 \
  --model /Volumes/long990max/exo_data/downloads/mlx-community--DeepSeek-R1-4bit

Observed Issues:
Network Instability: The VNC connection to Mac-mini-3 was interrupted during the inference process.
Storage Unmounting: The Samba-based shared storage (/Volumes/long990max) was unexpectedly unmounted.
PRTE Daemon Failure: The remote daemon on Mac-mini-3 was lost, causing the entire job to terminate.
Potential Causes:
Thunderbolt 5 Cable Quality: Could this issue be related to the quality or stability of the Thunderbolt 5 cable used for the network bridge?
Network Bridge Configuration: Is there a chance the bridge0 interface is not handling high-throughput MPI traffic efficiently?
Samba Service Instability: Could the Samba service on the shared storage have been disrupted by the network issue, causing the unmounting?
Request for Help:
Has anyone experienced similar issues with Thunderbolt-based network bridges during high-performance computing tasks?
Could this be related to Thunderbolt 5 cable quality, or is it more likely a software/network configuration issue?
Are there recommended diagnostic steps or settings to ensure stability in such a setup?

@angeloskath
Copy link
Member

Well unfortunately we have also observed some network instability when using the bridge interface. A possible solution which would however be a bit more obtrusive as it would disable the bridge is to setup a thunderbolt ring using the script in #1902 .

I haven't tested it with DeepSeek pipelining and all_gather isn't implemented in the ring backend yet as well as send/recv to nodes that aren't direct neighbors.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants