You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
On a single node, even with multiple ranks, this works finishes in about 30 seconds or less:
$ LEGATE_TEST=1 legate --nodes=1 --ranks-per-node=4 --launcher=mpirun --launcher-extra="--oversubscribe" --gpus=1 --fbmem=30000 --gpu-bind=0/1/2/3 --omps=1 --ompthreads=16 --sysmem=50000 --cpu-bind=0-31/32-63/64-95/96-127 ./test_sparse.py--------------------------------------------------------------------------WARNING: There was an error initializing an OpenFabrics device. Local host: cn0 Local device: mlx5_0--------------------------------------------------------------------------[cn0:2492802] 3 more processes have sent help message help-mpi-btl-openib.txt / error in device init[cn0:2492802] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
The same script on multiple nodes simply never completes:
This is for the LANL/SLAC project, low priority (since we have a workaround).
The following program freezes when you run it on multiple nodes:
On a single node, even with multiple ranks, this works finishes in about 30 seconds or less:
The same script on multiple nodes simply never completes:
(Prints same warnings and then freezes.)
I have confirmed that simpler scripts work on multiple nodes on this machine. In fact the only problematic line is the shuffle:
I have been working around this with:
That's fine for now, but could be a problem later for me.
Versions:
The text was updated successfully, but these errors were encountered: