Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Multi-node freeze on array shuffle #1167

Open
elliottslaughter opened this issue Feb 6, 2025 · 0 comments
Open

[BUG] Multi-node freeze on array shuffle #1167

elliottslaughter opened this issue Feb 6, 2025 · 0 comments

Comments

@elliottslaughter
Copy link

This is for the LANL/SLAC project, low priority (since we have a workaround).

The following program freezes when you run it on multiple nodes:

import cupynumeric as cpn

N = 1_000_000
M = 100

a = cpn.random.rand(N)
b = cpn.random.rand(M, N)
order = cpn.argsort(a)
c = b[:, order]

On a single node, even with multiple ranks, this works finishes in about 30 seconds or less:

$ LEGATE_TEST=1 legate --nodes=1 --ranks-per-node=4 --launcher=mpirun --launcher-extra="--oversubscribe" --gpus=1 --fbmem=30000 --gpu-bind=0/1/2/3 --omps=1 --ompthreads=16 --sysmem=50000 --cpu-bind=0-31/32-63/64-95/96-127 ./test_sparse.py
--------------------------------------------------------------------------
WARNING: There was an error initializing an OpenFabrics device.

  Local host:   cn0
  Local device: mlx5_0
--------------------------------------------------------------------------
[cn0:2492802] 3 more processes have sent help message help-mpi-btl-openib.txt / error in device init
[cn0:2492802] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages

The same script on multiple nodes simply never completes:

$ LEGATE_TEST=1 legate --nodes=2 --ranks-per-node=4 --launcher=mpirun --launcher-extra="--oversubscribe" --gpus=1 --fbmem=30000 --gpu-bind=0/1/2/3 --omps=1 --ompthreads=16 --sysmem=50000 --cpu-bind=0-31/32-63/64-95/96-127 ../test_sparse.py

(Prints same warnings and then freezes.)

I have confirmed that simpler scripts work on multiple nodes on this machine. In fact the only problematic line is the shuffle:

c = b[:, order]

I have been working around this with:

c = cpn.array(np.array(b)[:, order])

That's fine for now, but could be a problem later for me.

Versions:

$ conda list legate                                                  
# packages in environment at /vast/projects/heliosteam/eslaught/miniforge3/envs/legate-2025-02-03:
#
# Name                    Version                   Build  Channel
legate                    25.03.00.dev8   cuda12_py312_g016ec28b_8_ucx_gpu    legate/label/experimental
$ nvidia-smi
Thu Feb  6 16:55:11 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 560.35.05              Driver Version: 560.35.05      CUDA Version: 12.6     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA A100-SXM4-40GB          On  |   00000000:01:00.0 Off |                    0 |
| N/A   25C    P0             58W /  400W |   33475MiB /  40960MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA A100-SXM4-40GB          On  |   00000000:41:00.0 Off |                    0 |
| N/A   22C    P0             56W /  400W |   33451MiB /  40960MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   2  NVIDIA A100-SXM4-40GB          On  |   00000000:81:00.0 Off |                    0 |
| N/A   21C    P0             58W /  400W |   33451MiB /  40960MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   3  NVIDIA A100-SXM4-40GB          On  |   00000000:C1:00.0 Off |                    0 |
| N/A   23C    P0             61W /  400W |   33987MiB /  40960MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant