[BUG] Default device memory allocation is too aggressive #1160

JerryHOTS · 2024-12-04T00:05:07Z

Software versions

Python      :  3.12.7 | packaged by conda-forge | (main, Oct  4 2024, 16:05:46) [GCC 13.3.0]
Platform    :  Linux-5.15.167.4-microsoft-standard-WSL2-x86_64-with-glibc2.39
Legion      :  24.11.1 (commit: ac6aae07cb18fa9de978b073766dd9e3def29dbb)
Legate      :  24.11.1
[0 - 7f06e35b3600]    0.000000 {5}{module_config}: Module numa can not detect resources.
[0 - 7f06e35b3600]    0.000000 {4}{openmp}: numa support not found (or not working)
[0 - 7f06e35b3600]    0.000000 {5}{gpu}: /tmp/conda-croot/legate/work/arch-conda/skbuild_core/_deps/legion-src/runtime/realm/cuda/cuda_module.cc(3985):CUDA_DRIVER_FNPTR(cuIpcGetMemHandle)(&alloc.ipc_handle, alloc.dev_ptr) = 2(CUDA_ERROR_OUT_OF_MEMORY): out of memory

(Following's the output for nvidia-smi btw)

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 565.72                 Driver Version: 566.14         CUDA Version: 12.7     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 4070 ...    On  |   00000000:01:00.0 Off |                  N/A |
| N/A   48C    P8              1W /  120W |    7825MiB /   8188MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A      3830      C   /python3.12                                 N/A      |
+-----------------------------------------------------------------------------------------+

Jupyter notebook / Jupyter Lab version

No response

Expected behavior

No error

Observed behavior

When I run legate-issue or run import cupynumeric as np in python it raises the following error:

[0 - 7f06e35b3600]    0.000000 {5}{module_config}: Module numa can not detect resources.
[0 - 7f06e35b3600]    0.000000 {4}{openmp}: numa support not found (or not working)
[0 - 7f06e35b3600]    0.000000 {5}{gpu}: /tmp/conda-croot/legate/work/arch-conda/skbuild_core/_deps/legion-src/runtime/realm/cuda/cuda_module.cc(3985):CUDA_DRIVER_FNPTR(cuIpcGetMemHandle)(&alloc.ipc_handle, alloc.dev_ptr) = 2(CUDA_ERROR_OUT_OF_MEMORY): out of memory

Example code or instructions

legate-issue

import cupynumeric as np

Stack traceback or browser console output

No response

The text was updated successfully, but these errors were encountered:

manopapad · 2024-12-04T03:13:35Z

Multiple issues here:

Legate is attempting by default to reserve a large proportion of the device memory, and that's failing.
- @JerryHOTS could you please try running with LEGATE_SHOW_CONFIG=1? The run will still fail, but we'll be able to see how much device memory Legate is attempting to reserve.
- @JerryHOTS could you please try running with LEGATE_CONFIG="--fbmem 1000"? This will instruct Legate to try and reserve less memory.
- @muraj looks like memory allocation is failing with CUDA_ERROR_OUT_OF_MEMORY, but in cuIpcGetMemHandle. Is it possible that there is, in fact, enough memory, but we're hitting a different system limit related to CUDA-IPC?
Realm is complaining that we tried to instantiate NUMA-aligned memory, but that's not possible on the current system. This is not actually causing a failure in this case, but it is polluting the error message.
- @eddy16112 is working on a Realm fix that would allow us to detect this issue before we put in our request for NUMA-aligned memory.
- @JerryHOTS you can possibly make this go away by adding --omps 0 to LEGATE_CONFIG.
legate-issue is not working. This is an outcome of the fact that legate-issue is itself trying to import cupynumeric, so if the issue is with initialization itself, then legate-issue is going to hit it too.
- @bryevdv is looking into a fix.

JerryHOTS · 2024-12-04T03:40:06Z

Thank you very much for your prompt response!

Running with LEGATE_SHOW_CONFIG=1 gives: (terminal killed halfway)

Python      :  3.12.7 | packaged by conda-forge | (main, Oct  4 2024, 16:05:46) [GCC 13.3.0]
Platform    :  Linux-5.15.167.4-microsoft-standard-WSL2-x86_64-with-glibc2.39
Legion      :  24.11.1 (commit: ac6aae07cb18fa9de978b073766dd9e3def29dbb)
Legate      :  24.11.1
Legate hardware configuration: --cpus=1 --gpus=1 --omps=1 --ompthreads=6 --utility=2 --sysmem=12677 --numamem=0 --fbmem=7778 --zcmem=128 --regmem=0
[0 - 7fc0c6d24600]    0.000000 {5}{module_config}: Module numa can not detect resources.
[0 - 7fc0c6d24600]    0.000000 {4}{openmp}: numa support not found (or not working)
[0 - 7fc0c6d24600]    0.000000 {5}{gpu}: /tmp/conda-croot/legate/work/arch-conda/skbuild_core/_deps/legion-src/runtime/realm/cuda/cuda_module.cc(3985):CUDA_DRIVER_FNPTR(cuIpcGetMemHandle)(&alloc.ipc_handle, alloc.dev_ptr) = 2(CUDA_ERROR_OUT_OF_MEMORY): out of memory

Running with LEGATE_CONFIG="--fbmem 1000" gives: (successful run)

Python      :  3.12.7 | packaged by conda-forge | (main, Oct  4 2024, 16:05:46) [GCC 13.3.0]
Platform    :  Linux-5.15.167.4-microsoft-standard-WSL2-x86_64-with-glibc2.39
Legion      :  24.11.1 (commit: ac6aae07cb18fa9de978b073766dd9e3def29dbb)
Legate      :  24.11.1
[0 - 7f67147be600]    0.000000 {5}{module_config}: Module numa can not detect resources.
[0 - 7f67147be600]    0.000000 {4}{openmp}: numa support not found (or not working)
cuPynumeric :  24.11.02
Numpy       :  1.26.4
Scipy       :  1.14.1
Numba       :  (failed to detect)
CTK package :  cuda-version-12.6-h7480c83_3 (conda-forge)
GPU driver  :  566.14
GPU devices :  
  GPU 0: NVIDIA GeForce RTX 4070 Laptop GPU

And setting --omps=0 doesn't solve the problem, with or without setting "--fbmem 1000":

Python      :  3.12.7 | packaged by conda-forge | (main, Oct  4 2024, 16:05:46) [GCC 13.3.0]
Platform    :  Linux-5.15.167.4-microsoft-standard-WSL2-x86_64-with-glibc2.39
Legion      :  24.11.1 (commit: ac6aae07cb18fa9de978b073766dd9e3def29dbb)
Legate      :  24.11.1
Legate hardware configuration: --cpus=1 --gpus=1 --omps=0 --ompthreads=0 --utility=2 --sysmem=12677 --numamem=0 --fbmem=7778 --zcmem=128 --regmem=0
[0 - 7fb3d0369600]    0.000000 {5}{module_config}: Module numa can not detect resources.
[0 - 7fb3d0369600]    0.000000 {5}{gpu}: /tmp/conda-croot/legate/work/arch-conda/skbuild_core/_deps/legion-src/runtime/realm/cuda/cuda_module.cc(3985):CUDA_DRIVER_FNPTR(cuIpcGetMemHandle)(&alloc.ipc_handle, alloc.dev_ptr) = 2(CUDA_ERROR_OUT_OF_MEMORY): out of memory

manopapad · 2024-12-04T03:56:46Z

OK, so it looks like indeed Legate's default device memory reservation was too aggressive. It tried reserving 7778MiB, based on the total memory size of 8188MiB. But based on your nvidia-smi output, it looks like only 7825MiB is actually available, i.e. quite close to what Legate requested (although it's not clear where the rest is going).

It's still curious that the error occurs at cuIpcGetMemHandle rather than cuMemAlloc.

There is an ongoing discussion about shifting our device memory allocation logic to happen dynamically through CUDA memory allocation calls, rather than allocating a pool in the beginning and allocating out of that. But there are multiple issues to work through, so it won't be available in the immediate term.

For now, I suggest you explicitly tell Legate how much memory to reserve, e.g. LEGATE_CONFIG="--fbmem 7000", or whatever the maximum number is that works.

JerryHOTS · 2024-12-04T04:01:18Z

Thank you again for your help! When it's convenient, I would appreciate it if you could let me know when the fix becomes available.

JerryHOTS closed this as not planned Won't fix, can't repro, duplicate, stale Dec 4, 2024

JerryHOTS reopened this Dec 4, 2024

manopapad changed the title ~~[BUG] Module numa can not detect resources~~ [BUG] Default device memory allocation is too aggressive Jan 2, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Default device memory allocation is too aggressive #1160

[BUG] Default device memory allocation is too aggressive #1160

JerryHOTS commented Dec 4, 2024 •

edited by manopapad

Loading

manopapad commented Dec 4, 2024

JerryHOTS commented Dec 4, 2024 •

edited by manopapad

Loading

manopapad commented Dec 4, 2024

JerryHOTS commented Dec 4, 2024

[BUG] Default device memory allocation is too aggressive #1160

[BUG] Default device memory allocation is too aggressive #1160

Comments

JerryHOTS commented Dec 4, 2024 • edited by manopapad Loading

Software versions

Jupyter notebook / Jupyter Lab version

Expected behavior

Observed behavior

Example code or instructions

Stack traceback or browser console output

manopapad commented Dec 4, 2024

JerryHOTS commented Dec 4, 2024 • edited by manopapad Loading

manopapad commented Dec 4, 2024

JerryHOTS commented Dec 4, 2024

JerryHOTS commented Dec 4, 2024 •

edited by manopapad

Loading

JerryHOTS commented Dec 4, 2024 •

edited by manopapad

Loading