Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Default device memory allocation is too aggressive #1160

Open
JerryHOTS opened this issue Dec 4, 2024 · 4 comments
Open

[BUG] Default device memory allocation is too aggressive #1160

JerryHOTS opened this issue Dec 4, 2024 · 4 comments

Comments

@JerryHOTS
Copy link

JerryHOTS commented Dec 4, 2024

Software versions

Python      :  3.12.7 | packaged by conda-forge | (main, Oct  4 2024, 16:05:46) [GCC 13.3.0]
Platform    :  Linux-5.15.167.4-microsoft-standard-WSL2-x86_64-with-glibc2.39
Legion      :  24.11.1 (commit: ac6aae07cb18fa9de978b073766dd9e3def29dbb)
Legate      :  24.11.1
[0 - 7f06e35b3600]    0.000000 {5}{module_config}: Module numa can not detect resources.
[0 - 7f06e35b3600]    0.000000 {4}{openmp}: numa support not found (or not working)
[0 - 7f06e35b3600]    0.000000 {5}{gpu}: /tmp/conda-croot/legate/work/arch-conda/skbuild_core/_deps/legion-src/runtime/realm/cuda/cuda_module.cc(3985):CUDA_DRIVER_FNPTR(cuIpcGetMemHandle)(&alloc.ipc_handle, alloc.dev_ptr) = 2(CUDA_ERROR_OUT_OF_MEMORY): out of memory

(Following's the output for nvidia-smi btw)

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 565.72                 Driver Version: 566.14         CUDA Version: 12.7     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 4070 ...    On  |   00000000:01:00.0 Off |                  N/A |
| N/A   48C    P8              1W /  120W |    7825MiB /   8188MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A      3830      C   /python3.12                                 N/A      |
+-----------------------------------------------------------------------------------------+

Jupyter notebook / Jupyter Lab version

No response

Expected behavior

No error

Observed behavior

When I run legate-issue or run import cupynumeric as np in python it raises the following error:

[0 - 7f06e35b3600]    0.000000 {5}{module_config}: Module numa can not detect resources.
[0 - 7f06e35b3600]    0.000000 {4}{openmp}: numa support not found (or not working)
[0 - 7f06e35b3600]    0.000000 {5}{gpu}: /tmp/conda-croot/legate/work/arch-conda/skbuild_core/_deps/legion-src/runtime/realm/cuda/cuda_module.cc(3985):CUDA_DRIVER_FNPTR(cuIpcGetMemHandle)(&alloc.ipc_handle, alloc.dev_ptr) = 2(CUDA_ERROR_OUT_OF_MEMORY): out of memory

Example code or instructions

legate-issue
import cupynumeric as np

Stack traceback or browser console output

No response

@manopapad
Copy link
Contributor

Multiple issues here:

  • Legate is attempting by default to reserve a large proportion of the device memory, and that's failing.
    • @JerryHOTS could you please try running with LEGATE_SHOW_CONFIG=1? The run will still fail, but we'll be able to see how much device memory Legate is attempting to reserve.
    • @JerryHOTS could you please try running with LEGATE_CONFIG="--fbmem 1000"? This will instruct Legate to try and reserve less memory.
    • @muraj looks like memory allocation is failing with CUDA_ERROR_OUT_OF_MEMORY, but in cuIpcGetMemHandle. Is it possible that there is, in fact, enough memory, but we're hitting a different system limit related to CUDA-IPC?
  • Realm is complaining that we tried to instantiate NUMA-aligned memory, but that's not possible on the current system. This is not actually causing a failure in this case, but it is polluting the error message.
    • @eddy16112 is working on a Realm fix that would allow us to detect this issue before we put in our request for NUMA-aligned memory.
    • @JerryHOTS you can possibly make this go away by adding --omps 0 to LEGATE_CONFIG.
  • legate-issue is not working. This is an outcome of the fact that legate-issue is itself trying to import cupynumeric, so if the issue is with initialization itself, then legate-issue is going to hit it too.

@JerryHOTS JerryHOTS closed this as not planned Won't fix, can't repro, duplicate, stale Dec 4, 2024
@JerryHOTS JerryHOTS reopened this Dec 4, 2024
@JerryHOTS
Copy link
Author

JerryHOTS commented Dec 4, 2024

Thank you very much for your prompt response!

Running with LEGATE_SHOW_CONFIG=1 gives: (terminal killed halfway)

Python      :  3.12.7 | packaged by conda-forge | (main, Oct  4 2024, 16:05:46) [GCC 13.3.0]
Platform    :  Linux-5.15.167.4-microsoft-standard-WSL2-x86_64-with-glibc2.39
Legion      :  24.11.1 (commit: ac6aae07cb18fa9de978b073766dd9e3def29dbb)
Legate      :  24.11.1
Legate hardware configuration: --cpus=1 --gpus=1 --omps=1 --ompthreads=6 --utility=2 --sysmem=12677 --numamem=0 --fbmem=7778 --zcmem=128 --regmem=0
[0 - 7fc0c6d24600]    0.000000 {5}{module_config}: Module numa can not detect resources.
[0 - 7fc0c6d24600]    0.000000 {4}{openmp}: numa support not found (or not working)
[0 - 7fc0c6d24600]    0.000000 {5}{gpu}: /tmp/conda-croot/legate/work/arch-conda/skbuild_core/_deps/legion-src/runtime/realm/cuda/cuda_module.cc(3985):CUDA_DRIVER_FNPTR(cuIpcGetMemHandle)(&alloc.ipc_handle, alloc.dev_ptr) = 2(CUDA_ERROR_OUT_OF_MEMORY): out of memory

Running with LEGATE_CONFIG="--fbmem 1000" gives: (successful run)

Python      :  3.12.7 | packaged by conda-forge | (main, Oct  4 2024, 16:05:46) [GCC 13.3.0]
Platform    :  Linux-5.15.167.4-microsoft-standard-WSL2-x86_64-with-glibc2.39
Legion      :  24.11.1 (commit: ac6aae07cb18fa9de978b073766dd9e3def29dbb)
Legate      :  24.11.1
[0 - 7f67147be600]    0.000000 {5}{module_config}: Module numa can not detect resources.
[0 - 7f67147be600]    0.000000 {4}{openmp}: numa support not found (or not working)
cuPynumeric :  24.11.02
Numpy       :  1.26.4
Scipy       :  1.14.1
Numba       :  (failed to detect)
CTK package :  cuda-version-12.6-h7480c83_3 (conda-forge)
GPU driver  :  566.14
GPU devices :  
  GPU 0: NVIDIA GeForce RTX 4070 Laptop GPU

And setting --omps=0 doesn't solve the problem, with or without setting "--fbmem 1000":

Python      :  3.12.7 | packaged by conda-forge | (main, Oct  4 2024, 16:05:46) [GCC 13.3.0]
Platform    :  Linux-5.15.167.4-microsoft-standard-WSL2-x86_64-with-glibc2.39
Legion      :  24.11.1 (commit: ac6aae07cb18fa9de978b073766dd9e3def29dbb)
Legate      :  24.11.1
Legate hardware configuration: --cpus=1 --gpus=1 --omps=0 --ompthreads=0 --utility=2 --sysmem=12677 --numamem=0 --fbmem=7778 --zcmem=128 --regmem=0
[0 - 7fb3d0369600]    0.000000 {5}{module_config}: Module numa can not detect resources.
[0 - 7fb3d0369600]    0.000000 {5}{gpu}: /tmp/conda-croot/legate/work/arch-conda/skbuild_core/_deps/legion-src/runtime/realm/cuda/cuda_module.cc(3985):CUDA_DRIVER_FNPTR(cuIpcGetMemHandle)(&alloc.ipc_handle, alloc.dev_ptr) = 2(CUDA_ERROR_OUT_OF_MEMORY): out of memory

@manopapad
Copy link
Contributor

OK, so it looks like indeed Legate's default device memory reservation was too aggressive. It tried reserving 7778MiB, based on the total memory size of 8188MiB. But based on your nvidia-smi output, it looks like only 7825MiB is actually available, i.e. quite close to what Legate requested (although it's not clear where the rest is going).

It's still curious that the error occurs at cuIpcGetMemHandle rather than cuMemAlloc.

There is an ongoing discussion about shifting our device memory allocation logic to happen dynamically through CUDA memory allocation calls, rather than allocating a pool in the beginning and allocating out of that. But there are multiple issues to work through, so it won't be available in the immediate term.

For now, I suggest you explicitly tell Legate how much memory to reserve, e.g. LEGATE_CONFIG="--fbmem 7000", or whatever the maximum number is that works.

@JerryHOTS
Copy link
Author

Thank you again for your help! When it's convenient, I would appreciate it if you could let me know when the fix becomes available.

@manopapad manopapad changed the title [BUG] Module numa can not detect resources [BUG] Default device memory allocation is too aggressive Jan 2, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants