Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pytorch support #4

Open
d4l3k opened this issue Apr 30, 2024 · 20 comments
Open

pytorch support #4

d4l3k opened this issue Apr 30, 2024 · 20 comments

Comments

@d4l3k
Copy link

d4l3k commented Apr 30, 2024

I just tried this out on PyTorch and it seems to work for the cuda state but I'm hitting issues with criu when saving the parent process. It seems like the issue is with saving the nvidia driver in criu.

Are there any plans to expand support for this with criu for common ML frameworks?

~/D/torch-criu (main)> third-party/cuda-checkpoint/bin/x86_64_Linux/cuda-checkpoint --toggle --pid 125704
~/D/torch-criu (main)> sudo criu dump --shell-job --images-dir demo --tree 125704
Error (criu/files-ext.c:94): Can't dump file 19 of that type [20666] (chr 195:255)
Error (criu/cr-dump.c:1669): Dump files (pid: 125704) failed with -1
Error (criu/cr-dump.c:2093): Dumping FAILED.

There's no longer an active cuda process after toggling but still seems to have access to a nvidia device.

~/D/torch-criu (main) [1]> nvidia-smi
Mon Apr 29 17:11:21 2024       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.67                 Driver Version: 550.67         CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 3090        On  |   00000000:06:00.0 Off |                  N/A |
|  0%   32C    P8             16W /  350W |       5MiB /  24576MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA GeForce RTX 3090        On  |   00000000:07:00.0 Off |                  N/A |
|  0%   34C    P8             21W /  350W |       5MiB /  24576MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

The file that failed to save seems to be nvidia.

~/D/torch-criu (main)> grep 195 /proc/devices 
195 nvidia
195 nvidia-modeset
195 nvidiactl

Test script

import time
import os
import torch

device = torch.device("cuda")

a = torch.tensor(10, device=device)

print(os.getpid())
time.sleep(1000)
~/D/torch-criu (main)> criu -V
Version: 3.18
GitID: v3.18
@d4l3k
Copy link
Author

d4l3k commented Apr 30, 2024

Issue seems to specifically be nvidiactl. Some of the accesses aren't being cleared

Before

~/D/torch-criu (main)> lsof -p 74739 | rg nvidia
pt_main_t 74739 rice mem       CHR              195,0                940 /dev/nvidia0
pt_main_t 74739 rice mem       CHR            195,255                938 /dev/nvidiactl
pt_main_t 74739 rice mem       CHR              237,0                894 /dev/nvidia-uvm
pt_main_t 74739 rice mem       REG              254,1   2078360  2973095 /usr/lib/libnvidia-ml.so.550.76
pt_main_t 74739 rice mem       CHR              195,1                976 /dev/nvidia1
pt_main_t 74739 rice   8u      CHR            195,255       0t0      938 /dev/nvidiactl
pt_main_t 74739 rice   9u      CHR              237,0       0t0      894 /dev/nvidia-uvm
pt_main_t 74739 rice  10u      CHR              195,0       0t0      940 /dev/nvidia0
pt_main_t 74739 rice  11u      CHR              195,0       0t0      940 /dev/nvidia0
pt_main_t 74739 rice  12u      CHR              195,1       0t0      976 /dev/nvidia1
pt_main_t 74739 rice  13u      CHR              195,1       0t0      976 /dev/nvidia1
pt_main_t 74739 rice  14u      CHR              195,0       0t0      940 /dev/nvidia0
pt_main_t 74739 rice  15u      CHR              195,0       0t0      940 /dev/nvidia0
pt_main_t 74739 rice  16u      CHR              195,0       0t0      940 /dev/nvidia0
pt_main_t 74739 rice  17u      CHR              195,0       0t0      940 /dev/nvidia0
pt_main_t 74739 rice  19u      CHR            195,255       0t0      938 /dev/nvidiactl
pt_main_t 74739 rice  20u      CHR              195,0       0t0      940 /dev/nvidia0
pt_main_t 74739 rice  21u      CHR              195,1       0t0      976 /dev/nvidia1
pt_main_t 74739 rice  22u      CHR              237,0       0t0      894 /dev/nvidia-uvm
pt_main_t 74739 rice  23r      CHR              240,2       0t0      981 /dev/nvidia-caps/nvidia-cap2
pt_main_t 74739 rice  24u      CHR              195,0       0t0      940 /dev/nvidia0
pt_main_t 74739 rice  25u      CHR              195,0       0t0      940 /dev/nvidia0
pt_main_t 74739 rice  26u      CHR              195,1       0t0      976 /dev/nvidia1
pt_main_t 74739 rice  27u      CHR              195,1       0t0      976 /dev/nvidia1
pt_main_t 74739 rice  29u      CHR              195,0       0t0      940 /dev/nvidia0
pt_main_t 74739 rice  30u      CHR              195,0       0t0      940 /dev/nvidia0
pt_main_t 74739 rice  31u      CHR              195,0       0t0      940 /dev/nvidia0
pt_main_t 74739 rice  32u      CHR              195,0       0t0      940 /dev/nvidia0
pt_main_t 74739 rice  33u      CHR              195,0       0t0      940 /dev/nvidia0
pt_main_t 74739 rice  34u      CHR              195,0       0t0      940 /dev/nvidia0
pt_main_t 74739 rice  35u      CHR              195,0       0t0      940 /dev/nvidia0
pt_main_t 74739 rice  36u      CHR              195,0       0t0      940 /dev/nvidia0
pt_main_t 74739 rice  39u      CHR              195,0       0t0      940 /dev/nvidia0
pt_main_t 74739 rice  41u      CHR              195,0       0t0      940 /dev/nvidia0
pt_main_t 74739 rice  43u      CHR              195,0       0t0      940 /dev/nvidia0
pt_main_t 74739 rice  45u      CHR              195,0       0t0      940 /dev/nvidia0

After

~/D/torch-criu (main)> lsof -p 74739 | rg nvidia
pt_main_t 74739 rice mem       REG              254,1   2078360  2973095 /usr/lib/libnvidia-ml.so.550.76
pt_main_t 74739 rice  19u      CHR            195,255       0t0      938 /dev/nvidiactl
pt_main_t 74739 rice  20u      CHR              195,0       0t0      940 /dev/nvidia0
pt_main_t 74739 rice  21u      CHR              195,1       0t0      976 /dev/nvidia1
pt_main_t 74739 rice  22u      CHR              237,0       0t0      894 /dev/nvidia-uvm
pt_main_t 74739 rice  23r      CHR              240,2       0t0      981 /dev/nvidia-caps/nvidia-cap2
pt_main_t 74739 rice  24u      CHR              195,0       0t0      940 /dev/nvidia0
pt_main_t 74739 rice  25u      CHR              195,0       0t0      940 /dev/nvidia0
pt_main_t 74739 rice  26u      CHR              195,1       0t0      976 /dev/nvidia1
pt_main_t 74739 rice  27u      CHR              195,1       0t0      976 /dev/nvidia1

@d4l3k
Copy link
Author

d4l3k commented Apr 30, 2024

I've also rebuilt PyTorch from source with cuda 12.4.1, cudnn 8.9.7.29-1 and hit the same issue.

@d4l3k
Copy link
Author

d4l3k commented Apr 30, 2024

This seems to work on Jax so there's something odd going on with PyTorch

@d4l3k
Copy link
Author

d4l3k commented Apr 30, 2024

I hacked around this by modifying criu to discard those FDs without errors -- it is able to checkpoint now but I'm not sure how safe it is

diff --git a/criu/files-ext.c b/criu/files-ext.c
index 95ec8e37c..2a150c546 100644
--- a/criu/files-ext.c
+++ b/criu/files-ext.c
@@ -90,7 +90,9 @@ int dump_unsupp_fd(struct fd_parms *p, int lfd, char *more, char *info, FdinfoEn
ret = do_dump_gen_file(p, lfd, &ext_dump_ops, e);
if (ret == 0)
return 0;
-   	if (ret == -ENOTSUP)
+   	if (ret == -ENOTSUP) {
pr_err("Can't dump file %d of that type [%o] (%s %s)\n", p->fd, p->stat.st_mode, more, info);
+     	return 0;
+   	}
return -1;
}
diff --git a/criu/files.c b/criu/files.c
index 3b653e24b..2ea8ac3ef 100644
--- a/criu/files.c
+++ b/criu/files.c
@@ -847,7 +847,7 @@ int collect_fd(int pid, FdinfoEntry *e, struct rst_info *rst_info, bool fake)
fdesc = find_file_desc(e);
if (fdesc == NULL) {
pr_err("No file for fd %d id %#x\n", e->fd, e->id);
-           	return -1;
+           	return 0;
}
if (!collect_fd_to(pid, e, rst_info, fdesc, fake, false))

@sgurfinkel
Copy link
Collaborator

This is a known issue and we're working on fixing it!

@ZingLix
Copy link

ZingLix commented May 10, 2024

This is a known issue and we're working on fixing it!

This feature is truly helpful! Could you please share if there's a rough timeline or estimated date for this feature to be implemented?

@ethxnp
Copy link

ethxnp commented Aug 14, 2024

@sgurfinkel any update on this?

@thundergolfer
Copy link

thundergolfer commented Sep 7, 2024

@jesus-ramos is there a rough timeline on when PyTorch support will land?

@913887524gsd
Copy link

913887524gsd commented Sep 18, 2024

I hacked around this by modifying criu to discard those FDs without errors -- it is able to checkpoint now but I'm not sure how safe it is

diff --git a/criu/files-ext.c b/criu/files-ext.c
index 95ec8e37c..2a150c546 100644
--- a/criu/files-ext.c
+++ b/criu/files-ext.c
@@ -90,7 +90,9 @@ int dump_unsupp_fd(struct fd_parms *p, int lfd, char *more, char *info, FdinfoEn
ret = do_dump_gen_file(p, lfd, &ext_dump_ops, e);
if (ret == 0)
return 0;
-   	if (ret == -ENOTSUP)
+   	if (ret == -ENOTSUP) {
pr_err("Can't dump file %d of that type [%o] (%s %s)\n", p->fd, p->stat.st_mode, more, info);
+     	return 0;
+   	}
return -1;
}
diff --git a/criu/files.c b/criu/files.c
index 3b653e24b..2ea8ac3ef 100644
--- a/criu/files.c
+++ b/criu/files.c
@@ -847,7 +847,7 @@ int collect_fd(int pid, FdinfoEntry *e, struct rst_info *rst_info, bool fake)
fdesc = find_file_desc(e);
if (fdesc == NULL) {
pr_err("No file for fd %d id %#x\n", e->fd, e->id);
-           	return -1;
+           	return 0;
}
if (!collect_fd_to(pid, e, rst_info, fdesc, fake, false))

This works fine for fds tied to CUDA devices, but it struggles with PyTorch programs using pinned memory, which is commonly used to speed up data transmission. It's still a bit far from being fully practical...

@gflarity
Copy link

Is this is still on the roadmap?

@sgurfinkel
Copy link
Collaborator

Is this is still on the roadmap?

Yes, it is!

@lianghao208
Copy link

@sgurfinkel
Hi, can we know when the next version will be released? And will the code be open source in next release?

@thundergolfer
Copy link

Hey @sgurfinkel is this fixed on 565.57.01?

@sgurfinkel
Copy link
Collaborator

Hey @sgurfinkel is this fixed on 565.57.01?

No, not quite yet!

@thundergolfer
Copy link

Thanks for the update anyways :) A Google engineer to us indicated it may be fixed on the latest driver.

@sgurfinkel would NVIDIA be up for doing community-focused video meeting for this project? I'm thinking something similar to what the AWS Firecracker team did for planning NVIDIA GPU support.

We (at modal.com) are very excited about this technology but it's hard to adopt it with little visibility into the system or roadmap :)

@sgurfinkel
Copy link
Collaborator

Thanks for the update anyways :) A Google engineer to us indicated it may be fixed on the latest driver.

@sgurfinkel would NVIDIA be up for doing community-focused video meeting for this project? I'm thinking something similar to what the AWS Firecracker team did for planning NVIDIA GPU support.

We (at modal.com) are very excited about this technology but it's hard to adopt it with little visibility into the system or roadmap :)

I'll look into the video meeting, but I do otherwise have an update!
Single-process pytorch support is planned to be released in early 2025!

@d4l3k
Copy link
Author

d4l3k commented Nov 11, 2024

@sgurfinkel when you say "single-process" does that mean things like NCCL won't be supported?

@sgurfinkel
Copy link
Collaborator

@sgurfinkel when you say "single-process" does that mean things like NCCL won't be supported?

Yes, that's right. CUDA IPC support won't be present in the early 2025 release.

@laochonlam
Copy link

Hi! Do we now have a more concrete release plan for this? Looking forward to it. Thank you!

@zhuangqh
Copy link

Also looking forward to this ability. Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

10 participants