pytorch support #4

d4l3k · 2024-04-30T00:14:33Z

I just tried this out on PyTorch and it seems to work for the cuda state but I'm hitting issues with criu when saving the parent process. It seems like the issue is with saving the nvidia driver in criu.

Are there any plans to expand support for this with criu for common ML frameworks?

~/D/torch-criu (main)> third-party/cuda-checkpoint/bin/x86_64_Linux/cuda-checkpoint --toggle --pid 125704
~/D/torch-criu (main)> sudo criu dump --shell-job --images-dir demo --tree 125704
Error (criu/files-ext.c:94): Can't dump file 19 of that type [20666] (chr 195:255)
Error (criu/cr-dump.c:1669): Dump files (pid: 125704) failed with -1
Error (criu/cr-dump.c:2093): Dumping FAILED.

There's no longer an active cuda process after toggling but still seems to have access to a nvidia device.

~/D/torch-criu (main) [1]> nvidia-smi
Mon Apr 29 17:11:21 2024       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.67                 Driver Version: 550.67         CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 3090        On  |   00000000:06:00.0 Off |                  N/A |
|  0%   32C    P8             16W /  350W |       5MiB /  24576MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA GeForce RTX 3090        On  |   00000000:07:00.0 Off |                  N/A |
|  0%   34C    P8             21W /  350W |       5MiB /  24576MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

The file that failed to save seems to be nvidia.

~/D/torch-criu (main)> grep 195 /proc/devices 
195 nvidia
195 nvidia-modeset
195 nvidiactl

Test script

import time
import os
import torch

device = torch.device("cuda")

a = torch.tensor(10, device=device)

print(os.getpid())
time.sleep(1000)

~/D/torch-criu (main)> criu -V
Version: 3.18
GitID: v3.18

The text was updated successfully, but these errors were encountered:

d4l3k · 2024-04-30T00:18:32Z

Issue seems to specifically be nvidiactl. Some of the accesses aren't being cleared

Before

~/D/torch-criu (main)> lsof -p 74739 | rg nvidia
pt_main_t 74739 rice mem       CHR              195,0                940 /dev/nvidia0
pt_main_t 74739 rice mem       CHR            195,255                938 /dev/nvidiactl
pt_main_t 74739 rice mem       CHR              237,0                894 /dev/nvidia-uvm
pt_main_t 74739 rice mem       REG              254,1   2078360  2973095 /usr/lib/libnvidia-ml.so.550.76
pt_main_t 74739 rice mem       CHR              195,1                976 /dev/nvidia1
pt_main_t 74739 rice   8u      CHR            195,255       0t0      938 /dev/nvidiactl
pt_main_t 74739 rice   9u      CHR              237,0       0t0      894 /dev/nvidia-uvm
pt_main_t 74739 rice  10u      CHR              195,0       0t0      940 /dev/nvidia0
pt_main_t 74739 rice  11u      CHR              195,0       0t0      940 /dev/nvidia0
pt_main_t 74739 rice  12u      CHR              195,1       0t0      976 /dev/nvidia1
pt_main_t 74739 rice  13u      CHR              195,1       0t0      976 /dev/nvidia1
pt_main_t 74739 rice  14u      CHR              195,0       0t0      940 /dev/nvidia0
pt_main_t 74739 rice  15u      CHR              195,0       0t0      940 /dev/nvidia0
pt_main_t 74739 rice  16u      CHR              195,0       0t0      940 /dev/nvidia0
pt_main_t 74739 rice  17u      CHR              195,0       0t0      940 /dev/nvidia0
pt_main_t 74739 rice  19u      CHR            195,255       0t0      938 /dev/nvidiactl
pt_main_t 74739 rice  20u      CHR              195,0       0t0      940 /dev/nvidia0
pt_main_t 74739 rice  21u      CHR              195,1       0t0      976 /dev/nvidia1
pt_main_t 74739 rice  22u      CHR              237,0       0t0      894 /dev/nvidia-uvm
pt_main_t 74739 rice  23r      CHR              240,2       0t0      981 /dev/nvidia-caps/nvidia-cap2
pt_main_t 74739 rice  24u      CHR              195,0       0t0      940 /dev/nvidia0
pt_main_t 74739 rice  25u      CHR              195,0       0t0      940 /dev/nvidia0
pt_main_t 74739 rice  26u      CHR              195,1       0t0      976 /dev/nvidia1
pt_main_t 74739 rice  27u      CHR              195,1       0t0      976 /dev/nvidia1
pt_main_t 74739 rice  29u      CHR              195,0       0t0      940 /dev/nvidia0
pt_main_t 74739 rice  30u      CHR              195,0       0t0      940 /dev/nvidia0
pt_main_t 74739 rice  31u      CHR              195,0       0t0      940 /dev/nvidia0
pt_main_t 74739 rice  32u      CHR              195,0       0t0      940 /dev/nvidia0
pt_main_t 74739 rice  33u      CHR              195,0       0t0      940 /dev/nvidia0
pt_main_t 74739 rice  34u      CHR              195,0       0t0      940 /dev/nvidia0
pt_main_t 74739 rice  35u      CHR              195,0       0t0      940 /dev/nvidia0
pt_main_t 74739 rice  36u      CHR              195,0       0t0      940 /dev/nvidia0
pt_main_t 74739 rice  39u      CHR              195,0       0t0      940 /dev/nvidia0
pt_main_t 74739 rice  41u      CHR              195,0       0t0      940 /dev/nvidia0
pt_main_t 74739 rice  43u      CHR              195,0       0t0      940 /dev/nvidia0
pt_main_t 74739 rice  45u      CHR              195,0       0t0      940 /dev/nvidia0

After

~/D/torch-criu (main)> lsof -p 74739 | rg nvidia
pt_main_t 74739 rice mem       REG              254,1   2078360  2973095 /usr/lib/libnvidia-ml.so.550.76
pt_main_t 74739 rice  19u      CHR            195,255       0t0      938 /dev/nvidiactl
pt_main_t 74739 rice  20u      CHR              195,0       0t0      940 /dev/nvidia0
pt_main_t 74739 rice  21u      CHR              195,1       0t0      976 /dev/nvidia1
pt_main_t 74739 rice  22u      CHR              237,0       0t0      894 /dev/nvidia-uvm
pt_main_t 74739 rice  23r      CHR              240,2       0t0      981 /dev/nvidia-caps/nvidia-cap2
pt_main_t 74739 rice  24u      CHR              195,0       0t0      940 /dev/nvidia0
pt_main_t 74739 rice  25u      CHR              195,0       0t0      940 /dev/nvidia0
pt_main_t 74739 rice  26u      CHR              195,1       0t0      976 /dev/nvidia1
pt_main_t 74739 rice  27u      CHR              195,1       0t0      976 /dev/nvidia1

d4l3k · 2024-04-30T02:34:01Z

I've also rebuilt PyTorch from source with cuda 12.4.1, cudnn 8.9.7.29-1 and hit the same issue.

d4l3k · 2024-04-30T02:53:16Z

This seems to work on Jax so there's something odd going on with PyTorch

d4l3k · 2024-04-30T06:02:12Z

I hacked around this by modifying criu to discard those FDs without errors -- it is able to checkpoint now but I'm not sure how safe it is

diff --git a/criu/files-ext.c b/criu/files-ext.c
index 95ec8e37c..2a150c546 100644
--- a/criu/files-ext.c
+++ b/criu/files-ext.c
@@ -90,7 +90,9 @@ int dump_unsupp_fd(struct fd_parms *p, int lfd, char *more, char *info, FdinfoEn
ret = do_dump_gen_file(p, lfd, &ext_dump_ops, e);
if (ret == 0)
return 0;
-   	if (ret == -ENOTSUP)
+   	if (ret == -ENOTSUP) {
pr_err("Can't dump file %d of that type [%o] (%s %s)\n", p->fd, p->stat.st_mode, more, info);
+     	return 0;
+   	}
return -1;
}
diff --git a/criu/files.c b/criu/files.c
index 3b653e24b..2ea8ac3ef 100644
--- a/criu/files.c
+++ b/criu/files.c
@@ -847,7 +847,7 @@ int collect_fd(int pid, FdinfoEntry *e, struct rst_info *rst_info, bool fake)
fdesc = find_file_desc(e);
if (fdesc == NULL) {
pr_err("No file for fd %d id %#x\n", e->fd, e->id);
-           	return -1;
+           	return 0;
}
if (!collect_fd_to(pid, e, rst_info, fdesc, fake, false))

sgurfinkel · 2024-04-30T20:51:17Z

This is a known issue and we're working on fixing it!

ZingLix · 2024-05-10T03:54:50Z

This is a known issue and we're working on fixing it!

This feature is truly helpful! Could you please share if there's a rough timeline or estimated date for this feature to be implemented?

ethxnp · 2024-08-14T19:25:38Z

@sgurfinkel any update on this?

thundergolfer · 2024-09-07T18:28:18Z

@jesus-ramos is there a rough timeline on when PyTorch support will land?

913887524gsd · 2024-09-18T11:08:45Z

I hacked around this by modifying criu to discard those FDs without errors -- it is able to checkpoint now but I'm not sure how safe it is

diff --git a/criu/files-ext.c b/criu/files-ext.c
index 95ec8e37c..2a150c546 100644
--- a/criu/files-ext.c
+++ b/criu/files-ext.c
@@ -90,7 +90,9 @@ int dump_unsupp_fd(struct fd_parms *p, int lfd, char *more, char *info, FdinfoEn
ret = do_dump_gen_file(p, lfd, &ext_dump_ops, e);
if (ret == 0)
return 0;
-   	if (ret == -ENOTSUP)
+   	if (ret == -ENOTSUP) {
pr_err("Can't dump file %d of that type [%o] (%s %s)\n", p->fd, p->stat.st_mode, more, info);
+     	return 0;
+   	}
return -1;
}
diff --git a/criu/files.c b/criu/files.c
index 3b653e24b..2ea8ac3ef 100644
--- a/criu/files.c
+++ b/criu/files.c
@@ -847,7 +847,7 @@ int collect_fd(int pid, FdinfoEntry *e, struct rst_info *rst_info, bool fake)
fdesc = find_file_desc(e);
if (fdesc == NULL) {
pr_err("No file for fd %d id %#x\n", e->fd, e->id);
-           	return -1;
+           	return 0;
}
if (!collect_fd_to(pid, e, rst_info, fdesc, fake, false))

This works fine for fds tied to CUDA devices, but it struggles with PyTorch programs using pinned memory, which is commonly used to speed up data transmission. It's still a bit far from being fully practical...

gflarity · 2024-10-24T19:19:44Z

Is this is still on the roadmap?

sgurfinkel · 2024-10-24T19:45:56Z

Is this is still on the roadmap?

Yes, it is!

lianghao208 · 2024-10-25T03:07:25Z

@sgurfinkel
Hi, can we know when the next version will be released? And will the code be open source in next release?

thundergolfer · 2024-11-05T02:10:42Z

Hey @sgurfinkel is this fixed on 565.57.01?

sgurfinkel · 2024-11-05T16:32:31Z

Hey @sgurfinkel is this fixed on 565.57.01?

No, not quite yet!

thundergolfer · 2024-11-06T02:41:06Z

Thanks for the update anyways :) A Google engineer to us indicated it may be fixed on the latest driver.

@sgurfinkel would NVIDIA be up for doing community-focused video meeting for this project? I'm thinking something similar to what the AWS Firecracker team did for planning NVIDIA GPU support.

We (at modal.com) are very excited about this technology but it's hard to adopt it with little visibility into the system or roadmap :)

sgurfinkel · 2024-11-08T18:52:11Z

Thanks for the update anyways :) A Google engineer to us indicated it may be fixed on the latest driver.

@sgurfinkel would NVIDIA be up for doing community-focused video meeting for this project? I'm thinking something similar to what the AWS Firecracker team did for planning NVIDIA GPU support.

We (at modal.com) are very excited about this technology but it's hard to adopt it with little visibility into the system or roadmap :)

I'll look into the video meeting, but I do otherwise have an update!
Single-process pytorch support is planned to be released in early 2025!

d4l3k · 2024-11-11T18:23:37Z

@sgurfinkel when you say "single-process" does that mean things like NCCL won't be supported?

sgurfinkel · 2024-11-11T18:45:46Z

@sgurfinkel when you say "single-process" does that mean things like NCCL won't be supported?

Yes, that's right. CUDA IPC support won't be present in the early 2025 release.

laochonlam · 2025-01-10T10:07:51Z

Hi! Do we now have a more concrete release plan for this? Looking forward to it. Thank you!

zhuangqh · 2025-01-14T03:50:18Z

Also looking forward to this ability. Thank you!

rst0git mentioned this issue May 11, 2024

CRIU dump failed since it failed to dump external device file #7

Closed

ayushr2 mentioned this issue May 23, 2024

GPU Checkpointing: Can't save pma with non-MemoryFile of type *nvproxy.frontendFDMemmapFile google/gvisor#10478

Closed

paulpopelka mentioned this issue Oct 11, 2024

cuDevicePrimaryCtxGetState() returns error 3 (CUDA_ERROR_NOT_INITIALIZED) in a resumed snapshot under certain circumstances #15

Closed

cweld510 mentioned this issue Oct 31, 2024

Support for GPU checkpointing in nvproxy google/gvisor#11095

Open

rst0git mentioned this issue Nov 8, 2024

How can I use the cuda plugin? checkpoint-restore/criu#2515

Closed

ezerk mentioned this issue Nov 14, 2024

Failed to checkpoint dump container using GPU - Unable to connect a transport socket: Permission denied #19

Open

rst0git mentioned this issue Nov 22, 2024

Restore shows the GPU process has been restored successfully but the process does not exist checkpoint-restore/criu#2525

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pytorch support #4

pytorch support #4

d4l3k commented Apr 30, 2024 •

edited

Loading

d4l3k commented Apr 30, 2024 •

edited

Loading

d4l3k commented Apr 30, 2024

d4l3k commented Apr 30, 2024

d4l3k commented Apr 30, 2024 •

edited

Loading

sgurfinkel commented Apr 30, 2024

ZingLix commented May 10, 2024

ethxnp commented Aug 14, 2024

thundergolfer commented Sep 7, 2024 •

edited

Loading

913887524gsd commented Sep 18, 2024 •

edited

Loading

gflarity commented Oct 24, 2024

sgurfinkel commented Oct 24, 2024

lianghao208 commented Oct 25, 2024

thundergolfer commented Nov 5, 2024

sgurfinkel commented Nov 5, 2024

thundergolfer commented Nov 6, 2024

sgurfinkel commented Nov 8, 2024

d4l3k commented Nov 11, 2024

sgurfinkel commented Nov 11, 2024

laochonlam commented Jan 10, 2025

zhuangqh commented Jan 14, 2025

pytorch support #4

pytorch support #4

Comments

d4l3k commented Apr 30, 2024 • edited Loading

d4l3k commented Apr 30, 2024 • edited Loading

d4l3k commented Apr 30, 2024

d4l3k commented Apr 30, 2024

d4l3k commented Apr 30, 2024 • edited Loading

sgurfinkel commented Apr 30, 2024

ZingLix commented May 10, 2024

ethxnp commented Aug 14, 2024

thundergolfer commented Sep 7, 2024 • edited Loading

913887524gsd commented Sep 18, 2024 • edited Loading

gflarity commented Oct 24, 2024

sgurfinkel commented Oct 24, 2024

lianghao208 commented Oct 25, 2024

thundergolfer commented Nov 5, 2024

sgurfinkel commented Nov 5, 2024

thundergolfer commented Nov 6, 2024

sgurfinkel commented Nov 8, 2024

d4l3k commented Nov 11, 2024

sgurfinkel commented Nov 11, 2024

laochonlam commented Jan 10, 2025

zhuangqh commented Jan 14, 2025

d4l3k commented Apr 30, 2024 •

edited

Loading

d4l3k commented Apr 30, 2024 •

edited

Loading

d4l3k commented Apr 30, 2024 •

edited

Loading

thundergolfer commented Sep 7, 2024 •

edited

Loading

913887524gsd commented Sep 18, 2024 •

edited

Loading