-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
pytorch support #4
Comments
Issue seems to specifically be Before
After
|
I've also rebuilt PyTorch from source with cuda 12.4.1, cudnn 8.9.7.29-1 and hit the same issue. |
This seems to work on Jax so there's something odd going on with PyTorch |
I hacked around this by modifying diff --git a/criu/files-ext.c b/criu/files-ext.c
index 95ec8e37c..2a150c546 100644
--- a/criu/files-ext.c
+++ b/criu/files-ext.c
@@ -90,7 +90,9 @@ int dump_unsupp_fd(struct fd_parms *p, int lfd, char *more, char *info, FdinfoEn
ret = do_dump_gen_file(p, lfd, &ext_dump_ops, e);
if (ret == 0)
return 0;
- if (ret == -ENOTSUP)
+ if (ret == -ENOTSUP) {
pr_err("Can't dump file %d of that type [%o] (%s %s)\n", p->fd, p->stat.st_mode, more, info);
+ return 0;
+ }
return -1;
}
diff --git a/criu/files.c b/criu/files.c
index 3b653e24b..2ea8ac3ef 100644
--- a/criu/files.c
+++ b/criu/files.c
@@ -847,7 +847,7 @@ int collect_fd(int pid, FdinfoEntry *e, struct rst_info *rst_info, bool fake)
fdesc = find_file_desc(e);
if (fdesc == NULL) {
pr_err("No file for fd %d id %#x\n", e->fd, e->id);
- return -1;
+ return 0;
}
if (!collect_fd_to(pid, e, rst_info, fdesc, fake, false)) |
This is a known issue and we're working on fixing it! |
This feature is truly helpful! Could you please share if there's a rough timeline or estimated date for this feature to be implemented? |
@sgurfinkel any update on this? |
@jesus-ramos is there a rough timeline on when PyTorch support will land? |
This works fine for fds tied to CUDA devices, but it struggles with PyTorch programs using pinned memory, which is commonly used to speed up data transmission. It's still a bit far from being fully practical... |
Is this is still on the roadmap? |
Yes, it is! |
@sgurfinkel |
Hey @sgurfinkel is this fixed on |
No, not quite yet! |
Thanks for the update anyways :) A Google engineer to us indicated it may be fixed on the latest driver. @sgurfinkel would NVIDIA be up for doing community-focused video meeting for this project? I'm thinking something similar to what the AWS Firecracker team did for planning NVIDIA GPU support. We (at modal.com) are very excited about this technology but it's hard to adopt it with little visibility into the system or roadmap :) |
I'll look into the video meeting, but I do otherwise have an update! |
@sgurfinkel when you say "single-process" does that mean things like NCCL won't be supported? |
Yes, that's right. CUDA IPC support won't be present in the early 2025 release. |
Hi! Do we now have a more concrete release plan for this? Looking forward to it. Thank you! |
Also looking forward to this ability. Thank you! |
I just tried this out on PyTorch and it seems to work for the cuda state but I'm hitting issues with criu when saving the parent process. It seems like the issue is with saving the nvidia driver in
criu
.Are there any plans to expand support for this with
criu
for common ML frameworks?There's no longer an active cuda process after toggling but still seems to have access to a
nvidia
device.The file that failed to save seems to be
nvidia
.Test script
The text was updated successfully, but these errors were encountered: