Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CRIU dump failed since it failed to dump external device file #7

Closed
zobinHuang opened this issue May 11, 2024 · 2 comments
Closed

CRIU dump failed since it failed to dump external device file #7

zobinHuang opened this issue May 11, 2024 · 2 comments

Comments

@zobinHuang
Copy link

zobinHuang commented May 11, 2024

Hi NVIDIA folks, thanks for this wonderful transparent C/R tool, really helpful!

I encountered an issue when criu dump a pytorch process within a container, after using this tool to suspend the CUDA state.

the commands I used are:

$ cuda-checkpoint --toggle --pid $pid
$ criu dump --tree $pid --images-dir $ckpt_dir --shell-job --display-stats

The error message from the criu dump command looks like:

Warn  (criu/kerndat.c:1593): CRIU was built without libnftables support
Error (criu/files-ext.c:94): Can't dump file 17 of that type [20666] (chr 195:255)
Error (criu/cr-dump.c:1674): Dump files (pid: 406) failed with -1
Error (criu/cr-dump.c:2098): Dumping FAILED.

The software environment I met this issue is:

Looks like it's due to the criu failed to dump an external file, and this file seems to be a nvidia device file

$ grep 195 /proc/devices  
195 nvidia
195 nvidia-modeset
195 nvidiactl

Does anyone has any comment on this issue? Thanks for any suggestion :-)

@rst0git
Copy link

rst0git commented May 11, 2024

I encountered an issue when criu dump a pytorch process within a container

@zobinHuang There is another GitHub issue about checkpointing PyTorch with more information about this problem: #4.

@zobinHuang
Copy link
Author

I encountered an issue when criu dump a pytorch process within a container

@zobinHuang There is another GitHub issue about checkpointing PyTorch with more information about this problem: #4.

Thanks, would close this issue, looking forward to the fix!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants