Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Restore shows the GPU process has been restored successfully but the process does not exist #2525

Closed
lianghao208 opened this issue Nov 20, 2024 · 2 comments

Comments

@lianghao208
Copy link

Description

Steps to reproduce the issue:

  1. restore a GPU process
./criu restore --shell-job --restore-detached --images-dir demo  -L /root/criu  -v4 
  1. the log shows the process has been restored successfully
(00.012454) pie: 1008543: Restoring scheduler params 0.0.0
(00.012460) pie: 1008543: rseq: nothing to restore
(00.012537) pie: 1008543: 1008543: Restored
(00.012536) pie: 1008573: Restoring scheduler params 0.0.0
(00.012546) pie: 1008574: Restoring scheduler params 0.0.0
(00.012577) pie: 1008573: rseq: nothing to restore
(00.012603) pie: 1008574: rseq: nothing to restore
(00.012608) pie: 1008573: 1008573: Restored
(00.012617) pie: 1008574: 1008574: Restored
(00.012641) Running post-restore scripts
(00.012650) net: Unlock network
(00.012793) pie: 1008574: seccomp: mode 0 on tid 1008574
(00.012797) pie: 1008543: seccomp: mode 0 on tid 1008543
(00.012797) pie: 1008573: seccomp: mode 0 on tid 1008573
(00.020665) 1008574 was trapped
(00.020688) 1008543 was trapped
(00.020695) 1008574 was trapped
(00.020698) 1008574 (native) is going to execute the syscall 15, required is 15
(00.020706) 1008574 was stopped
(00.020708) 1008543 was trapped
(00.020710) 1008543 (native) is going to execute the syscall 15, required is 15
(00.020724) 1008543 was stopped
(00.020742) 1008573 was trapped
(00.020759) 1008573 was trapped
(00.020783) 1008573 (native) is going to execute the syscall 15, required is 15
(00.020792) 1008573 was stopped
(00.020826) 1008543 was trapped
(00.020830) 1008543 (native) is going to execute the syscall 11, required is 11
(00.020853) 1008543 was stopped
(00.020880) Run late stage hook from criu master for external devices
(00.020882) plugin: `gpu_migration_plugin' hook 9 -> 0x7f209489e2ae
(00.020884) restore late stage hook for external plugin failed
(00.020886) Running pre-resume scripts
(00.020894) Restore finished successfully. Tasks resumed.
(00.020895) Writing stats
(00.021007) Running post-resume scripts
  1. but the process does not exists
ps aux|grep 1008543|grep -v grep
  1. To dump and restore gpu process, I modify the CRIU code to ignore /dev/nvidia* in /proc/{pid}/fd and /proc/{pid}/map_files.

I know the gpu process can not be completely restored because I didn't handle the GPU device and vma in cuda plugin.
But I suppose the process will exist after restore.

Describe the results you received:

Describe the results you expected:

Additional information you deem important (e.g. issue happens only occasionally):

CRIU logs and information:

CRIU full dump/restore logs:

(paste your output here)

Output of `criu --version`:

(paste your output here)

Output of `criu check --all`:

(paste your output here)

Additional environment details:

@lianghao208
Copy link
Author

just found there was errors in dmesg after restore:

[Fri Nov 22 09:51:18 2024] cuda-EvtHandlr[1008574]: segfault at 7f9c9b175000 ip 00007f9ea1d4194c sp 00007f9c9bc159f0 error 4 in libcuda.so.535.161.08[7f9ea1a01000+520000]
[Fri Nov 22 09:51:18 2024] Code: 00 85 c0 0f 85 55 01 00 00 48 8b 83 80 18 00 00 48 85 c0 0f 84 45 01 00 00 48 8b 40 10 48 85 c0 0f 84 38 01 00 00 48 8b 40 10 <8b> 10 89 55 c0 8b 50 04 89 55 c4 8b 50 08 89 55 c8 0f b7 50 0c 66

@rst0git
Copy link
Member

rst0git commented Nov 22, 2024

I modify the CRIU code to ignore /dev/nvidia* in /proc/{pid}/fd and /proc/{pid}/map_files.

In many cases, these changes would break the restore functionality. There is not much we can do to help.

As mentioned in NVIDIA/cuda-checkpoint#4, this problem will be addressed with the next CUDA driver release.

@rst0git rst0git closed this as completed Nov 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants