Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inquiry about the docker checkpoint function for AMD GPU #2160

Closed
GYDmedwin opened this issue Apr 19, 2023 · 11 comments
Closed

Inquiry about the docker checkpoint function for AMD GPU #2160

GYDmedwin opened this issue Apr 19, 2023 · 11 comments
Labels

Comments

@GYDmedwin
Copy link

Hello, I see that a plugin for AMD GPU is already online. There is no problem when I use it directly for the process that use GPU, it works fine.

But when I want to use the docker checkpoint function for the docker which use the AMD GPU , it fails.

Therefore, I wonder if the dump function for AMD GPU can only be used outside of docker? Or is there any other way to dump a docker with AMD GPU? Thank you!

@adrianreber
Copy link
Member

Therefore, I wonder if the dump function for AMD GPU can only be used outside of docker?

I guess nobody tried it so far. So we don't really know. It would be interesting to see the errors you get.

@rst0git
Copy link
Member

rst0git commented Apr 19, 2023

@GYDmedwin the plugin for AMD GPUs was released in version 3.17. However, it is currently not enabled by default in CRIU packages because it requires hardware specific dependencies. Thus, you might need to build and install CRIU from source to use this functionality.

I wonder if the dump function for AMD GPU can only be used outside of docker? Or is there any other way to dump a docker with AMD GPU?

There is an example Docker container with PyTorch that can be used for testing:
https://github.com/checkpoint-restore/criu/blob/criu-dev/scripts/build/Dockerfile.amd-rocm

@GYDmedwin
Copy link
Author

GYDmedwin commented Apr 20, 2023

@adrianreber Thank you for your reply!

The command I use is : docker checkpoint create test checkpoint1
The following is the error message:
Error response from daemon: Cannot checkpoint container pytorch: runc did not terminate successfully: exit status 1: criu failed: type NOTIFY errno 0 path= /run/containerd/io.containerd.runtime.v2.task/moby/088a3512b65d3467fc64ae7c66ee3cdd76b4eb7f90ad4d08606444d5faddaedc/criu-dump.log: unknown

And he following is the content of the criu-dump.log file:

(00.000000) Unable to get $HOME directory, local configuration file will not be used.
(00.000090) Version: 3.17 (gitid v3.17-210-gcf21367d6)
(00.000114) Running on gxn1275 Linux 5.16.0 #1 SMP PREEMPT Tue Sep 20 02:10:11 UTC 2022 x86_64
(00.000123) Would overwrite RPC settings with values from /etc/criu/runc.conf
(00.015868) Loaded kdat cache from /run/criu.kdat
(00.015942) Hugetlb size 2 Mb is supported but cannot get dev's number
(00.015970) Hugetlb size 1024 Mb is supported but cannot get dev's number
(00.017074) ========================================
(00.017104) Dumping processes (pid: 1205653 comm: bash)
(00.017107) ========================================
(00.017112) rlimit: RLIMIT_NOFILE unlimited for self
(00.017121) Running pre-dump scripts
(00.017125) RPC
(00.017374) irmap: Searching irmap cache in work dir
(00.017386) No irmap-cache image
(00.017390) irmap: Searching irmap cache in parent
(00.017399) No parent images directory provided
(00.017402) irmap: No irmap cache
(00.017411) cpu: x86_family 6 x86_vendor_id GenuineIntel x86_model_id Intel(R) Xeon(R) Silver 4110 CPU @ 2.10GHz
(00.017415) cpu: fpu: xfeatures_mask 0x2f5 xsave_size 2696 xsave_size_max 2696 xsaves_size 2568
(00.017422) cpu: fpu: x87 floating point registers xstate_offsets 0 / 0 xstate_sizes 160 / 160
(00.017427) cpu: fpu: AVX registers xstate_offsets 576 / 576 xstate_sizes 256 / 256
(00.017430) cpu: fpu: MPX CSR xstate_offsets 1024 / 832 xstate_sizes 64 / 64
(00.017434) cpu: fpu: AVX-512 opmask xstate_offsets 1088 / 896 xstate_sizes 64 / 64
(00.017437) cpu: fpu: AVX-512 Hi256 xstate_offsets 1152 / 960 xstate_sizes 512 / 512
(00.017440) cpu: fpu: AVX-512 ZMM_Hi256 xstate_offsets 1664 / 1472 xstate_sizes 1024 / 1024
(00.017443) cpu: fpu: Protection Keys User registers xstate_offsets 2688 / 2496 xstate_sizes 8 / 8
(00.017446) cpu: fpu:1 fxsr:1 xsave:1 xsaveopt:1 xsavec:1 xgetbv1:1 xsaves:1
(00.017560) cg-prop: Parsing controller "cpu"
(00.017565) cg-prop: Strategy "replace"
(00.017569) cg-prop: Property "cpu.shares"
(00.017572) cg-prop: Property "cpu.cfs_period_us"
(00.017575) cg-prop: Property "cpu.cfs_quota_us"
(00.017577) cg-prop: Property "cpu.rt_period_us"
(00.017580) cg-prop: Property "cpu.rt_runtime_us"
(00.017583) cg-prop: Parsing controller "memory"
(00.017586) cg-prop: Strategy "replace"
(00.017589) cg-prop: Property "memory.limit_in_bytes"
(00.017592) cg-prop: Property "memory.memsw.limit_in_bytes"
(00.017594) cg-prop: Property "memory.swappiness"
(00.017597) cg-prop: Property "memory.soft_limit_in_bytes"
(00.017600) cg-prop: Property "memory.move_charge_at_immigrate"
(00.017603) cg-prop: Property "memory.oom_control"
(00.017605) cg-prop: Property "memory.use_hierarchy"
(00.017608) cg-prop: Property "memory.kmem.limit_in_bytes"
(00.017611) cg-prop: Property "memory.kmem.tcp.limit_in_bytes"
(00.017614) cg-prop: Parsing controller "cpuset"
(00.017620) cg-prop: Strategy "replace"
(00.017623) cg-prop: Property "cpuset.cpus"
(00.017626) cg-prop: Property "cpuset.mems"
(00.017629) cg-prop: Property "cpuset.memory_migrate"
(00.017631) cg-prop: Property "cpuset.cpu_exclusive"
(00.017634) cg-prop: Property "cpuset.mem_exclusive"
(00.017637) cg-prop: Property "cpuset.mem_hardwall"
(00.017639) cg-prop: Property "cpuset.memory_spread_page"
(00.017642) cg-prop: Property "cpuset.memory_spread_slab"
(00.017645) cg-prop: Property "cpuset.sched_load_balance"
(00.017648) cg-prop: Property "cpuset.sched_relax_domain_level"
(00.017650) cg-prop: Parsing controller "blkio"
(00.017653) cg-prop: Strategy "replace"
(00.017656) cg-prop: Property "blkio.weight"
(00.017659) cg-prop: Parsing controller "freezer"
(00.017662) cg-prop: Strategy "replace"
(00.017665) cg-prop: Parsing controller "perf_event"
(00.017668) cg-prop: Strategy "replace"
(00.017671) cg-prop: Parsing controller "net_cls"
(00.017673) cg-prop: Strategy "replace"
(00.017676) cg-prop: Property "net_cls.classid"
(00.017688) cg-prop: Parsing controller "net_prio"
(00.017691) cg-prop: Strategy "replace"
(00.017694) cg-prop: Property "net_prio.ifpriomap"
(00.017697) cg-prop: Parsing controller "pids"
(00.017700) cg-prop: Strategy "replace"
(00.017703) cg-prop: Property "pids.max"
(00.017705) cg-prop: Parsing controller "devices"
(00.017708) cg-prop: Strategy "replace"
(00.017711) cg-prop: Property "devices.list"
(00.017754) Preparing image inventory (version 1)
(00.017783) Add pid ns 1 pid 1210737
(00.017793) Add net ns 2 pid 1210737
(00.017804) Add ipc ns 3 pid 1210737
(00.017818) Add uts ns 4 pid 1210737
(00.017829) Add time ns 5 pid 1210737
(00.017845) Add mnt ns 6 pid 1210737
(00.017854) Add user ns 7 pid 1210737
(00.017864) Add cgroup ns 8 pid 1210737
(00.017868) cg: Dumping cgroups for thread 1210737
(00.017900) cg: - New css ID 1 (00.017903) cg: - [] -> [/system.slice/containerd.service] [0]
(00.017907) cg: - [blkio] -> [/system.slice/containerd.service] [0] (00.017910) cg: - [cpu,cpuacct] -> [/system.slice/containerd.service] [0]
(00.017912) cg: - [cpuset] -> [/] [0] (00.017915) cg: - [devices] -> [/system.slice/containerd.service] [0]
(00.017918) cg: - [freezer] -> [/] [0] (00.017921) cg: - [hugetlb] -> [/] [0]
(00.017923) cg: - [memory] -> [/system.slice/containerd.service] [0] (00.017926) cg: - [name=systemd] -> [/system.slice/containerd.service] [0]
(00.017929) cg: - [net_cls,net_prio] -> [/] [0] (00.017931) cg: - [perf_event] -> [/] [0]
(00.017934) cg: - [pids] -> [/system.slice/containerd.service] [0] (00.017937) cg: - [rdma] -> [/] [0]
(00.017939) cg: Set 1 is criu one
(00.017965) Detected cgroup V1 freezer
(00.017968) freezing processes: 100000 attempts with 100 ms steps
(00.017984) freezer.state=FROZEN
(00.018205) SEIZE 1205653 (comm bash): success
(00.018226) SEIZE 1210195 (comm python3): success

... ...

/proc/self/fd/19/docker/088a3512b65d3467fc64ae7c66ee3cdd76b4eb7f90ad4d08606444d5faddaedc/tasks
(00.128515) cg: Set 2 is root one
(00.128589) ----------------------------------------
(00.128602) Waiting for 1205653 to trap
(00.128634) Daemon 1205653 exited trapping
(00.128646) Sent msg to daemon 3 0 0
pie: 1: __fetched msg: 3 0 0
pie: 1: 1: new_sp=0x7fa093ee0748 ip 0x7fa093fdfc38
(00.156862) 1205653 was trapped
(00.156940) 1205653 was trapped
(00.156956) 1205653 (native) is going to execute the syscall 15, required is 15
(00.157001) 1205653 was stopped
(00.157269)
(00.157281) Dumping mm (pid: 1205653)
(00.157289) ----------------------------------------
(00.157299) 0x55d473fde000-0x55d47400b000 (180K) prot 0x1 flags 0x2 fdflags 0 st 0x41 off 0 reg fp shmid: 0x1
(00.157311) 0x55d47400b000-0x55d4740bc000 (708K) prot 0x5 flags 0x2 fdflags 0 st 0x41 off 0x2d000 reg fp shmid: 0x1
(00.157321) 0x55d4740bc000-0x55d4740f3000 (220K) prot 0x1 flags 0x2 fdflags 0 st 0x41 off 0xde000 reg fp shmid: 0x1
(00.157330) 0x55d4740f3000-0x55d4740f7000 (16K) prot 0x1 flags 0x2 fdflags 0 st 0x41 off 0x114000 reg fp shmid: 0x1
(00.157339) 0x55d4740f7000-0x55d474100000 (36K) prot 0x3 flags 0x2 fdflags 0 st 0x41 off 0x118000 reg fp shmid: 0x1
(00.157348) 0x55d474100000-0x55d47410a000 (40K) prot 0x3 flags 0x22 fdflags 0 st 0x201 off 0 reg ap shmid: 0
(00.157357) 0x55d475461000-0x55d4754c4000 (396K) prot 0x3 flags 0x22 fdflags 0 st 0x221 off 0 reg heap ap shmid: 0
(00.157365) 0x7fa093ee6000-0x7fa093ee9000 (12K) prot 0x1 flags 0x2 fdflags 0 st 0x41 off 0 reg fp shmid: 0x2
(00.157374) 0x7fa093ee9000-0x7fa093ef0000 (28K) prot 0x5 flags 0x2 fdflags 0 st 0x41 off 0x3000 reg fp shmid: 0x2
(00.157383) 0x7fa093ef0000-0x7fa093ef2000 (8K) prot 0x1 flags 0x2 fdflags 0 st 0x41 off 0xa000 reg fp shmid: 0x2
(00.157391) 0x7fa093ef2000-0x7fa093ef3000 (4K) prot 0x1 flags 0x2 fdflags 0 st 0x41 off 0xb000 reg fp shmid: 0x2
(00.157399) 0x7fa093ef3000-0x7fa093ef4000 (4K) prot 0x3 flags 0x2 fdflags 0 st 0x41 off 0xc000 reg fp shmid: 0x2
(00.157408) 0x7fa093ef4000-0x7fa093efd000 (36K) prot 0x3 flags 0x22 fdflags 0 st 0x201 off 0 reg ap shmid: 0
(00.157416) 0x7fa093efd000-0x7fa093f1f000 (136K) prot 0x1 flags 0x2 fdflags 0 st 0x41 off 0 reg fp shmid: 0x3
(00.157424) 0x7fa093f1f000-0x7fa094097000 (1504K) prot 0x5 flags 0x2 fdflags 0 st 0x41 off 0x22000 reg fp shmid: 0x3
(00.157432) 0x7fa094097000-0x7fa0940e5000 (312K) prot 0x1 flags 0x2 fdflags 0 st 0x41 off 0x19a000 reg fp shmid: 0x3
(00.157441) 0x7fa0940e5000-0x7fa0940e9000 (16K) prot 0x1 flags 0x2 fdflags 0 st 0x41 off 0x1e7000 reg fp shmid: 0x3
(00.157449) 0x7fa0940e9000-0x7fa0940eb000 (8K) prot 0x3 flags 0x2 fdflags 0 st 0x41 off 0x1eb000 reg fp shmid: 0x3
(00.157457) 0x7fa0940eb000-0x7fa0940ef000 (16K) prot 0x3 flags 0x22 fdflags 0 st 0x201 off 0 reg ap shmid: 0
(00.157465) 0x7fa0940ef000-0x7fa0940f0000 (4K) prot 0x1 flags 0x2 fdflags 0 st 0x41 off 0 reg fp shmid: 0x4
(00.157473) 0x7fa0940f0000-0x7fa0940f2000 (8K) prot 0x5 flags 0x2 fdflags 0 st 0x41 off 0x1000 reg fp shmid: 0x4
(00.157481) 0x7fa0940f2000-0x7fa0940f3000 (4K) prot 0x1 flags 0x2 fdflags 0 st 0x41 off 0x3000 reg fp shmid: 0x4
(00.157489) 0x7fa0940f3000-0x7fa0940f4000 (4K) prot 0x1 flags 0x2 fdflags 0 st 0x41 off 0x3000 reg fp shmid: 0x4
(00.157497) 0x7fa0940f4000-0x7fa0940f5000 (4K) prot 0x3 flags 0x2 fdflags 0 st 0x41 off 0x4000 reg fp shmid: 0x4
(00.157505) 0x7fa0940f5000-0x7fa094103000 (56K) prot 0x1 flags 0x2 fdflags 0 st 0x41 off 0 reg fp shmid: 0x5
(00.157534) 0x7fa094103000-0x7fa094112000 (60K) prot 0x5 flags 0x2 fdflags 0 st 0x41 off 0xe000 reg fp shmid: 0x5
(00.157542) 0x7fa094112000-0x7fa094120000 (56K) prot 0x1 flags 0x2 fdflags 0 st 0x41 off 0x1d000 reg fp shmid: 0x5
(00.157551) 0x7fa094120000-0x7fa094124000 (16K) prot 0x1 flags 0x2 fdflags 0 st 0x41 off 0x2a000 reg fp shmid: 0x5
(00.157559) 0x7fa094124000-0x7fa094125000 (4K) prot 0x3 flags 0x2 fdflags 0 st 0x41 off 0x2e000 reg fp shmid: 0x5
(00.157568) 0x7fa094125000-0x7fa094127000 (8K) prot 0x3 flags 0x22 fdflags 0 st 0x201 off 0 reg ap shmid: 0
(00.157576) 0x7fa094135000-0x7fa094136000 (4K) prot 0x1 flags 0x2 fdflags 0 st 0x41 off 0 reg fp shmid: 0x6
(00.157585) 0x7fa094136000-0x7fa094159000 (140K) prot 0x5 flags 0x2 fdflags 0 st 0x41 off 0x1000 reg fp shmid: 0x6
(00.157593) 0x7fa094159000-0x7fa094161000 (32K) prot 0x1 flags 0x2 fdflags 0 st 0x41 off 0x24000 reg fp shmid: 0x6
(00.157601) 0x7fa094162000-0x7fa094163000 (4K) prot 0x1 flags 0x2 fdflags 0 st 0x41 off 0x2c000 reg fp shmid: 0x6
(00.157609) 0x7fa094163000-0x7fa094164000 (4K) prot 0x3 flags 0x2 fdflags 0 st 0x41 off 0x2d000 reg fp shmid: 0x6
(00.157617) 0x7fa094164000-0x7fa094165000 (4K) prot 0x3 flags 0x22 fdflags 0 st 0x201 off 0 reg ap shmid: 0
(00.157626) 0x7ffd61be0000-0x7ffd61c01000 (132K) prot 0x3 flags 0x122 fdflags 0 st 0x201 off 0 reg ap shmid: 0
(00.157634) 0x7ffd61cd5000-0x7ffd61cd9000 (16K) prot 0x1 flags 0x22 fdflags 0 st 0x1201 off 0 reg vvar ap shmid: 0
(00.157643) 0x7ffd61cd9000-0x7ffd61cdb000 (8K) prot 0x5 flags 0x22 fdflags 0 st 0x209 off 0 reg vdso ap shmid: 0
(00.157651) 0xffffffffff600000-0xffffffffff601000 (4K) prot 0x4 flags 0x22 fdflags 0 st 0x204 off 0 vsys ap shmid: 0
(00.157661) Obtaining task auvx ...
(00.158004) Dumping path for -3 fd via self 19 [/home/tensorflow_mnist]
(00.158106) Dumping path for -3 fd via self 19 [/]
(00.158127) Dumping task cwd id 0x9 root id 0xa
(00.158303) ========================================
(00.158338) Dumping task (pid: 1210195 comm: python3)
(00.158347) ========================================
(00.158353) Obtaining task stat ...
(00.158508)
(00.158516) Collecting mappings (pid: 1210195)
(00.158524) ----------------------------------------
(00.158822) Found regular file mapping, OK
(00.158902) Dumping path for -3 fd via self 15 [/usr/bin/python3.9]
(00.159058) vma 41f000 borrows vfi from previous 400000
(00.159091) vma 6fa000 borrows vfi from previous 41f000
(00.159120) vma 938000 borrows vfi from previous 6fa000
(00.159148) vma 939000 borrows vfi from previous 938000
(00.169023) Error (criu/proc_parse.c:114): handle_device_vma plugin failed: No such file or directory
(00.169040) Error (criu/proc_parse.c:619): Can't handle non-regular mapping on 1210195's map 7fee01000000
(00.169166) Error (criu/cr-dump.c:1558): Collect mappings (pid: 1210195) failed with -1
(00.169276) net: Unlock network
(00.169280) Running network-unlock scripts
(00.169284) RPC
(00.176860) Unfreezing tasks into 1
(00.176893) Unseizing 1205653 into 1
(00.176907) Unseizing 1210195 into 1
(00.178698) Error (criu/cr-dump.c:2093): Dumping FAILED.

@adrianreber
Copy link
Member

That line:

(00.169023) Error (criu/proc_parse.c:114): handle_device_vma plugin failed: No such file or directory

claims that the plugin cannot be found. Not sure why.

@fxkamd can you help here?

@GYDmedwin
Copy link
Author

@GYDmedwin the plugin for AMD GPUs was released in version 3.17. However, it is currently not enabled by default in CRIU packages because it requires hardware specific dependencies. Thus, you might need to build and install CRIU from source to use this functionality.

I wonder if the dump function for AMD GPU can only be used outside of docker? Or is there any other way to dump a docker with AMD GPU?

There is an example Docker container with PyTorch that can be used for testing: https://github.com/checkpoint-restore/criu/blob/criu-dev/scripts/build/Dockerfile.amd-rocm

Thank you, I will try!

@GYDmedwin
Copy link
Author

@adrianreber
Hi, adrianreber. I took the test a step further.

On my machine, I found that the current CRIU was having problems running AMD plug-ins. Whether I use docker or not, the test is a failure.

But things are going well when I get CRIU from here:

https://github.com/RadeonOpenCompute/criu

The AMD plugin works even with docker's checkpoint function. So I suspect that there may be some feature in the current CRIU update that is affecting the proper functioning of the AMD plugin.

I hope you can also test whether my results are correct. Thank you very much.

@adrianreber
Copy link
Member

How are you installing CRIU?

https://github.com/RadeonOpenCompute/criu

That is more or less the same as here, just older.

@GYDmedwin
Copy link
Author

I install CRIU from source.

And the command I use :

  1. cd criu
  2. make -j 30
  3. sudo make install

@adrianreber
Copy link
Member

That sounds correct if you have the corresponding libraries installed. If you say that https://github.com/RadeonOpenCompute/criu works, it could be that the AMD GPU support is broken in CRIU. We cannot test it as we do not have access to to AMD GPUs in CI.

@GYDmedwin
Copy link
Author

That sounds correct if you have the corresponding libraries installed. If you say that https://github.com/RadeonOpenCompute/criu works, it could be that the AMD GPU support is broken in CRIU. We cannot test it as we do not have access to to AMD GPUs in CI.

Oh, then I hope that the developers of AMD can see this problem and carry out further verification.

Thanks a lot for your response,adrianreber!

Problem solved, I closed this issue.

@fxkamd
Copy link
Contributor

fxkamd commented Apr 20, 2023

https://github.com/RadeonOpenCompute/criu is probably not up do date. We have not touched this repository since upstreaming amdgpu CRIU support.

I can't tell what's going wrong just looking at the error message. Assuming it found the plugin, I'd expect more diagnostic messages. Our plugin is sprinkled with lots of pr_error, pr_info and pr_debug messages.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants