-
Notifications
You must be signed in to change notification settings - Fork 617
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Checkpoint of chrome fails: pagemap-cache: Can't read 9397's pagemap file: No such file or directory #2365
Comments
@muvaf Any luck here? Running into the same issue |
@muvaf could you show /proc/pid/maps for the target process? |
I think it hits the MAX_RW_COUNT (0x7ffff000) limit. The length of the target vma is 0x104fff3c000. CRIU reads 8 bytes per page, so it is 0x827ff9e0 bytes. |
@muvaf could you try out avagin@9405da0? It should fix the problem. By the way, for such large dummy mappings, the pagemap file interface works slowly. Recently, the new PAGEMAP_SCAN ioctl was merged into the mainline kernel, and its support was implemented in CRIU (#2292). With these changes, CRIU handles huge dummy mappings much faster. |
@avagin That patch did make the error go away and I was able to take the checkpoint. Thank you! However, I wasn't able to validate that the checkpoint is correctly taken. The restore command fails with the following even though criu restore --images-dir /checkpoint --tcp-established --file-locks --evasive-devices --tcp-close --manage-cgroups=ignore -v4 --log-file restore.log --inherit-fd fd[1]:pipe:[1687037] --inherit-fd fd[2]:pipe:[1687038] --external mnt[zoneinfo]:/usr/share/zoneinfo --external mnt[null]:/dev/null --external mnt[random]:/dev/random --external mnt[urandom]:/dev/urandom --external mnt[tty]:/dev/tty --external mnt[zero]:/dev/zero --external mnt[full]:/dev/full
@lukejmann I needed to add |
@avagin Huh I didn't realize that. I think, for starters, having an To go further may not be feasible due to the same issue TCP has in regards to change of IP addresses, so at least we'd give users an escape hatch if they really have to change the IP address. |
@avagin FWIW, if you can give me a pointer, I can try to get a PR going to add the |
@muvaf I am skeptical about the idea of "--udp-close." There is a significant difference between TCP and UDP. TCP is connection-oriented, and the situation where a connection is interrupted is entirely normal and must be handled in the code. UDP, on the other hand, is connectionless. Therefore, applications may be caught off guard if "send" or "recv" return errors. You can try out the next patch to see how your workload will handle closed udp sockets after restore:
For connected UDP sockets, it might be a good idea to skip binding to the local address. When CRIU calls "connect" to restore the destination address and port, the socket will be bound to the source address and a "random" port. I believe this should work in many cases. Could you please try the next patch, which implements this behavior?
|
A friendly reminder that this issue had no activity for 30 days. |
Description
Taking checkpoint of
chrome
fails with the following error:filling VMA 738000c4000-83d00000000 (1094712560K)
-1094712560K
sounds too big?Steps to reproduce the issue:
The dump happens inside a Kubernetes pod and the image is proprietary. I can create a new image if it turns out the problem is not much straight-forward and requires full reproduction to debug.
I used the following command to dump:
The process tree was like the following:
crit
commands likecrit x . fds
also fail because dump is not complete.Describe the results you received:
Got error during dump:
Describe the results you expected:
Expected success.
Additional information you deem important (e.g. issue happens only occasionally):
Happens consistently.
CRIU logs and information:
dump.log
Output of `criu --version`:
Version: 3.19 (gitid 0)
Output of `criu check --all`:
Additional environment details:
It's running inside a Kubernetes pod where container runtime is containerd and node arch is amd64.
The text was updated successfully, but these errors were encountered: