Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cgroup: fix --manage-cgroups=ignore #1800

Conversation

adrianreber
Copy link
Member

Using --manage-cgroups=ignore was still trying to restore cgroup properties at certain places.

Restoring a OCI container in a different cgroup (than the one it was checkpointed from) resulted in CRIU trying to put the container in the original cgroup but the container runtime (runc, crun) was putting it another cgroup. A second checkpoint of the container with two different cgroup settings did no longer work.

With this change CRIU really completely ignores the cgroup and the container can be checkpointed and restored multiple times even if the cgroup changes.

This PR includes a test to verify --manage-cgroups=ignore' works as expected.

@codecov-commenter
Copy link

codecov-commenter commented Apr 3, 2022

Codecov Report

Merging #1800 (0772605) into criu-dev (5b2e585) will increase coverage by 0.05%.
The diff coverage is 100.00%.

❗ Current head 0772605 differs from pull request most recent head 6d0b0b2. Consider uploading reports for the commit 6d0b0b2 to get more accurate results

@@             Coverage Diff              @@
##           criu-dev    #1800      +/-   ##
============================================
+ Coverage     68.90%   68.96%   +0.05%     
============================================
  Files           128      128              
  Lines         33191    33194       +3     
============================================
+ Hits          22871    22892      +21     
+ Misses        10320    10302      -18     
Impacted Files Coverage Δ
criu/cgroup.c 77.23% <100.00%> (+0.05%) ⬆️
compel/src/lib/infect.c 56.10% <0.00%> (+0.38%) ⬆️
criu/uffd.c 78.88% <0.00%> (+2.41%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 5b2e585...6d0b0b2. Read the comment docs.

@avagin
Copy link
Member

avagin commented Apr 4, 2022

I don't agree that runsc or crun has to use --manage-cgroups=ignore. A container can have a local hierarchy that we need to restore. I think a proper use of --cgroup-root should fix the origin problem.

@avagin
Copy link
Member

avagin commented Apr 4, 2022

You need to add a commit message for the test. It has to describe what test does.

@adrianreber adrianreber force-pushed the 2022-04-03-really-ignore-cgroups branch 2 times, most recently from f221917 to aedc3be Compare April 5, 2022 07:00
Copy link
Member

@rst0git rst0git left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

I can confirm that the patch in this PR fixes containers/podman#13672

@adrianreber adrianreber force-pushed the 2022-04-03-really-ignore-cgroups branch 4 times, most recently from 4897c12 to 10d6512 Compare April 5, 2022 14:14
@adrianreber
Copy link
Member Author

I don't agree that runsc or crun has to use --manage-cgroups=ignore.

Currently with the use of --manage-cgroups=soft we are using something which restores the container in the wrong cgroup. With ignore it works correctly.

A container can have a lock hierarchy that we need to restore.

Sorry, I don't understand this.

I think a proper use of --cgroup-root should fix the origin problem.

Currently the container runtimes are moving the restored process into the cgroup after restore and that seems to work.

@avagin
Copy link
Member

avagin commented Apr 5, 2022

Sorry, I don't understand this.

It was my typo. I mean it can have a local hierarchy. Do you mount cgroups into a container? Can processes in a container create new cgroups?

@avagin
Copy link
Member

avagin commented Apr 5, 2022

I think a proper use of --cgroup-root should fix the origin problem.

Currently the container runtimes are moving the restored process into the cgroup after restore and that seems to work.

This works only in cases when processes in a container don't create cgroups...

@adrianreber
Copy link
Member Author

Sorry, I don't understand this.

It was my type. I mean it can have a local hierarchy. Do you mount cgroups into a container? Can processes in a container create new cgroups?

Probably yes, but so far no one seems to be using it in OCI containers, because until now I have not seen any reports of people complaining about non working cgroups inside of the container.

During run in Podman the cgroup limit is specified. During restore this limit is again set in the cgroup and the container is restored into that newly created cgroup and the resulting container has the same limits as the original containers.

cgroup limits which are set during runtime of the container using exec are not preserved this way, but I am not sure anybody really uses it in OCI containers.

@adrianreber
Copy link
Member Author

Anyway, the fix for CRIU is to correctly work if ignore is specified. Currently CRIU does not correctly handle ignore.

What the container engine does is not really important here.

@adrianreber adrianreber force-pushed the 2022-04-03-really-ignore-cgroups branch from 10d6512 to 6d0b0b2 Compare April 5, 2022 15:25
@avagin
Copy link
Member

avagin commented Apr 5, 2022

Anyway, the fix for CRIU is to correctly work if ignore is specified. Currently CRIU does not correctly handle ignore.

It works how it was designed by its author. It doesn't restore a cgroup tree, but it moves tasks in their cgroups. This behaviour was introduced in commit f95b05e:

Date:   Wed Aug 6 22:23:00 2014 +0400

    opts: add --manage-cgroups option

    criu managed cgroups is now an opt-in thing, so by default criu does not manage
    (i.e. dump or restore) cgroups. This allows users to use the previous behavior.

Do we have uses that use this behaviour and whom new changes will break? I think the chance of breaking someone is low.

@adrianreber
Copy link
Member Author

adrianreber commented Apr 5, 2022

Anyway, the fix for CRIU is to correctly work if ignore is specified. Currently CRIU does not correctly handle ignore.

It works how it was designed by its author. It doesn't restore a cgroup tree, but it moves tasks in their cgroups. This behaviour was introduced in commit f95b05e:

Date:   Wed Aug 6 22:23:00 2014 +0400

    opts: add --manage-cgroups option

    criu managed cgroups is now an opt-in thing, so by default criu does not manage
    (i.e. dump or restore) cgroups. This allows users to use the previous behavior.

No, I do not agree. According to the man page ignore should Don’t deal with cgroups and pretend that they don’t exist. and if I restore a process with ignore that does not happen:

# sleep infinity &
# cat /proc/`pidof sleep`/cgroup
0::/user.slice/user-0.slice/session-1.scope
# mkdir /sys/fs/cgroup/test1
# echo `pidof sleep` > /sys/fs/cgroup/test1/cgroup.procs 
# cat /proc/`pidof sleep`/cgroup
0::/test1
# criu dump -t `pidof sleep` -j -D /tmp/2 --manage-cgroups=ignore
# criu restore -j -D /tmp/2 -v4 -d --manage-cgroup=ignore
(00.000005) Version: 3.16.1 (gitid 0)
(00.000037) Running on fedora02 Linux 5.16.18-200.fc35.x86_64 #1 SMP PREEMPT Mon Mar 28 14:10:07 UTC 2022 x86_64
(00.000130) Loaded kdat cache from /run/criu/criu.kdat
(00.000149) rlimit: RLIMIT_NOFILE unlimited for self
(00.000509) cpu: x86_family 25 x86_vendor_id AuthenticAMD x86_model_id AMD Ryzen 5 5600G with Radeon Graphics         
(00.000561) cpu: fpu: xfeatures_mask 0x205 xsave_size 2440 xsave_size_max 2440 xsaves_size 840
(00.000590) cpu: fpu: x87 floating point registers     xstate_offsets      0 / 0      xstate_sizes    160 / 160   
(00.000598) cpu: fpu: AVX registers                    xstate_offsets    576 / 576    xstate_sizes    256 / 256   
(00.000611) cpu: fpu: Protection Keys User registers   xstate_offsets   2432 / 832    xstate_sizes      8 / 8     
(00.000616) cpu: fpu:1 fxsr:1 xsave:1 xsaveopt:1 xsavec:1 xgetbv1:1 xsaves:1
(00.000708) kernel pid_max=4194304
(00.000741) Reading image tree
(00.000906) Add mnt ns 6 pid 769
(00.000921) Add net ns 2 pid 769
(00.000924) Add pid ns 1 pid 769
(00.000929) pstree pid_max=769
(00.000938) Will restore in 0 namespaces
(00.000949) NS mask to use 0
(00.001031) Collecting 51/56 (flags 3)
(00.001121) No memfd.img image
(00.001146)  `- ... done
(00.001174) Collecting 40/54 (flags 2)
(00.001252) Collected [usr/bin/sleep] ID 0x1
(00.001280) Collected [usr/lib/locale/locale-archive] ID 0x2
(00.001304) Collected [usr/lib64/libc.so.6] ID 0x3
(00.001326) Collected [usr/lib64/ld-linux-x86-64.so.2] ID 0x4
(00.001363) Collected [dev/pts/0] ID 0x6
(00.001439) Collected [tmp] ID 0x7
(00.001496) Collected [.] ID 0x8
(00.001518)  `- ... done
(00.001529) Collecting 46/68 (flags 0)
(00.001536) No remap-fpath.img image
(00.001581)  `- ... done
(00.001771) No apparmor.img image
(00.001897) Running pre-restore scripts
(00.002084) No pidns-1.img image
(00.002246) Forking task with 769 pid (flags 0x0)
(00.002295) Creating process using clone3()
(00.002717) PID: real 769 virt 769
(00.003161)    769: timens: monotonic -3 172033336
(00.003195)    769: timens: boottime -3 171999592
(00.003281)    769: cg: Move into 2
(00.003299)    769: cg:   `-> unifie//test1/cgroup.procs
(00.003324)    769: Error (criu/cgroup.c:1092): cg: Can't move 769 into unifie//test1/cgroup.procs (-1/-1): Bad file descriptor
(00.003337)    769: Error (criu/cgroup.c:1206): cg: Can't move into unifie//test1/cgroup.procs (-1/-1): Bad file descriptor
(00.003390) Error (criu/cr-restore.c:2447): Restoring FAILED.
(00.003806) Error (criu/cr-restore.c:1480): 769 killed by signal 9: Killed

With my patch that works. After restore the process is back in the current cgroup:

# cat /proc/`pidof sleep`/cgroup
0::/user.slice/user-0.slice/session-1.scope

The ignore functionality is partially implemented currently but it is missing at two places. That is what my patch fixes.

@avagin
Copy link
Member

avagin commented Apr 5, 2022

No, I do not agree. According to the man page ignore should Don’t deal with cgroups and pretend that they don’t exist.

You can be disagree and it is what it is. The source of truth is the commit message and the code but not the man page that was written much later than the code.

And I lean to accept this patch rather than introduce a new mode.

@adrianreber
Copy link
Member Author

And I lean to accept this patch rather than introduce a new mode.

When did we talk about a new mode? This is just about fixing the existing ignore mode. Nothing else.

@adrianreber
Copy link
Member Author

manage_cgroup as a boolean is deprecated and according to the source the right thing to do is --manage-cgroups-mode=MODE.

@kolyshkin
Copy link
Contributor

Ahh! That must explain why my earlier use of --manage-cgroup-mode=ignore did not fix the issues I was seeing with lazy migration (opencontainers/runc#2924 and opencontainers/runc#2760)!

@Snorch
Copy link
Member

Snorch commented Apr 7, 2022

I agree with @avagin, "ignore" should only mean that the container runtime is in charge of creating the whole container cgroup directory tree, but criu should still put processes it creates into same topology it had seen them on dump.

In Virtuozzo criu when we migrate container with change of container ID we do dump it with something like --cgroup-root /machine.slice/$old_ct_ID and restore with --cgroup-root /machine.slice/$new_ct_ID. This way if process was dumped in /sys/fs/cgroup/memory/machine.slice/$old_ct_ID/sub/directory it would restore in /sys/fs/cgroup/memory/machine.slice/$new_ct_ID/sub/directory.

Also note that in "soft" mode we don't restore cgroup options on root container cgroups if cgroup directories were pre-created by container environment, so container environment is in charge to put the same values there as was on dump. Using "full" mode solves it but requires some fixes, see e.g.: https://src.openvz.org/projects/OVZ/repos/criu/commits/6cab5493c59d699fdd231fd1ad6d808fe3bc2278

@adrianreber
Copy link
Member Author

I agree with @avagin, "ignore" should only mean that the container runtime is in charge of creating the whole container cgroup directory tree, but criu should still put processes it creates into same topology it had seen them on dump.

Independent of what the container runtime should do, ignore currently does not work. See my example in #1800 (comment)

@Snorch
Copy link
Member

Snorch commented Apr 7, 2022

I agree with @avagin, "ignore" should only mean that the container runtime is in charge of creating the whole container cgroup directory tree, but criu should still put processes it creates into same topology it had seen them on dump.

Independent of what the container runtime should do, ignore currently does not work. See my example in #1800 (comment)

Yeh probably you are right, proving hunks from the original patch:

c7d646afb ("cgroups: Introduce cgroup management modes")
...
+#define CG_MODE_IGNORE         (0u << 0)       /* Zero is important here */
...
@@ -1122,10 +1130,14 @@ static int prepare_cgroup_sfd(CgroupEntry *ce)
        int off, i, ret;
        char paux[PATH_MAX];
 
-       pr_info("Preparing cgroups yard\n");
+       pr_info("Preparing cgroups yard (cgroups restore mode %#x)\n",
+               opts.manage_cgroups);
+       if (!opts.manage_cgroups)
+               return 0;

Cgroup yard was never initialized for "ignore", so it was never possible to put tasks in right cgroups with "ignore".

Please also apply this fix to you pr, so that "unified" does not convert to "unifie":

diff --git a/criu/cgroup.c b/criu/cgroup.c
index 82d9b16a2..7490cebf2 100644
--- a/criu/cgroup.c
+++ b/criu/cgroup.c
@@ -1035,7 +1035,7 @@ static int ctrl_dir_and_opt(CgControllerEntry *ctl, char *dir, int ds, char *opt
                }
 
                if (n[0] == 0)
-                       doff += snprintf(dir + doff, ds - doff, "unified");
+                       doff += snprintf(dir + doff, ds - doff, "unified,");
                else
                        doff += snprintf(dir + doff, ds - doff, "%s,", n);
                if (opt)

After this fix it is obvious that with "ignore" the fail happens as we always have -1 CGROUP_YARD service fd in userns_move.

@adrianreber adrianreber force-pushed the 2022-04-03-really-ignore-cgroups branch from 6d0b0b2 to 91af633 Compare April 8, 2022 07:52
Copy link
Member

@Snorch Snorch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good.

@adrianreber adrianreber force-pushed the 2022-04-03-really-ignore-cgroups branch from 91af633 to 6425209 Compare April 8, 2022 13:46
@avagin
Copy link
Member

avagin commented Apr 8, 2022

pls remove the passage about oci containers from the commit message. It is not correlate directly with the patch.

@rst0git
Copy link
Member

rst0git commented Apr 8, 2022

pls remove the passage about oci containers from the commit message. It is not correlate directly with the patch.

Actually, the commit message describes the problem we are trying to fix: containers/podman#13672

@adrianreber adrianreber force-pushed the 2022-04-03-really-ignore-cgroups branch 2 times, most recently from 67a9557 to d62380c Compare April 11, 2022 06:55
@adrianreber adrianreber force-pushed the 2022-04-03-really-ignore-cgroups branch from d62380c to 4db6280 Compare April 12, 2022 08:01
@avagin
Copy link
Member

avagin commented Apr 13, 2022

BTW @Snorch explained me that what I think the ignore mode means is actually covered by the none mode.

The code expected that the cgroup directory ends with a ',' and
unconditionally removes the last character. For the "unified" case this
resulted in the last 'd' being remove instead of the non existing comma.

This just adds a comma after "unified" so that the last removed
character is not the 'd'.

Suggested-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
Signed-off-by: Adrian Reber <areber@redhat.com>
Using '--manage-cgroups=ignore' fails during restore. This fixes the use
of '--manage-cgroups=ignore'.

Signed-off-by: Adrian Reber <areber@redhat.com>
Test to ensure that --manage-cgroups=ignore works correctly.

Signed-off-by: Adrian Reber <areber@redhat.com>
@adrianreber adrianreber force-pushed the 2022-04-03-really-ignore-cgroups branch from 4db6280 to bed20ce Compare April 14, 2022 09:11
@avagin avagin merged commit c1125b1 into checkpoint-restore:criu-dev Apr 14, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants