Make cgroup freezer only care about current control group #3065

odinuge · 2021-07-05T14:21:17Z

If a control group is freezed, all its descendants will report FROZEN.
This causes runc to freeze a control group, and not THAW it afterwards.
When its parent is thawed, the descendant will remain frozen.

This causes issues for k8s.

odinuge · 2021-07-05T15:42:00Z

This should probably get some integration tests, but I would need help in that case.

Something like:

Create the-scope.scope
Freeze the-scope.scope
Create container container.slice in the-scope.scope
Unfreeze the-scope.scope

This should "unfreeze" container.slice, and work as expected.

kolyshkin · 2021-07-06T20:32:33Z

I can and will create some tests tomorrow (on PTO today) but I think this should be tested in kubernetes.

Note that commit bd8e070 (included into runc 1.0.0) brings the systemd v1 driver almost to the state it was before 108ee85 (included of 1.0.0-rc92).

Anyway, will take a look first thing tomorrow.

kolyshkin · 2021-07-07T01:41:57Z

This should probably get some integration tests, but I would need help in that case.

Something like:

Create the-scope.scope

Freeze the-scope.scope

Create container container.slice in the-scope.scope

Unfreeze the-scope.scope

This should "unfreeze" container.slice, and work as expected.

I'm not quite sure this is the scenario we are seeing in kubernetes/kubernetes#102508 (comment) -- I don't think kubernetes explicitly freezes anything.

Or do you mean we are having a race between

(1) the parent (pod) cgroup Set() (during which it is temporary frozen)

and

(2) the child (container) cgroup Set() (which gets the current status as FROZEN because the parent temporarily froze it, and thus sets the container to be frozen after `Set() is finished?

This is a possible scenario.

Just to add some context, we have recently added various hacks to cgroup v1 freezer implementation to make it work more reliably (#2774, #2791, #2918, #2941) for some scenarios. This might have resulted in what we see with kubernetes, although I don't quite see how. Perhaps all the freeze retries lead to widened race window?

In such case this PR makes sense. I am working on a test case.

kolyshkin · 2021-07-07T02:17:58Z

Here's my test but it passes even without your patch. You can play with it yourself by adding the following code to the end of runc's libcontainer/cgroups/systemd/systemd_test.go file:

func TestFreezePodCgroup(t *testing.T) {
	if !IsRunningSystemd() {
		t.Skip("Test requires systemd.")
	}
	if os.Geteuid() != 0 {
		t.Skip("Test requires root.")
	}

	podConfig := &configs.Cgroup{
		Parent: "system.slice",
		Name:   "system-runc_test_pods.slice",
		Resources: &configs.Resources{
			SkipDevices: true,
			Freezer:     configs.Frozen,
		},
	}
	// Create a "pod" cgroup (a systemd slice to hold containers),
	// which is frozen initially.
	pm := newManager(podConfig)
	defer pm.Destroy() //nolint:errcheck
	if err := pm.Apply(-1); err != nil {
		t.Fatal(err)
	}

	if err := pm.Freeze(configs.Frozen); err != nil {
		t.Fatal(err)
	}
	if err := pm.Set(podConfig.Resources); err != nil {
		t.Fatal(err)
	}

	// Check the pod is frozen.
	pf, err := pm.GetFreezerState()
	if pf != configs.Frozen {
		t.Fatalf("expected pod to be frozen, got %v", pf)
	}
	t.Log("pod frozen")

	// Create a "container" within the "pod" cgroup.
	// This is not a real container, just a process in the cgroup.
	config := &configs.Cgroup{
		Parent:      "system-runc_test_pods.slice",
		ScopePrefix: "test",
		Name:        "FreezeParent",
	}

	cmd := exec.Command("bash", "-c", "while read; do echo $REPLY; done")
	cmd.Env = append(os.Environ(), "LANG=C")

	// Setup stdin.
	stdinR, stdinW, err := os.Pipe()
	if err != nil {
		t.Fatal(err)
	}
	cmd.Stdin = stdinR

	// Setup stdout.
	stdoutR, stdoutW, err := os.Pipe()
	if err != nil {
		t.Fatal(err)
	}
	cmd.Stdout = stdoutW
	rdr := bufio.NewReader(stdoutR)

	// Setup stderr.
	var stderr bytes.Buffer
	cmd.Stderr = &stderr

	err = cmd.Start()
	stdinR.Close()
	stdoutW.Close()
	defer func() {
		_ = stdinW.Close()
		_ = stdoutR.Close()
	}()
	if err != nil {
		t.Fatal(err)
	}
	t.Log("container started")
	// Make sure to not leave a zombie.
	defer func() {
		// These may fail, we don't care.
		_ = cmd.Process.Kill()
		_ = cmd.Wait()
	}()

	// Put the process into a cgroup.
	m := newManager(config)
	defer m.Destroy() //nolint:errcheck

	if err := m.Apply(cmd.Process.Pid); err != nil {
		t.Fatal(err)
	}
	t.Log("container pid added")
	// Check that we put the "container" into the "pod" cgroup.
	if !strings.HasPrefix(m.Path("freezer"), pm.Path("freezer")) {
		t.Fatalf("expected container cgroup path %q to be under pod cgroup path %q",
			m.Path("freezer"), pm.Path("freezer"))
	}
	// Check the "container" is also frozen.
	cf, err := pm.GetFreezerState()
	if cf != configs.Frozen {
		t.Fatalf("expected container to be frozen, got %v", cf)
	}
	t.Log("pod is still frozen -- thawing")

	// Unfreeze the pod.
	if err := pm.Freeze(configs.Thawed); err != nil {
		t.Fatal(err)
	}

	// Check the "container" works.
	marker := "one two\n"
	stdinW.WriteString(marker)
	reply, err := rdr.ReadString('\n')
	if err != nil {
		t.Fatalf("reading from container: %v", err)
	}
	if reply != marker {
		t.Fatalf("expected %q, got %q", marker, reply)
	}
}

kolyshkin · 2021-07-07T02:51:26Z

Another theory. What's happening is the whole pod (aka slice) is frozen (together with the frozen sub-cgroups, but this is both normal and irrelevant I think).

This can happen in two scenarios:

Set() is stuck in the middle -- after the pod/slice is frozen and before it is thawed.
Two Set() methods racing on the same pod/slice (first Set() freezes the cgroup, second reads cgroup state as FROZEN and thus sets is to frozen).

Issue 2 can be fixed by reusing the cgroup mutex for Set().

In general, I think, the (*legacyManager).Set() code is somewhat hairy wrt freezing -- for one thing, it uses the public Freeze method which changes container resources, and I don't think it should. Also, having Freezer setting in resources is weird, too.

odinuge · 2021-07-07T09:49:38Z

I'm not quite sure this is the scenario we are seeing in kubernetes/kubernetes#102508 (comment) -- I don't think kubernetes explicitly freezes anything.

It does not explicitly do it, and it currently does not. However, when using runc v1.0.0, the cgroups will be frozen for a brief period when setting resources. This is done on startup, and once in a while. The freezing itself is not a huge problem, but together with the thing I am (hopefully) fixing here, containers and other control groups ends up being freezed permanently.

For k8s usage of libcontainer, there is no reason to freeze the control groups when updating resources, and thus it should not be done.

Or do you mean we are having a race between

(1) the parent (pod) cgroup Set() (during which it is temporary frozen)

and

(2) the child (container) cgroup Set() (which gets the current status as FROZEN because the parent temporarily froze it, and thus sets the container to be frozen after `Set() is finished?

This is a possible scenario.

Yes, that is the case for cgroup v1.

Just to add some context, we have recently added various hacks to cgroup v1 freezer implementation to make it work more reliably (#2774, #2791, #2918, #2941) for some scenarios. This might have resulted in what we see with kubernetes, although I don't quite see how. Perhaps all the freeze retries lead to widened race window?

In such case this PR makes sense. I am working on a test case.

Thanks. For k8s, we should never freeze the kubepods.slice or qos-class.slice control groups, as there is no need to do it. It has never been done before, since k8s didn't use the Set from runc before kubernetes/kubernetes#102147, and at that point we didn't freeze either.

odinuge · 2021-07-07T11:52:07Z

I have added a test based on your suggestion now @kolyshkin. As you see, it fails without this change: #3066.

The behavior after this is the same for cgroup v1 as it is with cgroup v2. The test pass for both, and the changes are only done for v1.

odinuge · 2021-07-07T13:11:26Z

Also, separate from this PR, k8s need a way to skip the Freeze all together. Should we add a new flag like SkipDeviceFreeze, or should we revert to the old behavior prior to bd8e070, and handle runc update separately?

libcontainer/cgroups/fs/freezer.go

If a control group is freezed, all its descendants will report FROZEN. This causes runc to freeze a control group, and not THAW it afterwards. When its parent is thawed, the descendant will remain frozen. Signed-off-by: Odin Ugedal <odin@uged.al>

@kolyshkin

Initial test work was done by Kir Kolyshkin (@kolyshkin). Co-Authored-By: Kir Kolyshkin <kolyshkin@gmail.com> Signed-off-by: Odin Ugedal <odin@uged.al>

kolyshkin · 2021-07-07T21:21:42Z

So I took a fresh look today and I think we need to do it a bit different.

Instead of having a public method GetFreezerState() to return "THAWED" while in fact the container is frozen (because its parent is frozen), I think it's better to use the contents of freezer.self_freezing directly from Set (as in this particular case we're not interested if the cgroup/container is frozen or not, but rather was it frozen directly).

The logic in the function should also be cleaned up a bit -- for example, we don't want to freeze the frozen container, or unfreeze the container that is about to be frozen again.

We should probably also introduce the "big freezing lock" because the "get freezer state / freeze / set unit properties / return to old state" code sequence should be protected.

I'll try to write some code...

odinuge · 2021-07-07T21:37:08Z

Instead of having a public method GetFreezerState() to return "THAWED" while in fact the container is frozen (because its parent is frozen),

Yeah, this all depends on the definition of a frozen container or not. With this PR cgroup v1 and v2 will work in the same way, and I think having them work differently is a bad idea in general, no matter the naming.

I think it's better to use the contents of freezer.self_freezing directly from Set (as in this particular case we're not interested if the cgroup/container is frozen or not, but rather was it frozen directly).

Again, depends on the definition of "is frozen". For v1 "frozen" induce the when a parent is frozen, but for v2 it does not. But yeah, we can invert the logic to say if "freezer.self_freezing == 1", wait until we see "FROZEN" in the state. That works as well.

The logic in the function should also be cleaned up a bit -- for example, we don't want to freeze the frozen container, or unfreeze the container that is about to be frozen again.

Not sure that I follow this logic. (but again, depends on the definition of a frozen container). If the container is frozen via a parent, we would still like to freeze it in case the parent is thawed. If "freezer.self_freezing = 1", writing "FROZEN" to the state is essentially a NOP, so imo. it doesn't matter. (We can however discuss if we should do it, but then we should do the same with other cgroup files as well.)

We should probably also introduce the "big freezing lock" because the "get freezer state / freeze / set unit properties / return to old state" code sequence should be protected.

Are you thinking about an internal lock/mutex in runc? That would probably not work when runc update is exeucted twice at the same time?

odinuge · 2021-07-07T21:38:22Z

But overall, this doesn't matter anything for k8s, since k8s only needs a way to disable this freezing mechanism all together...

kolyshkin · 2021-07-07T23:27:42Z

But overall, this doesn't matter anything for k8s, since k8s only needs a way to disable this freezing mechanism all together...

Can you emphasize? I did disabled the freezing in 108ee85 but later realized that it is causing issues (#3014) and thus had it reverted (#3019). Even if no Device* systemd unit properties are set, the container is seeing a brief EPERM. I believe the issue is the same if we're modifying parent/pod systemd unit properties for cgroup v1.

kolyshkin · 2021-07-07T23:28:24Z

@odinuge PTAL #3072 which (I hope) solves the same issue in a slightly different way.

cyphar · 2021-07-07T23:28:35Z

@odinuge Can you clarify why the freezing is not necessary in Kubernetes? Are resources never updated for running containers? If they are, then the freezing is arguably necessary (under systemd) because we need to avoid spurious device errors from occurring during the device rules update (a long-term solution would be to fix systemd's device rule updating code to do what we do for device updates).

As @kolyshkin said, we need this even if the devices aren't being updated because it seems systemd will still re-apply the device rules in certain cases. (I was about to suggest we skip freezing if SkipDevices was set, but then remembered we tried that already and it had issues -- see the issues Kir linked.)

odinuge · 2021-07-08T09:28:36Z

@kolyshkin:

Can you emphasize? I did disabled the freezing in 108ee85 but later realized that it is causing issues (#3014) and thus had it reverted (#3019). Even if no Device* systemd unit properties are set, the container is seeing a brief EPERM. I believe the issue is the same if we're modifying parent/pod systemd unit properties for cgroup v1.

Yes and no.

When talking about k8s, we talk about two distinct use cases of runc, where I am referring to the latter:

Use of the runc binary to create containers, not directly used by k8s, but by the container runtimes
- Freezing here is sound, since systemd will write * to devices.deny and then iterate through devices and write them to devices.allow.
- If no freeze is done, containers can be denied device access
Use of runc/libcontainer as a dependency to deal with control groups.
- Freezing here is not neccecary, since we don't set any device parameters. systemd will only write * to devices.allow once in a while/during updates.
- If no freeze is done,
- nothing is changed.

For ref: https://github.com/systemd/systemd/blob/dd376574fd62c6fcf446ee500ca85558a71d645e/src/core/cgroup.c#L1130-L1133

Does that make sense?

@odinuge PTAL #3072 which (I hope) solves the same issue in a slightly different way.

Well, yeah, it kinda does the same. Will comment on the PR. 👍

@cyphar

@odinuge Can you clarify why the freezing is not necessary in Kubernetes? Are resources never updated for running containers? If they are, then the freezing is arguably necessary (under systemd) because we need to avoid spurious device errors from occurring during the device rules update (a long-term solution would be to fix systemd's device rule updating code to do what we do for device updates).

As @kolyshkin said, we need this even if the devices aren't being updated because it seems systemd will still re-apply the device rules in certain cases.

Does my answer to @kolyshkin above make sense to answer this?

(I was about to suggest we skip freezing if SkipDevices was set, but then remembered we tried that already and it had issues -- see the issues Kir linked.)

SkipDevices worked well for cgroup management in k8s when it skipped the freezing part. The issue was introduced with #2994, where runc update use the same mechanism, but the two are different use cases. The only thing k8s cgroup management will need is a flag to highlight "We allow all devices, and therefore there is no need to freeze during update".

Freezing the internal cgroups in k8s (not the containers themselves) would work, even though we shouldn't do it, but because of the bug fixed in this PR, containers using runc v1.0.0 will permanently freeze upon creation and/or update. See more about it here: kubernetes/kubernetes#102508 (comment).

kolyshkin · 2021-07-08T18:24:36Z

So, one way to fix this would be to have a code that tells whether systemd is going to do deny-all on SetUnitProperties, and skip the freeze it it won't.

This is a kludge to a kludge and it's dependent upon systemd internals (which may change) but in these circumstances this may be the best way to proceed.

kolyshkin · 2021-07-08T21:17:38Z

libcontainer/cgroups/fs/freezer.go

+			case "1\n":
+				return configs.Frozen, nil
+			default:
+				return configs.Undefined, fmt.Errorf(`unknown "freezer.self_freezing" state: %q`, state)


this should be self not state.

kolyshkin · 2021-07-09T02:27:29Z

I have created #3080 which has both commits from here (with a few minor changes, a fix for #3065 (review), and the suggestion from #3072 (comment)), as well as the commit from #3072 (which is not that essential).

... and then I realized that this PR is better as it provides a minimal fix which we can backport to 1.0.

Let me re-review everything one more time.

kolyshkin · 2021-07-09T02:39:34Z

Let me re-review everything one more time.

In fact it's easier for me to reuse what I already have. @odinuge please see #3081, I propose we merge that one instead of this.

odinuge · 2021-07-09T12:41:37Z

In fact it's easier for me to reuse what I already have. @odinuge please see #3081, I propose we merge that one instead of this.

lgtm

Make cgroup freezer only care about current control group (carry #3065)

odinuge force-pushed the freezer-state branch from d2d8c5d to 5c8d36b Compare July 5, 2021 14:21

AkihiroSuda added the area/cgroupv1 label Jul 5, 2021

odinuge force-pushed the freezer-state branch from fd7cda4 to 0cf0fcc Compare July 7, 2021 09:35

odinuge mentioned this pull request Jul 7, 2021

[DNM] Test freezer tests #3066

Closed

odinuge force-pushed the freezer-state branch from 0cf0fcc to ac08516 Compare July 7, 2021 10:07

kolyshkin reviewed Jul 7, 2021

View reviewed changes

libcontainer/cgroups/fs/freezer.go Outdated Show resolved Hide resolved

kolyshkin reviewed Jul 7, 2021

View reviewed changes

libcontainer/cgroups/fs/freezer.go Outdated Show resolved Hide resolved

kolyshkin reviewed Jul 7, 2021

View reviewed changes

libcontainer/cgroups/fs/freezer.go Outdated Show resolved Hide resolved

odinuge and others added 2 commits July 7, 2021 20:02

Add freezer tests

65f0433

Initial test work was done by Kir Kolyshkin (@kolyshkin). Co-Authored-By: Kir Kolyshkin <kolyshkin@gmail.com> Signed-off-by: Odin Ugedal <odin@uged.al>

odinuge force-pushed the freezer-state branch from ac08516 to 65f0433 Compare July 7, 2021 18:02

odinuge changed the title ~~[WIP] Make cgroup freezer only care about current control group~~ Make cgroup freezer only care about current control group Jul 7, 2021

kolyshkin mentioned this pull request Jul 7, 2021

[WIP] libct/cg/sd/v1: Set: don't leave cgroup frozen #3072

Closed

cyphar mentioned this pull request Jul 7, 2021

System calls with timeout receives EINTR with runc version 1.0.0-91 and onwards #3071

Closed

kolyshkin mentioned this pull request Jul 8, 2021

Release 1.0.1 #3076

Closed

kolyshkin reviewed Jul 8, 2021

View reviewed changes

kolyshkin mentioned this pull request Jul 9, 2021

systemd cgroup v1 freeze fixes #3080

Closed

kolyshkin mentioned this pull request Jul 9, 2021

Make cgroup freezer only care about current control group (carry #3065) #3081

Merged

kolyshkin mentioned this pull request Jul 9, 2021

cgroups: Set: fix freeze, avoid unnecessary freeze from systemd v1 #3082

Merged

odinuge closed this Jul 9, 2021

AkihiroSuda added a commit that referenced this pull request Jul 12, 2021

Merge pull request #3081 from kolyshkin/carry-3065

3a041e9

Make cgroup freezer only care about current control group (carry #3065)

kolyshkin mentioned this pull request Jul 12, 2021

[1.0] Make cgroup freezer only care about current control group #3085

Merged

odinuge mentioned this pull request Aug 11, 2021

Kubelet regularly freeze control groups causing issues further down kubernetes/kubernetes#104280

Closed

cartermckinnon mentioned this pull request Dec 3, 2021

Use systemd as the containerd cgroup driver awslabs/amazon-eks-ami#717

Merged

kolyshkin mentioned this pull request May 18, 2022

Decouple setting cgroup device rules from cgroup manager #3452

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make cgroup freezer only care about current control group #3065

Make cgroup freezer only care about current control group #3065

odinuge commented Jul 5, 2021

odinuge commented Jul 5, 2021

kolyshkin commented Jul 6, 2021

kolyshkin commented Jul 7, 2021

kolyshkin commented Jul 7, 2021

kolyshkin commented Jul 7, 2021

odinuge commented Jul 7, 2021

odinuge commented Jul 7, 2021

odinuge commented Jul 7, 2021

kolyshkin commented Jul 7, 2021

odinuge commented Jul 7, 2021

odinuge commented Jul 7, 2021

kolyshkin commented Jul 7, 2021

kolyshkin commented Jul 7, 2021

cyphar commented Jul 7, 2021 •

edited

Loading

odinuge commented Jul 8, 2021 •

edited

Loading

kolyshkin commented Jul 8, 2021

kolyshkin Jul 8, 2021

kolyshkin commented Jul 9, 2021

kolyshkin commented Jul 9, 2021

odinuge commented Jul 9, 2021

Make cgroup freezer only care about current control group #3065

Make cgroup freezer only care about current control group #3065

Conversation

odinuge commented Jul 5, 2021

odinuge commented Jul 5, 2021

kolyshkin commented Jul 6, 2021

kolyshkin commented Jul 7, 2021

kolyshkin commented Jul 7, 2021

kolyshkin commented Jul 7, 2021

odinuge commented Jul 7, 2021

odinuge commented Jul 7, 2021

odinuge commented Jul 7, 2021

kolyshkin commented Jul 7, 2021

odinuge commented Jul 7, 2021

odinuge commented Jul 7, 2021

kolyshkin commented Jul 7, 2021

kolyshkin commented Jul 7, 2021

cyphar commented Jul 7, 2021 • edited Loading

odinuge commented Jul 8, 2021 • edited Loading

kolyshkin commented Jul 8, 2021

kolyshkin Jul 8, 2021

Choose a reason for hiding this comment

kolyshkin commented Jul 9, 2021

kolyshkin commented Jul 9, 2021

odinuge commented Jul 9, 2021

cyphar commented Jul 7, 2021 •

edited

Loading

odinuge commented Jul 8, 2021 •

edited

Loading