Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make cgroup freezer only care about current control group #3065

Closed
wants to merge 2 commits into from

Conversation

odinuge
Copy link
Contributor

@odinuge odinuge commented Jul 5, 2021

If a control group is freezed, all its descendants will report FROZEN.
This causes runc to freeze a control group, and not THAW it afterwards.
When its parent is thawed, the descendant will remain frozen.

This causes issues for k8s.

@odinuge
Copy link
Contributor Author

odinuge commented Jul 5, 2021

This should probably get some integration tests, but I would need help in that case.

Something like:

  • Create the-scope.scope
  • Freeze the-scope.scope
  • Create container container.slice in the-scope.scope
  • Unfreeze the-scope.scope

This should "unfreeze" container.slice, and work as expected.

@kolyshkin
Copy link
Contributor

I can and will create some tests tomorrow (on PTO today) but I think this should be tested in kubernetes.

Note that commit bd8e070 (included into runc 1.0.0) brings the systemd v1 driver almost to the state it was before 108ee85 (included of 1.0.0-rc92).

Anyway, will take a look first thing tomorrow.

@kolyshkin
Copy link
Contributor

This should probably get some integration tests, but I would need help in that case.

Something like:

  • Create the-scope.scope
  • Freeze the-scope.scope
  • Create container container.slice in the-scope.scope
  • Unfreeze the-scope.scope

This should "unfreeze" container.slice, and work as expected.

I'm not quite sure this is the scenario we are seeing in kubernetes/kubernetes#102508 (comment) -- I don't think kubernetes explicitly freezes anything.

Or do you mean we are having a race between

(1) the parent (pod) cgroup Set() (during which it is temporary frozen)

and

(2) the child (container) cgroup Set() (which gets the current status as FROZEN because the parent temporarily froze it, and thus sets the container to be frozen after `Set() is finished?

This is a possible scenario.

Just to add some context, we have recently added various hacks to cgroup v1 freezer implementation to make it work more reliably (#2774, #2791, #2918, #2941) for some scenarios. This might have resulted in what we see with kubernetes, although I don't quite see how. Perhaps all the freeze retries lead to widened race window?

In such case this PR makes sense. I am working on a test case.

@kolyshkin
Copy link
Contributor

Here's my test but it passes even without your patch. You can play with it yourself by adding the following code to the end of runc's libcontainer/cgroups/systemd/systemd_test.go file:

func TestFreezePodCgroup(t *testing.T) {
	if !IsRunningSystemd() {
		t.Skip("Test requires systemd.")
	}
	if os.Geteuid() != 0 {
		t.Skip("Test requires root.")
	}

	podConfig := &configs.Cgroup{
		Parent: "system.slice",
		Name:   "system-runc_test_pods.slice",
		Resources: &configs.Resources{
			SkipDevices: true,
			Freezer:     configs.Frozen,
		},
	}
	// Create a "pod" cgroup (a systemd slice to hold containers),
	// which is frozen initially.
	pm := newManager(podConfig)
	defer pm.Destroy() //nolint:errcheck
	if err := pm.Apply(-1); err != nil {
		t.Fatal(err)
	}

	if err := pm.Freeze(configs.Frozen); err != nil {
		t.Fatal(err)
	}
	if err := pm.Set(podConfig.Resources); err != nil {
		t.Fatal(err)
	}

	// Check the pod is frozen.
	pf, err := pm.GetFreezerState()
	if pf != configs.Frozen {
		t.Fatalf("expected pod to be frozen, got %v", pf)
	}
	t.Log("pod frozen")

	// Create a "container" within the "pod" cgroup.
	// This is not a real container, just a process in the cgroup.
	config := &configs.Cgroup{
		Parent:      "system-runc_test_pods.slice",
		ScopePrefix: "test",
		Name:        "FreezeParent",
	}

	cmd := exec.Command("bash", "-c", "while read; do echo $REPLY; done")
	cmd.Env = append(os.Environ(), "LANG=C")

	// Setup stdin.
	stdinR, stdinW, err := os.Pipe()
	if err != nil {
		t.Fatal(err)
	}
	cmd.Stdin = stdinR

	// Setup stdout.
	stdoutR, stdoutW, err := os.Pipe()
	if err != nil {
		t.Fatal(err)
	}
	cmd.Stdout = stdoutW
	rdr := bufio.NewReader(stdoutR)

	// Setup stderr.
	var stderr bytes.Buffer
	cmd.Stderr = &stderr

	err = cmd.Start()
	stdinR.Close()
	stdoutW.Close()
	defer func() {
		_ = stdinW.Close()
		_ = stdoutR.Close()
	}()
	if err != nil {
		t.Fatal(err)
	}
	t.Log("container started")
	// Make sure to not leave a zombie.
	defer func() {
		// These may fail, we don't care.
		_ = cmd.Process.Kill()
		_ = cmd.Wait()
	}()

	// Put the process into a cgroup.
	m := newManager(config)
	defer m.Destroy() //nolint:errcheck

	if err := m.Apply(cmd.Process.Pid); err != nil {
		t.Fatal(err)
	}
	t.Log("container pid added")
	// Check that we put the "container" into the "pod" cgroup.
	if !strings.HasPrefix(m.Path("freezer"), pm.Path("freezer")) {
		t.Fatalf("expected container cgroup path %q to be under pod cgroup path %q",
			m.Path("freezer"), pm.Path("freezer"))
	}
	// Check the "container" is also frozen.
	cf, err := pm.GetFreezerState()
	if cf != configs.Frozen {
		t.Fatalf("expected container to be frozen, got %v", cf)
	}
	t.Log("pod is still frozen -- thawing")

	// Unfreeze the pod.
	if err := pm.Freeze(configs.Thawed); err != nil {
		t.Fatal(err)
	}

	// Check the "container" works.
	marker := "one two\n"
	stdinW.WriteString(marker)
	reply, err := rdr.ReadString('\n')
	if err != nil {
		t.Fatalf("reading from container: %v", err)
	}
	if reply != marker {
		t.Fatalf("expected %q, got %q", marker, reply)
	}
}

@kolyshkin
Copy link
Contributor

Another theory. What's happening is the whole pod (aka slice) is frozen (together with the frozen sub-cgroups, but this is both normal and irrelevant I think).

This can happen in two scenarios:

  1. Set() is stuck in the middle -- after the pod/slice is frozen and before it is thawed.
  2. Two Set() methods racing on the same pod/slice (first Set() freezes the cgroup, second reads cgroup state as FROZEN and thus sets is to frozen).

Issue 2 can be fixed by reusing the cgroup mutex for Set().

In general, I think, the (*legacyManager).Set() code is somewhat hairy wrt freezing -- for one thing, it uses the public Freeze method which changes container resources, and I don't think it should. Also, having Freezer setting in resources is weird, too.

@odinuge
Copy link
Contributor Author

odinuge commented Jul 7, 2021

I'm not quite sure this is the scenario we are seeing in kubernetes/kubernetes#102508 (comment) -- I don't think kubernetes explicitly freezes anything.

It does not explicitly do it, and it currently does not. However, when using runc v1.0.0, the cgroups will be frozen for a brief period when setting resources. This is done on startup, and once in a while. The freezing itself is not a huge problem, but together with the thing I am (hopefully) fixing here, containers and other control groups ends up being freezed permanently.

For k8s usage of libcontainer, there is no reason to freeze the control groups when updating resources, and thus it should not be done.

Or do you mean we are having a race between

(1) the parent (pod) cgroup Set() (during which it is temporary frozen)

and

(2) the child (container) cgroup Set() (which gets the current status as FROZEN because the parent temporarily froze it, and thus sets the container to be frozen after `Set() is finished?

This is a possible scenario.

Yes, that is the case for cgroup v1.

Just to add some context, we have recently added various hacks to cgroup v1 freezer implementation to make it work more reliably (#2774, #2791, #2918, #2941) for some scenarios. This might have resulted in what we see with kubernetes, although I don't quite see how. Perhaps all the freeze retries lead to widened race window?

In such case this PR makes sense. I am working on a test case.

Thanks. For k8s, we should never freeze the kubepods.slice or qos-class.slice control groups, as there is no need to do it. It has never been done before, since k8s didn't use the Set from runc before kubernetes/kubernetes#102147, and at that point we didn't freeze either.

@odinuge
Copy link
Contributor Author

odinuge commented Jul 7, 2021

I have added a test based on your suggestion now @kolyshkin. As you see, it fails without this change: #3066.

The behavior after this is the same for cgroup v1 as it is with cgroup v2. The test pass for both, and the changes are only done for v1.

@odinuge
Copy link
Contributor Author

odinuge commented Jul 7, 2021

Also, separate from this PR, k8s need a way to skip the Freeze all together. Should we add a new flag like SkipDeviceFreeze, or should we revert to the old behavior prior to bd8e070, and handle runc update separately?

odinuge and others added 2 commits July 7, 2021 20:02
If a control group is freezed, all its descendants will report FROZEN.
This causes runc to freeze a control group, and not THAW it afterwards.
When its parent is thawed, the descendant will remain frozen.

Signed-off-by: Odin Ugedal <odin@uged.al>
Initial test work was done by Kir Kolyshkin (@kolyshkin).

Co-Authored-By: Kir Kolyshkin <kolyshkin@gmail.com>
Signed-off-by: Odin Ugedal <odin@uged.al>
@odinuge odinuge changed the title [WIP] Make cgroup freezer only care about current control group Make cgroup freezer only care about current control group Jul 7, 2021
@kolyshkin
Copy link
Contributor

So I took a fresh look today and I think we need to do it a bit different.

Instead of having a public method GetFreezerState() to return "THAWED" while in fact the container is frozen (because its parent is frozen), I think it's better to use the contents of freezer.self_freezing directly from Set (as in this particular case we're not interested if the cgroup/container is frozen or not, but rather was it frozen directly).

The logic in the function should also be cleaned up a bit -- for example, we don't want to freeze the frozen container, or unfreeze the container that is about to be frozen again.

We should probably also introduce the "big freezing lock" because the "get freezer state / freeze / set unit properties / return to old state" code sequence should be protected.

I'll try to write some code...

@odinuge
Copy link
Contributor Author

odinuge commented Jul 7, 2021

Instead of having a public method GetFreezerState() to return "THAWED" while in fact the container is frozen (because its parent is frozen),

Yeah, this all depends on the definition of a frozen container or not. With this PR cgroup v1 and v2 will work in the same way, and I think having them work differently is a bad idea in general, no matter the naming.

I think it's better to use the contents of freezer.self_freezing directly from Set (as in this particular case we're not interested if the cgroup/container is frozen or not, but rather was it frozen directly).

Again, depends on the definition of "is frozen". For v1 "frozen" induce the when a parent is frozen, but for v2 it does not. But yeah, we can invert the logic to say if "freezer.self_freezing == 1", wait until we see "FROZEN" in the state. That works as well.

The logic in the function should also be cleaned up a bit -- for example, we don't want to freeze the frozen container, or unfreeze the container that is about to be frozen again.

Not sure that I follow this logic. (but again, depends on the definition of a frozen container). If the container is frozen via a parent, we would still like to freeze it in case the parent is thawed. If "freezer.self_freezing = 1", writing "FROZEN" to the state is essentially a NOP, so imo. it doesn't matter. (We can however discuss if we should do it, but then we should do the same with other cgroup files as well.)

We should probably also introduce the "big freezing lock" because the "get freezer state / freeze / set unit properties / return to old state" code sequence should be protected.

Are you thinking about an internal lock/mutex in runc? That would probably not work when runc update is exeucted twice at the same time?

@odinuge
Copy link
Contributor Author

odinuge commented Jul 7, 2021

But overall, this doesn't matter anything for k8s, since k8s only needs a way to disable this freezing mechanism all together...

@kolyshkin
Copy link
Contributor

But overall, this doesn't matter anything for k8s, since k8s only needs a way to disable this freezing mechanism all together...

Can you emphasize? I did disabled the freezing in 108ee85 but later realized that it is causing issues (#3014) and thus had it reverted (#3019). Even if no Device* systemd unit properties are set, the container is seeing a brief EPERM. I believe the issue is the same if we're modifying parent/pod systemd unit properties for cgroup v1.

@kolyshkin
Copy link
Contributor

@odinuge PTAL #3072 which (I hope) solves the same issue in a slightly different way.

@cyphar
Copy link
Member

cyphar commented Jul 7, 2021

@odinuge Can you clarify why the freezing is not necessary in Kubernetes? Are resources never updated for running containers? If they are, then the freezing is arguably necessary (under systemd) because we need to avoid spurious device errors from occurring during the device rules update (a long-term solution would be to fix systemd's device rule updating code to do what we do for device updates).

As @kolyshkin said, we need this even if the devices aren't being updated because it seems systemd will still re-apply the device rules in certain cases. (I was about to suggest we skip freezing if SkipDevices was set, but then remembered we tried that already and it had issues -- see the issues Kir linked.)

@odinuge
Copy link
Contributor Author

odinuge commented Jul 8, 2021

@kolyshkin:

Can you emphasize? I did disabled the freezing in 108ee85 but later realized that it is causing issues (#3014) and thus had it reverted (#3019). Even if no Device* systemd unit properties are set, the container is seeing a brief EPERM. I believe the issue is the same if we're modifying parent/pod systemd unit properties for cgroup v1.

Yes and no.

When talking about k8s, we talk about two distinct use cases of runc, where I am referring to the latter:

  • Use of the runc binary to create containers, not directly used by k8s, but by the container runtimes
    • Freezing here is sound, since systemd will write * to devices.deny and then iterate through devices and write them to devices.allow.
    • If no freeze is done, containers can be denied device access
  • Use of runc/libcontainer as a dependency to deal with control groups.
    • Freezing here is not neccecary, since we don't set any device parameters. systemd will only write * to devices.allow once in a while/during updates.
    • If no freeze is done,
    • nothing is changed.

For ref: https://github.com/systemd/systemd/blob/dd376574fd62c6fcf446ee500ca85558a71d645e/src/core/cgroup.c#L1130-L1133

Does that make sense?

@odinuge PTAL #3072 which (I hope) solves the same issue in a slightly different way.

Well, yeah, it kinda does the same. Will comment on the PR. 👍

@cyphar

@odinuge Can you clarify why the freezing is not necessary in Kubernetes? Are resources never updated for running containers? If they are, then the freezing is arguably necessary (under systemd) because we need to avoid spurious device errors from occurring during the device rules update (a long-term solution would be to fix systemd's device rule updating code to do what we do for device updates).

As @kolyshkin said, we need this even if the devices aren't being updated because it seems systemd will still re-apply the device rules in certain cases.

Does my answer to @kolyshkin above make sense to answer this?

(I was about to suggest we skip freezing if SkipDevices was set, but then remembered we tried that already and it had issues -- see the issues Kir linked.)

SkipDevices worked well for cgroup management in k8s when it skipped the freezing part. The issue was introduced with #2994, where runc update use the same mechanism, but the two are different use cases. The only thing k8s cgroup management will need is a flag to highlight "We allow all devices, and therefore there is no need to freeze during update".

Freezing the internal cgroups in k8s (not the containers themselves) would work, even though we shouldn't do it, but because of the bug fixed in this PR, containers using runc v1.0.0 will permanently freeze upon creation and/or update. See more about it here: kubernetes/kubernetes#102508 (comment).

@kolyshkin
Copy link
Contributor

So, one way to fix this would be to have a code that tells whether systemd is going to do deny-all on SetUnitProperties, and skip the freeze it it won't.

This is a kludge to a kludge and it's dependent upon systemd internals (which may change) but in these circumstances this may be the best way to proceed.

case "1\n":
return configs.Frozen, nil
default:
return configs.Undefined, fmt.Errorf(`unknown "freezer.self_freezing" state: %q`, state)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should be self not state.

@kolyshkin
Copy link
Contributor

I have created #3080 which has both commits from here (with a few minor changes, a fix for #3065 (review), and the suggestion from #3072 (comment)), as well as the commit from #3072 (which is not that essential).

... and then I realized that this PR is better as it provides a minimal fix which we can backport to 1.0.

Let me re-review everything one more time.

@kolyshkin
Copy link
Contributor

Let me re-review everything one more time.

In fact it's easier for me to reuse what I already have. @odinuge please see #3081, I propose we merge that one instead of this.

@odinuge
Copy link
Contributor Author

odinuge commented Jul 9, 2021

In fact it's easier for me to reuse what I already have. @odinuge please see #3081, I propose we merge that one instead of this.

lgtm

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants