-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make cgroup freezer only care about current control group #3065
Conversation
This should probably get some integration tests, but I would need help in that case. Something like:
This should "unfreeze" container.slice, and work as expected. |
I can and will create some tests tomorrow (on PTO today) but I think this should be tested in kubernetes. Note that commit bd8e070 (included into runc 1.0.0) brings the systemd v1 driver almost to the state it was before 108ee85 (included of 1.0.0-rc92). Anyway, will take a look first thing tomorrow. |
I'm not quite sure this is the scenario we are seeing in kubernetes/kubernetes#102508 (comment) -- I don't think kubernetes explicitly freezes anything. Or do you mean we are having a race between (1) the parent (pod) cgroup and (2) the child (container) cgroup This is a possible scenario. Just to add some context, we have recently added various hacks to cgroup v1 freezer implementation to make it work more reliably (#2774, #2791, #2918, #2941) for some scenarios. This might have resulted in what we see with kubernetes, although I don't quite see how. Perhaps all the freeze retries lead to widened race window? In such case this PR makes sense. I am working on a test case. |
Here's my test but it passes even without your patch. You can play with it yourself by adding the following code to the end of runc's func TestFreezePodCgroup(t *testing.T) {
if !IsRunningSystemd() {
t.Skip("Test requires systemd.")
}
if os.Geteuid() != 0 {
t.Skip("Test requires root.")
}
podConfig := &configs.Cgroup{
Parent: "system.slice",
Name: "system-runc_test_pods.slice",
Resources: &configs.Resources{
SkipDevices: true,
Freezer: configs.Frozen,
},
}
// Create a "pod" cgroup (a systemd slice to hold containers),
// which is frozen initially.
pm := newManager(podConfig)
defer pm.Destroy() //nolint:errcheck
if err := pm.Apply(-1); err != nil {
t.Fatal(err)
}
if err := pm.Freeze(configs.Frozen); err != nil {
t.Fatal(err)
}
if err := pm.Set(podConfig.Resources); err != nil {
t.Fatal(err)
}
// Check the pod is frozen.
pf, err := pm.GetFreezerState()
if pf != configs.Frozen {
t.Fatalf("expected pod to be frozen, got %v", pf)
}
t.Log("pod frozen")
// Create a "container" within the "pod" cgroup.
// This is not a real container, just a process in the cgroup.
config := &configs.Cgroup{
Parent: "system-runc_test_pods.slice",
ScopePrefix: "test",
Name: "FreezeParent",
}
cmd := exec.Command("bash", "-c", "while read; do echo $REPLY; done")
cmd.Env = append(os.Environ(), "LANG=C")
// Setup stdin.
stdinR, stdinW, err := os.Pipe()
if err != nil {
t.Fatal(err)
}
cmd.Stdin = stdinR
// Setup stdout.
stdoutR, stdoutW, err := os.Pipe()
if err != nil {
t.Fatal(err)
}
cmd.Stdout = stdoutW
rdr := bufio.NewReader(stdoutR)
// Setup stderr.
var stderr bytes.Buffer
cmd.Stderr = &stderr
err = cmd.Start()
stdinR.Close()
stdoutW.Close()
defer func() {
_ = stdinW.Close()
_ = stdoutR.Close()
}()
if err != nil {
t.Fatal(err)
}
t.Log("container started")
// Make sure to not leave a zombie.
defer func() {
// These may fail, we don't care.
_ = cmd.Process.Kill()
_ = cmd.Wait()
}()
// Put the process into a cgroup.
m := newManager(config)
defer m.Destroy() //nolint:errcheck
if err := m.Apply(cmd.Process.Pid); err != nil {
t.Fatal(err)
}
t.Log("container pid added")
// Check that we put the "container" into the "pod" cgroup.
if !strings.HasPrefix(m.Path("freezer"), pm.Path("freezer")) {
t.Fatalf("expected container cgroup path %q to be under pod cgroup path %q",
m.Path("freezer"), pm.Path("freezer"))
}
// Check the "container" is also frozen.
cf, err := pm.GetFreezerState()
if cf != configs.Frozen {
t.Fatalf("expected container to be frozen, got %v", cf)
}
t.Log("pod is still frozen -- thawing")
// Unfreeze the pod.
if err := pm.Freeze(configs.Thawed); err != nil {
t.Fatal(err)
}
// Check the "container" works.
marker := "one two\n"
stdinW.WriteString(marker)
reply, err := rdr.ReadString('\n')
if err != nil {
t.Fatalf("reading from container: %v", err)
}
if reply != marker {
t.Fatalf("expected %q, got %q", marker, reply)
}
} |
Another theory. What's happening is the whole pod (aka slice) is frozen (together with the frozen sub-cgroups, but this is both normal and irrelevant I think). This can happen in two scenarios:
Issue 2 can be fixed by reusing the cgroup mutex for In general, I think, the |
It does not explicitly do it, and it currently does not. However, when using runc For k8s usage of libcontainer, there is no reason to freeze the control groups when updating resources, and thus it should not be done.
Yes, that is the case for cgroup v1.
Thanks. For k8s, we should never freeze the |
I have added a test based on your suggestion now @kolyshkin. As you see, it fails without this change: #3066. The behavior after this is the same for cgroup v1 as it is with cgroup v2. The test pass for both, and the changes are only done for v1. |
Also, separate from this PR, k8s need a way to skip the |
If a control group is freezed, all its descendants will report FROZEN. This causes runc to freeze a control group, and not THAW it afterwards. When its parent is thawed, the descendant will remain frozen. Signed-off-by: Odin Ugedal <odin@uged.al>
Initial test work was done by Kir Kolyshkin (@kolyshkin). Co-Authored-By: Kir Kolyshkin <kolyshkin@gmail.com> Signed-off-by: Odin Ugedal <odin@uged.al>
So I took a fresh look today and I think we need to do it a bit different. Instead of having a public method GetFreezerState() to return "THAWED" while in fact the container is frozen (because its parent is frozen), I think it's better to use the contents of The logic in the function should also be cleaned up a bit -- for example, we don't want to freeze the frozen container, or unfreeze the container that is about to be frozen again. We should probably also introduce the "big freezing lock" because the "get freezer state / freeze / set unit properties / return to old state" code sequence should be protected. I'll try to write some code... |
Yeah, this all depends on the definition of a frozen container or not. With this PR cgroup v1 and v2 will work in the same way, and I think having them work differently is a bad idea in general, no matter the naming.
Again, depends on the definition of "is frozen". For v1 "frozen" induce the when a parent is frozen, but for v2 it does not. But yeah, we can invert the logic to say if "freezer.self_freezing == 1", wait until we see "FROZEN" in the state. That works as well.
Not sure that I follow this logic. (but again, depends on the definition of a frozen container). If the container is frozen via a parent, we would still like to freeze it in case the parent is thawed. If "freezer.self_freezing = 1", writing "FROZEN" to the state is essentially a NOP, so imo. it doesn't matter. (We can however discuss if we should do it, but then we should do the same with other cgroup files as well.)
Are you thinking about an internal lock/mutex in runc? That would probably not work when |
But overall, this doesn't matter anything for k8s, since k8s only needs a way to disable this freezing mechanism all together... |
Can you emphasize? I did disabled the freezing in 108ee85 but later realized that it is causing issues (#3014) and thus had it reverted (#3019). Even if no Device* systemd unit properties are set, the container is seeing a brief EPERM. I believe the issue is the same if we're modifying parent/pod systemd unit properties for cgroup v1. |
@odinuge Can you clarify why the freezing is not necessary in Kubernetes? Are resources never updated for running containers? If they are, then the freezing is arguably necessary (under systemd) because we need to avoid spurious device errors from occurring during the device rules update (a long-term solution would be to fix systemd's device rule updating code to do what we do for device updates). As @kolyshkin said, we need this even if the devices aren't being updated because it seems systemd will still re-apply the device rules in certain cases. (I was about to suggest we skip freezing if |
Yes and no. When talking about k8s, we talk about two distinct use cases of runc, where I am referring to the latter:
Does that make sense?
Well, yeah, it kinda does the same. Will comment on the PR. 👍
Does my answer to @kolyshkin above make sense to answer this?
Freezing the internal cgroups in k8s (not the containers themselves) would work, even though we shouldn't do it, but because of the bug fixed in this PR, containers using runc |
So, one way to fix this would be to have a code that tells whether systemd is going to do deny-all on SetUnitProperties, and skip the freeze it it won't. This is a kludge to a kludge and it's dependent upon systemd internals (which may change) but in these circumstances this may be the best way to proceed. |
case "1\n": | ||
return configs.Frozen, nil | ||
default: | ||
return configs.Undefined, fmt.Errorf(`unknown "freezer.self_freezing" state: %q`, state) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this should be self
not state
.
I have created #3080 which has both commits from here (with a few minor changes, a fix for #3065 (review), and the suggestion from #3072 (comment)), as well as the commit from #3072 (which is not that essential). ... and then I realized that this PR is better as it provides a minimal fix which we can backport to 1.0. Let me re-review everything one more time. |
Make cgroup freezer only care about current control group (carry #3065)
If a control group is freezed, all its descendants will report FROZEN.
This causes runc to freeze a control group, and not THAW it afterwards.
When its parent is thawed, the descendant will remain frozen.
This causes issues for k8s.