notify_socket.go: use sd_notify_barrier mechanism #3291

indyjo · 2021-11-22T09:15:42Z

Add synchronization to how runc reports container readiness to systemd.

It was previously possible for runc to terminate immediately after sending the READY signal. This was sometimes too quick for systemd to find out the sender's cgroup. As a consequence, systemd would ignore the READY signal.

Fixes: #3293

notify_socket.go

AkihiroSuda · 2021-11-23T17:30:52Z

Please squash commits

kolyshkin · 2021-11-24T19:09:36Z

Overall this looks good to me. Is there any way to have a test for this @indyjo ?

indyjo · 2021-11-25T13:52:41Z

What I can imagine is to write a test against a mocked NOTIFY_SOCKET server, i.e. a mocked systemd. Concerning how to do it I think I need some guidance. I could try adding a notify_socket_test.go alongside notify_socket.go. I also noticed there are "bats tests" in tests/integration, but I'm completely unfamiliar with that framework. What do you think would be a good approach?

Also, my code is currently assuming that whoever listens on the other end of NOTIFY_SOCKET actually understands and honors the BARRIER=1 protocol. This is certainly the case with recent systemd versions (the feature seems to be about two years old). Other implementations might not. For example, sdnotify-proxy is too simple. Currently, this would lead to runc running into the 5 second timeout and failing to start a container.

Is this behavior a) accepted or should I b) ignore the timeout case or should there be c) a command-line option to suppress the sd_notify_barrier feature?

@kolyshkin @AkihiroSuda

indyjo · 2021-11-26T14:38:15Z

I have added a test (notify_socket_test.go). The question of how to deal with sd_notify_barrier timeouts is still open. crun waits for 30 seconds and then ignores the error (see PR). systemd-notify waits for 5 seconds and then fails (source).

notify_socket.go

notify_socket_test.go

kolyshkin · 2021-12-06T19:42:29Z

notify_socket.go

+	var out bytes.Buffer
+	_, err := out.Write(ready)
+	if err != nil {
+		return err
+	}

-			_, err = client.Write(out.Bytes())
-			if err != nil {
-				return err
-			}
+	_, err = out.Write([]byte{'\n'})
+	if err != nil {
+		return err
+	}

-			// now we can inform systemd to use pid1 as the pid to monitor
-			newPid := "MAINPID=" + strconv.Itoa(pid1)
-			_, err := client.Write([]byte(newPid + "\n"))
-			if err != nil {
-				return err
-			}
-			return nil
-		}
+	_, err = client.Write(out.Bytes())


Hmm, wonder why we avoid using bytes.Buffer at all and don't just do something like

_, err = client.WriteString(string(ready)+"\n")

There is no WriteString on client but of course we can change to:

_, err = io.WriteString(client, string(ready) + "\n")

or, more directly:

_, err = client.Write([]byte(string(ready) + "\n"))

or, more efficient but mutating the array backing ready:

_, err = client.Write(append(ready, '\n'))

Any preference?

Hmm, wonder why we avoid using bytes.Buffer at all

Looking at git history, this was done in #1308 but I don't see why.

_, err = client.Write(append(ready, '\n'))

This one looks fine to me (most probably there's \n already anyway at that position), as we are not reusing the buffer anywhere.

Can you implement it (as a separate first commit, I guess)?

kolyshkin · 2021-12-09T09:06:59Z

@indyjo thanks for the update! Can you please

reverse the order of commits (first remove bytes.Buffer usage, then add your changes)
add Signed-off-by to both commits

kolyshkin · 2021-12-09T09:09:58Z

reverse the order of commits (first remove bytes.Buffer usage, then add your changes)

I'm asking for this because it makes the git log -p smaller (and, consequently, it will be easier to figure out what was changed, especially a few years from now).

indyjo · 2021-12-10T15:07:53Z

Thanks for clarifying, @kolyshkin! The two commits are now in the preferred order.

Should we discuss how to handle the case where the supervising process doesn't (correctly) handle the sd_notify_barrier mechanism and runc fails with a timeout? I'm leaning towards simply doing what crun does, although the 30s timeout seems extreme.

kolyshkin · 2021-12-10T17:20:46Z

I'm leaning towards simply doing what crun does

I tend to agree.

although the 30s timeout seems extreme.

I guess on an overloaded system 5s might not be enough, so the timeout should be way above the reasonable expected reply time.

But let's ask @giuseppe -- why you ended up with 30s timeout?

giuseppe · 2021-12-10T18:48:20Z

But let's ask @giuseppe -- why you ended up with 30s timeout?

I've copied the timeout you set here: https://github.com/opencontainers/runc/blob/master/libcontainer/cgroups/systemd/common.go#L345 😄

Although I've decided to ignore any error, so it is equivalent to the version without sd_notify_barrier where it would not raise an error and because I am not sure as well what a correct value would be for the timeout

kolyshkin · 2022-01-25T07:08:15Z

@AkihiroSuda PTAL

kolyshkin · 2022-02-16T01:28:26Z

@opencontainers/runc-maintainers PTAL

kolyshkin · 2022-03-08T01:11:49Z

@opencontainers/runc-maintainers @AkihiroSuda PTAL

MofX · 2022-03-14T13:33:53Z

Does it really make sense to implement the barrier-machanism inside of runc?

This will fail if systemd is too old (I guess this does not matter, because it only triggers a warning?)
For non-detached containers, if systemd-notify or any other binary, that uses sd_notify_barrier is used inside of the container it will fail, because the pipe and the barrier message is not relayed to systemd leading to a timeout in sd_notify_barrier.
For detached containers, sd_notify_barrier can or cannot timeout, depending on a race. If I don't miss anything, runc terminates, after it received the ready message (and relayed it to systemd). After the patch it will also do the barrier synchronization. A process using sd_notify_barrier will most likely work with this PR, if it is able to send the barrier message with it's own pipe fd before runc terminates. I think the pipe handles are closed, when runc terminates (or on read). But if runc terminates before it read the BUFFER message and the pipe fd, sd_notify_barrier will timeout.

My implementation idea would be to also relay SCM_RIGHT data and let the client and systemd handle the synchronization transparently.
I do not have a real idea, how to implement this for detached containers, because runc now has to run until after barrier-pipe was closed by systemd. runc can only know this, if it intercepts the barrier message and injects it's own pipe. But when can it terminate if no barrier message is sent by the client? Maybe after the timeout?

indyjo · 2022-04-08T12:14:16Z

Hi @MofX, let me give my point of view. And sorry about it taking so long.

First of all, whether this patch ends up in an official release or not is not absolutely critical for me. The company I work for, KUKA, maintains their own patched version of runc. We have made good experiences so far, but of course we only test a very limited scenario.

This will fail if systemd is too old (I guess this does not matter, because it only triggers a warning?)

Probably true. The protocol basically mandates the use of the barrier, though. There is no way for a client to query whether it's supported.

Concerning your point 2 and 3:

I don't know enough about runc to understand the implications of detached and detached containers. It is true, however, that systemd-notify inside a container can still fail. That's because this patch doesn't address the barrier mechanism between runc and the container, but between runc and its supervising process.

Actually, we never observed any timeout happening in systemd-notify. The way systemd-notify may fail is that sometimes runc has already closed the connection directly after receiving the READY state, causing a "Connection refused" when systemd-notify tries to establish the barrier. We usually disable the barrier mechanism for this reason, when running systemd-notify in a container, using the --no-block option. That's a separate issue, though.

indyjo · 2022-05-09T08:05:50Z

@kolyshkin May I ask if there's a chance of including this in v1.1.2?

kolyshkin · 2022-05-09T18:58:07Z

@kolyshkin May I ask if there's a chance of including this in v1.1.2?

The target milestone is set for 1.2.0 (which I hope will be released some time in the summer). We can definitely do a backport to 1.1.x once this is merged, and so it will also be included into latest 1.1.x (whatever that would be).

We need more reviews on this, @opencontainers/runc-maintainers PTAL 🙏🏻

kolyshkin · 2022-05-09T18:58:40Z

@indyjo can you please rebase it (this will also re-run the CI on golang 1.18)?

indyjo · 2022-05-12T07:39:12Z

@indyjo can you please rebase it (this will also re-run the CI on golang 1.18)?

@kolyshkin Technically, it's done. There's a failing check though: ci / centos-7.

indyjo · 2022-11-12T11:02:25Z

Friendly reminder that there's still a chance for this baby to be merged before its first birthday! 🎂
Is there anything I can help with? @kolyshkin @thaJeztah @cyphar @AkihiroSuda

AkihiroSuda · 2022-11-12T12:18:28Z

The last rebase was 6 months ago, could you rebase again to ensure CI is still green? Thanks

Signed-off-by: Jonas Eschenburg <jonas.eschenburg@kuka.com>

indyjo · 2022-11-14T21:43:48Z

Rebased and green. @AkihiroSuda @kolyshkin

indyjo · 2022-12-08T09:49:53Z

@kolyshkin Just a friendly reminder

kolyshkin

LGTM

kolyshkin reviewed Nov 22, 2021

View reviewed changes