Use HTTP liveness probe instead of tetra status #2474

tpapagian · 2024-05-27T09:17:48Z

Please ensure your pull request adheres to the following guidelines:

For first time contributors, read Submitting a pull request.
All code is covered by unit and/or end-to-end tests tests where feasible.
All commits contain a well written commit message including a title,
description and a Fixes: #XXX line if the commit addresses a particular
GitHub issue.
All commits are signed off. See the section Developer’s Certificate of Origin
Provide a title or release-note blurb suitable for the release notes.
Are you a user of Tetragon? Please add yourself to the Users doc in the Cilium repository.
Thanks for contributing!

Now, we use tetra status command to report the status of tetragon agent. This comes with some overheads as tetra binary has a lot of additional functionality and it seems like an overkill to use that for status reporting.

On the other hand, k8s supports liveness probes by using an HTTP endpoint (i.e. https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-startup-probes/#http-probes). This patch first creates a new HTTP endpoint to report agent status and then change helm to make use of that.

Use HTTP liveness probe instead of tetra status.

install/kubernetes/tetragon/templates/_container_tetragon.tpl

Now, we use tetra status command to report the status of tetragon agent. This comes with some overheads as tetra binary has a lot of additional functionality and it seems like an overkill to use that for status reporting. On the other hand, k8s supports liveness probes by using an HTTP endpoint (i.e. https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-startup-probes/#http-probes). This patch first creates a new HTTP endpoint to report agent status that can be used for the liveness probe. Signed-off-by: Anastasios Papagiannis <tasos.papagiannnis@gmail.com>

netlify · 2024-05-27T11:45:35Z

✅ Deploy Preview for tetragon ready!

Name	Link
🔨 Latest commit	`d9bd92c`
🔍 Latest deploy log	https://app.netlify.com/sites/tetragon/deploys/6655a0fd9b19f20009e73930
😎 Deploy Preview	https://deploy-preview-2474--tetragon.netlify.app
📱 Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

kkourt · 2024-05-27T11:56:27Z

install/kubernetes/tetragon/values.yaml

+  # -- Use http-based livenessProbe for the tetragon container. livenessProbe has a higher priority.
+  livenessHttpProbe:
+    enabled: true
+  # host: "localhost"


I think it would better to have this under livenessProbe somehow.

CC: @michi-covalent @lambdanis

Maybe we could do something like:

{{- if .Values.tetragon.livenessProbe.enabled }} livenessProbe: timeoutSeconds: 60 {{- if .Values.tetragon.livenessProbe.custom }} {{- toYaml .Values.tetragon.livenessProbe.custom | nindent 4 }} {{- else if .Values.tetragon.livenessProbe.http.enabled }} httpGet: path: /liveness port: 6789 {{- if .Values.tetragon.livenessProbe.http.host }} host: {{ .Values.tetragon.livenessProbe.http.host }} {{- end }} {{- else if .Values.tetragon.grpc.enabled }} exec: command: - tetra - status - --server-address - {{ .Values.tetragon.grpc.address }} - --retries - "5" {{- end -}} {{- end -}}

Using that we can have:

livenessProbe: enabled: false

to disable liveness probe.

livenessProbe: enabled: true custom: grpc: port: 54321

to define a custom liveness probe.

livenessProbe: enabled: true http: enabled: true

to define an http-based liveness probe.

livenessProbe: enabled: true

to define our legacy "tetra status"-based liveness probe (assuming tetragon.grpc.enabled is true).

Maybe I'm missing something, but can we just change the default livenessProbe and keep everything else the same? That is, change the default values to:

livenessProbe: timeoutSeconds: 60 httpGet: path: /liveness port: 6789

and keep _container_tetragon.tpl unchanged. Then users can still use the tetra probe by setting livenessProbe: {}.

Yes, this makes sense (easier compared to what I am trying to do). But what I am not able to achieve is setting livenessProbe: {}. Possibly I am missing something.

First, change the existing values.yaml with:

@@ -64,9 +64,11 @@ tetragon: securityContext: privileged: true # -- Overrides the default livenessProbe for the tetragon container. - livenessProbe: {} - # grpc: - # port: 54321 + livenessProbe: + timeoutSeconds: 60 + httpGet: + path: "/liveness" + port: 6789 # Tetragon puts processes in an LRU cache. The cache is used to find ancestors # for subsequently exec'ed processes.

Then I run:

$ cat values.yaml tetragon: livenessProbe: {} $ helm template ./install/kubernetes/tetragon -f ./values.yaml [...] livenessProbe: httpGet: path: /liveness port: 6789 timeoutSeconds: 60 [...]

Which seems to be the same.

ok, I think I found a way to do that. Updating the PR now.

We should define

$ cat values.yaml tetragon: livenessProbe: null

instead of

$ cat values.yaml tetragon: livenessProbe: {}

to use the old tetra-based liveness probe.

PR is ready for review.

The previous commit introduced an HTTP endopoint that can be used for the liveness probe. This patch changes helm to make that default instead of the tetra status based liveness probe. The user can still use the tetra status based liveness probe by defining a values file similar to: $ cat values.yaml tetragon: livenessProbe: null Signed-off-by: Anastasios Papagiannis <tasos.papagiannnis@gmail.com>

michi-covalent · 2024-05-27T17:50:01Z

i have my opinion about this. i'll add my review here tomorrow 🚀 🙏

lambdanis

If the gRPC server is disabled then this new healthcheck will always fail, right? I think it shouldn't.

I don't have context on why the GetHealth endpoint was introduced, but it looks a bit awkward to me. gRPC itself defines the health checking service, so I would use the upstream proto and start the health service alongside the regular Tetragon gRPC server.

Also, Kubernetes now supports gRPC liveness probes, so unless we want to support old Kubernetes versions, we can use them instead of starting another HTTP server.

tpapagian · 2024-05-27T19:57:15Z

If the gRPC server is disabled then this new healthcheck will always fail, right? I think it shouldn't.

Not sure why with gRPC server disabled the new check will fail. How is that related?

I don't have context on why the GetHealth endpoint was introduced, but it looks a bit awkward to me. gRPC itself defines the health checking service, so I would use the upstream proto and start the health service alongside the regular Tetragon gRPC server.

Also, Kubernetes now supports gRPC liveness probes, so unless we want to support old Kubernetes versions, we can use them instead of starting another HTTP server.

I also don't have context about the details of GetHealth function that we use. But I believe that this is not directly related to this PR. The gRPC liveness probe in k8s is stable in 1.27. The purpose of this PR is to remove the need to have a liveness probe that is related to tetra status and use http-based liveness probe that also supports older k8s versions. We can also implement and use grpc-based liveness probe on systems that supports that, but this can be also done in a followup PR.

lambdanis · 2024-05-27T20:47:23Z

Not sure why with gRPC server disabled the new check will fail. How is that related?

Ok, I actually checked GetHealth implementation - it always says it's fine. So you're right, it won't fail, it's just a confusing implementation. I think it's fine to follow-up with the new health checks, just wanted to check I'm not missing something.

tpapagian · 2024-05-28T20:39:43Z

Closing in favor of #2478 as we choose to proceed with a gRPC-based liveness probe.

tpapagian force-pushed the pr/apapag/http-liveness-probe branch from bd2eb34 to 4ae2b3d Compare May 27, 2024 10:09

tpapagian added the release-note/misc This PR makes changes that have no direct user impact. label May 27, 2024

tpapagian force-pushed the pr/apapag/http-liveness-probe branch 2 times, most recently from d6c18d1 to 56363a8 Compare May 27, 2024 10:29

tpapagian changed the title ~~Test~~ Use HTTP liveness probe instead of tetra status May 27, 2024

tpapagian marked this pull request as ready for review May 27, 2024 11:08

tpapagian requested a review from a team as a code owner May 27, 2024 11:08

tpapagian requested a review from jrfastab May 27, 2024 11:08

kkourt reviewed May 27, 2024

View reviewed changes

install/kubernetes/tetragon/templates/_container_tetragon.tpl Outdated Show resolved Hide resolved

tpapagian force-pushed the pr/apapag/http-liveness-probe branch from 56363a8 to f3c11ed Compare May 27, 2024 11:44

kkourt reviewed May 27, 2024

View reviewed changes

tpapagian force-pushed the pr/apapag/http-liveness-probe branch from f3c11ed to d9bd92c Compare May 27, 2024 17:33

michi-covalent self-requested a review May 27, 2024 17:49

lambdanis reviewed May 27, 2024

View reviewed changes

tpapagian force-pushed the pr/apapag/http-liveness-probe branch 4 times, most recently from d47e9c9 to d9bd92c Compare May 28, 2024 09:16

tpapagian closed this May 28, 2024

tpapagian deleted the pr/apapag/http-liveness-probe branch May 28, 2024 20:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use HTTP liveness probe instead of tetra status #2474

Use HTTP liveness probe instead of tetra status #2474

tpapagian commented May 27, 2024 •

edited

Loading

netlify bot commented May 27, 2024 •

edited

Loading

kkourt May 27, 2024

tpapagian May 27, 2024 •

edited

Loading

tpapagian May 27, 2024

lambdanis May 27, 2024

tpapagian May 27, 2024

tpapagian May 27, 2024

tpapagian May 27, 2024

michi-covalent commented May 27, 2024

lambdanis left a comment

tpapagian commented May 27, 2024

lambdanis commented May 27, 2024

tpapagian commented May 28, 2024

Use HTTP liveness probe instead of tetra status #2474

Use HTTP liveness probe instead of tetra status #2474

Conversation

tpapagian commented May 27, 2024 • edited Loading

netlify bot commented May 27, 2024 • edited Loading

✅ Deploy Preview for tetragon ready!

kkourt May 27, 2024

Choose a reason for hiding this comment

tpapagian May 27, 2024 • edited Loading

Choose a reason for hiding this comment

tpapagian May 27, 2024

Choose a reason for hiding this comment

lambdanis May 27, 2024

Choose a reason for hiding this comment

tpapagian May 27, 2024

Choose a reason for hiding this comment

tpapagian May 27, 2024

Choose a reason for hiding this comment

tpapagian May 27, 2024

Choose a reason for hiding this comment

michi-covalent commented May 27, 2024

lambdanis left a comment

Choose a reason for hiding this comment

tpapagian commented May 27, 2024

lambdanis commented May 27, 2024

tpapagian commented May 28, 2024

tpapagian commented May 27, 2024 •

edited

Loading

netlify bot commented May 27, 2024 •

edited

Loading

tpapagian May 27, 2024 •

edited

Loading