Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

As a tekton pipeline user, I want to use liveness/readiness probes to check controller pod health #3111

Closed
xiujuan95 opened this issue Aug 18, 2020 · 17 comments · Fixed by #3489
Labels
kind/feature Categorizes issue or PR as related to a new feature.

Comments

@xiujuan95
Copy link
Contributor

xiujuan95 commented Aug 18, 2020

Feature request

I want to use liveness and readiness probes to detect if my tekton controller pod is healthy or not. However, seem like liveness and readiness field don't be included in controller deployment:https://github.com/tektoncd/pipeline/blob/master/config/controller.yaml

About this request, actually, I have done some experiments. I configure liveness and readiness like below:

livenessProbe:
          failureThreshold: 3
          httpGet:
            path: /metrics
            port: 9090
            scheme: HTTP
          initialDelaySeconds: 5
          periodSeconds: 10
          successThreshold: 1
          timeoutSeconds: 1
readinessProbe:
          failureThreshold: 3
          httpGet:
            path: /metrics
            port: 9090
            scheme: HTTP
          initialDelaySeconds: 5
          periodSeconds: 10
          successThreshold: 1
          timeoutSeconds: 1

And I check the pod event, it tells me the probes failed:

Events:
  Type     Reason     Age                From                  Message
  ----     ------     ----               ----                  -------
  Normal   Scheduled  35m                default-scheduler     Successfully assigned tekton-pipelines/tekton-pipelines-controller-7cd74569b7-mm96v to 10.242.0.19
  Normal   Pulled     35m                kubelet, 10.242.0.19  Container image "icr.io/obs/codeengine/tekton-pipeline/controller-10a3e32792f33651396d02b6855a6e36:v0.14.2-rc2@sha256:845358c3bb68e6900b421545ee40352391baccb751ef63915758016c4745bdbe" already present on machine
  Normal   Created    35m                kubelet, 10.242.0.19  Created container tekton-pipelines-controller
  Normal   Started    35m                kubelet, 10.242.0.19  Started container tekton-pipelines-controller
  Warning  Unhealthy  35m (x2 over 35m)  kubelet, 10.242.0.19  Readiness probe failed: Get http://172.30.18.233:9090/metrics: dial tcp 172.30.18.233:9090: connect: connection refused
  Warning  Unhealthy  35m (x2 over 35m)  kubelet, 10.242.0.19  Liveness probe failed: Get http://172.30.18.233:9090/metrics: dial tcp 172.30.18.233:9090: connect: connection refused

However, my pod is still running normally:

kubectl get pod -n tekton-pipelines -o wide
NAME                                           READY   STATUS    RESTARTS   AGE   IP               NODE            NOMINATED NODE   READINESS GATES
tekton-pipelines-controller-7cd74569b7-mm96v   1/1     Running   0          52m   172.30.18.233    10.242.0.19     <none>           <none>

This is not expected.

BTW, I can do curl http://localhost:9090/metrics command successfully within tekton controller container:

kubectl exec -ti tekton-pipelines-controller-7cd74569b7-mm96v -n tekton-pipelines sh
sh-4.4# curl http://localhost:9090/metrics
# HELP tekton_client_latency How long Kubernetes API requests take
# TYPE tekton_client_latency histogram
tekton_client_latency_bucket{name="",le="1e-05"} 13
tekton_client_latency_bucket{name="",le="0.0001"} 819
tekton_client_latency_bucket{name="",le="0.001"} 820
tekton_client_latency_bucket{name="",le="0.01"} 826
tekton_client_latency_bucket{name="",le="0.1"} 10358

Use case

@xiujuan95 xiujuan95 added the kind/feature Categorizes issue or PR as related to a new feature. label Aug 18, 2020
@xiujuan95
Copy link
Contributor Author

xiujuan95 commented Aug 19, 2020

Now, I check pod event and find liveness/readiness probes failed message is gone:

kubectl describe pod tekton-pipelines-controller-7cd74569b7-mm96v -n tekton-pipelines
Name:         tekton-pipelines-controller-7cd74569b7-mm96v
Namespace:    tekton-pipelines
Priority:     0
Node:         10.242.0.19/10.242.0.19
Start Time:   Tue, 18 Aug 2020 04:49:50 -0400
Labels:       app=tekton-pipelines-controller
              app.kubernetes.io/component=controller
              app.kubernetes.io/instance=default
              app.kubernetes.io/name=controller
              app.kubernetes.io/part-of=tekton-pipelines
              app.kubernetes.io/version=devel
              pipeline.tekton.dev/release=devel
              pod-template-hash=7cd74569b7
              version=devel
Annotations:  container.apparmor.security.beta.kubernetes.io/tekton-pipelines-controller: runtime/default
              kubernetes.io/psp: ibm-coligo-restricted-psp
              prometheus.io/port: 9090
              prometheus.io/scrape: true
              seccomp.security.alpha.kubernetes.io/pod: docker/default
Status:       Running
IP:           172.30.18.233
IPs:
  IP:           172.30.18.233
Controlled By:  ReplicaSet/tekton-pipelines-controller-7cd74569b7
Containers:
  tekton-pipelines-controller:
    Container ID:  containerd://ac980aa330e0e2f0da70b013926e7528d278f0a41f27cc80c9a3ea02db051030
    Image:         icr.io/obs/codeengine/tekton-pipeline/controller-10a3e32792f33651396d02b6855a6e36:v0.14.2-rc2@sha256:845358c3bb68e6900b421545ee40352391baccb751ef63915758016c4745bdbe
    Image ID:      icr.io/obs/codeengine/tekton-pipeline/controller-10a3e32792f33651396d02b6855a6e36@sha256:845358c3bb68e6900b421545ee40352391baccb751ef63915758016c4745bdbe
    Port:          <none>
    Host Port:     <none>
    Args:
      -kubeconfig-writer-image
      icr.io/obs/codeengine/tekton-pipeline/kubeconfigwriter-3d37fea0b053ea82d66b7c0bae03dcb0:v0.14.2-rc2@sha256:d9163204ebd1f9b8d7bbafd888e9b2d661834dfda97d02002ef964b538fbc803
      -creds-image
      icr.io/obs/codeengine/tekton-pipeline/creds-init-c761f275af7b3d8bea9d50cc6cb0106f:v0.14.2-rc2@sha256:2d3fca0f61c115ba1c092d49fa328012f245d1a041467f4d34ee409b17537cfe
      -git-image
      icr.io/obs/codeengine/tekton-pipeline/git-init-4874978a9786b6625dd8b6ef2a21aa70:v0.14.2-rc2@sha256:aed72cf82ad06aedd4d185334cc4b2790e074626064ea1517e46429c7540a2eb
      -entrypoint-image
      icr.io/obs/codeengine/tekton-pipeline/entrypoint-bff0a22da108bc2f16c818c97641a296:v0.14.2-rc2@sha256:3bce35f04e04e74a539b7511bbd8db00bad4ffb8698aca65d1fb8e48db8e958a
      -imagedigest-exporter-image
      icr.io/obs/codeengine/tekton-pipeline/imagedigestexporter-6e7c518e6125f31761ebe0b96cc63971:v0.14.2-rc2@sha256:3174897711d6dc697834ebf8bf5ab79aaf1b68ab0922804999199f5fab08276c
      -pr-image
      icr.io/obs/codeengine/tekton-pipeline/pullrequest-init-4e60f6acf9725cba4c9b0c81d0ba89b8:v0.14.2-rc2@sha256:fc8589362d32095dd25fd4200174fc9b050b704d16c30159058ff89f8613ed2f
      -build-gcs-fetcher-image
      icr.io/obs/codeengine/tekton-pipeline/gcs-fetcher-029518c065a5d298216f115c6595f133:v0.14.2-rc2@sha256:ce5cf198fdc17ddd4c09666b34f4c7a9becd89fe6b97be2a99ae880c772f55af
      -affinity-assistant-image
      nginx@sha256:c870bf53de0357813af37b9500cb1c2ff9fb4c00120d5fe1d75c21591293c34d
      -nop-image
      tianon/true@sha256:009cce421096698832595ce039aa13fa44327d96beedb84282a69d3dbcf5a81b
      -gsutil-image
      google/cloud-sdk@sha256:37654ada9b7afbc32828b537030e85de672a9dd468ac5c92a36da1e203a98def
      -shell-image
      gcr.io/distroless/base@sha256:f79e093f9ba639c957ee857b1ad57ae5046c328998bf8f72b30081db4d8edbe4
    State:          Running
      Started:      Tue, 18 Aug 2020 04:49:51 -0400
    Ready:          True
    Restart Count:  0
    Limits:
      cpu:     2
      memory:  2Gi
    Requests:
      cpu:      500m
      memory:   512Mi
    Liveness:   http-get http://:9090/metrics delay=5s timeout=1s period=10s #success=1 #failure=3
    Readiness:  http-get http://:9090/metrics delay=5s timeout=1s period=10s #success=1 #failure=3
    Environment:
      SYSTEM_NAMESPACE:             tekton-pipelines (v1:metadata.namespace)
      CONFIG_LOGGING_NAME:          config-logging
      CONFIG_OBSERVABILITY_NAME:    config-observability
      CONFIG_ARTIFACT_BUCKET_NAME:  config-artifact-bucket
      CONFIG_ARTIFACT_PVC_NAME:     config-artifact-pvc
      CONFIG_FEATURE_FLAGS_NAME:    feature-flags
      CONFIG_LEADERELECTION_NAME:   config-leader-election
      METRICS_DOMAIN:               tekton.dev/pipeline
    Mounts:
      /etc/config-logging from config-logging (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from tekton-pipelines-controller-token-zls7s (ro)
Conditions:
  Type              Status
  Initialized       True
  Ready             True
  ContainersReady   True
  PodScheduled      True
Volumes:
  config-logging:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      config-logging
    Optional:  false
  tekton-pipelines-controller-token-zls7s:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  tekton-pipelines-controller-token-zls7s
    Optional:    false
QoS Class:       Burstable
Node-Selectors:  <none>
Tolerations:     node.kubernetes.io/not-ready:NoExecute for 600s
                 node.kubernetes.io/unreachable:NoExecute for 600s
Events:          <none>

So I think previous failure is expected, Because pod is restarting, do liveness and readiness detect failed. Once pod is ready, liveness and readiness detections are normal.

But it's better for your side to add an explicit probes for controller. I think it's necessary.

@imjasonh
Copy link
Member

@mattmoor is there a health check enpoint provided by knative/pkg by any chance? Perhaps hooked into any signal that reconciling is happening, and into the new HA stuff?

@mattmoor
Copy link
Member

Yeah, but I think it's probably only exposed when the webhook is active, since the main use case would be lame ducking webhooks so the K8s Service drops the endpoint before it quits.

@qu1queee
Copy link
Contributor

@imjasonh any comments around using the /metrics endpoint as the probe?

@imjasonh
Copy link
Member

@imjasonh any comments around using the /metrics endpoint as the probe?

It's honestly a bit surprising that it reports unhealthy in the example above. I'd want to look into that and figure out why that is. Maybe for simplicity we should add a new /health handler that simply always responds successfully, to remove potential noise.

That seems a bit less useful though, since ideally we'd like to only report "ready" when listers/informers are set up, or when the webhook is registered, and should probably have some way to programmatically be able to report unhealthy/non-live. That's why I roped in @mattmoor, in case these considerations have already come up and been solved in Knative-land.

@xiujuan95
Copy link
Contributor Author

@imjasonh Thanks for your attention!

@imjasonh any comments around using the /metrics endpoint as the probe?

It's honestly a bit surprising that it reports unhealthy in the example above. I'd want to look into that and figure out why that is. Maybe for simplicity we should add a new /health handler that simply always responds successfully, to remove potential noise.

  • It reports unhealthy, I think it's because the pod is restarting. Once the pod is ready, then probes will be successful. You can see here.

  • Yes, I agree with you to add a simple /health endpoint to detect the liveness and readiness. Please consider it, thanks a lot!

@zhangtbj
Copy link
Contributor

And this liveness and readiness is also required for tekton pipeline webook and tekton trigger controller and webhook.

@ywluogg
Copy link
Contributor

ywluogg commented Sep 9, 2020

This sounds the same to the request in #1586. I tried adding a port in the controller (commit) but still didn't work. Still trying adding this to controller. The two probes are just added to webhook via this commit.

@xiujuan95
Copy link
Contributor Author

@ywluogg Why do you use 8080 port instead of 9090?

@ywluogg
Copy link
Contributor

ywluogg commented Sep 10, 2020

@xiujuan95 Ah that's because I wanted to separate the probes' ports from metrics port.

I'm able to add the probes and the pending changes are in: 1d0f3d3

But as @imjasonh mentioned it seems more useful that if we can connect probes' ports to a signal that can clearly tells the controller is processing reconciliations, which needs much more changes. The current controller setup is a single controller replica and its time to restart itself when it crashes and the time it restarts itself using probes are roughly the same. Are you trying to use the probes for some other goals?

I'm going to wait for the discussion about this and then probably send a PR.

@xiujuan95
Copy link
Contributor Author

Thanks @ywluogg ! No, I just want to use probes to detect the health of controller, don't have other goals.

@ywluogg
Copy link
Contributor

ywluogg commented Sep 14, 2020

@imjasonh do you think we should add the probes for simple health check purposes for now?

@qu1queee
Copy link
Contributor

any updates on this issue? It seems that if we use the /metrics endpoint for probes it will eventually conflict with HA for the controllers, as explained in #2735 (comment).

@afrittoli
Copy link
Member

The probes are available now on the webhook:

livenessProbe:
tcpSocket:
port: https-webhook
initialDelaySeconds: 5
periodSeconds: 10
timeoutSeconds: 5
readinessProbe:
tcpSocket:
port: https-webhook
initialDelaySeconds: 5
periodSeconds: 10

@afrittoli
Copy link
Member

Yeah, but I think it's probably only exposed when the webhook is active, since the main use case would be lame ducking webhooks so the K8s Service drops the endpoint before it quits.

@mattmoor @pmorie I found knative/pkg#1048 but it's not clear to me then whether it is available yet or not. If not we could perhaps add a "shallow" /healthz for now, that always reports "ok" (as @imjasonh suggested) and switch to the knative/pkg one once it becomes available.

@ywluogg
Copy link
Contributor

ywluogg commented Oct 6, 2020

Agreed with @afrittoli. It seems still more suitable if we add the health checks after knative/pkg#1048 becomes available.

@xiujuan95
Copy link
Contributor Author

Hi, @ywluogg any updates about adding livenss/readiness probes for controller?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature Categorizes issue or PR as related to a new feature.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants