Too much memory used with large number of namespaces #1268

guillaumerose · 2021-11-24T14:32:42Z

Expected Behavior

Same memory usage as the pipeline controller

Actual Behavior

The memory taken by the interceptor and trigger controller is around 1GB.

This is not the case of the pipeline controller which stay around 50MB.

(memory usage of pods in Tekton namespace during the load test - highest are tekton-triggers-controller and tekton-triggers-core-interceptors pods)

Steps to Reproduce the Problem

Create 2000 namespaces. In the first 600, create a PipelineRun like this:

apiVersion: tekton.dev/v1beta1
kind: PipelineRun
metadata:
  name: pipeline1
spec:
  pipelineSpec:
    tasks:
      - name: step1
        taskSpec:
          steps:
          - name: step1
            image: busybox
            imagePullPolicy: IfNotPresent
            script: ls /

Observe the memory taken by the 2 controllers

Additional Info

Kubernetes version:

Output of kubectl version:

Client Version: version.Info{Major:"1", Minor:"22", GitVersion:"v1.22.2", GitCommit:"8b5a19147530eaac9476b0ab82980b4088bbc1b2", GitTreeState:"clean", BuildDate:"2021-09-15T21:31:32Z", GoVersion:"go1.16.8", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"22", GitVersion:"v1.22.1+d8c4430", GitCommit:"2ade4d46986c94405aa0542f8d6ff7691bddb153", GitTreeState:"clean", BuildDate:"2021-11-02T02:02:53Z", GoVersion:"go1.16.6", Compiler:"gc", Platform:"linux/amd64"}

Tekton Pipeline version:

pipeline: v0.28.2
triggers: v0.16.1

Other additional information

It is stable: if I kill the pod, the controller is recreated and takes almost the same amount of memory.

The cluster has 681 deployments and 14180 configMaps.

I turned on profiling and saw there is something happening in the informers.

The text was updated successfully, but these errors were encountered:

dibyom · 2021-11-24T18:35:34Z

Thanks for the detailed bug report. At a first glance, it looks like the watch for the lister/informer cache may be watching too many things.

It prevents controller to watch for configMap in all namespaces. Before this change: controller logs shows: 2021/11/25 08:46:36 Registering 5 informers crictl stats reports 66.88MB with around 4k configMaps of 2.5kB with kind. After this change: controller logs shows: 2021/11/25 08:48:48 Registering 4 informers crictl stats reports 16.29MB Related to tektoncd#1268

guillaumerose · 2021-11-25T09:46:49Z

I found an explanation for the Triggers controller. See #1273

For the interceptors container:

I observe that all secrets of the cluster are kept in memory. See https://github.com/tektoncd/triggers/blob/main/cmd/interceptors/main.go#L67.
I see 2 possibles solutions:

Don't use a secret lister but instead use a secret getter

Pros:

no more informer
this is what pipeline already does

Cons:

for each webhook, the interceptor will issue an API call to get the secret. Perhaps a cache can mitigate this?

Ask users to add a specific label on their secrets so that we can filter on them.

Pros:

less code to change

Cons:

breaking change for the user.

Replace the secret informer by a secret getter. It implies that for each webhook an API call will be issued against the k8s API to get the secret. Previously, all secrets of the cluster were in the interceptor memory. Before this change: crictl stats reports 79.18MB with around 5k secrets of 2.5kB with kind. After this change: crictl stats reports 8.221MB Related to tektoncd#1268

guillaumerose · 2021-11-26T08:44:16Z

I also realize that the triggers controller is watching for all deployments and all services in the cluster. It might be a good idea to filter them.

It prevents controller to watch for configMap in all namespaces. Before this change: controller logs shows: 2021/11/25 08:46:36 Registering 5 informers crictl stats reports 66.88MB with around 4k configMaps of 2.5kB with kind. After this change: controller logs shows: 2021/11/25 08:48:48 Registering 4 informers crictl stats reports 16.29MB Related to tektoncd#1268

It prevents controller to watch for configMap in all namespaces. Before this change: controller logs shows: 2021/11/25 08:46:36 Registering 5 informers crictl stats reports 66.88MB with around 4k configMaps of 2.5kB with kind. After this change: controller logs shows: 2021/11/25 08:48:48 Registering 4 informers crictl stats reports 16.29MB Related to #1268

lbernick · 2021-12-13T20:42:53Z

/priority important-longterm

tekton-robot · 2022-03-13T21:43:28Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale with a justification.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close with a justification.
If this issue should be exempted, mark the issue as frozen with /lifecycle frozen with a justification.

/lifecycle stale

Send feedback to tektoncd/plumbing.

dibyom · 2022-03-21T19:27:42Z

This was mostly fixed. The remaining work is around caching only the secrets needed by the interceptors vs all secrets.

tekton-robot · 2022-06-19T20:13:06Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale with a justification.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close with a justification.
If this issue should be exempted, mark the issue as frozen with /lifecycle frozen with a justification.

/lifecycle stale

Send feedback to tektoncd/plumbing.

khrm · 2022-07-27T15:22:24Z

Fixed by #1359

guillaumerose added the kind/bug Categorizes issue or PR as related to a bug. label Nov 24, 2021

guillaumerose mentioned this issue Nov 25, 2021

Unregister unnecessary configMap informer #1273

Merged

4 tasks

guillaumerose mentioned this issue Nov 26, 2021

Avoid watching all secrets in the cluster #1274

Merged

4 tasks

tekton-robot added the priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. label Dec 13, 2021

tekton-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Mar 13, 2022

khrm mentioned this issue May 6, 2022

Avoid watching all secrets in the cluster #1359

Closed

4 tasks

tekton-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 19, 2022

dibyom removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 28, 2022

khrm closed this as completed Jul 27, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Too much memory used with large number of namespaces #1268

Too much memory used with large number of namespaces #1268

guillaumerose commented Nov 24, 2021

dibyom commented Nov 24, 2021

guillaumerose commented Nov 25, 2021 •

edited

Loading

guillaumerose commented Nov 26, 2021 •

edited

Loading

lbernick commented Dec 13, 2021

tekton-robot commented Mar 13, 2022

dibyom commented Mar 21, 2022

tekton-robot commented Jun 19, 2022

khrm commented Jul 27, 2022

Too much memory used with large number of namespaces #1268

Too much memory used with large number of namespaces #1268

Comments

guillaumerose commented Nov 24, 2021

Expected Behavior

Actual Behavior

Steps to Reproduce the Problem

Additional Info

dibyom commented Nov 24, 2021

guillaumerose commented Nov 25, 2021 • edited Loading

guillaumerose commented Nov 26, 2021 • edited Loading

lbernick commented Dec 13, 2021

tekton-robot commented Mar 13, 2022

dibyom commented Mar 21, 2022

tekton-robot commented Jun 19, 2022

khrm commented Jul 27, 2022

guillaumerose commented Nov 25, 2021 •

edited

Loading

guillaumerose commented Nov 26, 2021 •

edited

Loading