Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Too much memory used with large number of namespaces #1268

Closed
guillaumerose opened this issue Nov 24, 2021 · 8 comments
Closed

Too much memory used with large number of namespaces #1268

guillaumerose opened this issue Nov 24, 2021 · 8 comments
Labels
kind/bug Categorizes issue or PR as related to a bug. priority/backlog Higher priority than priority/awaiting-more-evidence.

Comments

@guillaumerose
Copy link
Contributor

Expected Behavior

Same memory usage as the pipeline controller

Actual Behavior

The memory taken by the interceptor and trigger controller is around 1GB.

This is not the case of the pipeline controller which stay around 50MB.

Screenshot 2021-11-24 at 15 30 56

(memory usage of pods in Tekton namespace during the load test - highest are tekton-triggers-controller and tekton-triggers-core-interceptors pods)

Steps to Reproduce the Problem

  1. Create 2000 namespaces. In the first 600, create a PipelineRun like this:
apiVersion: tekton.dev/v1beta1
kind: PipelineRun
metadata:
  name: pipeline1
spec:
  pipelineSpec:
    tasks:
      - name: step1
        taskSpec:
          steps:
          - name: step1
            image: busybox
            imagePullPolicy: IfNotPresent
            script: ls /
  1. Observe the memory taken by the 2 controllers

Additional Info

  • Kubernetes version:

    Output of kubectl version:

Client Version: version.Info{Major:"1", Minor:"22", GitVersion:"v1.22.2", GitCommit:"8b5a19147530eaac9476b0ab82980b4088bbc1b2", GitTreeState:"clean", BuildDate:"2021-09-15T21:31:32Z", GoVersion:"go1.16.8", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"22", GitVersion:"v1.22.1+d8c4430", GitCommit:"2ade4d46986c94405aa0542f8d6ff7691bddb153", GitTreeState:"clean", BuildDate:"2021-11-02T02:02:53Z", GoVersion:"go1.16.6", Compiler:"gc", Platform:"linux/amd64"}
  • Tekton Pipeline version:

pipeline: v0.28.2
triggers: v0.16.1

  • Other additional information

It is stable: if I kill the pod, the controller is recreated and takes almost the same amount of memory.

The cluster has 681 deployments and 14180 configMaps.

I turned on profiling and saw there is something happening in the informers.

Screenshot 2021-11-24 at 15 31 32

@guillaumerose guillaumerose added the kind/bug Categorizes issue or PR as related to a bug. label Nov 24, 2021
@dibyom
Copy link
Member

dibyom commented Nov 24, 2021

Thanks for the detailed bug report. At a first glance, it looks like the watch for the lister/informer cache may be watching too many things.

guillaumerose added a commit to guillaumerose/triggers that referenced this issue Nov 25, 2021
It prevents controller to watch for configMap in all namespaces.

Before this change:
controller logs shows:
2021/11/25 08:46:36 Registering 5 informers

crictl stats reports 66.88MB with around 4k configMaps of 2.5kB with kind.

After this change:
controller logs shows:
2021/11/25 08:48:48 Registering 4 informers

crictl stats reports 16.29MB

Related to tektoncd#1268
@guillaumerose
Copy link
Contributor Author

guillaumerose commented Nov 25, 2021

I found an explanation for the Triggers controller. See #1273

For the interceptors container:

I observe that all secrets of the cluster are kept in memory. See https://github.com/tektoncd/triggers/blob/main/cmd/interceptors/main.go#L67.
I see 2 possibles solutions:

  1. Don't use a secret lister but instead use a secret getter

Pros:

  • no more informer
  • this is what pipeline already does

Cons:

  • for each webhook, the interceptor will issue an API call to get the secret. Perhaps a cache can mitigate this?
  1. Ask users to add a specific label on their secrets so that we can filter on them.

Pros:

  • less code to change

Cons:

  • breaking change for the user.

guillaumerose added a commit to guillaumerose/triggers that referenced this issue Nov 26, 2021
Replace the secret informer by a secret getter.
It implies that for each webhook an API call will be issued against the
k8s API to get the secret.
Previously, all secrets of the cluster were in the interceptor memory.

Before this change:
crictl stats reports 79.18MB with around 5k secrets of 2.5kB with kind.

After this change:
crictl stats reports 8.221MB

Related to tektoncd#1268
@guillaumerose
Copy link
Contributor Author

guillaumerose commented Nov 26, 2021

I also realize that the triggers controller is watching for all deployments and all services in the cluster. It might be a good idea to filter them.

guillaumerose added a commit to guillaumerose/triggers that referenced this issue Nov 30, 2021
It prevents controller to watch for configMap in all namespaces.

Before this change:
controller logs shows:
2021/11/25 08:46:36 Registering 5 informers

crictl stats reports 66.88MB with around 4k configMaps of 2.5kB with kind.

After this change:
controller logs shows:
2021/11/25 08:48:48 Registering 4 informers

crictl stats reports 16.29MB

Related to tektoncd#1268
guillaumerose added a commit to guillaumerose/triggers that referenced this issue Nov 30, 2021
It prevents controller to watch for configMap in all namespaces.

Before this change:
controller logs shows:
2021/11/25 08:46:36 Registering 5 informers

crictl stats reports 66.88MB with around 4k configMaps of 2.5kB with kind.

After this change:
controller logs shows:
2021/11/25 08:48:48 Registering 4 informers

crictl stats reports 16.29MB

Related to tektoncd#1268
tekton-robot pushed a commit that referenced this issue Dec 1, 2021
It prevents controller to watch for configMap in all namespaces.

Before this change:
controller logs shows:
2021/11/25 08:46:36 Registering 5 informers

crictl stats reports 66.88MB with around 4k configMaps of 2.5kB with kind.

After this change:
controller logs shows:
2021/11/25 08:48:48 Registering 4 informers

crictl stats reports 16.29MB

Related to #1268
@lbernick
Copy link
Member

/priority important-longterm

@tekton-robot tekton-robot added the priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. label Dec 13, 2021
@tekton-robot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale with a justification.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close with a justification.
If this issue should be exempted, mark the issue as frozen with /lifecycle frozen with a justification.

/lifecycle stale

Send feedback to tektoncd/plumbing.

@tekton-robot tekton-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Mar 13, 2022
@dibyom
Copy link
Member

dibyom commented Mar 21, 2022

This was mostly fixed. The remaining work is around caching only the secrets needed by the interceptors vs all secrets.

@dibyom dibyom added priority/backlog Higher priority than priority/awaiting-more-evidence. and removed priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Mar 21, 2022
@tekton-robot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale with a justification.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close with a justification.
If this issue should be exempted, mark the issue as frozen with /lifecycle frozen with a justification.

/lifecycle stale

Send feedback to tektoncd/plumbing.

@tekton-robot tekton-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 19, 2022
@dibyom dibyom removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 28, 2022
@khrm
Copy link
Contributor

khrm commented Jul 27, 2022

Fixed by #1359

@khrm khrm closed this as completed Jul 27, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. priority/backlog Higher priority than priority/awaiting-more-evidence.
Projects
None yet
Development

No branches or pull requests

5 participants