Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

VSO is getting OOMKilled on OpenShift cluster #973

Closed
erzhan46 opened this issue Nov 20, 2024 · 8 comments · Fixed by #982
Closed

VSO is getting OOMKilled on OpenShift cluster #973

erzhan46 opened this issue Nov 20, 2024 · 8 comments · Fixed by #982
Assignees
Labels
bug Something isn't working memory usage Issues with memory consumption by the operator Pod
Milestone

Comments

@erzhan46
Copy link

Describe the bug
VSO recently started to get OOMKilled on one of the OpenShift clusters (v.4.14.37).
Increasing memory limits to 2Gi and trying to put VSO to guaranteed QOS didn't help.
There are several other OpenShift clusters where VSO runs just fine with default resource specs.

To Reproduce

  1. Deployed VSO using standard helm chart increasing memory limit to 2Gi.
  2. Tried to set VSO pod to guaranteed QOS by setting resources specs for manager and kuberbacproxy containers.
  3. In both cases - VSO is getting OOMKilled on one cluster and it runs just fine on several others even with default resource specs.

Expected behavior
VSO should run using default resource specs.

Environment

  • Kubernetes version:
    • OpenShift on-prem v.4.14.37:
    • Other configuration options or runtime services (istio, etc.):
  • vault-secrets-operator version:
    Helm chart 0.7.1

Additional context
This seems to be the same issue experienced by others recently.

@erzhan46 erzhan46 added the bug Something isn't working label Nov 20, 2024
@erzhan46
Copy link
Author

This seems to be related to AppRole authentication failures.
VSO eventually came up spiking to 2G upon startup and now using1.2G.
And it currently logs 'invalid role or secret' errors.

@tvoran tvoran added the memory usage Issues with memory consumption by the operator Pod label Nov 21, 2024
@tvoran
Copy link
Member

tvoran commented Nov 21, 2024

Hi @erzhan46, that level of memory usage is unexpected. Are the AppRole authentication failures expected, and unique to this cluster? How many and what kind of secrets are being synced? Are there other auth methods besides AppRole in use?

@erzhan46
Copy link
Author

Hi @tvoran

We fixed the issue with AppRole authentication - however memory problem still persist.
VSO gets OOMKilled several times upon startup before starting successfully.
Memory metrics show VSO spikes to about 2G and then runs consistently at 1G.
One thing I noticed is the following VSO logs on that cluster.
As you can see - 'Objects listed" error: 33246ms' reported probably related to 'SecretTransformation' processing.
On other clusters where VSO runs fine - this error is not present.

{"level":"info","ts":"2024-11-21T16:34:02Z","msg":"Starting EventSource","controller":"secrettransformation","controllerGroup":"secrets.hashicorp.com","controllerKind":"SecretTransformation","source":"kind source: *v1beta1.SecretTransformation"}
{"level":"info","ts":"2024-11-21T16:34:02Z","msg":"Starting Controller","controller":"secrettransformation","controllerGroup":"secrets.hashicorp.com","controllerKind":"SecretTransformation"}
I1121 16:34:35.727589 1 trace.go:236] Trace[1704102856]: "Reflector ListAndWatch" name:pkg/mod/k8s.io/client-go@v0.30.1/tools/cache/reflector.go:232 (21-Nov-2024 16:34:02.275) (total time: 33451ms):
Trace[1704102856]: ---"Objects listed" error: 33246ms (16:34:35.522)
Trace[1704102856]: [33.451708284s] [33.451708284s] END
{"level":"info","ts":"2024-11-21T16:34:37Z","msg":"Starting workers","controller":"secrettransformation","controllerGroup":"secrets.hashicorp.com","controllerKind":"SecretTransformation","worker count":1}

@erzhan46
Copy link
Author

There is just a few StaticSecrets synced. Couple SecretsTransformations.
Cannot use authentication methods other than AppRole because of the issue with private domain name resolution in Vault instances deployed in HCP.

@benashz
Copy link
Collaborator

benashz commented Nov 27, 2024

Hi @erzhan46,

The error you reported in #973 (comment) is possibly related to the VSO CRDs being out of sync. As of VSO v0.8.0, the CRDs are automatically updated when the VSO Helm release is upgraded docs.

If you aren't ready to upgrade, would you mind following the instructions here for the version you are currently running.

Thanks,

Ben

@benashz
Copy link
Collaborator

benashz commented Nov 27, 2024

@erzhan46 - would you mind running VSO with the log level set to trace. The configuration docs can be found here

Thanks!

@erzhan46
Copy link
Author

erzhan46 commented Dec 4, 2024

@benashz - we upgraded VSO on all nonprod clusters to 0.9.0 - it's continue to getting OOMKilled on one of them.
Tried to raise log-level to trace - development and pod are updated with new zap-log-level argument set to 6 - however no extra entries are produced in the pod logs.
There are 8 OCP clusters (nonprod/prod) - all at the same k8s/OCP level. On all of them VSO is deployed using the same Helm chart via ArgoCD. On all of them VSO memory limit/resource is set to 2G. Out of them 2 are having VSO getting OOMKIlled. Rest have VSO running fine with memory raising to ~350M on startup and then running at <200M.

@erzhan46
Copy link
Author

erzhan46 commented Dec 4, 2024

Few of these clusters have VSO pod deployed with 'default' VaultConnection CR only. There are no other hashicorp.com CR's there (VaultAuth, VaultStaticSecret, etc.)
VSO is getting OOMKilled on one of these 'empty' clusters, while it's running just fine on others.

@benashz benashz linked a pull request Dec 9, 2024 that will close this issue
1 task
@benashz benashz added this to the v0.9.1 milestone Dec 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working memory usage Issues with memory consumption by the operator Pod
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants