VSO is getting OOMKilled on OpenShift cluster #973

erzhan46 · 2024-11-20T21:00:42Z

Describe the bug
VSO recently started to get OOMKilled on one of the OpenShift clusters (v.4.14.37).
Increasing memory limits to 2Gi and trying to put VSO to guaranteed QOS didn't help.
There are several other OpenShift clusters where VSO runs just fine with default resource specs.

To Reproduce

Deployed VSO using standard helm chart increasing memory limit to 2Gi.
Tried to set VSO pod to guaranteed QOS by setting resources specs for manager and kuberbacproxy containers.
In both cases - VSO is getting OOMKilled on one cluster and it runs just fine on several others even with default resource specs.

Expected behavior
VSO should run using default resource specs.

Environment

Kubernetes version:
- OpenShift on-prem v.4.14.37:
- Other configuration options or runtime services (istio, etc.):
vault-secrets-operator version:
Helm chart 0.7.1

Additional context
This seems to be the same issue experienced by others recently.

erzhan46 · 2024-11-21T14:17:23Z

This seems to be related to AppRole authentication failures.
VSO eventually came up spiking to 2G upon startup and now using1.2G.
And it currently logs 'invalid role or secret' errors.

tvoran · 2024-11-21T18:18:55Z

Hi @erzhan46, that level of memory usage is unexpected. Are the AppRole authentication failures expected, and unique to this cluster? How many and what kind of secrets are being synced? Are there other auth methods besides AppRole in use?

erzhan46 · 2024-11-21T18:43:56Z

Hi @tvoran

We fixed the issue with AppRole authentication - however memory problem still persist.
VSO gets OOMKilled several times upon startup before starting successfully.
Memory metrics show VSO spikes to about 2G and then runs consistently at 1G.
One thing I noticed is the following VSO logs on that cluster.
As you can see - 'Objects listed" error: 33246ms' reported probably related to 'SecretTransformation' processing.
On other clusters where VSO runs fine - this error is not present.

{"level":"info","ts":"2024-11-21T16:34:02Z","msg":"Starting EventSource","controller":"secrettransformation","controllerGroup":"secrets.hashicorp.com","controllerKind":"SecretTransformation","source":"kind source: *v1beta1.SecretTransformation"}
{"level":"info","ts":"2024-11-21T16:34:02Z","msg":"Starting Controller","controller":"secrettransformation","controllerGroup":"secrets.hashicorp.com","controllerKind":"SecretTransformation"}
I1121 16:34:35.727589 1 trace.go:236] Trace[1704102856]: "Reflector ListAndWatch" name:pkg/mod/k8s.io/client-go@v0.30.1/tools/cache/reflector.go:232 (21-Nov-2024 16:34:02.275) (total time: 33451ms):
Trace[1704102856]: ---"Objects listed" error: 33246ms (16:34:35.522)
Trace[1704102856]: [33.451708284s] [33.451708284s] END
{"level":"info","ts":"2024-11-21T16:34:37Z","msg":"Starting workers","controller":"secrettransformation","controllerGroup":"secrets.hashicorp.com","controllerKind":"SecretTransformation","worker count":1}

erzhan46 · 2024-11-21T18:46:07Z

There is just a few StaticSecrets synced. Couple SecretsTransformations.
Cannot use authentication methods other than AppRole because of the issue with private domain name resolution in Vault instances deployed in HCP.

benashz · 2024-11-27T14:52:07Z

Hi @erzhan46,

The error you reported in #973 (comment) is possibly related to the VSO CRDs being out of sync. As of VSO v0.8.0, the CRDs are automatically updated when the VSO Helm release is upgraded docs.

If you aren't ready to upgrade, would you mind following the instructions here for the version you are currently running.

Thanks,

Ben

benashz · 2024-11-27T15:06:25Z

@erzhan46 - would you mind running VSO with the log level set to trace. The configuration docs can be found here

Thanks!

erzhan46 · 2024-12-04T14:33:05Z

@benashz - we upgraded VSO on all nonprod clusters to 0.9.0 - it's continue to getting OOMKilled on one of them.
Tried to raise log-level to trace - development and pod are updated with new zap-log-level argument set to 6 - however no extra entries are produced in the pod logs.
There are 8 OCP clusters (nonprod/prod) - all at the same k8s/OCP level. On all of them VSO is deployed using the same Helm chart via ArgoCD. On all of them VSO memory limit/resource is set to 2G. Out of them 2 are having VSO getting OOMKIlled. Rest have VSO running fine with memory raising to ~350M on startup and then running at <200M.

erzhan46 · 2024-12-04T14:35:56Z

Few of these clusters have VSO pod deployed with 'default' VaultConnection CR only. There are no other hashicorp.com CR's there (VaultAuth, VaultStaticSecret, etc.)
VSO is getting OOMKilled on one of these 'empty' clusters, while it's running just fine on others.

erzhan46 added the bug Something isn't working label Nov 20, 2024

tvoran added the memory usage Issues with memory consumption by the operator Pod label Nov 21, 2024

benashz self-assigned this Nov 25, 2024

benashz mentioned this issue Nov 25, 2024

Use the mutex pool provided by k8s keymutex #975

Merged

benashz linked a pull request Dec 9, 2024 that will close this issue

Core: reduce memory footprint #982

Merged

1 task

benashz added this to the v0.9.1 milestone Dec 10, 2024

benashz mentioned this issue Dec 10, 2024

Vault Secrets Operator memory usage way too high #969

Closed

benashz closed this as completed in #982 Dec 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

VSO is getting OOMKilled on OpenShift cluster #973

VSO is getting OOMKilled on OpenShift cluster #973

erzhan46 commented Nov 20, 2024

erzhan46 commented Nov 21, 2024

tvoran commented Nov 21, 2024

erzhan46 commented Nov 21, 2024

erzhan46 commented Nov 21, 2024

benashz commented Nov 27, 2024

benashz commented Nov 27, 2024

erzhan46 commented Dec 4, 2024

erzhan46 commented Dec 4, 2024

VSO is getting OOMKilled on OpenShift cluster #973

VSO is getting OOMKilled on OpenShift cluster #973

Comments

erzhan46 commented Nov 20, 2024

erzhan46 commented Nov 21, 2024

tvoran commented Nov 21, 2024

erzhan46 commented Nov 21, 2024

erzhan46 commented Nov 21, 2024

benashz commented Nov 27, 2024

benashz commented Nov 27, 2024

erzhan46 commented Dec 4, 2024

erzhan46 commented Dec 4, 2024