Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Installing argocd causes unbounded etcd memory usage #3556

Closed
stujb opened this issue May 7, 2020 · 4 comments
Closed

Installing argocd causes unbounded etcd memory usage #3556

stujb opened this issue May 7, 2020 · 4 comments
Labels
bug Something isn't working component:core Syncing, diffing, cluster state cache more-information-needed Further information is requested type:scalability Issues related to scalability and performance related issues
Milestone

Comments

@stujb
Copy link

stujb commented May 7, 2020

If you are trying to resolve an environment-specific issue or have a one-off question about the edge case that does not require a feature then please consider asking a
question in argocd slack channel.

Checklist:

  • [X ] I've searched in the docs and FAQ for my answer: http://bit.ly/argocd-faq.
  • [X ] I've included steps to reproduce the bug.
  • [ Multiple versions ] I've pasted the output of argocd version.

Describe the bug

ArgoCD causes unbounded increases in etcd memory usage

To Reproduce

Install ArgoCD and monitor etcd memory usage. We are observing that the memory increases consistently to the extent that etcd processes get OOM killed.

Expected behavior

ArgoCD doesn't cause etcd memory to increase consistently

Screenshots

We restarted etcd on a cluster as this is the only way we have so far discovered to relieve the problem and reduce the memory usage. I have included below metrics graphs that are from after an etcd restart to show the issue.

ArgoCD on this cluster is configured with an app-of-apps containing two applications. One of these apps contains a single SealedSecret. The second app contains a ConfigMap, Deployment, Service, SealedSecret and custom CRD called a TLSRoute that creates a Route and injects the required certificates into it.

After we restarted etcd this is the memory usage we see over approx 20-21 hours :

image

We believe the memory usage may be caused by an unbounded increase in the number of watches in etcd. Graph for the same cluster over the same period showing the number of watches in etcd :

image

We have investigated changing the snapshot-count of the etcd cluster to resolve this but reducing this from the default of 100000 to 10000 and then to 10 makes no difference (above graphs are with snapshot-count set to 10)

We have found this etcd bug and wonder if it could be related and if there is anything that ArgoCD does with watches that could exacerbate this issue etcd-io/etcd#9416 (comment)

I would also note we see an increase in the etcd memory usage and number of watches even if we install ArgoCD and do NOT configure any applications in it.

Version

We have observed this issue with argocd 1.5.1, 1.5.2 and 1.5.3.
ArgoCD is installed on an OKD 3.11 cluster.

@stujb stujb added the bug Something isn't working label May 7, 2020
@stujb
Copy link
Author

stujb commented May 7, 2020

Another memory graph of a production cluster that we had installed ArgoCD on 23rd April and removed ArgoCD on 4th May. The graph shows the memory % free on each of the etcd nodes in the cluster

image

@jannfis jannfis added component:core Syncing, diffing, cluster state cache type:scalability Issues related to scalability and performance related issues labels May 14, 2020
@alexmt alexmt added this to the v1.7 milestone May 20, 2020
@rokrokss
Copy link
Contributor

what's your etcd version? did the etcd db size increase too?

@jessesuen
Copy link
Member

v1.7 significantly reduced etcd churn by utilizing redis as a pub/sub channel for application events. Previously, we caused much more updates in etcd. Can you confirm if v1.7 helped this behavior?

@jessesuen jessesuen added the more-information-needed Further information is requested label Sep 16, 2020
@no-response
Copy link

no-response bot commented Sep 30, 2020

This issue has been automatically closed because there has been no response to our request for more information from the original author. With only the information that is currently in the issue, we don't have enough information to take action. Please reach out if you have or find the answers we need so that we can investigate further.

@no-response no-response bot closed this as completed Sep 30, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working component:core Syncing, diffing, cluster state cache more-information-needed Further information is requested type:scalability Issues related to scalability and performance related issues
Projects
None yet
Development

No branches or pull requests

5 participants