-
Notifications
You must be signed in to change notification settings - Fork 381
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Persistent memory escalation within Tetragon leads to Out-of-Memory errors #1485
Comments
Hi, Thanks for the report, and the detailed information! I'm seeing that you are deploying a number of tracing policies. Does the issue appear if you don't use any policies at all? If not, is it possible to pinpoint a single policy that leads to this issue? Also, as far as I understand the issue does not appear on a different cluster with the same tetragon configuration? Could you please provide the same information from the cluster where the issue does not exist? |
@kkourt Hi, I've noticed that the memory of tetragon in the test cluster is also constantly increasing, but not as rapidly as in the production environment. And in this environment, I haven't added any resource limits. Here's the bugtool file from the test cluster: May I ask if there's any data report concerning the resource usage of Tetragon during its running? I've noticed that Tetragon's memory usage seems to increase consistently with its prolonged operation time. Here is the trend chart showing the increase in memory usage. |
Thank you for the info!
One known cause of memory footprint increase was due to having pod labels as a dimension in prometheus metrics. This issue was recently addressed in: #1279, which is part of 0.11 (see: 16a9408). I'm not aware of any other increasing memory issues (especially if there are no policies loaded). Could you also check with metrics disabled and see if the issue persists? |
@kkourt Thank you for your reply. I have removed the constraints on CPU resources and disabled metric. No resource limit, no TracyingPolicy and no metric. But unfortunately, during testing I found that the memory usage is still experiencing a gradual increase. I believe this indicates an issue within Tetragon, possibly causing a memory leak. Below is the memory resource variation chart within 19 hours after metric disabled. As you can see, the memory resources are still slowly increasing. The tetragon bugtool is: |
@kkourt I discovered that this issue was due to some processes that had not been GC, leading to a continuous increase in the number of process caches. I found that the original process PIDs of all non-GC processes were 1. And I noticed that the reference count values of non-GC processes typically range from 4294967200 to 4294967295, which I believe is due to the process's reference counter undergoing multiple Dec() operations, causing the reference count to be non-zero. This results in the process being skipped during garbage collection, eventually leading to this problem. After attempting to modify the RefDec function, I noticed that the memory no longer increased, and the processes were successfully GC, the code is as follow. However, due to some pieces of code that I'm struggling to understand. I have yet to identify the cause of the superfluous Dec() in the reference count. I would appreciate your assistance in investigating this matter further. |
Hello @Jay-JHe, thanks for checking into that. What is the workload that the cluster runs? To be more specific, I would like to know if new pods/container continuously created/destroyed, or the same pods/containers run for a long periods of time? Thanks! |
@tpapagian Yes, I noticed that there are some pods in the cluster that are constantly being restarted and created. In addition, some pods of Tetragon failed to start because the host kernel version is less than 5.4, and they are also constantly being restarted and created. |
Ok, thanks, it makes sense! I will try to reproduce that issue locally and let you know for any updates. |
Thanks @Jay-JHe for those, so already identified one cause #1507 (comment) we will try to fix and see if this is the only one. |
@tpapagian @tixxdz I think this issue is due to improper concurrent handling. When hooking to 'do_task_dead', it's necessary to ascertain whether it's currently in the main process. Here is my bug fix. |
CC: @olsajiri |
Thanks for the proposed fix but this could block from receiving the event at all, if I'm reading it correctly this will trigger only if the leader is doing an exit and all other threads are also doing same (racing on the counter), but there are cases where these other threads (non leader) may keep running while the leader did actually finish, for this we won't send an event |
yes, I think that's the case.. also it's still racy wrt live.counter |
fixed by #1509. |
What happened?
Hey everyone, I'm currently using Tetragon v0.11.0. For resource considerations, I've limited the resource allocation of Tetragon, setting the CPU to 250m and memory to 400MB.
However, when deploying Tetragon on a production cluster, I noticed that memory usage starts from 250MB and continuously increases until the container experiences OOM (Out Of Memory) , which usually only takes half an hour.
I checked the event generation rate of Tetragon and it's roughly 200 events per second. Given this magnitude, such event traffic shouldn't result in such high memory usage. Upon analyzing the Tetragon events, I found that over 90% of them were either process execve or exit.
On a different test cluster, I've tested Tetragon's performance by hook tcp_connect with the netperf TCP short connection tool, and it ran stably under a 400MB memory allocation, with an event rate of 1000-2000 events per second. Hence, this has puzzled me.
I've tried versions v0.9, v0.10, and v0.11 of Tetragon; they all seem to have this issue. My cluster host kernel version is 5.4.119.
Can anyone advise on how to tackle this issue?
Below is the display for pprof-heap and memory usage
pprof-heap.zip
profile002.pdf
Tetragon Version
v0.9 v0.10 v0.11
Kernel Version
kernel version :5.4.119
Kubernetes Version
Client Version: version.Info{Major:"1", Minor:"27", GitVersion:"v1.27.3", GitCommit:"25b4e43193bcda6c7328a6d147b1fb73a33f1598", GitTreeState:"clean", BuildDate:"2023-06-14T09:53:42Z", GoVersion:"go1.20.5", Compiler:"gc", Platform:"linux/amd64"}
Kustomize Version: v5.0.1
Server Version: version.Info{Major:"1", Minor:"20+", GitVersion:"v1.20.6-tke.21.tkex.1.43+f47562b7e0c24c-dirty", GitCommit:"f47562b7e0c24c2b70a8ee9b17b4b012bb61706f", GitTreeState:"dirty", BuildDate:"2022-09-30T07:39:44Z", GoVersion:"go1.15.10", Compiler:"gc", Platform:"linux/amd64"}
Bugtool
tetragon-bugtool.tar.gz
Relevant log output
Anything else?
No response
The text was updated successfully, but these errors were encountered: