Persistent memory escalation within Tetragon leads to Out-of-Memory errors #1485

Jay-JHe · 2023-09-18T12:18:31Z

What happened?

Hey everyone, I'm currently using Tetragon v0.11.0. For resource considerations, I've limited the resource allocation of Tetragon, setting the CPU to 250m and memory to 400MB.

However, when deploying Tetragon on a production cluster, I noticed that memory usage starts from 250MB and continuously increases until the container experiences OOM (Out Of Memory) , which usually only takes half an hour.

I checked the event generation rate of Tetragon and it's roughly 200 events per second. Given this magnitude, such event traffic shouldn't result in such high memory usage. Upon analyzing the Tetragon events, I found that over 90% of them were either process execve or exit.

On a different test cluster, I've tested Tetragon's performance by hook tcp_connect with the netperf TCP short connection tool, and it ran stably under a 400MB memory allocation, with an event rate of 1000-2000 events per second. Hence, this has puzzled me.

I've tried versions v0.9, v0.10, and v0.11 of Tetragon; they all seem to have this issue. My cluster host kernel version is 5.4.119.
Can anyone advise on how to tackle this issue?

Below is the display for pprof-heap and memory usage
pprof-heap.zip

profile002.pdf

Tetragon Version

v0.9 v0.10 v0.11

Kernel Version

kernel version :5.4.119

Kubernetes Version

Client Version: version.Info{Major:"1", Minor:"27", GitVersion:"v1.27.3", GitCommit:"25b4e43193bcda6c7328a6d147b1fb73a33f1598", GitTreeState:"clean", BuildDate:"2023-06-14T09:53:42Z", GoVersion:"go1.20.5", Compiler:"gc", Platform:"linux/amd64"}
Kustomize Version: v5.0.1
Server Version: version.Info{Major:"1", Minor:"20+", GitVersion:"v1.20.6-tke.21.tkex.1.43+f47562b7e0c24c-dirty", GitCommit:"f47562b7e0c24c2b70a8ee9b17b4b012bb61706f", GitTreeState:"dirty", BuildDate:"2022-09-30T07:39:44Z", GoVersion:"go1.15.10", Compiler:"gc", Platform:"linux/amd64"}

Bugtool

tetragon-bugtool.tar.gz

Relevant log output

time="2023-09-18T12:03:03Z" level=info msg="config settings" config="map[bpf-lib:/var/lib/tetragon/ btf: cilium-bpf: config-dir:/etc/runtime/runtime.conf.d/ config-file: cpuprofile: data-cache-size:1024 debug:false disable-kprobe-multi:false enable-cilium-api:false enable-export-aggregation:false enable-k8s-api:true enable-msg-handling-latency:false enable-pid-set-filter:false enable-policy-filter:false enable-policy-filter-debug:false enable-process-ancestors:true enable-process-cred:false enable-process-ns:false enable-report:true event-queue-size:10000 export-aggregation-buffer-size:10000 export-aggregation-window-size:15s export-allowlist:{\"event_set\":[\"PROCESS_EXEC\", \"PROCESS_EXIT\", \"PROCESS_KPROBE\", \"PROCESS_UPROBE\"]} export-denylist:{\"health_check\":true}\n{\"namespace\":[\"\", \"tcss\", \"kube-system\"]} export-file-compress:false export-file-max-backups:5 export-file-max-size-mb:10 export-file-rotation-interval:0s export-filename:/var/run/cilium/tetragon/tetragon.log export-rate-limit:-1 field-filters:{} force-large-progs:false force-small-progs:false gops-address:localhost:8118 k8s-kubeconfig-path: kernel: kmods:[] log-format:text log-level:info memprofile:/var/run/cilium/profile.heap metrics-server::2112 netns-dir:/var/run/docker/netns/ pprof-addr:localhost:6060 process-cache-size:65536 procfs:/proc rb-size:0 rb-size-total:0 release-pinned-bpf:true server-address:localhost:54321 tracing-policy: tracing-policy-dir:/etc/tetragon/tetragon.tp.d verbose:0]"
time="2023-09-18T12:03:03Z" level=warning msg="BPF filesystem is going to be mounted automatically in /run/cilium/bpffs. However, it probably means that Cilium is running inside container and BPFFS is not mounted on the host. for more information, see: https://cilium.link/err-bpf-mount"
time="2023-09-18T12:03:03Z" level=info msg="Starting mem profiling" file=/var/run/cilium/profile.heap
time="2023-09-18T12:03:03Z" level=info msg="BPF: successfully released pinned BPF programs and maps" bpf-dir=/run/cilium/bpffs/tetragon
time="2023-09-18T12:03:03Z" level=info msg="BTF discovery: default kernel btf file found" btf-file=/sys/kernel/btf/vmlinux
time="2023-09-18T12:03:03Z" level=info msg="sensor controller waiting on channel"
time="2023-09-18T12:03:03Z" level=info msg="Registering pod delete handler for metrics"
time="2023-09-18T12:03:03Z" level=info msg="Starting metrics server" addr=":2112"
time="2023-09-18T12:03:03Z" level=info msg="Cgroup mode detection succeeded" cgroup.fs=/sys/fs/cgroup cgroup.mode="Legacy mode (Cgroupv1)"
time="2023-09-18T12:03:03Z" level=info msg="Supported cgroup controller 'memory' is active on the system" cgroup.controller.hierarchyID=10 cgroup.controller.index=4 cgroup.controller.name=memory cgroup.fs=/sys/fs/cgroup
time="2023-09-18T12:03:03Z" level=info msg="Supported cgroup controller 'pids' is active on the system" cgroup.controller.hierarchyID=12 cgroup.controller.index=11 cgroup.controller.name=pids cgroup.fs=/sys/fs/cgroup
time="2023-09-18T12:03:03Z" level=info msg="Supported cgroup controller 'cpuset' is active on the system" cgroup.controller.hierarchyID=4 cgroup.controller.index=0 cgroup.controller.name=cpuset cgroup.fs=/sys/fs/cgroup
time="2023-09-18T12:03:03Z" level=info msg="Cgroupv1 controller 'memory' will be used" cgroup.controller.hierarchyID=10 cgroup.controller.index=4 cgroup.controller.name=memory cgroup.fs=/sys/fs/cgroup
time="2023-09-18T12:03:03Z" level=info msg="Cgroupv1 hierarchy validated successfully" cgroup.fs=/sys/fs/cgroup cgroup.path=/sys/fs/cgroup/memory/kubepods/burstable/pod4937d3d8-8012-49c9-b579-faf1ee9c1666/5841fd9b21b5d977ab55700c510cd478a7e63aee00db82044b60b67c5275a0ea
time="2023-09-18T12:03:03Z" level=info msg="Deployment mode detection succeeded" cgroup.fs=/sys/fs/cgroup deployment.mode=Kubernetes
time="2023-09-18T12:03:03Z" level=info msg="Updated TetragonConf map successfully" NSPID=3089669 cgroup.controller.hierarchyID=10 cgroup.controller.index=4 cgroup.controller.name=memory cgroup.fs.magic=Cgroupv1 confmap-update=tg_conf_map deployment.mode=Kubernetes log.level=info
time="2023-09-18T12:03:03Z" level=info msg="Enabling Kubernetes API"
time="2023-09-18T12:03:03Z" level=info msg="registering policyfilter pod handlers"
time="2023-09-18T12:03:04Z" level=info msg="Initialized pod cache" num_pods=39
time="2023-09-18T12:03:04Z" level=info msg="Initialized service cache" num_services=546
time="2023-09-18T12:03:04Z" level=info msg="Disabling Cilium API"
time="2023-09-18T12:03:04Z" level=info msg="Starting process manager" enableCilium=false enableK8s=true enableProcessCred=false enableProcessNs=false
time="2023-09-18T12:03:04Z" level=info msg="Starting JSON exporter" logger="&{/var/run/cilium/tetragon/tetragon.log 10 0 5 false false ---------- 0 <nil> {0 0} <nil> {0 {0 0}}}" request="allow_list:{event_set:PROCESS_EXEC  event_set:PROCESS_EXIT  event_set:PROCESS_KPROBE  event_set:PROCESS_UPROBE}  deny_list:{health_check:{value:true}}  deny_list:{namespace:\"\"  namespace:\"tcss\"  namespace:\"kube-system\"}  field_filters:{}"
time="2023-09-18T12:03:04Z" level=info msg="Exporter configuration" enabled=true fileName=/var/run/cilium/tetragon/tetragon.log
time="2023-09-18T12:03:04Z" level=info msg="Successfully detected bpftool path" bpftool=/usr/bin/bpftool
time="2023-09-18T12:03:04Z" level=info msg="BPF: found active BPF resources" bpf-dir=/run/cilium/bpffs/tetragon pinned-bpf="[tg_conf_map]"
time="2023-09-18T12:03:04Z" level=info msg="BTF file: using metadata file" metadata=/sys/kernel/btf/vmlinux
time="2023-09-18T12:03:04Z" level=info msg="Loading sensor" name=__base__
time="2023-09-18T12:03:04Z" level=info msg="Loading kernel version 5.4.119"
time="2023-09-18T12:03:04Z" level=info msg="Starting gRPC server" address="localhost:54321" protocol=tcp
time="2023-09-18T12:03:04Z" level=info msg="tetragon, map loaded." map=execve_map path=/run/cilium/bpffs/tetragon/execve_map sensor=__base__
time="2023-09-18T12:03:04Z" level=info msg="tetragon, map loaded." map=execve_map_stats path=/run/cilium/bpffs/tetragon/execve_map_stats sensor=__base__
time="2023-09-18T12:03:04Z" level=info msg="tetragon, map loaded." map=execve_calls path=/run/cilium/bpffs/tetragon/execve_calls sensor=__base__
time="2023-09-18T12:03:04Z" level=info msg="tetragon, map loaded." map=names_map path=/run/cilium/bpffs/tetragon/names_map sensor=__base__
time="2023-09-18T12:03:04Z" level=info msg="tetragon, map loaded." map=tcpmon_map path=/run/cilium/bpffs/tetragon/tcpmon_map sensor=__base__
time="2023-09-18T12:03:04Z" level=info msg="tetragon, map loaded." map=tg_conf_map path=/run/cilium/bpffs/tetragon/tg_conf_map sensor=__base__
time="2023-09-18T12:03:04Z" level=info msg="Loading BPF program" Program=/var/lib/tetragon/bpf_exit.o Type=kprobe
time="2023-09-18T12:03:06Z" level=info msg="Loading BPF program" Program=/var/lib/tetragon/bpf_fork.o Type=kprobe
time="2023-09-18T12:03:06Z" level=info msg="Loading registered BPF probe" Program=/var/lib/tetragon/bpf_execve_event_v53.o Type=execve
time="2023-09-18T12:03:08Z" level=info msg="Read ProcFS /proc appended 1157/1810 entries"
time="2023-09-18T12:03:08Z" level=warning msg="Procfs execve event pods/ identifier error" error="open /proc/3087789/cgroup: no such file or directory"
time="2023-09-18T12:03:08Z" level=warning msg="Procfs execve event pods/ identifier error" error="open /proc/3087665/cgroup: no such file or directory"
time="2023-09-18T12:03:08Z" level=warning msg="Procfs execve event pods/ identifier error" error="open /proc/3083006/cgroup: no such file or directory"
time="2023-09-18T12:03:08Z" level=warning msg="Procfs execve event pods/ identifier error" error="open /proc/3083189/cgroup: no such file or directory"
time="2023-09-18T12:03:09Z" level=warning msg="Procfs execve event pods/ identifier error" error="open /proc/3088312/cgroup: no such file or directory"
time="2023-09-18T12:03:09Z" level=warning msg="Procfs execve event pods/ identifier error" error="open /proc/0/cgroup: no such file or directory"
time="2023-09-18T12:03:09Z" level=info msg="Loaded BPF maps and events for sensor successfully" sensor=__base__
time="2023-09-18T12:03:09Z" level=info msg="sensor controller starts"
time="2023-09-18T12:03:09Z" level=info msg="Available sensors" sensors=
time="2023-09-18T12:03:09Z" level=info msg="Registered sensors (policy-handlers)" policy-handlers="loader sensor, tracing, uprobe sensor"
time="2023-09-18T12:03:09Z" level=info msg="Registered probe types" types="generic_tracepoint, generic_uprobe, loader, execve, generic_kprobe"
time="2023-09-18T12:03:09Z" level=info msg="Start to load internal policy" TracingPolicy="Default Policies"
time="2023-09-18T12:03:10Z" level=info msg="Added generic kprobe sensor: /var/lib/tetragon/bpf_generic_kprobe_v53.o -> do_init_module" flags=none override=false
time="2023-09-18T12:03:10Z" level=info msg="BTF file: using metadata file" metadata=/sys/kernel/btf/vmlinux
time="2023-09-18T12:03:10Z" level=info msg="Loading sensor" name=gkp-sensor-1
time="2023-09-18T12:03:10Z" level=info msg="Loading kernel version 5.4.119"
time="2023-09-18T12:03:11Z" level=info msg="tetragon, map loaded." map=fdinstall_map path=/run/cilium/bpffs/tetragon/gkp-sensor-1-fdinstall_map sensor=gkp-sensor-1
time="2023-09-18T12:03:12Z" level=info msg="tetragon, map loaded." map=config_map path=/run/cilium/bpffs/tetragon/gkp-sensor-1-gkp-0-config_map sensor=gkp-sensor-1
time="2023-09-18T12:03:13Z" level=info msg="tetragon, map loaded." map=kprobe_calls path=/run/cilium/bpffs/tetragon/gkp-sensor-1-gkp-0-kp_calls sensor=gkp-sensor-1
time="2023-09-18T12:03:13Z" level=info msg="tetragon, map loaded." map=filter_map path=/run/cilium/bpffs/tetragon/gkp-sensor-1-gkp-0-filter_map sensor=gkp-sensor-1
time="2023-09-18T12:03:15Z" level=info msg="tetragon, map loaded." map=argfilter_maps path=/run/cilium/bpffs/tetragon/gkp-sensor-1-gkp-0-argfilter_maps sensor=gkp-sensor-1
time="2023-09-18T12:03:15Z" level=info msg="tetragon, map loaded." map=addr4lpm_maps path=/run/cilium/bpffs/tetragon/gkp-sensor-1-gkp-0-addr4lpm_maps sensor=gkp-sensor-1
time="2023-09-18T12:03:16Z" level=info msg="tetragon, map loaded." map=addr6lpm_maps path=/run/cilium/bpffs/tetragon/gkp-sensor-1-gkp-0-addr6lpm_maps sensor=gkp-sensor-1
time="2023-09-18T12:03:17Z" level=info msg="tetragon, map loaded." map=retprobe_map path=/run/cilium/bpffs/tetragon/gkp-sensor-1-gkp-0-retprobe_map sensor=gkp-sensor-1
time="2023-09-18T12:03:18Z" level=info msg="tetragon, map loaded." map=process_call_heap path=/run/cilium/bpffs/tetragon/gkp-sensor-1-gkp-0-process_call_heap sensor=gkp-sensor-1
time="2023-09-18T12:03:19Z" level=info msg="tetragon, map loaded." map=sel_names_map path=/run/cilium/bpffs/tetragon/gkp-sensor-1-gkp-0-sel_names_map sensor=gkp-sensor-1
time="2023-09-18T12:03:20Z" level=info msg="tetragon, map loaded." map=socktrack_map path=/run/cilium/bpffs/tetragon/gkp-sensor-1-socktrack_map sensor=gkp-sensor-1
time="2023-09-18T12:03:20Z" level=info msg="Loading registered BPF probe" Program=/var/lib/tetragon/bpf_generic_kprobe_v53.o Type=generic_kprobe
time="2023-09-18T12:03:42Z" level=info msg="Loaded generic kprobe program: /var/lib/tetragon/bpf_generic_kprobe_v53.o -> do_init_module"
time="2023-09-18T12:03:42Z" level=info msg="Loaded BPF maps and events for sensor successfully" sensor=gkp-sensor-1
time="2023-09-18T12:03:42Z" level=info msg="Added TracingPolicy with success" metadata.name=kernel-module-load
time="2023-09-18T12:03:43Z" level=warning msg="Kprobe spec pre-validation failed, but will continue with loading" error="type (sock) of argument 0 does not match spec type (sock)\n" sensor=gkp-sensor-2
time="2023-09-18T12:03:43Z" level=info msg="Added generic kprobe sensor: /var/lib/tetragon/bpf_generic_kprobe_v53.o -> tcp_connect" flags=none override=false
time="2023-09-18T12:03:43Z" level=info msg="BTF file: using metadata file" metadata=/sys/kernel/btf/vmlinux
time="2023-09-18T12:03:43Z" level=info msg="Loading sensor" name=gkp-sensor-2
time="2023-09-18T12:03:43Z" level=info msg="Loading kernel version 5.4.119"
time="2023-09-18T12:03:43Z" level=info msg="tetragon, map loaded." map=fdinstall_map path=/run/cilium/bpffs/tetragon/gkp-sensor-2-fdinstall_map sensor=gkp-sensor-2
time="2023-09-18T12:03:44Z" level=info msg="tetragon, map loaded." map=config_map path=/run/cilium/bpffs/tetragon/gkp-sensor-2-gkp-1-config_map sensor=gkp-sensor-2
time="2023-09-18T12:03:45Z" level=info msg="tetragon, map loaded." map=kprobe_calls path=/run/cilium/bpffs/tetragon/gkp-sensor-2-gkp-1-kp_calls sensor=gkp-sensor-2
time="2023-09-18T12:03:46Z" level=info msg="tetragon, map loaded." map=filter_map path=/run/cilium/bpffs/tetragon/gkp-sensor-2-gkp-1-filter_map sensor=gkp-sensor-2
time="2023-09-18T12:03:47Z" level=info msg="tetragon, map loaded." map=argfilter_maps path=/run/cilium/bpffs/tetragon/gkp-sensor-2-gkp-1-argfilter_maps sensor=gkp-sensor-2
time="2023-09-18T12:03:48Z" level=info msg="tetragon, map loaded." map=addr4lpm_maps path=/run/cilium/bpffs/tetragon/gkp-sensor-2-gkp-1-addr4lpm_maps sensor=gkp-sensor-2
time="2023-09-18T12:03:48Z" level=info msg="tetragon, map loaded." map=addr6lpm_maps path=/run/cilium/bpffs/tetragon/gkp-sensor-2-gkp-1-addr6lpm_maps sensor=gkp-sensor-2
time="2023-09-18T12:03:49Z" level=info msg="tetragon, map loaded." map=retprobe_map path=/run/cilium/bpffs/tetragon/gkp-sensor-2-gkp-1-retprobe_map sensor=gkp-sensor-2
time="2023-09-18T12:03:50Z" level=info msg="tetragon, map loaded." map=process_call_heap path=/run/cilium/bpffs/tetragon/gkp-sensor-2-gkp-1-process_call_heap sensor=gkp-sensor-2
time="2023-09-18T12:03:50Z" level=info msg="tetragon, map loaded." map=sel_names_map path=/run/cilium/bpffs/tetragon/gkp-sensor-2-gkp-1-sel_names_map sensor=gkp-sensor-2
time="2023-09-18T12:03:51Z" level=info msg="tetragon, map loaded." map=socktrack_map path=/run/cilium/bpffs/tetragon/gkp-sensor-2-socktrack_map sensor=gkp-sensor-2
time="2023-09-18T12:03:51Z" level=info msg="Loading registered BPF probe" Program=/var/lib/tetragon/bpf_generic_kprobe_v53.o Type=generic_kprobe
time="2023-09-18T12:04:15Z" level=info msg="Loaded generic kprobe program: /var/lib/tetragon/bpf_generic_kprobe_v53.o -> tcp_connect"
time="2023-09-18T12:04:15Z" level=info msg="Loaded BPF maps and events for sensor successfully" sensor=gkp-sensor-2
time="2023-09-18T12:04:15Z" level=info msg="Added TracingPolicy with success" metadata.name=tcp-connect
time="2023-09-18T12:04:16Z" level=info msg="Added generic kprobe sensor: /var/lib/tetragon/bpf_generic_kprobe_v53.o -> __x64_sys_setuid" flags=none override=false
time="2023-09-18T12:04:16Z" level=info msg="BTF file: using metadata file" metadata=/sys/kernel/btf/vmlinux
time="2023-09-18T12:04:16Z" level=info msg="Loading sensor" name=gkp-sensor-3
time="2023-09-18T12:04:16Z" level=info msg="Loading kernel version 5.4.119"
time="2023-09-18T12:04:16Z" level=info msg="tetragon, map loaded." map=fdinstall_map path=/run/cilium/bpffs/tetragon/gkp-sensor-3-fdinstall_map sensor=gkp-sensor-3
time="2023-09-18T12:04:17Z" level=info msg="tetragon, map loaded." map=config_map path=/run/cilium/bpffs/tetragon/gkp-sensor-3-gkp-2-config_map sensor=gkp-sensor-3
time="2023-09-18T12:04:18Z" level=info msg="tetragon, map loaded." map=kprobe_calls path=/run/cilium/bpffs/tetragon/gkp-sensor-3-gkp-2-kp_calls sensor=gkp-sensor-3
time="2023-09-18T12:04:18Z" level=info msg="tetragon, map loaded." map=filter_map path=/run/cilium/bpffs/tetragon/gkp-sensor-3-gkp-2-filter_map sensor=gkp-sensor-3
time="2023-09-18T12:04:19Z" level=info msg="tetragon, map loaded." map=argfilter_maps path=/run/cilium/bpffs/tetragon/gkp-sensor-3-gkp-2-argfilter_maps sensor=gkp-sensor-3
time="2023-09-18T12:04:20Z" level=info msg="tetragon, map loaded." map=addr4lpm_maps path=/run/cilium/bpffs/tetragon/gkp-sensor-3-gkp-2-addr4lpm_maps sensor=gkp-sensor-3
time="2023-09-18T12:04:21Z" level=info msg="tetragon, map loaded." map=addr6lpm_maps path=/run/cilium/bpffs/tetragon/gkp-sensor-3-gkp-2-addr6lpm_maps sensor=gkp-sensor-3
time="2023-09-18T12:04:22Z" level=info msg="tetragon, map loaded." map=retprobe_map path=/run/cilium/bpffs/tetragon/gkp-sensor-3-gkp-2-retprobe_map sensor=gkp-sensor-3
time="2023-09-18T12:04:23Z" level=info msg="tetragon, map loaded." map=process_call_heap path=/run/cilium/bpffs/tetragon/gkp-sensor-3-gkp-2-process_call_heap sensor=gkp-sensor-3
time="2023-09-18T12:04:23Z" level=info msg="tetragon, map loaded." map=sel_names_map path=/run/cilium/bpffs/tetragon/gkp-sensor-3-gkp-2-sel_names_map sensor=gkp-sensor-3
time="2023-09-18T12:04:24Z" level=info msg="tetragon, map loaded." map=socktrack_map path=/run/cilium/bpffs/tetragon/gkp-sensor-3-socktrack_map sensor=gkp-sensor-3
time="2023-09-18T12:04:24Z" level=info msg="Loading registered BPF probe" Program=/var/lib/tetragon/bpf_generic_kprobe_v53.o Type=generic_kprobe
time="2023-09-18T12:04:45Z" level=info msg="Loaded generic kprobe program: /var/lib/tetragon/bpf_generic_kprobe_v53.o -> __x64_sys_setuid"
time="2023-09-18T12:04:45Z" level=info msg="Loaded BPF maps and events for sensor successfully" sensor=gkp-sensor-3
time="2023-09-18T12:04:45Z" level=info msg="Added TracingPolicy with success" metadata.name=sys-setuid
time="2023-09-18T12:04:45Z" level=info msg="Perf ring buffer size (bytes)" percpu=69632 total=3342336
time="2023-09-18T12:04:45Z" level=info msg="Listening for events..."
I0918 12:07:04.317893 3089669 trace.go:219] Trace[598696473]: "DeltaFIFO Pop Process" ID:kube-system/tkex-security-sync-95c6c5654-x4jkr,Depth:30,Reason:slow event handlers blocking the queue (18-Sep-2023 12:07:04.117) (total time: 100ms):
Trace[598696473]: [100.050478ms] [100.050478ms] END
I0918 12:12:05.017975 3089669 trace.go:219] Trace[468173039]: "DeltaFIFO Pop Process" ID:kube-system/csi-cbs-node-8chdf,Depth:35,Reason:slow event handlers blocking the queue (18-Sep-2023 12:12:04.118) (total time: 899ms):
Trace[468173039]: [899.780442ms] [899.780442ms] END
I0918 12:18:04.418931 3089669 trace.go:219] Trace[576527658]: "DeltaFIFO Pop Process" ID:ns-prjm987r-1639428-production/live-pullpush-interface-mumbai-0-0,Depth:19,Reason:slow event handlers blocking the queue (18-Sep-2023 12:18:04.318) (total time: 100ms):
Trace[576527658]: [100.496195ms] [100.496195ms] END

Anything else?

No response

kkourt · 2023-09-18T15:31:28Z

Hi,

Thanks for the report, and the detailed information!

I'm seeing that you are deploying a number of tracing policies. Does the issue appear if you don't use any policies at all? If not, is it possible to pinpoint a single policy that leads to this issue?

Also, as far as I understand the issue does not appear on a different cluster with the same tetragon configuration? Could you please provide the same information from the cluster where the issue does not exist?

Jay-JHe · 2023-09-19T02:50:08Z

Hi,

Thanks for the report, and the detailed information!

I'm seeing that you are deploying a number of tracing policies. Does the issue appear if you don't use any policies at all? If not, is it possible to pinpoint a single policy that leads to this issue?

Also, as far as I understand the issue does not appear on a different cluster with the same tetragon configuration? Could you please provide the same information from the cluster where the issue does not exist?

@kkourt Hi,
I've removed all policies from the production cluster and found out that the problem of increasing memory still persists. I even adjusted the CPU limit to 500m, but it didn't have any impact on the escalating memory issue, which continues to occur. Below is the bugtool file after removing policies from the production cluster.
tetragon-bugtool-production.tar.gz

I've noticed that the memory of tetragon in the test cluster is also constantly increasing, but not as rapidly as in the production environment. And in this environment, I haven't added any resource limits. Here's the bugtool file from the test cluster:
tetragon-bugtoo-testl.tar.gz

May I ask if there's any data report concerning the resource usage of Tetragon during its running? I've noticed that Tetragon's memory usage seems to increase consistently with its prolonged operation time.

Here is the trend chart showing the increase in memory usage.

kkourt · 2023-09-19T06:51:47Z

Thank you for the info!

May I ask if there's any data report concerning the resource usage of Tetragon during its running? I've noticed that Tetragon's memory usage seems to increase consistently with its prolonged operation time.

One known cause of memory footprint increase was due to having pod labels as a dimension in prometheus metrics.

This issue was recently addressed in: #1279, which is part of 0.11 (see: 16a9408). I'm not aware of any other increasing memory issues (especially if there are no policies loaded).

Could you also check with metrics disabled and see if the issue persists?

Jay-JHe · 2023-09-19T11:38:48Z

Thank you for the info!

May I ask if there's any data report concerning the resource usage of Tetragon during its running? I've noticed that Tetragon's memory usage seems to increase consistently with its prolonged operation time.

One known cause of memory footprint increase was due to having pod labels as a dimension in prometheus metrics.

This issue was recently addressed in: #1279, which is part of 0.11 (see: 16a9408). I'm not aware of any other increasing memory issues (especially if there are no policies loaded).

Could you also check with metrics disabled and see if the issue persists?

@kkourt Thank you for your reply. I have removed the constraints on CPU resources and disabled metric. No resource limit, no TracyingPolicy and no metric.

But unfortunately, during testing I found that the memory usage is still experiencing a gradual increase. I believe this indicates an issue within Tetragon, possibly causing a memory leak.

Below is the memory resource variation chart within 19 hours after metric disabled. As you can see, the memory resources are still slowly increasing.

The tetragon bugtool is:
tetragon-disable-metric-bugtool.tar.gz

Jay-JHe · 2023-09-25T09:11:30Z

@kkourt I discovered that this issue was due to some processes that had not been GC, leading to a continuous increase in the number of process caches.

I found that the original process PIDs of all non-GC processes were 1. And I noticed that the reference count values of non-GC processes typically range from 4294967200 to 4294967295, which I believe is due to the process's reference counter undergoing multiple Dec() operations, causing the reference count to be non-zero. This results in the process being skipped during garbage collection, eventually leading to this problem.

After attempting to modify the RefDec function, I noticed that the memory no longer increased, and the processes were successfully GC, the code is as follow.

However, due to some pieces of code that I'm struggling to understand. I have yet to identify the cause of the superfluous Dec() in the reference count. I would appreciate your assistance in investigating this matter further.
process_cache.log

tpapagian · 2023-09-25T10:44:05Z

Hello @Jay-JHe, thanks for checking into that. What is the workload that the cluster runs?

To be more specific, I would like to know if new pods/container continuously created/destroyed, or the same pods/containers run for a long periods of time? Thanks!

Jay-JHe · 2023-09-25T10:59:03Z

@tpapagian Yes, I noticed that there are some pods in the cluster that are constantly being restarted and created. In addition, some pods of Tetragon failed to start because the host kernel version is less than 5.4, and they are also constantly being restarted and created.

tpapagian · 2023-09-25T11:02:16Z

Ok, thanks, it makes sense! I will try to reproduce that issue locally and let you know for any updates.

tixxdz · 2023-09-25T15:45:15Z

Thanks @Jay-JHe for those, so already identified one cause #1507 (comment) we will try to fix and see if this is the only one.

Jay-JHe · 2023-09-27T11:28:28Z

@tpapagian @tixxdz I think this issue is due to improper concurrent handling. When hooking to 'do_task_dead', it's necessary to ascertain whether it's currently in the main process. Here is my bug fix.

tpapagian · 2023-09-27T11:34:57Z

@tpapagian @tixxdz I think this issue is due to improper concurrent handling. When hooking to 'do_task_dead', it's necessary to ascertain whether it's currently in the main process. Here is my bug fix.

CC: @olsajiri

tixxdz · 2023-09-27T15:18:16Z

Thanks for the proposed fix but this could block from receiving the event at all, if I'm reading it correctly this will trigger only if the leader is doing an exit and all other threads are also doing same (racing on the counter), but there are cases where these other threads (non leader) may keep running while the leader did actually finish, for this we won't send an event

olsajiri · 2023-09-27T21:49:25Z

Thanks for the proposed fix but this could block from receiving the event at all, if I'm reading it correctly this will trigger only if the leader is doing an exit and all other threads are also doing same (racing on the counter), but there are cases where these other threads (non leader) may keep running while the leader did actually finish, for this we won't send an event

yes, I think that's the case.. also it's still racy wrt live.counter

@Jay-JHe fyi we have pending fix in here #1509

kkourt · 2023-10-23T16:31:42Z

fixed by #1509.

Jay-JHe added the kind/bug Something isn't working label Sep 18, 2023

kkourt assigned kkourt and tixxdz Sep 25, 2023

kkourt assigned tpapagian Sep 25, 2023

jrfastab added the release-blocker This PR or issue is blocking the next release. label Oct 20, 2023

tixxdz mentioned this issue Oct 23, 2023

gc: collect entries where the refcounter could underflow and wrap #1517

Open

kkourt closed this as completed Oct 23, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Persistent memory escalation within Tetragon leads to Out-of-Memory errors #1485

Persistent memory escalation within Tetragon leads to Out-of-Memory errors #1485

Jay-JHe commented Sep 18, 2023

kkourt commented Sep 18, 2023

Jay-JHe commented Sep 19, 2023

kkourt commented Sep 19, 2023 •

edited

Loading

Jay-JHe commented Sep 19, 2023 •

edited

Loading

Jay-JHe commented Sep 25, 2023 •

edited

Loading

tpapagian commented Sep 25, 2023

Jay-JHe commented Sep 25, 2023

tpapagian commented Sep 25, 2023

tixxdz commented Sep 25, 2023

Jay-JHe commented Sep 27, 2023 •

edited

Loading

tpapagian commented Sep 27, 2023

tixxdz commented Sep 27, 2023

olsajiri commented Sep 27, 2023

kkourt commented Oct 23, 2023

Persistent memory escalation within Tetragon leads to Out-of-Memory errors #1485

Persistent memory escalation within Tetragon leads to Out-of-Memory errors #1485

Comments

Jay-JHe commented Sep 18, 2023

What happened?

Tetragon Version

Kernel Version

Kubernetes Version

Bugtool

Relevant log output

Anything else?

kkourt commented Sep 18, 2023

Jay-JHe commented Sep 19, 2023

kkourt commented Sep 19, 2023 • edited Loading

Jay-JHe commented Sep 19, 2023 • edited Loading

Jay-JHe commented Sep 25, 2023 • edited Loading

tpapagian commented Sep 25, 2023

Jay-JHe commented Sep 25, 2023

tpapagian commented Sep 25, 2023

tixxdz commented Sep 25, 2023

Jay-JHe commented Sep 27, 2023 • edited Loading

tpapagian commented Sep 27, 2023

tixxdz commented Sep 27, 2023

olsajiri commented Sep 27, 2023

kkourt commented Oct 23, 2023

kkourt commented Sep 19, 2023 •

edited

Loading

Jay-JHe commented Sep 19, 2023 •

edited

Loading

Jay-JHe commented Sep 25, 2023 •

edited

Loading

Jay-JHe commented Sep 27, 2023 •

edited

Loading