[solved] Memory usage on in_tail (WAL) #3073

loburm · 2021-02-12T17:22:42Z

Bug Report

Describe the bug
I have managed to reproduce issue both in 1.6.10 and 1.7.0-rc5. Also it seems to be affecting 1.5.7 too, but at least I don't see huge growth at the beginning that eats additional 5-6 MB. On 1.3.11 I don't see such huge problem (bigger amount of RAM explained by initial load that the cluster was experiencing).

Screenshots

Your Environment

Version used: 1.6.10 and 1.7.0-rc5.
Configuration:

I have removed a few input blocks that haven't got any logs during a test, but can provide full config if needed. Our metrics show that overall load during that period of time was near 400 bytes/s. If was mainly coming from three sources: node-problem-detector, kubelet and kube_containers_kube-system.

[SERVICE]
    Flush           5
    Grace           120
    Log_Level       info
    Log_File        /var/log/fluentbit.log
    Daemon          off
    Parsers_File    parsers.conf
    HTTP_Server     On
    storage.metrics On
    HTTP_Listen     0.0.0.0
    HTTP_PORT       2020


[INPUT]
    Name             tail
    Alias            kube_containers_kube-system
    Tag              kube_<namespace_name>_<pod_name>_<container_name>
    Tag_Regex        (?<pod_name>[a-z0-9]([-a-z0-9]*[a-z0-9])?(\.[a-z0-9]([-a-z0-9]*[a-z0-9])?)*)_(?<namespace_name>[^_]+)_(?<container_name>.+)-
    Path             /var/log/containers/*_kube-system_*.log
    DB               /var/run/google-fluentbit/pos-files/flb_kube_kube-system.db
    DB.locking       true
    Read_from_Head   On
    Buffer_Max_Size  1MB
    Mem_Buf_Limit    5MB
    Skip_Long_Lines  On
    Refresh_Interval 5

[INPUT]
    Name             tail
    Alias            kube_containers
    Tag              kube_<namespace_name>_<pod_name>_<container_name>
    Tag_Regex        (?<pod_name>[a-z0-9]([-a-z0-9]*[a-z0-9])?(\.[a-z0-9]([-a-z0-9]*[a-z0-9])?)*)_(?<namespace_name>[^_]+)_(?<container_name>.+)-
    Path             /var/log/containers/*.log
    Exclude_Path     /var/log/containers/*_kube-system_*.log,/var/log/containers/*_istio-system_*.log,/var/log/containers/*_knative-serving_*.log,/var/log/containers/*_gke-system_*.log,/var/log/containers/*_config-management-system_*.log
    DB               /var/run/google-fluentbit/pos-files/flb_kube.db
    DB.locking       true
    Read_from_Head   On
    Buffer_Max_Size  1MB
    Mem_Buf_Limit    5MB
    Skip_Long_Lines  On
    Refresh_Interval 5

[FILTER]
    Name         parser
    Match        kube_*
    Key_Name     log
    Reserve_Data True
    Parser       docker
    Parser       containerd

[INPUT]
    Name            tail
    Alias           kube-proxy
    Tag             kube-proxy
    Path            /var/log/kube-proxy.log
    DB              /var/run/google-fluentbit/pos-files/kube-proxy.db
    DB.locking      true
    Buffer_Max_Size 1MB
    Mem_Buf_Limit   1MB
    Read_from_Head  On
    Parser          glog

[INPUT]
    Name            systemd
    Alias           node-problem-detector
    Tag             node-problem-detector
    Systemd_Filter  _SYSTEMD_UNIT=node-problem-detector.service
    Path            /var/log/journal
    DB              /var/run/google-fluentbit/pos-files/node-problem-detector.db
    DB.Sync         Normal
    Buffer_Max_Size 1MB
    Mem_Buf_Limit   1MB

[INPUT]
    Name            systemd
    Alias           kubelet
    Tag             kubelet
    Systemd_Filter  _SYSTEMD_UNIT=kubelet.service
    Path            /var/log/journal
    DB              /var/run/google-fluentbit/pos-files/kubelet.db
    DB.Sync         Normal
    Buffer_Max_Size 1MB
    Mem_Buf_Limit   1MB

[FILTER]
    Name        modify
    Match       *
    Hard_rename log message

[FILTER]
    Name         parser
    Match        kube_*
    Key_Name     message
    Reserve_Data True
    Parser       glog
    Parser       json

[OUTPUT]
    Name        http
    Match       *
    Host        127.0.0.1
    Port        2021
    URI         /logs
    header_tag  FLUENT-TAG
    Format      msgpack
    Retry_Limit 2

[PARSER]
    Name        docker
    Format      json
    Time_Key    time
    Time_Format %Y-%m-%dT%H:%M:%S.%L%z

[PARSER]
    Name        containerd
    Format      regex
    Regex       ^(?<time>.+) (?<stream>stdout|stderr) [^ ]* (?<log>.*)$
    Time_Key    time
    Time_Format %Y-%m-%dT%H:%M:%S.%L%z

[PARSER]
    Name        json
    Format      json

[PARSER]
    Name        syslog
    Format      regex
    Regex       ^\<(?<pri>[0-9]+)\>(?<time>[^ ]* {1,2}[^ ]* [^ ]*) (?<host>[^ ]*) (?<ident>[a-zA-Z0-9_\/\.\-]*)(?:\[(?<pid>[0-9]+)\])?(?:[^\:]*\:)? *(?<message>.*)$
    Time_Key    time
    Time_Format %b %d %H:%M:%S

[PARSER]
    Name        glog
    Format      regex
    Regex       ^(?<severity>\w)(?<time>\d{4} [^\s]*)\s+(?<pid>\d+)\s+(?<source_file>[^ \]]+)\:(?<source_line>\d+)\]\s(?<message>.*)$
    Time_Key    time
    Time_Format %m%d %H:%M:%S.%L

[PARSER]
    Name        network-log
    Format      json
    Time_Key    timestamp
    Time_Format %Y-%m-%dT%H:%M:%S.%L%z

Environment name and version (e.g. Kubernetes? What version?): GKE 1.19

The text was updated successfully, but these errors were encountered:

edsiper · 2021-02-13T03:47:19Z

hi @loburm

I was digging into this and the static analyzer reported a leak in the stackdriver connector, which I just fixed here 7962ad6 (I assume you tested with stack driver output)

that code was introduced around Nov 2020 ~v1.6.4... not sure if associated

edsiper · 2021-02-13T17:21:39Z

@loburm can you please try the following new RC image:

fluent/fluent-bit:1.7.0-rc9

loburm · 2021-02-14T08:38:25Z

Hi Eduardo,
I'll test the latest release candidate, but in our configuration we are not using stackdriver connector. All data goes through the http plugin.

edsiper · 2021-02-15T00:26:27Z

is there a way you can remove some filters to identify the possible root cause of the leak ?

loburm · 2021-02-16T08:31:47Z

So far I have run three more tests:

All other input plugins except 3 that are generating logs are disabled.
Out of those 3 plugins only input plugin and corresponding filters are enabled.
Out of those 3 plugins only systemd plugins and corresponding filters are enabled.

Results:

As we can see problem should be somewhere in the next config:

[INPUT]
    Name             tail
    Alias            kube_containers_kube-system
    Tag              kube_<namespace_name>_<pod_name>_<container_name>
    Tag_Regex        (?<pod_name>[a-z0-9]([-a-z0-9]*[a-z0-9])?(\.[a-z0-9]([-a-z0-9]*[a-z0-9])?)*)_(?<namespace_name>[^_]+)_(?<container_name>.+)-
    Path             /var/log/containers/*_kube-system_*.log
    DB               /var/run/google-fluentbit/pos-files/flb_kube_kube-system.db
    DB.locking       true
    Read_from_Head   On
    Buffer_Max_Size  1MB
    Mem_Buf_Limit    5MB
    Skip_Long_Lines  On
    Refresh_Interval 5

[FILTER]
    Name         parser
    Match        kube_*
    Key_Name     log
    Reserve_Data True
    Parser       docker
    Parser       containerd

[FILTER]
    Name        modify
    Match       *
    Hard_rename log message

[FILTER]
    Name         parser
    Match        kube_*
    Key_Name     message
    Reserve_Data True
    Parser       glog
    Parser       json

I'm going to run a few more tests to figure out which plugin is causing the problem here.

loburm · 2021-02-16T08:35:22Z

As a side note I was testing with 1.7.0-rc9.

loburm · 2021-02-16T17:03:35Z

Last result:

I have been running different combinations:

tail + parser + modify
tail + parser + parser
tail

In all cases we see a similar pattern, so I assume memory leak we have in tail plugin. @edsiper should I try some other options to narrow problem down further?

agup006 · 2021-02-19T17:19:30Z

Thanks for the repro, some initial thoughts are it might be related to tag_regex. Any possibility of running tail without it @loburm , appreciate your help?

loburm · 2021-03-03T09:35:41Z

Hi Anurag,
I don't think that it's related to tag regex. Here are two runs first without it and second with, memory footprint looks exactly the same.

loburm · 2021-03-04T19:27:16Z

Result from the latest test:

As it's visible from the graph, regression was introduced just before 1.6.0. I have used the simplest possible configuration to exclude problems with output plugin:

[INPUT]
    Name             tail
    Alias            kube_containers
    Tag              kube_containers
    Path             /var/log/containers/*.log
    DB               /var/run/google-fluentbit/pos-files/flb_kube.db
    Buffer_Max_Size  1MB
    Mem_Buf_Limit    5MB
    Skip_Long_Lines  On
    Refresh_Interval 5

[OUTPUT]
    Name        null 
    Match       *

loburm · 2021-03-06T12:55:51Z

I have tried to use binary search approach and managed to find a change, after which we see an issue: a79bc25

Additional investigation has shown that reverting line:

PRAGMA journal_mode=WAL;

to:

PRAGMA journal_mode=OFF;

resolves an issue for all Fluent Bit versions. I have cherrypicked, built and run tests for 1.6.0, 1.7.0 and 1.7.2

loburm · 2021-03-12T09:39:29Z

@edsiper how do you think, can we revert that line to fix an issue?

AkshayDubey29 · 2021-03-14T12:03:43Z

@edsiper My pods are getting in crashLoopBackOff after some time and when I removed Db.sync and all DB-related config it worked fine, I am suspecting this might due to the same issue.

`a0d00qf@m-c02dj3hfmd6n fluent-bit % kubectl logs -f fluent-bit-lv7h4 -n logging
Fluent Bit v1.6.10

Fluent Bit is a CNCF sub-project under the umbrella of Fluentd
https://fluentbit.io

[2021/03/14 12:00:48] [engine] caught signal (SIGSEGV)
#0 0x55f10349991e in ???() at ???:0
#1 0x55f103480296 in ???() at ???:0
#2 0x55f1033a63b3 in ???() at ???:0
#3 0x55f1033a6727 in ???() at ???:0
#4 0x55f1033a6866 in ???() at ???:0
#5 0x55f1033a7083 in ???() at ???:0
#6 0x55f1032d7000 in ???() at ???:0
#7 0x55f1032feda9 in ???() at ???:0
#8 0x55f10331ae6b in ???() at ???:0
#9 0x55f10331cb08 in ???() at ???:0
#10 0x55f1033172ea in ???() at ???:0
#11 0x55f1032d6ab6 in ???() at ???:0
#12 0x55f1032e3e57 in ???() at ???:0
#13 0x55f1032d0823 in ???() at ???:0
#14 0x7ff816f41fa2 in ???() at ???:0
#15 0x7ff8165814ce in ???() at ???:0
#16 0xffffffffffffffff in ???() at ???:0`

DB /tail-db/tail-kube-state.db DB.Sync Normal DB.locking true

edsiper · 2021-03-15T15:31:02Z

I am testing that specific issue..,.

edsiper · 2021-03-15T20:10:13Z

NOTE: this is not a memory leak. It's the extra memory map used by WAL feature. I will take two actions:

Document WAL mode
Make WAL mode optional

loburm · 2021-03-15T22:21:00Z

Wow, but that's a lot (4MB per input plugin). Thanks for the investigation.

edsiper · 2021-03-15T22:22:31Z

everything has a price: 4MB for less I/O to disk :)

edsiper · 2021-03-15T22:45:09Z

New option db.wal: f0c2c9c
Documentation: https://docs.fluentbit.io/manual/pipeline/inputs/tail#sqlite-and-write-ahead-logging

edsiper changed the title ~~Memory leak in Fluent Bit starting from 1.6~~ [solved] Memory usage on in_tail (WAL) Mar 15, 2021

edsiper self-assigned this Mar 15, 2021

edsiper added enhancement fixed labels Mar 15, 2021

edsiper closed this as completed Mar 15, 2021

AkshayDubey29 mentioned this issue Mar 17, 2021

tail: added configuration for journal mode in WAL #3239

Merged

4 tasks

nokute78 mentioned this issue Mar 21, 2021

Can't turn off WAL #3255

Closed

loburm mentioned this issue Apr 9, 2021

Memory leak still present in 1.7 #3346

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[solved] Memory usage on in_tail (WAL) #3073

[solved] Memory usage on in_tail (WAL) #3073

loburm commented Feb 12, 2021

edsiper commented Feb 13, 2021

edsiper commented Feb 13, 2021

loburm commented Feb 14, 2021

edsiper commented Feb 15, 2021

loburm commented Feb 16, 2021 •

edited

Loading

loburm commented Feb 16, 2021

loburm commented Feb 16, 2021

agup006 commented Feb 19, 2021

loburm commented Mar 3, 2021

loburm commented Mar 4, 2021

loburm commented Mar 6, 2021

loburm commented Mar 12, 2021

AkshayDubey29 commented Mar 14, 2021 •

edited

Loading

edsiper commented Mar 15, 2021

edsiper commented Mar 15, 2021

loburm commented Mar 15, 2021

edsiper commented Mar 15, 2021

edsiper commented Mar 15, 2021

[solved] Memory usage on in_tail (WAL) #3073

[solved] Memory usage on in_tail (WAL) #3073

Comments

loburm commented Feb 12, 2021

Bug Report

edsiper commented Feb 13, 2021

edsiper commented Feb 13, 2021

loburm commented Feb 14, 2021

edsiper commented Feb 15, 2021

loburm commented Feb 16, 2021 • edited Loading

loburm commented Feb 16, 2021

loburm commented Feb 16, 2021

agup006 commented Feb 19, 2021

loburm commented Mar 3, 2021

loburm commented Mar 4, 2021

loburm commented Mar 6, 2021

loburm commented Mar 12, 2021

AkshayDubey29 commented Mar 14, 2021 • edited Loading

edsiper commented Mar 15, 2021

edsiper commented Mar 15, 2021

loburm commented Mar 15, 2021

edsiper commented Mar 15, 2021

edsiper commented Mar 15, 2021

loburm commented Feb 16, 2021 •

edited

Loading

AkshayDubey29 commented Mar 14, 2021 •

edited

Loading