Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[solved] Memory usage on in_tail (WAL) #3073

Closed
loburm opened this issue Feb 12, 2021 · 18 comments
Closed

[solved] Memory usage on in_tail (WAL) #3073

loburm opened this issue Feb 12, 2021 · 18 comments
Assignees

Comments

@loburm
Copy link
Contributor

loburm commented Feb 12, 2021

Bug Report

Describe the bug
I have managed to reproduce issue both in 1.6.10 and 1.7.0-rc5. Also it seems to be affecting 1.5.7 too, but at least I don't see huge growth at the beginning that eats additional 5-6 MB. On 1.3.11 I don't see such huge problem (bigger amount of RAM explained by initial load that the cluster was experiencing).

Screenshots
image

Your Environment

  • Version used: 1.6.10 and 1.7.0-rc5.
  • Configuration:

I have removed a few input blocks that haven't got any logs during a test, but can provide full config if needed. Our metrics show that overall load during that period of time was near 400 bytes/s. If was mainly coming from three sources: node-problem-detector, kubelet and kube_containers_kube-system.

[SERVICE]
    Flush           5
    Grace           120
    Log_Level       info
    Log_File        /var/log/fluentbit.log
    Daemon          off
    Parsers_File    parsers.conf
    HTTP_Server     On
    storage.metrics On
    HTTP_Listen     0.0.0.0
    HTTP_PORT       2020


[INPUT]
    Name             tail
    Alias            kube_containers_kube-system
    Tag              kube_<namespace_name>_<pod_name>_<container_name>
    Tag_Regex        (?<pod_name>[a-z0-9]([-a-z0-9]*[a-z0-9])?(\.[a-z0-9]([-a-z0-9]*[a-z0-9])?)*)_(?<namespace_name>[^_]+)_(?<container_name>.+)-
    Path             /var/log/containers/*_kube-system_*.log
    DB               /var/run/google-fluentbit/pos-files/flb_kube_kube-system.db
    DB.locking       true
    Read_from_Head   On
    Buffer_Max_Size  1MB
    Mem_Buf_Limit    5MB
    Skip_Long_Lines  On
    Refresh_Interval 5

[INPUT]
    Name             tail
    Alias            kube_containers
    Tag              kube_<namespace_name>_<pod_name>_<container_name>
    Tag_Regex        (?<pod_name>[a-z0-9]([-a-z0-9]*[a-z0-9])?(\.[a-z0-9]([-a-z0-9]*[a-z0-9])?)*)_(?<namespace_name>[^_]+)_(?<container_name>.+)-
    Path             /var/log/containers/*.log
    Exclude_Path     /var/log/containers/*_kube-system_*.log,/var/log/containers/*_istio-system_*.log,/var/log/containers/*_knative-serving_*.log,/var/log/containers/*_gke-system_*.log,/var/log/containers/*_config-management-system_*.log
    DB               /var/run/google-fluentbit/pos-files/flb_kube.db
    DB.locking       true
    Read_from_Head   On
    Buffer_Max_Size  1MB
    Mem_Buf_Limit    5MB
    Skip_Long_Lines  On
    Refresh_Interval 5

[FILTER]
    Name         parser
    Match        kube_*
    Key_Name     log
    Reserve_Data True
    Parser       docker
    Parser       containerd

[INPUT]
    Name            tail
    Alias           kube-proxy
    Tag             kube-proxy
    Path            /var/log/kube-proxy.log
    DB              /var/run/google-fluentbit/pos-files/kube-proxy.db
    DB.locking      true
    Buffer_Max_Size 1MB
    Mem_Buf_Limit   1MB
    Read_from_Head  On
    Parser          glog

[INPUT]
    Name            systemd
    Alias           node-problem-detector
    Tag             node-problem-detector
    Systemd_Filter  _SYSTEMD_UNIT=node-problem-detector.service
    Path            /var/log/journal
    DB              /var/run/google-fluentbit/pos-files/node-problem-detector.db
    DB.Sync         Normal
    Buffer_Max_Size 1MB
    Mem_Buf_Limit   1MB

[INPUT]
    Name            systemd
    Alias           kubelet
    Tag             kubelet
    Systemd_Filter  _SYSTEMD_UNIT=kubelet.service
    Path            /var/log/journal
    DB              /var/run/google-fluentbit/pos-files/kubelet.db
    DB.Sync         Normal
    Buffer_Max_Size 1MB
    Mem_Buf_Limit   1MB

[FILTER]
    Name        modify
    Match       *
    Hard_rename log message

[FILTER]
    Name         parser
    Match        kube_*
    Key_Name     message
    Reserve_Data True
    Parser       glog
    Parser       json

[OUTPUT]
    Name        http
    Match       *
    Host        127.0.0.1
    Port        2021
    URI         /logs
    header_tag  FLUENT-TAG
    Format      msgpack
    Retry_Limit 2

[PARSER]
    Name        docker
    Format      json
    Time_Key    time
    Time_Format %Y-%m-%dT%H:%M:%S.%L%z

[PARSER]
    Name        containerd
    Format      regex
    Regex       ^(?<time>.+) (?<stream>stdout|stderr) [^ ]* (?<log>.*)$
    Time_Key    time
    Time_Format %Y-%m-%dT%H:%M:%S.%L%z

[PARSER]
    Name        json
    Format      json

[PARSER]
    Name        syslog
    Format      regex
    Regex       ^\<(?<pri>[0-9]+)\>(?<time>[^ ]* {1,2}[^ ]* [^ ]*) (?<host>[^ ]*) (?<ident>[a-zA-Z0-9_\/\.\-]*)(?:\[(?<pid>[0-9]+)\])?(?:[^\:]*\:)? *(?<message>.*)$
    Time_Key    time
    Time_Format %b %d %H:%M:%S

[PARSER]
    Name        glog
    Format      regex
    Regex       ^(?<severity>\w)(?<time>\d{4} [^\s]*)\s+(?<pid>\d+)\s+(?<source_file>[^ \]]+)\:(?<source_line>\d+)\]\s(?<message>.*)$
    Time_Key    time
    Time_Format %m%d %H:%M:%S.%L

[PARSER]
    Name        network-log
    Format      json
    Time_Key    timestamp
    Time_Format %Y-%m-%dT%H:%M:%S.%L%z
  • Environment name and version (e.g. Kubernetes? What version?): GKE 1.19
@edsiper
Copy link
Member

edsiper commented Feb 13, 2021

hi @loburm

I was digging into this and the static analyzer reported a leak in the stackdriver connector, which I just fixed here 7962ad6 (I assume you tested with stack driver output)

that code was introduced around Nov 2020 ~v1.6.4... not sure if associated

@edsiper
Copy link
Member

edsiper commented Feb 13, 2021

@loburm can you please try the following new RC image:

fluent/fluent-bit:1.7.0-rc9

@loburm
Copy link
Contributor Author

loburm commented Feb 14, 2021

Hi Eduardo,
I'll test the latest release candidate, but in our configuration we are not using stackdriver connector. All data goes through the http plugin.

@edsiper
Copy link
Member

edsiper commented Feb 15, 2021

is there a way you can remove some filters to identify the possible root cause of the leak ?

@loburm
Copy link
Contributor Author

loburm commented Feb 16, 2021

So far I have run three more tests:

  1. All other input plugins except 3 that are generating logs are disabled.
  2. Out of those 3 plugins only input plugin and corresponding filters are enabled.
  3. Out of those 3 plugins only systemd plugins and corresponding filters are enabled.

Results:
image

As we can see problem should be somewhere in the next config:

[INPUT]
    Name             tail
    Alias            kube_containers_kube-system
    Tag              kube_<namespace_name>_<pod_name>_<container_name>
    Tag_Regex        (?<pod_name>[a-z0-9]([-a-z0-9]*[a-z0-9])?(\.[a-z0-9]([-a-z0-9]*[a-z0-9])?)*)_(?<namespace_name>[^_]+)_(?<container_name>.+)-
    Path             /var/log/containers/*_kube-system_*.log
    DB               /var/run/google-fluentbit/pos-files/flb_kube_kube-system.db
    DB.locking       true
    Read_from_Head   On
    Buffer_Max_Size  1MB
    Mem_Buf_Limit    5MB
    Skip_Long_Lines  On
    Refresh_Interval 5

[FILTER]
    Name         parser
    Match        kube_*
    Key_Name     log
    Reserve_Data True
    Parser       docker
    Parser       containerd

[FILTER]
    Name        modify
    Match       *
    Hard_rename log message

[FILTER]
    Name         parser
    Match        kube_*
    Key_Name     message
    Reserve_Data True
    Parser       glog
    Parser       json

I'm going to run a few more tests to figure out which plugin is causing the problem here.

@loburm
Copy link
Contributor Author

loburm commented Feb 16, 2021

As a side note I was testing with 1.7.0-rc9.

@loburm
Copy link
Contributor Author

loburm commented Feb 16, 2021

Last result:

image

I have been running different combinations:

  1. tail + parser + modify
  2. tail + parser + parser
  3. tail

In all cases we see a similar pattern, so I assume memory leak we have in tail plugin. @edsiper should I try some other options to narrow problem down further?

@agup006
Copy link
Member

agup006 commented Feb 19, 2021

Thanks for the repro, some initial thoughts are it might be related to tag_regex. Any possibility of running tail without it @loburm , appreciate your help?

@loburm
Copy link
Contributor Author

loburm commented Mar 3, 2021

Hi Anurag,
I don't think that it's related to tag regex. Here are two runs first without it and second with, memory footprint looks exactly the same.

image

@loburm
Copy link
Contributor Author

loburm commented Mar 4, 2021

Result from the latest test:

image

As it's visible from the graph, regression was introduced just before 1.6.0. I have used the simplest possible configuration to exclude problems with output plugin:

[INPUT]
    Name             tail
    Alias            kube_containers
    Tag              kube_containers
    Path             /var/log/containers/*.log
    DB               /var/run/google-fluentbit/pos-files/flb_kube.db
    Buffer_Max_Size  1MB
    Mem_Buf_Limit    5MB
    Skip_Long_Lines  On
    Refresh_Interval 5

[OUTPUT]
    Name        null 
    Match       *

@loburm
Copy link
Contributor Author

loburm commented Mar 6, 2021

I have tried to use binary search approach and managed to find a change, after which we see an issue: a79bc25

Additional investigation has shown that reverting line:

PRAGMA journal_mode=WAL;

to:

PRAGMA journal_mode=OFF;

resolves an issue for all Fluent Bit versions. I have cherrypicked, built and run tests for 1.6.0, 1.7.0 and 1.7.2

@loburm
Copy link
Contributor Author

loburm commented Mar 12, 2021

@edsiper how do you think, can we revert that line to fix an issue?

@AkshayDubey29
Copy link
Contributor

AkshayDubey29 commented Mar 14, 2021

@edsiper My pods are getting in crashLoopBackOff after some time and when I removed Db.sync and all DB-related config it worked fine, I am suspecting this might due to the same issue.

`a0d00qf@m-c02dj3hfmd6n fluent-bit % kubectl logs -f fluent-bit-lv7h4 -n logging
Fluent Bit v1.6.10

  • Copyright (C) 2019-2020 The Fluent Bit Authors
  • Copyright (C) 2015-2018 Treasure Data
  • Fluent Bit is a CNCF sub-project under the umbrella of Fluentd
  • https://fluentbit.io

[2021/03/14 12:00:48] [engine] caught signal (SIGSEGV)
#0 0x55f10349991e in ???() at ???:0
#1 0x55f103480296 in ???() at ???:0
#2 0x55f1033a63b3 in ???() at ???:0
#3 0x55f1033a6727 in ???() at ???:0
#4 0x55f1033a6866 in ???() at ???:0
#5 0x55f1033a7083 in ???() at ???:0
#6 0x55f1032d7000 in ???() at ???:0
#7 0x55f1032feda9 in ???() at ???:0
#8 0x55f10331ae6b in ???() at ???:0
#9 0x55f10331cb08 in ???() at ???:0
#10 0x55f1033172ea in ???() at ???:0
#11 0x55f1032d6ab6 in ???() at ???:0
#12 0x55f1032e3e57 in ???() at ???:0
#13 0x55f1032d0823 in ???() at ???:0
#14 0x7ff816f41fa2 in ???() at ???:0
#15 0x7ff8165814ce in ???() at ???:0
#16 0xffffffffffffffff in ???() at ???:0`

DB /tail-db/tail-kube-state.db DB.Sync Normal DB.locking true

@edsiper
Copy link
Member

edsiper commented Mar 15, 2021

I am testing that specific issue..,.

@edsiper
Copy link
Member

edsiper commented Mar 15, 2021

NOTE: this is not a memory leak. It's the extra memory map used by WAL feature. I will take two actions:

  • Document WAL mode
  • Make WAL mode optional

@loburm
Copy link
Contributor Author

loburm commented Mar 15, 2021

Wow, but that's a lot (4MB per input plugin). Thanks for the investigation.

@edsiper
Copy link
Member

edsiper commented Mar 15, 2021

everything has a price: 4MB for less I/O to disk :)

@edsiper
Copy link
Member

edsiper commented Mar 15, 2021

@edsiper edsiper changed the title Memory leak in Fluent Bit starting from 1.6 [solved] Memory usage on in_tail (WAL) Mar 15, 2021
@edsiper edsiper self-assigned this Mar 15, 2021
@edsiper edsiper closed this as completed Mar 15, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants