Loki docker driver: sporadic high CPU usage #3319

glebsam · 2021-02-10T12:46:37Z

Description

Sometimes I can observe high CPU usage by /bin/docker-driver which is not correlated with amount of logs flown through it. CPU usage tracked via htop. The only thing which always helps is docker engine restart. By "high CPU usage" I mean 20-40% instead of usual 1-9%.

To Reproduce

No stable steps to reproduce. Possible cause is temporary increased (and then decreased) amount of logs.

Environment:

We use AWS Linux v1 and v2 (both affected). Instance type is AWS t3a.small. Docker driver used in ECS instance, settings described in /etc/docker/daemon.json:

{
  "log-opts": {
    "loki-url": "http://loki:3100/loki/api/v1/push",
    "no-file": "true",
    "labels": "com.amazonaws.ecs.cluster,com.amazonaws.ecs.container-name,com.amazonaws.ecs.task-definition-family,com.amazonaws.ecs.task-definition-version",
    "loki-relabel-config": "- {action: labelmap, regex: com_amazonaws_(.+)}\n- {action: labelkeep, regex: (host|image_id|image_name|ecs_cluster|ecs_container_name|ecs_task_definition_family|ecs_task_definition_version)}\n",
    "max-buffer-size": "100m",
    "mode": "non-blocking",
    "loki-external-labels": "image_id={{.ImageID}},image_name={{.ImageName}}"
  },
  "log-driver": "loki"
}

Additional info

I understand that this issue is hard to solve without concrete steps to reproduce. If you can, please describe for me actions which I should do to help you to track down a root cause of the issue, in case it will be noticed by me (I bet it will, I saw it more than two times for past three month).

The text was updated successfully, but these errors were encountered:

glebsam · 2021-02-10T15:48:03Z

Found a couple of instances. Inspected pid 2810 (see pic) with strace. Hope it will help.

# strace -f -c -p 2810
Process 2810 attached with 11 threads
^CProcess 2810 detached
Process 2824 detached
Process 2825 detached
Process 2826 detached
Process 2827 detached
Process 2994 detached
Process 2995 detached
Process 578 detached
Process 16488 detached
Process 25690 detached
Process 21758 detached
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 71.78  113.265659         259    436753     57080 futex
 20.05   31.630500         168    188246         2 epoll_pwait
  7.50   11.826787          57    205964           nanosleep
  0.46    0.732902          52     14125           write
  0.21    0.329093          23     14415       104 read
  0.00    0.002044          37        55           tgkill
  0.00    0.001639          30        55         2 rt_sigreturn
  0.00    0.000962          17        55           getpid
  0.00    0.000168          17        10           sched_yield
------ ----------- ----------- --------- --------- ----------------
100.00  157.789754                859678     57188 total

Docker plugin inspect:

[
    {
        "Config": {
            "Args": {
                "Description": "",
                "Name": "",
                "Settable": null,
                "Value": null
            },
            "Description": "Loki Logging Driver",
            "DockerVersion": "17.09.0-ce",
            "Documentation": "https://github.com/grafana/loki",
            "Entrypoint": [
                "/bin/docker-driver"
            ],
            "Env": [
                {
                    "Description": "Set log level to output for plugin logs",
                    "Name": "LOG_LEVEL",
                    "Settable": [
                        "value"
                    ],
                    "Value": "info"
                }
            ],
            "Interface": {
                "Socket": "loki.sock",
                "Types": [
                    "docker.logdriver/1.0"
                ]
            },
            "IpcHost": false,
            "Linux": {
                "AllowAllDevices": false,
                "Capabilities": null,
                "Devices": null
            },
            "Mounts": null,
            "Network": {
                "Type": "host"
            },
            "PidHost": false,
            "PropagatedMount": "",
            "User": {},
            "WorkDir": "",
            "rootfs": {
                "diff_ids": [
                    "sha256:712bd6e70d0e97729b18f19d6a59069cd88262f41deeca5eaadc48a973a756f5"
                ],
                "type": "layers"
            }
        },
        "Enabled": true,
        "Id": "ae3b954388a6478908b680e2d44e0e99bc4523ce0e68c46c206b807199d02c41",
        "Name": "loki:latest",
        "PluginReference": "docker.io/grafana/loki-docker-driver:2.0.0",
        "Settings": {
            "Args": [],
            "Devices": [],
            "Env": [
                "LOG_LEVEL=info"
            ],
            "Mounts": []
        }
    }
]

Current traffic from the instance per minute:

Traffic from the instance for the last 7 days per minute:

So, it is yet another case: high CPU usage after big amount of logs.

glebsam · 2021-02-10T16:47:23Z

We are using non-blocking mode, maybe the issue starts reproduce after buffer size exceeded?

cyriltovena · 2021-02-12T12:11:07Z

I'm looking at adding more way to figured what's up, testing it right now.

cyriltovena · 2021-02-12T12:33:31Z

Alright I just pushed this plugin grafana/loki-docker-driver:master-8a9b94a-WIP.

Install it and then enabled pprof by doing this.

❯ docker plugin set {plugin_name} PPROF_PORT=6060

This will open a port on the plugin that will allow you to grab a cpu and memory profile which I can use to do my detective work and find out how this is happening.

All you need to do next is grab the 2 profiles and send them to me:

curl localhost:6060/debug/pprof/profile?seconds=20 -o cpu.pprof
curl localhost:6060/debug/pprof/heap -o heap.pprof

Obviously do that when the problem occurs.

Let me know !

glebsam · 2021-02-12T12:53:54Z

@cyriltovena got an error while upgrading the plugin using official documentation:

[root@ip-10-0-6-38 ec2-user]# docker plugin disable loki --force
loki
[root@ip-10-0-6-38 ec2-user]# docker plugin upgrade loki grafana/loki-docker-driver:master-8a9b94a-WIP --grant-all-permissions
Upgrading plugin loki:latest from grafana/loki-docker-driver:2.0.0 to grafana/loki-docker-driver:master-8a9b94a-WIP
Plugin images do not match, are you sure? [y/N] y
master-8a9b94a-WIP: Pulling from grafana/loki-docker-driver
92f8bb3e063f: Download complete
layers from manifest don't match image configuration
[root@ip-10-0-6-38 ec2-user]# docker plugin ls
ID                  NAME                DESCRIPTION           ENABLED
ae3b954388a6        loki:latest         Loki Logging Driver   false

Docker plugin inspect shows me an old Id: ae3b954388a6478908b680e2d44e0e99bc4523ce0e68c46c206b807199d02c41

cyriltovena · 2021-02-12T13:00:50Z

not sure can you uninstall and reinstall ?

till · 2021-03-15T12:33:11Z

I had the same problem, but couldn't install your driver (due to lack of time).

I straced the process (and -f(orks)) and it seemed like the reason for the high CPU usage were constant connection timeouts that I couldn't exactly pin down due to time constraints. Is there anything you'd think a strace would uncover? I probably have other nodes exhibiting this behavior.

till · 2021-03-15T12:54:41Z

Did a little more tracing. It seems that at times, the DNS resolution fails. But it doesn't seem to be anything that I can verify outside of the Loki code. DNS generally works, but it fails when logs are written to the remote loki (from the docker plugin). But this is a somewhat older version (1.5.0) of it. I guess the best course of action is to update the plugin first?

till · 2021-03-15T13:03:17Z

Leaving this here for now:

% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
  0.00    0.000000           0      6325      1736 read
  0.00    0.000000           0      2530           write
  0.00    0.000000           0       301           close
  0.00    0.000000           0         2           fstat
  0.00    0.000000           0         2           lseek
  0.00    0.000000           0    462140           sched_yield
  0.00    0.000000           0    295916           nanosleep
  0.00    0.000000           0       291           socket
  0.00    0.000000           0       291        97 connect
  0.00    0.000000           0       291           getsockname
  0.00    0.000000           0       291           getpeername
  0.00    0.000000           0       679           setsockopt
  0.00    0.000000           0        97           getsockopt
  0.00    0.000000           0         2           uname
  0.00    0.000000           0         4           fcntl
  0.00    0.000000           0   3616592   2745443 futex
  0.00    0.000000           0         4           getdents64
  0.00    0.000000           0         6         5 restart_syscall
  0.00    0.000000           0       602        16 epoll_ctl
  0.00    0.000000           0        10           openat
  0.00    0.000000           0         2           mkdirat
  0.00    0.000000           0        27         2 newfstatat
  0.00    0.000000           0        10         6 unlinkat
  0.00    0.000000           0      7070           epoll_pwait
------ ----------- ----------- --------- --------- ----------------
100.00    0.000000               4393485   2747305 total

till · 2021-03-16T09:58:57Z

I looked at this again with a fresh set of 👀 . It's not DNS resolution. It seems to be actual connects that fail. Host is translated into IP, then the connect attempts happen again and again, and again. A restart of Docker temporarily fixes it. And then eventually it'll happen again.

Any suggestions for what else to look at?

glebsam · 2021-03-18T09:09:17Z

@till Sounds scary! did you noticed any loss of logs in Loki server?

till · 2021-03-18T17:16:05Z

@glebsam not really, this is a bit of an edge case. Customer is mostly using an old ELK stack, the plugin is rarely used. So not sure if it's usage. It's also a CentOS7, which is not a great base anyway. So maybe this will go away with a new system/kernel/everything.

glebsam · 2021-03-25T16:25:32Z

I was able to install driver with loki:2.2.0 tag and gather the statistics from the instance which is currently suffering from high CPU usage of docker driver (but of previous version). I ran simple application that does nothing except generating 3_000 log records per second. Please take a look:
loki_log_driver_pprof_github_3319.zip

To be clear, what I've done:

docker plugin install grafana/loki-docker-driver:2.2.0 \
    --alias loki:2.2.0 \
    --grant-all-permissions

docker plugin disable loki:2.2.0
docker plugin set loki:2.2.0 PPROF_PORT=6060
docker plugin enable loki:2.2.0

docker run -d --rm -it \
    --name loki-test \
    --log-driver=loki:2.2.0 \
    --log-opt loki-url="http://loki:3100/loki/api/v1/push" \
    --log-opt no-file=true \
    --log-opt max-buffer-size=100m \
    --log-opt mode=non-blocking \
    loggen:latest -log-mps 3_000

curl localhost:6060/debug/pprof/heap -o heap.pprof
curl localhost:6060/debug/pprof/profile?seconds=20 -o cpu.pprof

cyriltovena · 2021-03-31T19:52:06Z

I think I have an idea what's going on. I'm going to send a custom build let's see.

cyriltovena · 2021-03-31T20:04:41Z

I should have jumped straight on that futex. I'm sorry I took so long to react, I have good hope this is going to be fixed.

cyriltovena · 2021-03-31T20:08:41Z

You can try it out with this one: grafana/loki-docker-driver:close-ticker-eff4ff1-WIP

Sending the PR too.

I think this was only impacting the docker driver who would start/stop for each new followed containers. My assumption is that this would slowly build up and eat CPU over time. Fixes grafana#3319 I arrived to this conclusion since strace was showing a crazy amount of futex syscall and nothing else seems to be the cause. :pray: Signed-off-by: Cyril Tovena <cyril.tovena@gmail.com>

I think this was only impacting the docker driver who would start/stop for each new followed containers. My assumption is that this would slowly build up and eat CPU over time. Fixes #3319 I arrived to this conclusion since strace was showing a crazy amount of futex syscall and nothing else seems to be the cause. :pray: Signed-off-by: Cyril Tovena <cyril.tovena@gmail.com>

till · 2021-04-01T12:32:50Z

@cyriltovena thanks for your prompt work on this! I'm looking at updating a few nodes next week. Is a 2.2.1 coming soon?

cyriltovena · 2021-04-01T19:16:51Z

@slim-bean can you include that in 2.2.1 please?

I think this was only impacting the docker driver who would start/stop for each new followed containers. My assumption is that this would slowly build up and eat CPU over time. Fixes #3319 I arrived to this conclusion since strace was showing a crazy amount of futex syscall and nothing else seems to be the cause. :pray: Signed-off-by: Cyril Tovena <cyril.tovena@gmail.com> (cherry picked from commit ecca5d3)

cyriltovena mentioned this issue Feb 12, 2021

Add pprof endpoint to docker-driver. #3331

Merged

cyriltovena mentioned this issue Mar 31, 2021

Properly release the ticker in Loki client. #3566

Merged

cyriltovena closed this as completed in #3566 Mar 31, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Loki docker driver: sporadic high CPU usage #3319

Loki docker driver: sporadic high CPU usage #3319

glebsam commented Feb 10, 2021

glebsam commented Feb 10, 2021 •

edited

Loading

glebsam commented Feb 10, 2021

cyriltovena commented Feb 12, 2021

cyriltovena commented Feb 12, 2021 •

edited

Loading

glebsam commented Feb 12, 2021 •

edited

Loading

cyriltovena commented Feb 12, 2021

till commented Mar 15, 2021

till commented Mar 15, 2021

till commented Mar 15, 2021

till commented Mar 16, 2021

glebsam commented Mar 18, 2021

till commented Mar 18, 2021 •

edited

Loading

glebsam commented Mar 25, 2021

cyriltovena commented Mar 31, 2021

cyriltovena commented Mar 31, 2021

cyriltovena commented Mar 31, 2021

till commented Apr 1, 2021

cyriltovena commented Apr 1, 2021

Loki docker driver: sporadic high CPU usage #3319

Loki docker driver: sporadic high CPU usage #3319

Comments

glebsam commented Feb 10, 2021

glebsam commented Feb 10, 2021 • edited Loading

glebsam commented Feb 10, 2021

cyriltovena commented Feb 12, 2021

cyriltovena commented Feb 12, 2021 • edited Loading

glebsam commented Feb 12, 2021 • edited Loading

cyriltovena commented Feb 12, 2021

till commented Mar 15, 2021

till commented Mar 15, 2021

till commented Mar 15, 2021

till commented Mar 16, 2021

glebsam commented Mar 18, 2021

till commented Mar 18, 2021 • edited Loading

glebsam commented Mar 25, 2021

cyriltovena commented Mar 31, 2021

cyriltovena commented Mar 31, 2021

cyriltovena commented Mar 31, 2021

till commented Apr 1, 2021

cyriltovena commented Apr 1, 2021

glebsam commented Feb 10, 2021 •

edited

Loading

cyriltovena commented Feb 12, 2021 •

edited

Loading

glebsam commented Feb 12, 2021 •

edited

Loading

till commented Mar 18, 2021 •

edited

Loading