Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Loki docker driver: sporadic high CPU usage #3319

Closed
glebsam opened this issue Feb 10, 2021 · 18 comments · Fixed by #3566
Closed

Loki docker driver: sporadic high CPU usage #3319

glebsam opened this issue Feb 10, 2021 · 18 comments · Fixed by #3566

Comments

@glebsam
Copy link

glebsam commented Feb 10, 2021

Description

Sometimes I can observe high CPU usage by /bin/docker-driver which is not correlated with amount of logs flown through it. CPU usage tracked via htop. The only thing which always helps is docker engine restart. By "high CPU usage" I mean 20-40% instead of usual 1-9%.

To Reproduce

No stable steps to reproduce. Possible cause is temporary increased (and then decreased) amount of logs.

Environment:

We use AWS Linux v1 and v2 (both affected). Instance type is AWS t3a.small. Docker driver used in ECS instance, settings described in /etc/docker/daemon.json:

{
  "log-opts": {
    "loki-url": "http://loki:3100/loki/api/v1/push",
    "no-file": "true",
    "labels": "com.amazonaws.ecs.cluster,com.amazonaws.ecs.container-name,com.amazonaws.ecs.task-definition-family,com.amazonaws.ecs.task-definition-version",
    "loki-relabel-config": "- {action: labelmap, regex: com_amazonaws_(.+)}\n- {action: labelkeep, regex: (host|image_id|image_name|ecs_cluster|ecs_container_name|ecs_task_definition_family|ecs_task_definition_version)}\n",
    "max-buffer-size": "100m",
    "mode": "non-blocking",
    "loki-external-labels": "image_id={{.ImageID}},image_name={{.ImageName}}"
  },
  "log-driver": "loki"
}

Additional info

I understand that this issue is hard to solve without concrete steps to reproduce. If you can, please describe for me actions which I should do to help you to track down a root cause of the issue, in case it will be noticed by me (I bet it will, I saw it more than two times for past three month).

@glebsam
Copy link
Author

glebsam commented Feb 10, 2021

Found a couple of instances. Inspected pid 2810 (see pic) with strace. Hope it will help.

image

# strace -f -c -p 2810
Process 2810 attached with 11 threads
^CProcess 2810 detached
Process 2824 detached
Process 2825 detached
Process 2826 detached
Process 2827 detached
Process 2994 detached
Process 2995 detached
Process 578 detached
Process 16488 detached
Process 25690 detached
Process 21758 detached
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 71.78  113.265659         259    436753     57080 futex
 20.05   31.630500         168    188246         2 epoll_pwait
  7.50   11.826787          57    205964           nanosleep
  0.46    0.732902          52     14125           write
  0.21    0.329093          23     14415       104 read
  0.00    0.002044          37        55           tgkill
  0.00    0.001639          30        55         2 rt_sigreturn
  0.00    0.000962          17        55           getpid
  0.00    0.000168          17        10           sched_yield
------ ----------- ----------- --------- --------- ----------------
100.00  157.789754                859678     57188 total

Docker plugin inspect:

[
    {
        "Config": {
            "Args": {
                "Description": "",
                "Name": "",
                "Settable": null,
                "Value": null
            },
            "Description": "Loki Logging Driver",
            "DockerVersion": "17.09.0-ce",
            "Documentation": "https://github.com/grafana/loki",
            "Entrypoint": [
                "/bin/docker-driver"
            ],
            "Env": [
                {
                    "Description": "Set log level to output for plugin logs",
                    "Name": "LOG_LEVEL",
                    "Settable": [
                        "value"
                    ],
                    "Value": "info"
                }
            ],
            "Interface": {
                "Socket": "loki.sock",
                "Types": [
                    "docker.logdriver/1.0"
                ]
            },
            "IpcHost": false,
            "Linux": {
                "AllowAllDevices": false,
                "Capabilities": null,
                "Devices": null
            },
            "Mounts": null,
            "Network": {
                "Type": "host"
            },
            "PidHost": false,
            "PropagatedMount": "",
            "User": {},
            "WorkDir": "",
            "rootfs": {
                "diff_ids": [
                    "sha256:712bd6e70d0e97729b18f19d6a59069cd88262f41deeca5eaadc48a973a756f5"
                ],
                "type": "layers"
            }
        },
        "Enabled": true,
        "Id": "ae3b954388a6478908b680e2d44e0e99bc4523ce0e68c46c206b807199d02c41",
        "Name": "loki:latest",
        "PluginReference": "docker.io/grafana/loki-docker-driver:2.0.0",
        "Settings": {
            "Args": [],
            "Devices": [],
            "Env": [
                "LOG_LEVEL=info"
            ],
            "Mounts": []
        }
    }
]

Current traffic from the instance per minute:

image

Traffic from the instance for the last 7 days per minute:

image

So, it is yet another case: high CPU usage after big amount of logs.

@glebsam
Copy link
Author

glebsam commented Feb 10, 2021

We are using non-blocking mode, maybe the issue starts reproduce after buffer size exceeded?

@cyriltovena
Copy link
Contributor

I'm looking at adding more way to figured what's up, testing it right now.

@cyriltovena
Copy link
Contributor

cyriltovena commented Feb 12, 2021

Alright I just pushed this plugin grafana/loki-docker-driver:master-8a9b94a-WIP.

Install it and then enabled pprof by doing this.

❯ docker plugin set {plugin_name} PPROF_PORT=6060

This will open a port on the plugin that will allow you to grab a cpu and memory profile which I can use to do my detective work and find out how this is happening.

All you need to do next is grab the 2 profiles and send them to me:

curl localhost:6060/debug/pprof/profile?seconds=20 -o cpu.pprof
curl localhost:6060/debug/pprof/heap -o heap.pprof

Obviously do that when the problem occurs.

Let me know !

@glebsam
Copy link
Author

glebsam commented Feb 12, 2021

@cyriltovena got an error while upgrading the plugin using official documentation:

[root@ip-10-0-6-38 ec2-user]# docker plugin disable loki --force
loki
[root@ip-10-0-6-38 ec2-user]# docker plugin upgrade loki grafana/loki-docker-driver:master-8a9b94a-WIP --grant-all-permissions
Upgrading plugin loki:latest from grafana/loki-docker-driver:2.0.0 to grafana/loki-docker-driver:master-8a9b94a-WIP
Plugin images do not match, are you sure? [y/N] y
master-8a9b94a-WIP: Pulling from grafana/loki-docker-driver
92f8bb3e063f: Download complete
layers from manifest don't match image configuration
[root@ip-10-0-6-38 ec2-user]# docker plugin ls
ID                  NAME                DESCRIPTION           ENABLED
ae3b954388a6        loki:latest         Loki Logging Driver   false

Docker plugin inspect shows me an old Id: ae3b954388a6478908b680e2d44e0e99bc4523ce0e68c46c206b807199d02c41

@cyriltovena
Copy link
Contributor

not sure can you uninstall and reinstall ?

@till
Copy link
Contributor

till commented Mar 15, 2021

I had the same problem, but couldn't install your driver (due to lack of time).

I straced the process (and -f(orks)) and it seemed like the reason for the high CPU usage were constant connection timeouts that I couldn't exactly pin down due to time constraints. Is there anything you'd think a strace would uncover? I probably have other nodes exhibiting this behavior.

@till
Copy link
Contributor

till commented Mar 15, 2021

Did a little more tracing. It seems that at times, the DNS resolution fails. But it doesn't seem to be anything that I can verify outside of the Loki code. DNS generally works, but it fails when logs are written to the remote loki (from the docker plugin). But this is a somewhat older version (1.5.0) of it. I guess the best course of action is to update the plugin first?

@till
Copy link
Contributor

till commented Mar 15, 2021

Leaving this here for now:

% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
  0.00    0.000000           0      6325      1736 read
  0.00    0.000000           0      2530           write
  0.00    0.000000           0       301           close
  0.00    0.000000           0         2           fstat
  0.00    0.000000           0         2           lseek
  0.00    0.000000           0    462140           sched_yield
  0.00    0.000000           0    295916           nanosleep
  0.00    0.000000           0       291           socket
  0.00    0.000000           0       291        97 connect
  0.00    0.000000           0       291           getsockname
  0.00    0.000000           0       291           getpeername
  0.00    0.000000           0       679           setsockopt
  0.00    0.000000           0        97           getsockopt
  0.00    0.000000           0         2           uname
  0.00    0.000000           0         4           fcntl
  0.00    0.000000           0   3616592   2745443 futex
  0.00    0.000000           0         4           getdents64
  0.00    0.000000           0         6         5 restart_syscall
  0.00    0.000000           0       602        16 epoll_ctl
  0.00    0.000000           0        10           openat
  0.00    0.000000           0         2           mkdirat
  0.00    0.000000           0        27         2 newfstatat
  0.00    0.000000           0        10         6 unlinkat
  0.00    0.000000           0      7070           epoll_pwait
------ ----------- ----------- --------- --------- ----------------
100.00    0.000000               4393485   2747305 total

@till
Copy link
Contributor

till commented Mar 16, 2021

I looked at this again with a fresh set of 👀 . It's not DNS resolution. It seems to be actual connects that fail. Host is translated into IP, then the connect attempts happen again and again, and again. A restart of Docker temporarily fixes it. And then eventually it'll happen again.

Any suggestions for what else to look at?

@glebsam
Copy link
Author

glebsam commented Mar 18, 2021

@till Sounds scary! did you noticed any loss of logs in Loki server?

@till
Copy link
Contributor

till commented Mar 18, 2021

@glebsam not really, this is a bit of an edge case. Customer is mostly using an old ELK stack, the plugin is rarely used. So not sure if it's usage. It's also a CentOS7, which is not a great base anyway. So maybe this will go away with a new system/kernel/everything.

@glebsam
Copy link
Author

glebsam commented Mar 25, 2021

I was able to install driver with loki:2.2.0 tag and gather the statistics from the instance which is currently suffering from high CPU usage of docker driver (but of previous version). I ran simple application that does nothing except generating 3_000 log records per second. Please take a look:
loki_log_driver_pprof_github_3319.zip

To be clear, what I've done:

docker plugin install grafana/loki-docker-driver:2.2.0 \
    --alias loki:2.2.0 \
    --grant-all-permissions

docker plugin disable loki:2.2.0
docker plugin set loki:2.2.0 PPROF_PORT=6060
docker plugin enable loki:2.2.0

docker run -d --rm -it \
    --name loki-test \
    --log-driver=loki:2.2.0 \
    --log-opt loki-url="http://loki:3100/loki/api/v1/push" \
    --log-opt no-file=true \
    --log-opt max-buffer-size=100m \
    --log-opt mode=non-blocking \
    loggen:latest -log-mps 3_000

curl localhost:6060/debug/pprof/heap -o heap.pprof
curl localhost:6060/debug/pprof/profile?seconds=20 -o cpu.pprof

@cyriltovena
Copy link
Contributor

I think I have an idea what's going on. I'm going to send a custom build let's see.

@cyriltovena
Copy link
Contributor

I should have jumped straight on that futex. I'm sorry I took so long to react, I have good hope this is going to be fixed.

@cyriltovena
Copy link
Contributor

You can try it out with this one: grafana/loki-docker-driver:close-ticker-eff4ff1-WIP

Sending the PR too.

cyriltovena added a commit to cyriltovena/loki that referenced this issue Mar 31, 2021
I think this was only impacting the docker driver who would start/stop for each new followed containers.
My assumption is that this would slowly build up and eat CPU over time.

Fixes grafana#3319

I arrived to this conclusion since strace was showing a crazy amount of futex syscall and nothing else seems to be the cause.

:pray:

Signed-off-by: Cyril Tovena <cyril.tovena@gmail.com>
cyriltovena added a commit that referenced this issue Mar 31, 2021
I think this was only impacting the docker driver who would start/stop for each new followed containers.
My assumption is that this would slowly build up and eat CPU over time.

Fixes #3319

I arrived to this conclusion since strace was showing a crazy amount of futex syscall and nothing else seems to be the cause.

:pray:

Signed-off-by: Cyril Tovena <cyril.tovena@gmail.com>
@till
Copy link
Contributor

till commented Apr 1, 2021

@cyriltovena thanks for your prompt work on this! I'm looking at updating a few nodes next week. Is a 2.2.1 coming soon?

@cyriltovena
Copy link
Contributor

@slim-bean can you include that in 2.2.1 please?

slim-bean pushed a commit that referenced this issue Apr 6, 2021
I think this was only impacting the docker driver who would start/stop for each new followed containers.
My assumption is that this would slowly build up and eat CPU over time.

Fixes #3319

I arrived to this conclusion since strace was showing a crazy amount of futex syscall and nothing else seems to be the cause.

:pray:

Signed-off-by: Cyril Tovena <cyril.tovena@gmail.com>
(cherry picked from commit ecca5d3)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants