-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Loki docker driver: sporadic high CPU usage #3319
Comments
Found a couple of instances. Inspected pid 2810 (see pic) with
Docker plugin inspect: [
{
"Config": {
"Args": {
"Description": "",
"Name": "",
"Settable": null,
"Value": null
},
"Description": "Loki Logging Driver",
"DockerVersion": "17.09.0-ce",
"Documentation": "https://github.com/grafana/loki",
"Entrypoint": [
"/bin/docker-driver"
],
"Env": [
{
"Description": "Set log level to output for plugin logs",
"Name": "LOG_LEVEL",
"Settable": [
"value"
],
"Value": "info"
}
],
"Interface": {
"Socket": "loki.sock",
"Types": [
"docker.logdriver/1.0"
]
},
"IpcHost": false,
"Linux": {
"AllowAllDevices": false,
"Capabilities": null,
"Devices": null
},
"Mounts": null,
"Network": {
"Type": "host"
},
"PidHost": false,
"PropagatedMount": "",
"User": {},
"WorkDir": "",
"rootfs": {
"diff_ids": [
"sha256:712bd6e70d0e97729b18f19d6a59069cd88262f41deeca5eaadc48a973a756f5"
],
"type": "layers"
}
},
"Enabled": true,
"Id": "ae3b954388a6478908b680e2d44e0e99bc4523ce0e68c46c206b807199d02c41",
"Name": "loki:latest",
"PluginReference": "docker.io/grafana/loki-docker-driver:2.0.0",
"Settings": {
"Args": [],
"Devices": [],
"Env": [
"LOG_LEVEL=info"
],
"Mounts": []
}
}
] Current traffic from the instance per minute: Traffic from the instance for the last 7 days per minute: So, it is yet another case: high CPU usage after big amount of logs. |
We are using |
I'm looking at adding more way to figured what's up, testing it right now. |
Alright I just pushed this plugin grafana/loki-docker-driver:master-8a9b94a-WIP. Install it and then enabled pprof by doing this.
This will open a port on the plugin that will allow you to grab a cpu and memory profile which I can use to do my detective work and find out how this is happening. All you need to do next is grab the 2 profiles and send them to me:
Obviously do that when the problem occurs. Let me know ! |
@cyriltovena got an error while upgrading the plugin using official documentation:
Docker plugin inspect shows me an old Id: |
not sure can you uninstall and reinstall ? |
I had the same problem, but couldn't install your driver (due to lack of time). I |
Did a little more tracing. It seems that at times, the DNS resolution fails. But it doesn't seem to be anything that I can verify outside of the Loki code. DNS generally works, but it fails when logs are written to the remote loki (from the |
Leaving this here for now:
|
I looked at this again with a fresh set of 👀 . It's not DNS resolution. It seems to be actual connects that fail. Host is translated into IP, then the connect attempts happen again and again, and again. A restart of Docker temporarily fixes it. And then eventually it'll happen again. Any suggestions for what else to look at? |
@till Sounds scary! did you noticed any loss of logs in Loki server? |
@glebsam not really, this is a bit of an edge case. Customer is mostly using an old ELK stack, the plugin is rarely used. So not sure if it's usage. It's also a CentOS7, which is not a great base anyway. So maybe this will go away with a new system/kernel/everything. |
I was able to install driver with To be clear, what I've done:
|
I think I have an idea what's going on. I'm going to send a custom build let's see. |
I should have jumped straight on that futex. I'm sorry I took so long to react, I have good hope this is going to be fixed. |
You can try it out with this one: grafana/loki-docker-driver:close-ticker-eff4ff1-WIP Sending the PR too. |
I think this was only impacting the docker driver who would start/stop for each new followed containers. My assumption is that this would slowly build up and eat CPU over time. Fixes grafana#3319 I arrived to this conclusion since strace was showing a crazy amount of futex syscall and nothing else seems to be the cause. :pray: Signed-off-by: Cyril Tovena <cyril.tovena@gmail.com>
I think this was only impacting the docker driver who would start/stop for each new followed containers. My assumption is that this would slowly build up and eat CPU over time. Fixes #3319 I arrived to this conclusion since strace was showing a crazy amount of futex syscall and nothing else seems to be the cause. :pray: Signed-off-by: Cyril Tovena <cyril.tovena@gmail.com>
@cyriltovena thanks for your prompt work on this! I'm looking at updating a few nodes next week. Is a |
@slim-bean can you include that in 2.2.1 please? |
I think this was only impacting the docker driver who would start/stop for each new followed containers. My assumption is that this would slowly build up and eat CPU over time. Fixes #3319 I arrived to this conclusion since strace was showing a crazy amount of futex syscall and nothing else seems to be the cause. :pray: Signed-off-by: Cyril Tovena <cyril.tovena@gmail.com> (cherry picked from commit ecca5d3)
Description
Sometimes I can observe high CPU usage by
/bin/docker-driver
which is not correlated with amount of logs flown through it. CPU usage tracked viahtop
. The only thing which always helps is docker engine restart. By "high CPU usage" I mean 20-40% instead of usual 1-9%.To Reproduce
No stable steps to reproduce. Possible cause is temporary increased (and then decreased) amount of logs.
Environment:
We use AWS Linux v1 and v2 (both affected). Instance type is AWS
t3a.small
. Docker driver used in ECS instance, settings described in/etc/docker/daemon.json
:Additional info
I understand that this issue is hard to solve without concrete steps to reproduce. If you can, please describe for me actions which I should do to help you to track down a root cause of the issue, in case it will be noticed by me (I bet it will, I saw it more than two times for past three month).
The text was updated successfully, but these errors were encountered: