-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
promtail skip logs on Loki unavailable for a short time #915
Comments
For network timeouts etc, I don't see why it shouldn't just retry forever. It's only tailing a logfile after all, so the source data is going to remain forever (or at least until the logfile is rotated). Retrying on error is more nuanced. A 400 error means the data you sent is bad; you need to give up and move onto the next batch. A 500 error means the server had a problem (e.g. disk full) and in principle you should retry indefinitely. However, it is possible that bad client data can return 500 (e.g. #929) , and you don't want log ingestion to freeze as a result. So it would be reasonable to retry for a limited period. Still, it should be much longer than the defaults you show. |
This issue has been automatically marked as stale because it has not had any activity in the past 30 days. It will be closed in 7 days if no further activity occurs. Thank you for your contributions. |
Please keep it alive, since there's already an open PR. |
Hi. This is a problem that I am facing as well. As it happens, when we deploy on client and server machines, loki may not necessarily be up when promtail starts. This means that initial few log lines are alwasy missed. But this is not the biggest issue. Problem is when loki is unreachable because of pure network issue (weak wifi). In that case we easily loose logs. How can we make sure that we don't loose on data for extended period of time (say 1 hr or atleast 15 mins)? |
You can configure proper backoff, see https://grafana.com/docs/loki/latest/clients/promtail/configuration/#clients backoff config. |
Yes. That's what i did. Thanks a lot. Recent versions actually have a much
better back off settings by default.
…On Fri, Jan 8, 2021, 4:07 PM Cyril Tovena ***@***.***> wrote:
You can configure proper backoff, see
https://grafana.com/docs/loki/latest/clients/promtail/configuration/#clients
backoff config.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#915 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AE3S5IMGKYJ5I36ZL6DJ7L3SY3RNNANCNFSM4INEBUVA>
.
|
Describe the bug
When Loki is unavailable (down, network partitioning, ...)
promtail
keeps sending a batch of tailed logs until thebackoff_config
maxretries
is reached. Once reached, the batch is discarded and promtail tries to push the next batch of lines read from a file.The backoff implementation is inherited from Cortex, which implements a "Full jitter" backoff algorithm (the
minbackoff
is not a guaranteed minimum, just the base for the initial upper range for the random sleep time). I've sketched an example in a google spreadsheet.The
production/helm/promtail/values.yaml
is configured with:This means that - on an average case - it's enough to have Loki unavailable for 1.55 seconds before start loosing logs.
Is there any reason for such a low number of
maxretries
? Shouldn't we try to reduce the likelihood of data loss, increasing it significantly?The text was updated successfully, but these errors were encountered: