promtail skip logs on Loki unavailable for a short time #915

pracucci · 2019-08-19T17:31:45Z

Describe the bug
When Loki is unavailable (down, network partitioning, ...) promtail keeps sending a batch of tailed logs until the backoff_config maxretries is reached. Once reached, the batch is discarded and promtail tries to push the next batch of lines read from a file.

The backoff implementation is inherited from Cortex, which implements a "Full jitter" backoff algorithm (the minbackoff is not a guaranteed minimum, just the base for the initial upper range for the random sleep time). I've sketched an example in a google spreadsheet.

The production/helm/promtail/values.yaml is configured with:

    backoff_config:
      # Initial backoff time between retries
      minbackoff: 100ms
      # Maximum backoff time between retries
      maxbackoff: 5s
      # Maximum number of retries when sending batches, 0 means infinite retries
      maxretries: 5

This means that - on an average case - it's enough to have Loki unavailable for 1.55 seconds before start loosing logs.

Is there any reason for such a low number of maxretries? Shouldn't we try to reduce the likelihood of data loss, increasing it significantly?

The text was updated successfully, but these errors were encountered:

candlerb · 2019-08-21T13:22:25Z

For network timeouts etc, I don't see why it shouldn't just retry forever. It's only tailing a logfile after all, so the source data is going to remain forever (or at least until the logfile is rotated).

Retrying on error is more nuanced. A 400 error means the data you sent is bad; you need to give up and move onto the next batch. A 500 error means the server had a problem (e.g. disk full) and in principle you should retry indefinitely. However, it is possible that bad client data can return 500 (e.g. #929) , and you don't want log ingestion to freeze as a result. So it would be reasonable to retry for a limited period. Still, it should be much longer than the defaults you show.

stale · 2019-10-06T17:14:28Z

This issue has been automatically marked as stale because it has not had any activity in the past 30 days. It will be closed in 7 days if no further activity occurs. Thank you for your contributions.

pracucci · 2019-10-07T09:26:58Z

Please keep it alive, since there's already an open PR.

muzammil360 · 2021-01-04T09:21:01Z

Hi. This is a problem that I am facing as well. As it happens, when we deploy on client and server machines, loki may not necessarily be up when promtail starts. This means that initial few log lines are alwasy missed.

But this is not the biggest issue. Problem is when loki is unreachable because of pure network issue (weak wifi). In that case we easily loose logs. How can we make sure that we don't loose on data for extended period of time (say 1 hr or atleast 15 mins)?

cyriltovena · 2021-01-08T11:06:46Z

You can configure proper backoff, see https://grafana.com/docs/loki/latest/clients/promtail/configuration/#clients backoff config.

muzammil360 · 2021-01-10T14:33:57Z

Yes. That's what i did. Thanks a lot. Recent versions actually have a much better back off settings by default.

…

On Fri, Jan 8, 2021, 4:07 PM Cyril Tovena ***@***.***> wrote: You can configure proper backoff, see https://grafana.com/docs/loki/latest/clients/promtail/configuration/#clients backoff config. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#915 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AE3S5IMGKYJ5I36ZL6DJ7L3SY3RNNANCNFSM4INEBUVA> .

sh0rez added component/agent kind/bug labels Sep 6, 2019

pracucci mentioned this issue Sep 30, 2019

Increased promtail's backoff settings in prod and improved doc #1083

Merged

2 tasks

stale bot added the stale A stale issue or PR that will automatically be closed. label Oct 6, 2019

stale bot removed the stale A stale issue or PR that will automatically be closed. label Oct 7, 2019

cyriltovena closed this as completed in #1083 Oct 7, 2019

chaudum added the type/bug Somehing is not working as expected label Jun 14, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

promtail skip logs on Loki unavailable for a short time #915

promtail skip logs on Loki unavailable for a short time #915

pracucci commented Aug 19, 2019

candlerb commented Aug 21, 2019

stale bot commented Oct 6, 2019

pracucci commented Oct 7, 2019 •

edited

Loading

muzammil360 commented Jan 4, 2021

cyriltovena commented Jan 8, 2021

muzammil360 commented Jan 10, 2021 via email

promtail skip logs on Loki unavailable for a short time #915

promtail skip logs on Loki unavailable for a short time #915

Comments

pracucci commented Aug 19, 2019

candlerb commented Aug 21, 2019

stale bot commented Oct 6, 2019

pracucci commented Oct 7, 2019 • edited Loading

muzammil360 commented Jan 4, 2021

cyriltovena commented Jan 8, 2021

muzammil360 commented Jan 10, 2021 via email

pracucci commented Oct 7, 2019 •

edited

Loading