Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

promtail skip logs on Loki unavailable for a short time #915

Closed
pracucci opened this issue Aug 19, 2019 · 6 comments · Fixed by #1083
Closed

promtail skip logs on Loki unavailable for a short time #915

pracucci opened this issue Aug 19, 2019 · 6 comments · Fixed by #1083
Labels
component/agent type/bug Somehing is not working as expected

Comments

@pracucci
Copy link
Contributor

Describe the bug
When Loki is unavailable (down, network partitioning, ...) promtail keeps sending a batch of tailed logs until the backoff_config maxretries is reached. Once reached, the batch is discarded and promtail tries to push the next batch of lines read from a file.

The backoff implementation is inherited from Cortex, which implements a "Full jitter" backoff algorithm (the minbackoff is not a guaranteed minimum, just the base for the initial upper range for the random sleep time). I've sketched an example in a google spreadsheet.

The production/helm/promtail/values.yaml is configured with:

    backoff_config:
      # Initial backoff time between retries
      minbackoff: 100ms
      # Maximum backoff time between retries
      maxbackoff: 5s
      # Maximum number of retries when sending batches, 0 means infinite retries
      maxretries: 5

This means that - on an average case - it's enough to have Loki unavailable for 1.55 seconds before start loosing logs.

Is there any reason for such a low number of maxretries? Shouldn't we try to reduce the likelihood of data loss, increasing it significantly?

@candlerb
Copy link
Contributor

For network timeouts etc, I don't see why it shouldn't just retry forever. It's only tailing a logfile after all, so the source data is going to remain forever (or at least until the logfile is rotated).

Retrying on error is more nuanced. A 400 error means the data you sent is bad; you need to give up and move onto the next batch. A 500 error means the server had a problem (e.g. disk full) and in principle you should retry indefinitely. However, it is possible that bad client data can return 500 (e.g. #929) , and you don't want log ingestion to freeze as a result. So it would be reasonable to retry for a limited period. Still, it should be much longer than the defaults you show.

@stale
Copy link

stale bot commented Oct 6, 2019

This issue has been automatically marked as stale because it has not had any activity in the past 30 days. It will be closed in 7 days if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale A stale issue or PR that will automatically be closed. label Oct 6, 2019
@pracucci
Copy link
Contributor Author

pracucci commented Oct 7, 2019

Please keep it alive, since there's already an open PR.

@stale stale bot removed the stale A stale issue or PR that will automatically be closed. label Oct 7, 2019
@muzammil360
Copy link

Hi. This is a problem that I am facing as well. As it happens, when we deploy on client and server machines, loki may not necessarily be up when promtail starts. This means that initial few log lines are alwasy missed.

But this is not the biggest issue. Problem is when loki is unreachable because of pure network issue (weak wifi). In that case we easily loose logs. How can we make sure that we don't loose on data for extended period of time (say 1 hr or atleast 15 mins)?

@cyriltovena
Copy link
Contributor

You can configure proper backoff, see https://grafana.com/docs/loki/latest/clients/promtail/configuration/#clients backoff config.

@muzammil360
Copy link

muzammil360 commented Jan 10, 2021 via email

@chaudum chaudum added the type/bug Somehing is not working as expected label Jun 14, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
component/agent type/bug Somehing is not working as expected
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants