Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

promtail: added action_on_failure support to timestamp stage #1123

Merged
merged 1 commit into from
Oct 10, 2019

Conversation

pracucci
Copy link
Contributor

@pracucci pracucci commented Oct 6, 2019

What this PR does / why we need it:

Currently, when the timestamp stage fails to parse the timestamp on a log entry, it keeps the time of when the entry has been scraped by Promtail. This may cause logs to fail to ingest due to out of order timestamps, because the "scrape timestamp" of an entry which failed parsing may be more recent than the "parsed timestamp" of a subsequent log entry.

In this PR I'm introducing the timestamp stage setting action_on_failure which can be:

  • fudge (new default): change the timestamp to the last known timestamp, summing up 1 nanosecond
  • skip (old behavior): do not change the timestamp and keep the time when the log entry has been scraped by Promtail

Unfortunately, just keeping an overall "last parsed timestamp" is not enough, because a single pipeline instance can process multiple targets (ie. logs from multiple files) and log timestamps should be expected to be ordered per single file only. For this reason, I've introduced an LRU cache to keep the "last known timestamp per target file" (the reason why I used an LRU is to add a cap to the max number of entries stored). So far, I haven't found a good solution for the journal target: any suggestion?

Which issue(s) this PR fixes:
Fixes #1093

Special notes for your reviewer:

  • I've picked github.com/hashicorp/golang-lru for the LRU, which is already a dependency of other dependecies we have

Checklist

  • Documentation added
  • Tests updated

@pracucci pracucci force-pushed the add-fudging-to-timestamp-stage branch from d06a986 to 1f93fb9 Compare October 6, 2019 21:07
@rfratto
Copy link
Member

rfratto commented Oct 7, 2019

I'm trying to think of other ways we might be able to implement this so we can include journal support. One way might be to add extra metadata into the entry handler so we can pass a unique identifier to the pipeline instead of having the pipeline figure it out per target. Then we could also configure the journal target to pass through the unit name as the unique ID, which should be good enough in most cases.

This is just a rough idea though, WDYT?

@pracucci
Copy link
Contributor Author

pracucci commented Oct 7, 2019

Thanks @rfratto for the feedback. I do agree with you that guessing the ID in the timestamp stage is not the right approach.

One way might be to add extra metadata into the entry handler so we can pass a unique identifier to the pipeline instead of having the pipeline figure it out per target. Then we could also configure the journal target to pass through the unit name as the unique ID, which should be good enough in most cases.

The unique identifier passed through the pipeline has no real value until that identifier is added as value of a label, because we need to guarantee uniqueness to the log streams pushed to Loki. So in the case we pick the unit name as "unique stream ID" (within a promtail instance) we'll need to also guarantee the unit name is always added to the labels, in the same way we do with filename for the file target.

What do you think?

@rfratto
Copy link
Member

rfratto commented Oct 7, 2019

The unique identifier passed through the pipeline has no real value until that identifier is added as value of a label, because we need to guarantee uniqueness to the log streams pushed to Loki.

I think I'm following you here; is the issue that the unique identifier may not be pushed to Loki, causing a bad entry in the LRU cache?

Overall I'm not sure why we couldn't just compute the fingerprint from the entire labelset and use that as the key into the LRU cache. It'd be the same mechanism the ingesters use to divide the streams to validate ordering, and I think it's fair to rely on the labelset the pipeline receives to be unique enough without having to explicitly check for a specific label (like filename, systemd unit, etc).

So in the case we pick the unit name as "unique stream ID" (within a promtail instance) we'll need to also guarantee the unit name is always added to the labels, in the same way we do with filename for the file target.

I'm a little hesitant about this. I do think users will almost always include the unit name as a label for journal targets, but I don't want to take away the choice of omitting it.

@pracucci
Copy link
Contributor Author

pracucci commented Oct 7, 2019

I think I'm following you here; is the issue that the unique identifier may not be pushed to Loki, causing a bad entry in the LRU cache?

Yes, correct.

Overall I'm not sure why we couldn't just compute the fingerprint from the entire labelset

I thought about it, but unless the user doesn't add a "unique stream ID" (ie. unit name) as label via the relabelling config, the labels set may not guarantee stream uniqueness anyway, so we're shifting this responsability to the user instead of forcing it. Am I missing anything?

@rfratto
Copy link
Member

rfratto commented Oct 7, 2019

No, you're right. It's already the user's responsibility to guarantee uniqueness per-file/unit if they're using a pipeline with a timestamp stage. I guess it's a question of whether or not it's a good idea to make it a hard requirement.

I'm not sure I'm informed enough to have an opinion on this. We could make the unique identifier label configurable in the stage, but that might be over engineering this.

/cc @cyriltovena @slim-bean any thoughts?

@cyriltovena
Copy link
Contributor

I'm fine with this solution. AFAIK you can have pipeline with journald targets ? Not sure what's the problem there.

@pracucci pracucci force-pushed the add-fudging-to-timestamp-stage branch from 1f93fb9 to 7645b12 Compare October 10, 2019 07:09
@pracucci pracucci requested a review from cyriltovena October 10, 2019 07:09
@pracucci
Copy link
Contributor Author

@rfratto @cyriltovena Thanks for your feedback. I've switched the "stream ID" to "labels fingerprint", which is a better solution compared to the previous one.

AFAIK you can have pipeline with journald targets ? Not sure what's the problem there.

My only concern is that - contrary to the file target (which always has the filename label) - the journal target doesn't have any default labels, so it's the user responsability to add a label - before the timestamp stage - which uniquely identify the source log stream, otherwise the fudging will not be per source stream (required to avoid issues).

Copy link
Contributor

@cyriltovena cyriltovena left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, Pretty sure journald is fine if you have one stream it would still work.

@cyriltovena cyriltovena merged commit 7af0c2b into grafana:master Oct 10, 2019
@pracucci pracucci deleted the add-fudging-to-timestamp-stage branch October 10, 2019 15:35
cyriltovena pushed a commit to cyriltovena/loki that referenced this pull request Jun 11, 2021
Download chunks as tarball so you can run tests against them.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Timestamp stage add a fudged timestamp for lines missing a timestamp
3 participants