Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add non-indexed metadata to chunks #9700
Add non-indexed metadata to chunks #9700
Changes from 9 commits
a7d0ff8
a1dc712
db14bad
089a643
db4443d
95fe0d4
4c78a5d
21df77b
9d1b5bc
8c105b5
e80481e
1df541d
ca8f1b4
0d5816f
8fe4f4f
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
Large diffs are not rendered by default.
Large diffs are not rendered by default.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we also check for equality of
metaLabels
here to allow entries with the same timestamp and logline but different non-indexed labels? However, this would mean that we will have to generate a hash with both log line and non-indexed labels in SampleIterator, which uses it for deduping duplicate data due to replication.It would be rare that someone would have log entries with the same timestamp and logline but different non-indexed labels, so I think we should keep it as is to avoid paying the cost to support this rare case. Just putting it out to see if anyone wants to share their thoughts here. If we decide not to support it, we should make it clear in the docs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Taking promtail as an example, I don't see any stage that can duplicate log lines. Therefore, this situation would only happen if the user is using a custom client that somehow is running into this scenario. IMO, to avoid having to compute a hash over the meta labels, and to keep the code simple, we should not check the non-indexed labels here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was only considering the case where the same logline is being emitted by some jobs, and for whatever reason, they all have the same stream labels.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm ok with just checking
(indexed_labels, ts, line)
for equality here without specifying a formal stance yet. We can later choose to handle this differently if we choose as long as it doesn't seriously break backwards compatibility in the future. Basically, before we include this in a releaseThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am inclined towards keeping it as is to avoid supporting it partially, i.e. without taking care of required changes in
SampleIterator
. If you strongly feel we should make this change, please let me know, and I will take care of it with my string interning change.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We had decided to write the length of the whole nonIndexedLabels section first so that we can choose to skip it altogether and jump to the beginning of the next entry. I can take care of it while working on string interning.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. We can do it here, but we'd need to iterate through all the labels and sum up their lengths. With strings interning, IIUC, the label and values would be refs to wherever the actual string is. Therefore, in that case we wouldn't need to iterate through all the label value pairs, but rather multiply
len(metaLabels)
by twice the ref size, right?In that case, I think it would make more sense to do it in the PR for strings interning directly? To not complicate this PR more than necessary provided that this part would change. Wdyt?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we are writing data with variable length encoding, the number of bytes can't be estimated beforehand. We will have to write the data to a temp buffer first, calculate and write its size first and then write the data from temp buffer. I will take care of it while working on string interning.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Keep in mind, this will break compatibility between this PR and a future PR unless we add a newer format than
UnorderedWithMetadataHeadBlockFmt
. This is one of the benefits of not setting this as the default yet (viaDefaultHeadBlockFmt = UnorderedHeadBlockFmt
) -- we can change the implementation in a later PR before we start using it.