-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
implementing string interning to optimize resource usage for storing and processing non-indexed labels #10044
implementing string interning to optimize resource usage for storing and processing non-indexed labels #10044
Conversation
…and processing non-indexed labels
805a495
to
238b174
Compare
@@ -316,11 +320,11 @@ func (hb *headBlock) LoadBytes(b []byte) error { | |||
return nil | |||
} | |||
|
|||
func (hb *headBlock) Convert(version HeadBlockFmt) (HeadBlock, error) { | |||
func (hb *headBlock) Convert(version HeadBlockFmt, symbolizer *symbolizer) (HeadBlock, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As far as I can see, all calls to Convert pass nil as the symbolized. I wonder if there is a workaround to not having to pass the symbolizer here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It was a mistake passing the symbolizer as nil everywhere. I am going to fix it. Nice catch!
pkg/chunkenc/memchunk.go
Outdated
metasLen := uint64(0) | ||
if version >= chunkFormatV4 { | ||
// version >= 4 starts writing length of sections after their offsets | ||
metasLen, metasOffset = readSectionLenAndOffset(1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we shouldn't hardcode the index of the section here. but rather have a const defining it.
type symbolizer struct { | ||
mtx sync.RWMutex | ||
symbolsMap map[string]uint32 | ||
labels []string |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I find this name a bit confusing. IIUC, this is an array with all the symbols. I think it should be named symbols
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, it is a list of labels in string form. symbols
are actually references by index to the deduped list of labels.
pkg/chunkenc/symbols.go
Outdated
|
||
idx, ok = s.symbolsMap[lbl] | ||
if !ok { | ||
idx = uint32(len(s.symbolsMap)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since idx
is the index within the labels
slice, I think it would be easier to understand this if we look at the len of labels
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It should have the same length. symbolsMap
is actually for deduping while labels
is used at query time to find labels referenced by index. I will change it the way you suggested since I am fine with either ways.
pkg/chunkenc/symbols.go
Outdated
|
||
s.compressedSize = len(b) | ||
|
||
for { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIUC this loop will read until there is an EOF. Since in the future, we might want to put something after the symbols. I think we should rather store the number of symbols first and then read that many.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We already store the number of labels. I will update the code to use it.
pkg/chunkenc/unordered.go
Outdated
@@ -88,14 +90,14 @@ func (hb *unorderedHeadBlock) UncompressedSize() int { | |||
return hb.size | |||
} | |||
|
|||
func (hb *unorderedHeadBlock) Reset() { | |||
x := newUnorderedHeadBlock(hb.format) | |||
func (hb *unorderedHeadBlock) Reset(symbolizer *symbolizer) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like, the same symbolizer is reused after calling Reset
.
Line 905 in 238b174
c.head.Reset(c.symbolizer) |
To minimize passing the symbolizer around, I think it makes more sense to remove the symbolizer argument and just pass hb.symbolizer
to newUnorderedHeadBlock
pkg/chunkenc/unordered.go
Outdated
line string | ||
nonIndexedLabels labels.Labels | ||
line string | ||
symbols symbols |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should be renamed to nonIndexedLabelsSymbols
to account for what the symbols are used for
pkg/chunkenc/unordered.go
Outdated
@@ -570,23 +547,21 @@ func (hb *unorderedHeadBlock) LoadBytes(b []byte) error { | |||
lineLn := db.uvarint() | |||
line := string(db.bytes(lineLn)) | |||
|
|||
var metaLabels labels.Labels | |||
var symbols symbols |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same with naming. I think it should be named nonIndexedLabelsSymbols
pkg/chunkenc/unordered.go
Outdated
@@ -262,8 +261,8 @@ func (hb *unorderedHeadBlock) Iterator( | |||
direction, | |||
mint, | |||
maxt, | |||
func(statsCtx *stats.Context, ts int64, line string, nonIndexedLabels labels.Labels) error { | |||
newLine, parsedLbs, matches := pipeline.ProcessString(ts, line, nonIndexedLabels...) | |||
func(statsCtx *stats.Context, ts int64, line string, symbols symbols) error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same with naming. I think it should be named nonIndexedLabelsSymbols
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm. small comments
pkg/chunkenc/symbols.go
Outdated
s := symbolizer{ | ||
symbolsMap: map[string]uint32{}, | ||
} | ||
numLabels := db.uvarint() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we can init symbolsMap and labels with size numLabels := db.uvarint() to prevent dynamic grow of the collections.
s := symbolizer{ | |
symbolsMap: map[string]uint32{}, | |
} | |
numLabels := db.uvarint() | |
numLabels := db.uvarint() | |
s := symbolizer{ | |
symbolsMap: make(map[string]uint32{}, numLabels), | |
labels: make([]string, 0, numLabels), | |
} | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good catch, not sure how I missed it. Thanks!
pkg/chunkenc/symbols.go
Outdated
var ( | ||
readBuf [10]byte // Enough bytes to store one varint. | ||
readBufValid int // How many bytes are left in readBuf from previous read. | ||
s symbolizer |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we init symbolized with slice and map size enough to store numSymbols
?
…and processing non-indexed labels (#10044) **What this PR does / why we need it**: In PR #9700, we added support for storing non-indexed labels in chunks. This PR optimizes resource usage for storing and processing non-indexed labels by doing string interning. We will store deduped label names and values as a list in chunks in the newly added non-indexed labels section. The labels would then be referenced in blocks by their index(called symbols). Additionally, I have started the convention of writing lengths of sections with their offsets within chunks, making it easier to introduce new sections. The section offsets and lengths would be stored at the end of the chunk, similar to [TOC](https://ganeshvernekar.com/blog/prometheus-tsdb-persistent-block-and-its-index/#a-toc) in TSDB. **Checklist** - [x] Tests updated (cherry picked from commit 9b554bb)
What this PR does / why we need it:
In PR #9700, we added support for storing non-indexed labels in chunks. This PR optimizes resource usage for storing and processing non-indexed labels by doing string interning. We will store deduped label names and values as a list in chunks
in the newly added non-indexed labels section. The labels would then be referenced in blocks by their index(called symbols).
Additionally, I have started the convention of writing lengths of sections with their offsets within chunks, making it easier to introduce new sections. The section offsets and lengths would be stored at the end of the chunk, similar to TOC in TSDB.
Checklist