implementing string interning to optimize resource usage for storing and processing non-indexed labels #10044

sandeepsukhani · 2023-07-24T15:10:09Z

What this PR does / why we need it:
In PR #9700, we added support for storing non-indexed labels in chunks. This PR optimizes resource usage for storing and processing non-indexed labels by doing string interning. We will store deduped label names and values as a list in chunks
in the newly added non-indexed labels section. The labels would then be referenced in blocks by their index(called symbols).

Additionally, I have started the convention of writing lengths of sections with their offsets within chunks, making it easier to introduce new sections. The section offsets and lengths would be stored at the end of the chunk, similar to TOC in TSDB.

Checklist

Tests updated

…and processing non-indexed labels

…it altogether

salvacorts · 2023-07-25T07:43:00Z

pkg/chunkenc/memchunk.go

@@ -316,11 +320,11 @@ func (hb *headBlock) LoadBytes(b []byte) error {
 	return nil
 }

-func (hb *headBlock) Convert(version HeadBlockFmt) (HeadBlock, error) {
+func (hb *headBlock) Convert(version HeadBlockFmt, symbolizer *symbolizer) (HeadBlock, error) {


As far as I can see, all calls to Convert pass nil as the symbolized. I wonder if there is a workaround to not having to pass the symbolizer here.

It was a mistake passing the symbolizer as nil everywhere. I am going to fix it. Nice catch!

salvacorts · 2023-07-25T07:46:51Z

pkg/chunkenc/memchunk.go

+	metasLen := uint64(0)
+	if version >= chunkFormatV4 {
+		// version >= 4 starts writing length of sections after their offsets
+		metasLen, metasOffset = readSectionLenAndOffset(1)


I think we shouldn't hardcode the index of the section here. but rather have a const defining it.

pkg/chunkenc/memchunk.go

salvacorts · 2023-07-25T08:05:19Z

pkg/chunkenc/symbols.go

+type symbolizer struct {
+	mtx            sync.RWMutex
+	symbolsMap     map[string]uint32
+	labels         []string


I find this name a bit confusing. IIUC, this is an array with all the symbols. I think it should be named symbols

No, it is a list of labels in string form. symbols are actually references by index to the deduped list of labels.

salvacorts · 2023-07-25T08:09:05Z

pkg/chunkenc/symbols.go

+
+	idx, ok = s.symbolsMap[lbl]
+	if !ok {
+		idx = uint32(len(s.symbolsMap))


Since idx is the index within the labels slice, I think it would be easier to understand this if we look at the len of labels

It should have the same length. symbolsMap is actually for deduping while labels is used at query time to find labels referenced by index. I will change it the way you suggested since I am fine with either ways.

salvacorts · 2023-07-25T08:14:25Z

pkg/chunkenc/symbols.go

+
+	s.compressedSize = len(b)
+
+	for {


IIUC this loop will read until there is an EOF. Since in the future, we might want to put something after the symbols. I think we should rather store the number of symbols first and then read that many.

We already store the number of labels. I will update the code to use it.

salvacorts · 2023-07-25T08:18:37Z

pkg/chunkenc/unordered.go

@@ -88,14 +90,14 @@ func (hb *unorderedHeadBlock) UncompressedSize() int {
 	return hb.size
 }

-func (hb *unorderedHeadBlock) Reset() {
-	x := newUnorderedHeadBlock(hb.format)
+func (hb *unorderedHeadBlock) Reset(symbolizer *symbolizer) {


Looks like, the same symbolizer is reused after calling Reset.

loki/pkg/chunkenc/memchunk.go

Line 905 in 238b174

c.head.Reset(c.symbolizer)

To minimize passing the symbolizer around, I think it makes more sense to remove the symbolizer argument and just pass hb.symbolizer to newUnorderedHeadBlock

salvacorts · 2023-07-25T08:19:55Z

pkg/chunkenc/unordered.go

-	line             string
-	nonIndexedLabels labels.Labels
+	line    string
+	symbols symbols


I think this should be renamed to nonIndexedLabelsSymbols to account for what the symbols are used for

salvacorts · 2023-07-25T08:26:22Z

pkg/chunkenc/unordered.go

@@ -570,23 +547,21 @@ func (hb *unorderedHeadBlock) LoadBytes(b []byte) error {
 		lineLn := db.uvarint()
 		line := string(db.bytes(lineLn))

-		var metaLabels labels.Labels
+		var symbols symbols


Same with naming. I think it should be named nonIndexedLabelsSymbols

salvacorts · 2023-07-25T08:31:40Z

pkg/chunkenc/unordered.go

@@ -262,8 +261,8 @@ func (hb *unorderedHeadBlock) Iterator(
 		direction,
 		mint,
 		maxt,
-		func(statsCtx *stats.Context, ts int64, line string, nonIndexedLabels labels.Labels) error {
-			newLine, parsedLbs, matches := pipeline.ProcessString(ts, line, nonIndexedLabels...)
+		func(statsCtx *stats.Context, ts int64, line string, symbols symbols) error {


Same with naming. I think it should be named nonIndexedLabelsSymbols

vlad-diachenko

lgtm. small comments

vlad-diachenko · 2023-07-25T11:33:08Z

pkg/chunkenc/symbols.go

+	s := symbolizer{
+		symbolsMap: map[string]uint32{},
+	}
+	numLabels := db.uvarint()


we can init symbolsMap and labels with size numLabels := db.uvarint() to prevent dynamic grow of the collections.

Suggested change

s := symbolizer{

symbolsMap: map[string]uint32{},

}

numLabels := db.uvarint()

numLabels := db.uvarint()

s := symbolizer{

symbolsMap: make(map[string]uint32{}, numLabels),

labels: make([]string, 0, numLabels),

}

good catch, not sure how I missed it. Thanks!

pkg/chunkenc/symbols.go

vlad-diachenko · 2023-07-25T14:18:02Z

pkg/chunkenc/symbols.go

+	var (
+		readBuf      [10]byte // Enough bytes to store one varint.
+		readBufValid int      // How many bytes are left in readBuf from previous read.
+		s            symbolizer


can we init symbolized with slice and map size enough to store numSymbols ?

pkg/chunkenc/unordered.go

…and processing non-indexed labels (#10044) **What this PR does / why we need it**: In PR #9700, we added support for storing non-indexed labels in chunks. This PR optimizes resource usage for storing and processing non-indexed labels by doing string interning. We will store deduped label names and values as a list in chunks in the newly added non-indexed labels section. The labels would then be referenced in blocks by their index(called symbols). Additionally, I have started the convention of writing lengths of sections with their offsets within chunks, making it easier to introduce new sections. The section offsets and lengths would be stored at the end of the chunk, similar to [TOC](https://ganeshvernekar.com/blog/prometheus-tsdb-persistent-block-and-its-index/#a-toc) in TSDB. **Checklist** - [x] Tests updated (cherry picked from commit 9b554bb)

sandeepsukhani requested a review from a team as a code owner July 24, 2023 15:10

pull-request-size bot added the size/XXL label Jul 24, 2023

implementing string interning to optimize resource usage for storing …

238b174

…and processing non-indexed labels

sandeepsukhani force-pushed the non-indexed-labels-string-interning branch from 805a495 to 238b174 Compare July 24, 2023 15:51

sandeepsukhani added 3 commits July 25, 2023 11:37

fix test and lint

215e85b

fix test and lint again

dc6c466

store length of symbols section before it to be able to skip reading …

811c140

…it altogether

salvacorts reviewed Jul 25, 2023

View reviewed changes

changes suggested from PR review

216fa53

vlad-diachenko approved these changes Jul 25, 2023

View reviewed changes

sandeepsukhani added 3 commits July 26, 2023 09:33

changes suggested from PR review

6dc0746

do not convert head block if the format is already unordered

73e1a28

better tests

42fc3f0

sandeepsukhani merged commit 9b554bb into grafana:main Jul 26, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

implementing string interning to optimize resource usage for storing and processing non-indexed labels #10044

implementing string interning to optimize resource usage for storing and processing non-indexed labels #10044

sandeepsukhani commented Jul 24, 2023

salvacorts Jul 25, 2023

sandeepsukhani Jul 25, 2023

salvacorts Jul 25, 2023

salvacorts Jul 25, 2023

sandeepsukhani Jul 25, 2023

salvacorts Jul 25, 2023

sandeepsukhani Jul 25, 2023

salvacorts Jul 25, 2023

sandeepsukhani Jul 25, 2023

salvacorts Jul 25, 2023

salvacorts Jul 25, 2023

salvacorts Jul 25, 2023

salvacorts Jul 25, 2023

vlad-diachenko left a comment

vlad-diachenko Jul 25, 2023

sandeepsukhani Jul 26, 2023

vlad-diachenko Jul 25, 2023

implementing string interning to optimize resource usage for storing and processing non-indexed labels #10044

implementing string interning to optimize resource usage for storing and processing non-indexed labels #10044

Conversation

sandeepsukhani commented Jul 24, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vlad-diachenko left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment