Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

implementing string interning to optimize resource usage for storing and processing non-indexed labels #10044

Merged

Conversation

sandeepsukhani
Copy link
Contributor

What this PR does / why we need it:
In PR #9700, we added support for storing non-indexed labels in chunks. This PR optimizes resource usage for storing and processing non-indexed labels by doing string interning. We will store deduped label names and values as a list in chunks
in the newly added non-indexed labels section. The labels would then be referenced in blocks by their index(called symbols).

Additionally, I have started the convention of writing lengths of sections with their offsets within chunks, making it easier to introduce new sections. The section offsets and lengths would be stored at the end of the chunk, similar to TOC in TSDB.

Checklist

  • Tests updated

@sandeepsukhani sandeepsukhani requested a review from a team as a code owner July 24, 2023 15:10
@sandeepsukhani sandeepsukhani force-pushed the non-indexed-labels-string-interning branch from 805a495 to 238b174 Compare July 24, 2023 15:51
@@ -316,11 +320,11 @@ func (hb *headBlock) LoadBytes(b []byte) error {
return nil
}

func (hb *headBlock) Convert(version HeadBlockFmt) (HeadBlock, error) {
func (hb *headBlock) Convert(version HeadBlockFmt, symbolizer *symbolizer) (HeadBlock, error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I can see, all calls to Convert pass nil as the symbolized. I wonder if there is a workaround to not having to pass the symbolizer here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was a mistake passing the symbolizer as nil everywhere. I am going to fix it. Nice catch!

metasLen := uint64(0)
if version >= chunkFormatV4 {
// version >= 4 starts writing length of sections after their offsets
metasLen, metasOffset = readSectionLenAndOffset(1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we shouldn't hardcode the index of the section here. but rather have a const defining it.

pkg/chunkenc/memchunk.go Outdated Show resolved Hide resolved
type symbolizer struct {
mtx sync.RWMutex
symbolsMap map[string]uint32
labels []string
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find this name a bit confusing. IIUC, this is an array with all the symbols. I think it should be named symbols

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, it is a list of labels in string form. symbols are actually references by index to the deduped list of labels.


idx, ok = s.symbolsMap[lbl]
if !ok {
idx = uint32(len(s.symbolsMap))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since idx is the index within the labels slice, I think it would be easier to understand this if we look at the len of labels

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should have the same length. symbolsMap is actually for deduping while labels is used at query time to find labels referenced by index. I will change it the way you suggested since I am fine with either ways.


s.compressedSize = len(b)

for {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC this loop will read until there is an EOF. Since in the future, we might want to put something after the symbols. I think we should rather store the number of symbols first and then read that many.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We already store the number of labels. I will update the code to use it.

@@ -88,14 +90,14 @@ func (hb *unorderedHeadBlock) UncompressedSize() int {
return hb.size
}

func (hb *unorderedHeadBlock) Reset() {
x := newUnorderedHeadBlock(hb.format)
func (hb *unorderedHeadBlock) Reset(symbolizer *symbolizer) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like, the same symbolizer is reused after calling Reset.

c.head.Reset(c.symbolizer)

To minimize passing the symbolizer around, I think it makes more sense to remove the symbolizer argument and just pass hb.symbolizer to newUnorderedHeadBlock

line string
nonIndexedLabels labels.Labels
line string
symbols symbols
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be renamed to nonIndexedLabelsSymbols to account for what the symbols are used for

@@ -570,23 +547,21 @@ func (hb *unorderedHeadBlock) LoadBytes(b []byte) error {
lineLn := db.uvarint()
line := string(db.bytes(lineLn))

var metaLabels labels.Labels
var symbols symbols
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same with naming. I think it should be named nonIndexedLabelsSymbols

@@ -262,8 +261,8 @@ func (hb *unorderedHeadBlock) Iterator(
direction,
mint,
maxt,
func(statsCtx *stats.Context, ts int64, line string, nonIndexedLabels labels.Labels) error {
newLine, parsedLbs, matches := pipeline.ProcessString(ts, line, nonIndexedLabels...)
func(statsCtx *stats.Context, ts int64, line string, symbols symbols) error {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same with naming. I think it should be named nonIndexedLabelsSymbols

Copy link
Contributor

@vlad-diachenko vlad-diachenko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm. small comments

Comment on lines 300 to 303
s := symbolizer{
symbolsMap: map[string]uint32{},
}
numLabels := db.uvarint()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can init symbolsMap and labels with size numLabels := db.uvarint() to prevent dynamic grow of the collections.

Suggested change
s := symbolizer{
symbolsMap: map[string]uint32{},
}
numLabels := db.uvarint()
numLabels := db.uvarint()
s := symbolizer{
symbolsMap: make(map[string]uint32{}, numLabels),
labels: make([]string, 0, numLabels),
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch, not sure how I missed it. Thanks!

pkg/chunkenc/symbols.go Show resolved Hide resolved
var (
readBuf [10]byte // Enough bytes to store one varint.
readBufValid int // How many bytes are left in readBuf from previous read.
s symbolizer
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we init symbolized with slice and map size enough to store numSymbols ?

pkg/chunkenc/unordered.go Show resolved Hide resolved
@sandeepsukhani sandeepsukhani merged commit 9b554bb into grafana:main Jul 26, 2023
sandeepsukhani added a commit that referenced this pull request Jul 26, 2023
…and processing non-indexed labels (#10044)

**What this PR does / why we need it**:
In PR #9700, we added support for storing non-indexed labels in chunks.
This PR optimizes resource usage for storing and processing non-indexed
labels by doing string interning. We will store deduped label names and
values as a list in chunks
in the newly added non-indexed labels section. The labels would then be
referenced in blocks by their index(called symbols).

Additionally, I have started the convention of writing lengths of
sections with their offsets within chunks, making it easier to introduce
new sections. The section offsets and lengths would be stored at the end
of the chunk, similar to
[TOC](https://ganeshvernekar.com/blog/prometheus-tsdb-persistent-block-and-its-index/#a-toc)
in TSDB.

**Checklist**
- [x] Tests updated

(cherry picked from commit 9b554bb)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants