-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Log error message for invalid checksum #1713
Conversation
@cyriltovena Have few queries.
|
pkg/ingester/transfer.go
Outdated
return err | ||
err := instance.consumeChunk(userCtx, lbls, chunk) err != nil { | ||
if err.Error() == "invalid checksum" { | ||
level.Error(util.Logger).Log("msg", "Checksum does not match for chunk, this chunk will be skipped", "err", "invalid checksum", "chunk", chunk.String()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if logging chuck.String()
is a good idea
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not able reproduce this issue(#1346). Can we request the author of this issue to validate from his side using this commit and see what he sees in his logs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this change is necessary since now you're not sending back the invalid error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's leave the transfer erroring (rollback that file), I don't think this is causing any issue for #1346.
4d734c3
to
f45ec22
Compare
Codecov Report
@@ Coverage Diff @@
## master #1713 +/- ##
==========================================
+ Coverage 64.75% 64.79% +0.04%
==========================================
Files 122 122
Lines 9271 9283 +12
==========================================
+ Hits 6003 6015 +12
- Misses 2853 2854 +1
+ Partials 415 414 -1
|
pkg/storage/iterator.go
Outdated
@@ -385,6 +385,7 @@ func fetchLazyChunks(ctx context.Context, chunks []*chunkenc.LazyChunk) error { | |||
} | |||
level.Debug(log).Log("msg", "loading lazy chunks", "chunks", totalChunks) | |||
|
|||
invalidChunks := []*chunk.Fetcher{} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reason for creating this array is later when we have to remove invalid chunks, we can iterate only invalid chunks rather than iterating all fetchers and chunks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need for this.
pkg/storage/iterator.go
Outdated
@@ -403,8 +404,15 @@ func fetchLazyChunks(ctx context.Context, chunks []*chunkenc.LazyChunk) error { | |||
} | |||
chks, err := fetcher.FetchChunks(ctx, chks, keys) | |||
if err != nil { | |||
if err.Error() == chunk.ErrInvalidChecksum.Error() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if err.Error() == chunk.ErrInvalidChecksum.Error() { | |
if err == chunk.ErrInvalidChecksum { |
pkg/storage/iterator.go
Outdated
@@ -403,8 +404,15 @@ func fetchLazyChunks(ctx context.Context, chunks []*chunkenc.LazyChunk) error { | |||
} | |||
chks, err := fetcher.FetchChunks(ctx, chks, keys) | |||
if err != nil { | |||
if err.Error() == chunk.ErrInvalidChecksum.Error() { | |||
level.Error(util.Logger).Log("msg", "checksum of chunks does not match", "err", chunk.ErrInvalidChecksum) | |||
invalidChunks = append(invalidChunks, fetcher) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think you need this invalid chunks array.
pkg/storage/iterator.go
Outdated
@@ -415,6 +423,13 @@ func fetchLazyChunks(ctx context.Context, chunks []*chunkenc.LazyChunk) error { | |||
}(fetcher, chunks) | |||
} | |||
|
|||
// remove data of any invalid chunk | |||
for _, fetcher := range invalidChunks { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You need to remove empty chunk from the chunks
variable. And this needs to happens when we have waited for all errors.
after:
var lastErr error
for i := 0; i < len(chksByFetcher); i++ {
if err := <-errChan; err != nil {
lastErr = err
}
}
if lastErr != nil {
return lastErr
}
i := 0 // output index
for _, c := range chunks {
if c.Chunk.Data != nil {
chunks[i] = c
i++
}
}
chunks = chunks[:i]
return nil
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got you
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cyriltovena I see a problem here. I may not be correct though.
From the following function https://github.com/grafana/loki/blob/master/pkg/storage/iterator.go#L280, we first partitionBySeriesChunks, which partitions chunks based on overlapping iterator. At this stage say chunksBySeries
has chunks as below
chksBySeries after partition
[0xc00011c400]
[0xc00011c480]
[0xc00011c500 0xc00011c600]
[0xc00011c580]
Later first chunks are loaded from loadFirstChunks
which is passed with value chksBySeries
. After this chksBySeries
looks as below
[0xc00011c400]
[0xc00011c480]
[0xc00011c500 0xc00011c600]
[0xc00011c580]
and in turn fetchLazyChunks
would have picked following chunks
[0xc00011c400 0xc00011c480 0xc00011c500 0xc00011c580]
Even at this point chkBySeries
has below values
[0xc00011c400]
[0xc00011c480]
[0xc00011c500 0xc00011c600]
[0xc00011c580]
Later filterSeriesByMatchers
is called with value chksBySeries
and chksBySeries
has following values
[[0xc00011c500 0xc00011c600][0xc00011c580]]
Now, say 0xc00011c600
chunk returns invalid chunk checksum
when doing fetchLazyChunks
and the chunks picked are [0xc00011c500],[0xc00011c580]
, Iterator is not affected with this because buildIterators
is passed with chksBySeries
which iterates following chunks
[[0xc00011c500 0xc00011c600][0xc00011c580]]
. Since, 0xc00011c600
is having no data, iterator errors saying chunks not loaded
.
May be we should check for nil Data
chunk here https://github.com/grafana/loki/blob/master/pkg/storage/iterator.go#L335 and skip it
OR
Re arrange chkBySeries
after fetching lazy chunks before calling buildIterators
here https://github.com/grafana/loki/blob/master/pkg/storage/iterator.go#L305
PS: I am not sure how efficient I was in conveying the message 😃
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm off for a week. We'll get back to you then.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure. Tried to add a dummy code here https://play.golang.org/p/HFXOy5Ek2xd. This should explain what's actually happening.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No currently it crashes. I want this function to be changed like we discussed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I was explaining things without pushing code. The code added fails here. The test code is not neat. Just validating things. However, we get into below scenario.
The chunks may be initially chunk1, chunk2, chunk3
after we run this https://github.com/grafana/loki/pull/1713/files#diff-7d61465906fc121b68e860420fcd1c87R440 chunks may be
[chunk1, chunk3]
. However, chksBySeries
here https://github.com/grafana/loki/blob/64a1c0d96bd033d67c634aff195fe1b0b96441c5/pkg/storage/iterator.go#L305 might have chunks as [[chunk1, chunk2][chunk3]]
. So, if chunk2
is having nil
data due to invalid checksum
chksBySeries
don't drop the chunk. Which leads to chunk is not loaded
here https://github.com/grafana/loki/blob/64a1c0d96bd033d67c634aff195fe1b0b96441c5/pkg/chunkenc/lazy_chunk.go#L29
We should re-arrange the chunks before calling buildIterators
https://github.com/grafana/loki/blob/64a1c0d96bd033d67c634aff195fe1b0b96441c5/pkg/storage/iterator.go#L305
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes That's what I tried to explain you should remove the lazy chunk from the array.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am just checking for easy ways to do this Removing a chunk from an array is quiet a costly operation. We would have to remove an element and shift other elements. When we have large number of chunks to load, this can turn out to be time taking process.
One easy way is to label the chunk as invalid
or valid
or may be a boolean. Then in the iterators we can check if the chunk is valid or not. If invalid we can skip and continue.
what's your thought on this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cyriltovena Need your feedback on this
c427cda
to
64a1c0d
Compare
I think you're still working on this right ? We can close and open a new one if you want to clear things up when you're ready. |
Yeah, I am still working on this. Anything is fine with me. However, since we have had a good discussion on this PR, I would like to get this PR merged rather opening new PR. It's your call finally. |
pkg/storage/iterator.go
Outdated
var validChunks map[*chunkenc.LazyChunk]bool | ||
for _, series := range chksBySeries { | ||
for _, chunks := range series { | ||
for _, chunk := range chunks { | ||
if contains(chunk, allChunks) { | ||
validChunks[chunk] = true | ||
} | ||
} | ||
} | ||
} | ||
|
||
iters, err := buildIterators(ctx, chksBySeries, validChunks, filter, direction, from, through) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cyriltovena This is one way we can check and avoid iterating invalid chunks. However, this is going to be O(n^3) operation.
We can add a label to LazyChunk
like below but we will have to change many references and disturb the code a bit.
type LazyChunk struct {
Chunk chunk.Chunk
IsValid bool
Fetcher *chunk.Fetcher
}
What do you suggest?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
type LazyChunk struct {
Chunk chunk.Chunk
IsValid bool
Fetcher *chunk.Fetcher
}
Is less intrusive for this code base IMO I prefer this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure. Thanks
pkg/storage/iterator.go
Outdated
for i := range chks { | ||
iterators := make([]iter.EntryIterator, 0, len(chks[i])) | ||
for j := range chks[i] { | ||
if _, ok := validChunks[chks[i][j]]; !ok { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is exactly where you should filter invalid chunk. But I don't like a map. I prefer that you check a boolean on the chunk. I don't think you need a log also, as this will already be logged by the memchunk.go
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
pkg/storage/iterator.go
Outdated
@@ -403,8 +406,14 @@ func fetchLazyChunks(ctx context.Context, chunks []*chunkenc.LazyChunk) error { | |||
} | |||
chks, err := fetcher.FetchChunks(ctx, chks, keys) | |||
if err != nil { | |||
if err.Error() == chunkenc.ErrInvalidChecksum.Error() || strings.Contains(err.Error(), chunkenc.ErrInvalidChecksum.Error()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just checking err == chunkenc.ErrInvalidChecksum
is not working. We have to do err.Error() == chunkenc.ErrInvalidChecksum.Error()
.
Also, the error might come from cortex from here https://github.com/cortexproject/cortex/blob/02af2474da0cef0cb4076106e03528eed8830166/pkg/chunk/chunk.go#L263. So, we have to check if stack
contains invalid chunk checksum
error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should use the err not the stringer, as the stringer function could change.
If it can also come from cortex you can also check that since they are exposing it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a catch here. Cortex returns errors with different types. Here it is promql
https://github.com/cortexproject/cortex/blob/master/pkg/chunk/chunk_store_utils.go#L165.
So, comparing err == chunkenc.ErrInvalidChecksum
is not possible since chunkenc.ErrInvalidChecksum
is of type errors.Error
and err
will be of type promql.ErrStorage
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Totally understand I think you should make a function.
func isInvalidCheckSum(err error) bool
You could check if it is a promql.ErrStorage
first, if so unwrap the error, then check it.
Bonus point this means this function can be easily testable and you should send it chunkenc.ErrInvalidChecksum from cortex and from Loki and wrapped with promql.ErrStorage and with a random error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perfect. Will add this
pkg/storage/iterator.go
Outdated
func isInvalidChunkError(err error) bool { | ||
err = errors.Cause(err) | ||
if err, ok := err.(promql.ErrStorage); ok { | ||
return err.Err == chunk.ErrInvalidChecksum |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought the error could be this one and the Loki one ? shouldn't you test that ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, we can add the Loki one as well. It's just the same error. Will change it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cyriltovena I have made this change. Also, I will rebase on top of this #1914
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought the error could be this one and the Loki one ?
I think it can be both. You also need a small test for that function, this was the point.
Almost there !
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we just use chunk.ErrInvalidChecksum
from cortex here as well https://github.com/grafana/loki/blob/master/pkg/chunkenc/memchunk.go#L209 ?
If any change on cortex, we should make sure to bring that change to Loki as well otherwise we would break
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cyriltovena Added the test
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Series test Test_store_GetSeries
is failing. Have to identify some other way to do chunk test. Doing this doesn't work out
https://github.com/grafana/loki/blob/7c2274093f7a76f433a6fd32aad2bd0205e7c1c6/pkg/storage/util_test.go#L208
I will re work on this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed the Batchsize in Test_store_GetSeries
test. That works.
WDYT we should?
7c22740
to
147fb4b
Compare
pkg/storage/iterator_test.go
Outdated
func Test_IsInvalidChunkError(t *testing.T) { | ||
tests := []struct { | ||
name string | ||
err error |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You need to add a case at least where it is not an invalid chunk error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added
pkg/storage/store_test.go
Outdated
@@ -373,7 +423,7 @@ func Test_store_GetSeries(t *testing.T) { | |||
{Labels: mustParseLabels("{foo=\"bar\"}")}, | |||
{Labels: mustParseLabels("{foo=\"bazz\"}")}, | |||
}, | |||
1, | |||
2, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
? why do you need to change that ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not required. Reverted my change. Changed logic here https://github.com/grafana/loki/blob/30a0657e0e7e01617dc14ccd7bb78eae86d4678e/pkg/storage/util_test.go#L209
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still this is a hacky thing. However, I just want to keep this because this is kind of complete testing for chunk along with stream
2e7c297
to
5dd4a5a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a bit scared of those tests changes. I think I prefer a new set of tests only for that change. Doesn't have to be complex at least test isInvalidChunkError
. The rest is less a problem.
@cyriltovena tests for |
}, | ||
}, | ||
}) | ||
secondChunk = newLazyInvalidChunk(logproto.Stream{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
secondChunk = newLazyInvalidChunk(logproto.Stream{ | |
invalid = newLazyInvalidChunk(logproto.Stream{ |
pkg/storage/util_test.go
Outdated
@@ -77,10 +86,11 @@ func newChunk(stream logproto.Stream) chunk.Chunk { | |||
} | |||
chk.Close() | |||
c := chunk.NewChunk("fake", client.Fingerprint(l), l, chunkenc.NewFacade(chk), from, through) | |||
// force the checksum creation | |||
// // force the checksum creation |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// // force the checksum creation | |
// force the checksum creation |
}, | ||
} { | ||
t.Run(fmt.Sprintf("%d", i), func(t *testing.T) { | ||
ctx = user.InjectOrgID(context.Background(), "test-user") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See I think this test is sufficient. Remove any changes from the other tests in pkg/storage/store_test.go
and other changes for that tests, let's keep only this one. And rename the chunk variable to invalidChunk
.
@cyriltovena I have given one more try to add tests to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Test are failing I'm fine if you remove all those new tests and leave the one on heapIterator, It cover what we need. |
I have removed the new tests. Have retained |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #1713 +/- ##
==========================================
+ Coverage 64.29% 64.30% +0.01%
==========================================
Files 131 131
Lines 9996 10013 +17
==========================================
+ Hits 6427 6439 +12
- Misses 3086 3091 +5
Partials 483 483
|
What this PR does / why we need it:
Skip
chunks
withinvalid checkusm
and log error message.Which issue(s) this PR fixes:
Fixes #1346
Checklist