Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

store: Prevent store from panic after range object storage #3322

Closed
wants to merge 2 commits into from

Conversation

MegaEle
Copy link

@MegaEle MegaEle commented Oct 15, 2020

Signed-off-by: MegaEle ajmy789456123@gmail.com

  • I added CHANGELOG entry for this change.
  • Change is not relevant to the end user.

Changes

When the store uses the object store as the s3 interface, there is a probability that the loadChunks or loadSeries method panic will appear in the query. It is found that the s3 interface does not return complete data, and the store should return an error instead of panic

Verification

I compared it after build the modified code. Under the same circumstances, the store will not panic, and the query returns the following error like:
Mint: 1601021768826 Maxt: 1602734400000: rpc error: code = Aborted desc = fetch series for block 01EMMWANSPNASD41ER54W9B2WQ: preload chunks: read range for 0: can't read chunk for required length, required 2723780, actually 48762

Copy link
Contributor

@pracucci pracucci left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi!

When the store uses the object store as the s3 interface, there is a probability that the loadChunks or loadSeries method panic will appear in the query. It is found that the s3 interface does not return complete data, and the store should return an error instead of panic

I would be interested into hearing about the issue. The protection you propose here is ok (unless will turn out to add too much complexity) but I'm not sure I understand the root cause of this issue, which is more interesting to me. Have you opened an issue with details about it?

@@ -1908,7 +1909,11 @@ func (r *bucketIndexReader) loadSeries(ctx context.Context, ids []uint64, refetc
r.mtx.Unlock()

for i, id := range ids {
c := b[id-start:]
seriesOffset := id - start
if int(seriesOffset) > indexLength {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This just checks the offset, but the following code could overread anyway. Am I missing anything?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, I found the following issue, we encountered the same issue.
#3057

When I debugged, I found that when using readIndexRange and readChunkRange, the length of the data is not necessarily equal to or longer than the required length of the data, but when we use it, we will set the offset according to the length, so there will be situations .

I use the s3 interface and the underlying ceph storage. I am still looking for ways to solve the root cause of the incorrect data length, but there is no progress. In addition, I hope the store will not panic in this case

I tried to modify the readChunkRange and readIndexRange methods, but I found that other upstream methods do not require the returned data length. So I only modified the place where panic occurs : )

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@MegaEle do you use radosgw, perchance? I have seen the same thing a few times but never got to it. It's probably some kind of bug there in the API in that it doesn't follow the spec. Have you ever found a way to consistently reproduce it?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if the issue I just encountered is also relevant: #3339

I use Radosgw/ceph to confirm what @GiedriusS says

@stale
Copy link

stale bot commented Dec 20, 2020

Is this still relevant? If so, what is blocking it? Is there anything you can do to help move it forward?

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.

@stale stale bot added the stale label Dec 20, 2020
@GiedriusS
Copy link
Member

I think #3621 is a more reasonable defensive coding measure than this one because that other change adds a checker that is closer to the actual problem.

@stale stale bot removed the stale label Dec 21, 2020
@stale
Copy link

stale bot commented Feb 21, 2021

Is this still relevant? If so, what is blocking it? Is there anything you can do to help move it forward?

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.

@stale stale bot added the stale label Feb 21, 2021
Base automatically changed from master to main February 26, 2021 16:30
@stale stale bot removed the stale label Feb 26, 2021
@GiedriusS
Copy link
Member

This has been fixed by #3795, thank you for your contribution!

@GiedriusS GiedriusS closed this Mar 18, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants