Add better observability to queryReadiness #5946

DylanGuedes · 2022-04-18T13:31:21Z

What this PR does / why we need it:
Adds a new query_readiness_duration_seconds metric, that reports query readiness duration of a tablemanager/index gateway instance. We should use it later to report performance against the ring mode.

Adds a new usersToBeQueryReadyForTotal metric, that reports number of users involved in the query readiness operation. We should use it later to correlate number of users with the query readiness duration.

Logs the users involved in the query readiness operation. Will be especially useful for the ring mode so that we can track down the users assigned to each instance.

Logs the duration spent in the query readiness operation.

Which issue(s) this PR fixes:
N/A

Special notes for your reviewer:
N/A

Checklist

Documentation added
Tests updated
Add an entry in the CHANGELOG.md about the changes.

- Add a new `query_readiness_duration_seconds` metric, that reports query readiness duration of a tablemanager/index gateway instance. We should use it later to report performance against the ring mode - Add a new `usersToBeQueryReadyForTotal` metric, that reports number of users involved in the query readiness operation. We should use it later to correlate number of users with the query readiness duration.

pkg/storage/stores/shipper/downloads/table_manager.go

pkg/storage/stores/shipper/downloads/metrics.go

pkg/storage/stores/shipper/downloads/table_manager.go

DylanGuedes · 2022-04-19T10:10:24Z

Is the plan to do this point_up soon?

I'm working on it right now, I'm just not sure how long it will take to have this merged as I'm having some difficulties in the implementation 😅 do you think it makes sense to remove this metric for now and only add it in a more appropriate time?

cstyan · 2022-04-20T05:01:45Z

Personally I would err on the side of adding it earlier rather than later. It's not that big of a deal to remove a metric later on (especially given that the metric wouldn't make it into a big release for quite a while). OTOH, it can be a pain in the ass to build and deploy a new image in a live environment just because you don't have a metric you need to debug something during an incident.

- It will report all users always for now, so it isn't too helpful the way it is.

DylanGuedes · 2022-04-20T16:21:24Z

Personally I would err on the side of adding it earlier rather than later. It's not that big of a deal to remove a metric later on (especially given that the metric wouldn't make it into a big release for quite a while). OTOH, it can be a pain in the ass to build and deploy a new image in a live environment just because you don't have a metric you need to debug something during an incident.

I agree! But in the end I decided to remove it just because I already added a log message that will serve a similar purpose (i.e: we can derive/debug how is the balancing going with Loki instead of Prometheus).

JordanRushing

LGTM

pkg/storage/stores/shipper/downloads/table_manager.go

- This is necessary since go-kit doesn't support array type.

- As suggested by Ed on grafana#5972 (comment) and grafana#5972 (comment)

- It is redundant with a recently added log line.

DylanGuedes · 2022-04-28T12:40:37Z

fyi: after some suggestion I decided to remove the queryReadinessDuration metric as it was redundant with one of the new log lines.

slim-bean

LGTM

pull-request-size bot added the size/S label Apr 18, 2022

sandeepsukhani reviewed Apr 18, 2022

View reviewed changes

pkg/storage/stores/shipper/downloads/table_manager.go Outdated Show resolved Hide resolved

DylanGuedes marked this pull request as ready for review April 18, 2022 14:33

DylanGuedes requested a review from a team as a code owner April 18, 2022 14:33

cstyan reviewed Apr 19, 2022

View reviewed changes

pkg/storage/stores/shipper/downloads/metrics.go Outdated Show resolved Hide resolved

pkg/storage/stores/shipper/downloads/table_manager.go Outdated Show resolved Hide resolved

pkg/storage/stores/shipper/downloads/table_manager.go Outdated Show resolved Hide resolved

DylanGuedes added 3 commits April 20, 2022 13:16

Remove usersToBeQueryReadyForTotal.

63321f9

- It will report all users always for now, so it isn't too helpful the way it is.

Rename metric help text to not mislead people.

cbbd4b6

Log queryReadiness duration.

f9c520a

kavirajk requested review from sandeepsukhani and cstyan April 21, 2022 06:05

JordanRushing approved these changes Apr 22, 2022

View reviewed changes

sandeepsukhani reviewed Apr 25, 2022

View reviewed changes

pkg/storage/stores/shipper/downloads/table_manager.go Outdated Show resolved Hide resolved

pkg/storage/stores/shipper/downloads/table_manager.go Outdated Show resolved Hide resolved

Fix where log message and duration and triggered.

b4737dc

DylanGuedes mentioned this pull request Apr 25, 2022

Loki: Modifies TableManager to use IndexGateway ring #5972

Merged

4 tasks

JordanRushing mentioned this pull request Apr 25, 2022

Add IndexGateway Query Readiness Duration panel to Loki - Reads Resources dashboard in production/loki-mixin #6014

Merged

4 tasks

DylanGuedes added 2 commits April 26, 2022 10:19

Join users list in a single string.

ddd4380

- This is necessary since go-kit doesn't support array type.

Tweak queryReadiness log messages.

e90a816

- As suggested by Ed on grafana#5972 (comment) and grafana#5972 (comment)

DylanGuedes changed the title ~~Add more queryReadiness metrics to the IndexGateway~~ Add better observability to queryReadiness Apr 28, 2022

Ensure queryReadinessDuration metric.

bb1ed79

- It is redundant with a recently added log line.

pull-request-size bot added size/XS and removed size/S labels Apr 28, 2022

noop

b1da770

slim-bean approved these changes Apr 28, 2022

View reviewed changes

slim-bean merged commit cd02d6a into grafana:main Apr 28, 2022

dannykopping mentioned this pull request Jun 28, 2022

dannykopping/remove cache stats dannykopping/loki#13

Closed

dannykopping mentioned this pull request Jun 28, 2022

dannykopping/remove cache stats dannykopping/loki#14

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add better observability to queryReadiness #5946

Add better observability to queryReadiness #5946

DylanGuedes commented Apr 18, 2022 •

edited

Loading

DylanGuedes commented Apr 19, 2022

cstyan commented Apr 20, 2022

DylanGuedes commented Apr 20, 2022 •

edited

Loading

JordanRushing left a comment

DylanGuedes commented Apr 28, 2022

slim-bean left a comment

Add better observability to queryReadiness #5946

Add better observability to queryReadiness #5946

Conversation

DylanGuedes commented Apr 18, 2022 • edited Loading

DylanGuedes commented Apr 19, 2022

cstyan commented Apr 20, 2022

DylanGuedes commented Apr 20, 2022 • edited Loading

JordanRushing left a comment

Choose a reason for hiding this comment

DylanGuedes commented Apr 28, 2022

slim-bean left a comment

Choose a reason for hiding this comment

DylanGuedes commented Apr 18, 2022 •

edited

Loading

DylanGuedes commented Apr 20, 2022 •

edited

Loading