-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add better observability to queryReadiness #5946
Conversation
- Add a new `query_readiness_duration_seconds` metric, that reports query readiness duration of a tablemanager/index gateway instance. We should use it later to report performance against the ring mode - Add a new `usersToBeQueryReadyForTotal` metric, that reports number of users involved in the query readiness operation. We should use it later to correlate number of users with the query readiness duration.
I'm working on it right now, I'm just not sure how long it will take to have this merged as I'm having some difficulties in the implementation 😅 do you think it makes sense to remove this metric for now and only add it in a more appropriate time? |
Personally I would err on the side of adding it earlier rather than later. It's not that big of a deal to remove a metric later on (especially given that the metric wouldn't make it into a big release for quite a while). OTOH, it can be a pain in the ass to build and deploy a new image in a live environment just because you don't have a metric you need to debug something during an incident. |
- It will report all users always for now, so it isn't too helpful the way it is.
I agree! But in the end I decided to remove it just because I already added a log message that will serve a similar purpose (i.e: we can derive/debug how is the balancing going with Loki instead of Prometheus). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
- This is necessary since go-kit doesn't support array type.
- As suggested by Ed on grafana#5972 (comment) and grafana#5972 (comment)
- It is redundant with a recently added log line.
fyi: after some suggestion I decided to remove the queryReadinessDuration metric as it was redundant with one of the new log lines. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
What this PR does / why we need it:
Adds a newquery_readiness_duration_seconds
metric, that reports query readiness duration of a tablemanager/index gateway instance. We should use it later to report performance against the ring mode.Adds a newusersToBeQueryReadyForTotal
metric, that reports number of users involved in the query readiness operation. We should use it later to correlate number of users with the query readiness duration.Logs the users involved in the query readiness operation. Will be especially useful for the ring mode so that we can track down the users assigned to each instance.
Logs the duration spent in the query readiness operation.
Which issue(s) this PR fixes:
N/A
Special notes for your reviewer:
N/A
Checklist
CHANGELOG.md
about the changes.