Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

query-scheduler querier inflight requests: convert to summary metric #8417

Conversation

francoposa
Copy link
Member

@francoposa francoposa commented Jun 18, 2024

What this PR does

The use of a gauge was preventing us from almost ever seeing the peak values in the scraped data.

Using this summary instead with and querying like

sum by(query_component) (last_over_time(cortex_query_scheduler_querier_inflight_requests{quantile="0.99"}[1m]))

gives us a better look at real-time peak values for these statistics which will be used for query-scheduler load balancing decisions

Which issue(s) this PR fixes or relates to

previously, querying these metrics never reached their theoretical max, which is equal to all querier worker connections ever though we could see in logs that it was maxed out. This is fixed now:

image

Checklist

  • Tests updated.
  • Documentation added.
  • CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX].
  • about-versioning.md updated with experimental features.

@francoposa francoposa changed the title query scheduler querier inflight requests convert to summary metric query-scheduler querier inflight requests: convert to summary metric Jun 18, 2024
@francoposa francoposa marked this pull request as ready for review June 18, 2024 21:20
@francoposa francoposa requested a review from a team as a code owner June 18, 2024 21:20
@francoposa francoposa enabled auto-merge (squash) June 18, 2024 21:20
Copy link
Contributor

@chencs chencs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changelog?

@francoposa
Copy link
Member Author

Changelog?

we didn't document this yet because I knew it would probably have issues so I wasn't going to changelog it until it was more stable

will fix the weird period issues

@francoposa francoposa requested a review from chencs June 18, 2024 22:11
Copy link
Collaborator

@pracucci pracucci left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure you're implementing it the right way. Please see my comment on Slack.

Comment on lines 269 to 270
case <-q.observeInflightRequests:
q.processObserveInflightRequests()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isn't the dispatcher loop what's supposed to dispatch queries? Can we move this directly in the running goroutine (which mean we can also remove observeInflightRequests)

Copy link
Member Author

@francoposa francoposa Jun 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The dispatcher loop actually does everything - it processes querier operations updates and the every-5-seconds notifications to forgetDisconnectedQueriers as well as the requestsSent and requestsCompleted notifications in the exact same manner as it handles these observeInflightRequests.

I had done something similar to your idea previously for the inflight request tracking instead of the requestsSent and requestsCompleted notification channels - essentially just use atomics to observe and update from outside the dispatcherLoop instead of message passing. Charles wanted to avoid atomics if at all possible, though we didn't do any benchmarking on atomic vs. non-atomic under various conditions.

I am not necessarily convinced using atomics to observe on these metrics would be a hot path since the dispatcherLoop is already single-threaded, but the channel approach is consistent with what we have already.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have a strong opinion on atomics vs channels. Perhaps sticking with channels allows to keep the same pattern as with the rest of the struct.

The dispatcher loop actually does everything - it processes querier operations updates and the every-5-seconds notifications to forgetDisconnectedQueriers as well as the requestsSent and requestsCompleted notifications in the exact same manner as it handles these observeInflightRequests.

the reason I brought it up was so that we don't shove even more responsibility in that function. I see Marco's suggested to break it out into a separate goroutine so that frequency is more predictable. That also seems good

@francoposa
Copy link
Member Author

francoposa commented Jun 21, 2024

So I have updated this to be a summary updated on a 250ms ticker just like the scheduler inflight requests.

I also sought to have a separate summary metric observed every time these values change that would show a more real-time view better by having a shorter MaxAge, more AgeBuckets for higher granularity, and only observing 99th and 99.9th percentiles.

Unfortunately it still is not nearly granular enough for this stat to be any sort of accurate reflection of the moments when we cross over some TBD utilization threshold. While I was able to get it to rise faster than the 250ms-observed summary, it would not fall as fast for reasons I have not quite figured out so it just does not end up providing the "more real-time" view I was seeking and just confuses things when compared with the more standard summary.

For the 250ms ticker observation, we mirror the TimerService implementation but implement running ourselves because we need to manage two tickers: the existing ticker for forgetDisconnectedQueriers every 5 seconds and the new 250ms metrics observation ticker.

Here's the updated summary working:
image

@@ -239,6 +266,8 @@ func (q *RequestQueue) dispatcherLoop() {
case <-q.stopRequested:
// Nothing much to do here - fall through to the stop logic below to see if we can stop immediately.
stopping = true
case <-q.observeInflightRequests:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need a dedicated goroutine just to track it at fixed regular intervals (keep it simple, using a timer, not publishing messages through a chan). The problem of doing it here is that the frequency is not guaranteed, because the for loop could be busy doing other stuff.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK Charles had requested I avoid atomics , but I will switch back to the atomics I had used originally

@francoposa francoposa requested a review from Logiraptor June 25, 2024 18:05
@francoposa
Copy link
Member Author

planning to convert back to atomics:
Channels were used to avoid locking, but as Marco's vision for this metric is to be similar to the schedulerInflightRequests, where being observed on a regular interval is essential to the interpretation of the metric.

While I think it would be rare for it to block long enough for the scraped metrics to matter, we technically do not have any way to control or guarantee this, and since the locking works fine on outer scheduler process with much higher concurrency, I don't expect it to be a hotspot for the request queue

@francoposa francoposa merged commit d5ed105 into main Jun 25, 2024
29 checks passed
@francoposa francoposa deleted the francoposa/query-scheduler-querier-inflight-requests-convert-to-summary-metric branch June 25, 2024 22:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants