query-scheduler querier inflight requests: convert to summary metric #8417

francoposa · 2024-06-18T20:38:01Z

What this PR does

The use of a gauge was preventing us from almost ever seeing the peak values in the scraped data.

Using this summary instead with and querying like

sum by(query_component) (last_over_time(cortex_query_scheduler_querier_inflight_requests{quantile="0.99"}[1m]))

gives us a better look at real-time peak values for these statistics which will be used for query-scheduler load balancing decisions

Which issue(s) this PR fixes or relates to

previously, querying these metrics never reached their theoretical max, which is equal to all querier worker connections ever though we could see in logs that it was maxed out. This is fixed now:

Checklist

Tests updated.
Documentation added.
CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX].
about-versioning.md updated with experimental features.

chencs

Changelog?

pkg/frontend/v1/frontend.go

pkg/scheduler/queue/query_component_utilization_test.go

pkg/scheduler/scheduler.go

pkg/frontend/v1/frontend.go

francoposa · 2024-06-18T21:54:52Z

Changelog?

we didn't document this yet because I knew it would probably have issues so I wasn't going to changelog it until it was more stable

will fix the weird period issues

pracucci

I'm not sure you're implementing it the right way. Please see my comment on Slack.

dimitarvdimitrov · 2024-06-21T09:43:50Z

pkg/scheduler/queue/queue.go

+		case <-q.observeInflightRequests:
+			q.processObserveInflightRequests()


isn't the dispatcher loop what's supposed to dispatch queries? Can we move this directly in the running goroutine (which mean we can also remove observeInflightRequests)

The dispatcher loop actually does everything - it processes querier operations updates and the every-5-seconds notifications to forgetDisconnectedQueriers as well as the requestsSent and requestsCompleted notifications in the exact same manner as it handles these observeInflightRequests.

I had done something similar to your idea previously for the inflight request tracking instead of the requestsSent and requestsCompleted notification channels - essentially just use atomics to observe and update from outside the dispatcherLoop instead of message passing. Charles wanted to avoid atomics if at all possible, though we didn't do any benchmarking on atomic vs. non-atomic under various conditions.

I am not necessarily convinced using atomics to observe on these metrics would be a hot path since the dispatcherLoop is already single-threaded, but the channel approach is consistent with what we have already.

I don't have a strong opinion on atomics vs channels. Perhaps sticking with channels allows to keep the same pattern as with the rest of the struct.

The dispatcher loop actually does everything - it processes querier operations updates and the every-5-seconds notifications to forgetDisconnectedQueriers as well as the requestsSent and requestsCompleted notifications in the exact same manner as it handles these observeInflightRequests.

the reason I brought it up was so that we don't shove even more responsibility in that function. I see Marco's suggested to break it out into a separate goroutine so that frequency is more predictable. That also seems good

francoposa · 2024-06-21T19:28:27Z

So I have updated this to be a summary updated on a 250ms ticker just like the scheduler inflight requests.

I also sought to have a separate summary metric observed every time these values change that would show a more real-time view better by having a shorter MaxAge, more AgeBuckets for higher granularity, and only observing 99th and 99.9th percentiles.

Unfortunately it still is not nearly granular enough for this stat to be any sort of accurate reflection of the moments when we cross over some TBD utilization threshold. While I was able to get it to rise faster than the 250ms-observed summary, it would not fall as fast for reasons I have not quite figured out so it just does not end up providing the "more real-time" view I was seeking and just confuses things when compared with the more standard summary.

For the 250ms ticker observation, we mirror the TimerService implementation but implement running ourselves because we need to manage two tickers: the existing ticker for forgetDisconnectedQueriers every 5 seconds and the new 250ms metrics observation ticker.

Here's the updated summary working:

pracucci · 2024-06-24T13:48:04Z

pkg/scheduler/queue/queue.go

@@ -239,6 +266,8 @@ func (q *RequestQueue) dispatcherLoop() {
 		case <-q.stopRequested:
 			// Nothing much to do here - fall through to the stop logic below to see if we can stop immediately.
 			stopping = true
+		case <-q.observeInflightRequests:


We need a dedicated goroutine just to track it at fixed regular intervals (keep it simple, using a timer, not publishing messages through a chan). The problem of doing it here is that the frequency is not guaranteed, because the for loop could be busy doing other stuff.

OK Charles had requested I avoid atomics , but I will switch back to the atomics I had used originally

francoposa · 2024-06-25T18:08:59Z

planning to convert back to atomics:
Channels were used to avoid locking, but as Marco's vision for this metric is to be similar to the schedulerInflightRequests, where being observed on a regular interval is essential to the interpretation of the metric.

While I think it would be rare for it to block long enough for the scraped metrics to matter, we technically do not have any way to control or guarantee this, and since the locking works fine on outer scheduler process with much higher concurrency, I don't expect it to be a hotspot for the request queue

…omics

francoposa added 2 commits June 18, 2024 13:21

query scheduler querier inflight requests convert to summary metric

7be8c66

fix test

585dd99

francoposa changed the title ~~query scheduler querier inflight requests convert to summary metric~~ query-scheduler querier inflight requests: convert to summary metric Jun 18, 2024

fix other test

c604016

francoposa marked this pull request as ready for review June 18, 2024 21:20

francoposa requested a review from a team as a code owner June 18, 2024 21:20

francoposa enabled auto-merge (squash) June 18, 2024 21:20

chencs reviewed Jun 18, 2024

View reviewed changes

pkg/frontend/v1/frontend.go Outdated Show resolved Hide resolved

chencs reviewed Jun 18, 2024

View reviewed changes

pkg/scheduler/queue/query_component_utilization_test.go Outdated Show resolved Hide resolved

chencs reviewed Jun 18, 2024

View reviewed changes

pkg/scheduler/scheduler.go Outdated Show resolved Hide resolved

chencs reviewed Jun 18, 2024

View reviewed changes

pkg/frontend/v1/frontend.go Outdated Show resolved Hide resolved

fix bad copy-paste

6bb39f3

francoposa requested a review from chencs June 18, 2024 22:11

pracucci reviewed Jun 19, 2024

View reviewed changes

WIP: request queu observe inflight requests on ticker like scheduler

27611b3

dimitarvdimitrov reviewed Jun 21, 2024

View reviewed changes

rm commented lines

c485ee0

francoposa requested review from pracucci and dimitarvdimitrov June 21, 2024 19:41

pracucci reviewed Jun 24, 2024

View reviewed changes

francoposa requested a review from Logiraptor June 25, 2024 18:05

francoposa added 3 commits June 25, 2024 12:50

convert querier-scheduler inflight requests observation to use atomics

ebff77f

convert all querier-scheduler inflight request tracking to utilize at…

2359807

…omics

cleanup

4f1e32b

Logiraptor approved these changes Jun 25, 2024

View reviewed changes

francoposa merged commit d5ed105 into main Jun 25, 2024
29 checks passed

francoposa deleted the francoposa/query-scheduler-querier-inflight-requests-convert-to-summary-metric branch June 25, 2024 22:04

francoposa mentioned this pull request Jun 25, 2024

scheduler querier inflight request tracking fix: mark completed on frontend cancel #8523

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

query-scheduler querier inflight requests: convert to summary metric #8417

query-scheduler querier inflight requests: convert to summary metric #8417

francoposa commented Jun 18, 2024 •

edited

Loading

chencs left a comment

francoposa commented Jun 18, 2024

pracucci left a comment

dimitarvdimitrov Jun 21, 2024

francoposa Jun 21, 2024 •

edited

Loading

dimitarvdimitrov Jun 24, 2024

francoposa commented Jun 21, 2024 •

edited

Loading

pracucci Jun 24, 2024

francoposa Jun 24, 2024

francoposa commented Jun 25, 2024

		case <-q.observeInflightRequests:
		q.processObserveInflightRequests()

query-scheduler querier inflight requests: convert to summary metric #8417

query-scheduler querier inflight requests: convert to summary metric #8417

Conversation

francoposa commented Jun 18, 2024 • edited Loading

What this PR does

Which issue(s) this PR fixes or relates to

Checklist

chencs left a comment

Choose a reason for hiding this comment

francoposa commented Jun 18, 2024

pracucci left a comment

Choose a reason for hiding this comment

dimitarvdimitrov Jun 21, 2024

Choose a reason for hiding this comment

francoposa Jun 21, 2024 • edited Loading

Choose a reason for hiding this comment

dimitarvdimitrov Jun 24, 2024

Choose a reason for hiding this comment

francoposa commented Jun 21, 2024 • edited Loading

pracucci Jun 24, 2024

Choose a reason for hiding this comment

francoposa Jun 24, 2024

Choose a reason for hiding this comment

francoposa commented Jun 25, 2024

francoposa commented Jun 18, 2024 •

edited

Loading

francoposa Jun 21, 2024 •

edited

Loading

francoposa commented Jun 21, 2024 •

edited

Loading