Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Original vs hedged request metrics #46

Merged
merged 5 commits into from
Aug 10, 2023
Merged

Original vs hedged request metrics #46

merged 5 commits into from
Aug 10, 2023

Conversation

cristaloleg
Copy link
Member

Fixes #42

Copy link
Contributor

@dannykopping dannykopping left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exactly how I was planning to implement it 🙂
LGTM!

@cristaloleg
Copy link
Member Author

Thanks, then I will add tests to cover this case and release a new version soon.

@dannykopping
Copy link
Contributor

Really appreciate it @cristaloleg 🙌

hedged.go Show resolved Hide resolved
@cristaloleg
Copy link
Member Author

Tests are added @dannykopping @joe-elliott . Not sure about thread above, IMO having 2 counters is much simpler and costs nothing, WDYT ?

I will refactor tests in the next PR, looks too wordy now.

hedged_test.go Show resolved Hide resolved
hedged_test.go Show resolved Hide resolved
@cristaloleg cristaloleg merged commit 6504b04 into main Aug 10, 2023
@cristaloleg cristaloleg deleted the orig-vs-hedg-metric branch August 10, 2023 07:18
@dannykopping
Copy link
Contributor

🎉🎉🎉

@dannykopping
Copy link
Contributor

dannykopping commented Aug 17, 2023

I tried to integrate this, but long story short we have many hedged clients but only register the metrics once; this makes it difficult to use the stats because one client might have 0 for HedgedRequestWins() and another might have some other value - which one would we use to update the metrics? The answer is: without major refactoring, we can't.

I solved this by using the return value of the roundtrip:

func (rt *limitedHedgingRoundTripper) RoundTrip(req *http.Request) (*http.Response, error) {
	isHedged := hedgedhttp.IsHedgedRequest(req)
	if isHedged {
		if !rt.limiter.Allow() {
			totalRateLimitedHedgeRequests.Inc()
			return nil, ErrTooManyHedgeRequests
		}
		totalHedgeRequests.Inc()
	}

	resp, err := rt.next.RoundTrip(req)
	if err == nil {
		if isHedged {
			requestsWon.WithLabelValues("hedged").Inc()
		} else {
			requestsWon.WithLabelValues("original").Inc()
		}
	}

	return resp, err
}

Edit: here's the PR grafana/loki#10281

@cristaloleg
Copy link
Member Author

Thanks for the update! So...as I understood it's not library's problem but only a case that you've many hedged clients in 1 app. Am I right?

BTW, any numbers how many wons? :D (such info is probably under NDA but I decided to try :D )

@dannykopping
Copy link
Contributor

So...as I understood it's not library's problem but only a case that you've many hedged clients in 1 app. Am I right?

That's correct!

BTW, any numbers how many wons? :D (such info is probably under NDA but I decided to try :D )

I'll update you once I roll this out to production next week 🙂 I'm sure I can send a % of hedging effectiveness with no problems.

@dannykopping
Copy link
Contributor

image
image

This is from a 12 hour range in our pre-prod environment. Seems like we're averaging about a 20% effectiveness rate (i.e. only 1 of every 5 hedged requests wins vs the original).

Do you have a good intuition about what this rate should be in order to justify the extra expense?

@cristaloleg
Copy link
Member Author

Do you have a good intuition about what this rate should be in order to justify the extra expense?

Sorry, no real formulas, someone with a better probability theory background should comment on that.

But the intuition suggests that it should not be high (or even lower than that). Request hedging is only about tail latency, so basically 1-5% of the requests.

I don't think there is are exact numbers for everyone. Sleep between calls, amount of hedged calls and success rate as a result heavily depends on the systems and it's behaviour. The only thing I can really suggest is to play with the numbers.

Probably the target latency should be a const (so, SLO) and after that tweaking sleeps & amount should be adjusted to minimise numbers of calls but keeping latency at the desired level. My 2c.

CC: @storozhukBM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Determining the effectiveness of hedging
3 participants