Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Retry with backoff when rate-limited #35

Merged
merged 6 commits into from
Nov 7, 2018
Merged

Conversation

mars
Copy link
Member

@mars mars commented Oct 23, 2018

Context

Heroku Platform API:

Proposal

Let's make this client real friendly with HTTP status 429 response errors 🤗 to improve durability of the Terraform Provider.

Implement retry with exponential backoff when rate-limited with a new RoundTripWithRetryBackoff transport wrapping the default net/http RoundTrip().

Proposed (& now implemented) usage, note the Transport field:

heroku.NewService(&http.Client{
  Transport: &heroku.Transport{
    Username:          c.Email,
    Password:          c.APIKey,
    UserAgent:         heroku.DefaultUserAgent,
    AdditionalHeaders: c.Headers,
    Debug:             debugHTTP,
    Transport:         heroku.RoundTripWithRetryBackoff{
      // These are default values for ExponentialBackOff config
      InitialIntervalSeconds: 30,
      RandomizationFactor:    0.25,
      Multiplier:             2,
      MaxIntervalSeconds:     900,
      MaxElapsedTimeSeconds:  0,
    },
  },
})

The client reacts to the terminal HTTPS status 429 by retrying with back-off, instead of trying to predict and back-off before being rate-limited. Using the Ratelimit-Remaining header intelligently would seem to require global state shared across Terraform's parallel execution of nodes, so this proposal avoids that seemingly unnecessary complexity.

Fixes #32

@mars mars force-pushed the retry-with-backoff branch from efd52a4 to 176c52c Compare October 23, 2018 23:56
@mars mars changed the title Retry with backoff Retry with backoff when rate-limited Oct 24, 2018
var lastError error

retryableRoundTrip := func() error {
lastResponse, lastError = http.DefaultTransport.RoundTrip(req)
Copy link

@sclasen sclasen Oct 24, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probably need to check lastError and return it if not nil?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lastError is checked just a few lines down in the parent function, because this backoff.Retry function should only return an error when it should be retried (status 429 for this rate-limiting use-case).

Copy link
Contributor

@talbright talbright Oct 24, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could dry it up by dropping lastError and doing something like this:

retryableRoundTrip := func() (err error) {
		lastResponse, err = http.DefaultTransport.RoundTrip(req)
		// Detect Heroku API rate limiting
		// https://devcenter.heroku.com/articles/platform-api-reference#client-error-responses
		if lastResponse.StatusCode == 429 {
			err = fmt.Errorf("Heroku API rate limited: 429 Too Many Requests")
		}
		return err
	}

That would also mean that you would only need to handle the error once below...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But we only want this to retry for HTTP status 429 responses.

If that function returns other errors, it will retry for them too.

We only want to retry when rate-limited, right?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh yeah 😊

@mars
Copy link
Member Author

mars commented Oct 24, 2018

I've verified that this retrier is doing the right thing in the Terraform Provider by switching this code to retry on to HTTP status 200, and then watch the logs during tests with TF_LOG=DEBUG to see it retrying.

@mars
Copy link
Member Author

mars commented Oct 25, 2018

To test I revised the backoff.RetryNotify function:

retryableRoundTrip := func() error {
	lastResponse, lastError = http.DefaultTransport.RoundTrip(req)
	// Detect Heroku API rate limiting
	// https://devcenter.heroku.com/articles/platform-api-reference#client-error-responses
	if lastResponse.StatusCode == 429 {
		return fmt.Errorf("Heroku API rate limited: 429 Too Many Requests")
	}

	// Fake retry testing.
	if lastResponse.StatusCode == 200 {
		return fmt.Errorf("Fake retrier!!!")
	}

	return nil
}

…and can see the randomized backoff working in the logs:

2018/10/25 09:14:00 Will retry Heroku API request in 33.872687193s, because Fake retrier!!!
2018/10/25 09:14:35 Will retry Heroku API request in 46.371947138s, because Fake retrier!!!
2018/10/25 09:15:21 Will retry Heroku API request in 1m46.253268969s, because Fake retrier!!!
2018/10/25 09:17:08 Will retry Heroku API request in 4m0.01901876s, because Fake retrier!!!
2018/10/25 09:21:09 Will retry Heroku API request in 8m21.662152299s, because Fake retrier!!!

So, my confidence is building that this solution is good 😇

@mars mars force-pushed the retry-with-backoff branch from b2db372 to 4a8c894 Compare October 26, 2018 21:54
@mars
Copy link
Member Author

mars commented Nov 5, 2018

I think this is ready to go!

The new retry functionality is not enabled by default. Transport: heroku.RoundTripWithRetryBackoff{} must be specifically set. So there's very little chance of this affecting existing usage.

Please review and let me know what you think.

Copy link
Contributor

@talbright talbright left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, though I'm not clear on why Reset() has to be called immediately after the struct is created.


// net/http RoundTripper interface, a.k.a. Transport
// https://godoc.org/net/http#RoundTripper
type RoundTripWithRetryBackoff struct {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ good call -- was going to suggest in an earlier commit you allow some configuration to be passed through

MaxInterval: time.Duration(int64WithDefault(r.MaxIntervalSeconds, int64(900))) * time.Second,
MaxElapsedTime: time.Duration(int64WithDefault(r.MaxElapsedTimeSeconds, int64(0))) * time.Second,
}
rateLimitRetryConfig.Reset()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

@mars mars Nov 7, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I followed the way the backoff module's built-in convenience constructor works:
https://github.com/cenkalti/backoff/blob/adb73d5bf0d9237fab19ff58aebf658449e326df/exponential.go#L92

Not sure it's required, but it seemed prudent to follow the creator's lead 😅

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That sounds logical to me 👍

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mars
Copy link
Member Author

mars commented Nov 7, 2018

Thanks @talbright 😄

@mars mars merged commit 3b96879 into heroku:master Nov 7, 2018
@schneems
Copy link

Hello rate throttlers! I've been poking at this problem of adding rate throttling to the platform-api gem for a bit.

I did some benchmarking and it looks like if one of the goals is to reduce the total number of requests to the origin server that an exponential backoff strategy doesn't help that much. In my benchmark it has an 80% retry rate. If the primary goal is to not have failures due to rate limiting, then it does a great job there. Here's my benchmark results:

https://github.com/zombocom/rate_throttle_client#ratethrottleclientexponentialbackoff-results-duration-300-minutes-multiplier-12

And here's some of my research and how I went about iterating on a solution with good properties that has a lower retry rate (~3% instead of 80%): https://github.com/schneems/rate_throttle_clients#how-to-write-a-rate-throttling-algorithm

Here's the platform-api PR heroku/platform-api#103.

There's no immediate action needed by y'all but if you're interested in various rate-throttling strategies, it's an active area of research/interest of mine.

@mars
Copy link
Member Author

mars commented Apr 23, 2020

So thoughtful @schneems 🙌 As you intuited, the reason rate-throttling was implemented here was to avoid terraform apply failures.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Built-in retry with exponential back-offs
4 participants