Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add BootstrapRefreshInterval #2052

Merged
merged 6 commits into from
Apr 13, 2023
Merged

Add BootstrapRefreshInterval #2052

merged 6 commits into from
Apr 13, 2023

Conversation

ciaozhang
Copy link
Contributor

@ciaozhang ciaozhang commented Apr 6, 2023

Add a new timer with exponential backoff for failing to fetch metadata at bootstrap til we get the first config (30s)

@ciaozhang ciaozhang changed the title retry up to 4 times if no metadata has been successfully retrieved Add BootstrapRefreshInterval Apr 6, 2023
_bootstrapRefreshRetryCount = (_bootstrapRefreshRetryCount < int.MaxValue) ? _bootstrapRefreshRetryCount + 1 : int.MaxValue;

if (_bootstrapRefreshRetryCount <= BootstrapRefreshMaxAttempt)
_syncAfter = DateTimeUtil.Add(now.UtcDateTime, BootstrapRefreshInterval);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we come in here we should log which retry # we're on at a fairly high log level, maybe max-1

catch (Exception firstFetchMetadataFailure)
{
// Refresh interval is BootstrapRefreshInterval
var syncAfter = configManager.GetType().GetField("_syncAfter", BindingFlags.NonPublic | BindingFlags.Instance).GetValue(configManager);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we just expose an internal getter here for the tests? Using reflection seems unnecessarily brittle to me. If someone updates this we'll have to wait til runtime to catch the break.

Copy link
Contributor

@TimHannMSFT TimHannMSFT left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks pretty good. I think we need range validation on the new public controls. Given these are new configurable settings they should be exposed via the config layer too. Perhaps we can keep internal for now and then expose publicly when we're able to add via config?

@TimHannMSFT
Copy link
Contributor

In general would also be good to ensure the description contains not only what we're doing (I think you cover that pretty well) but why we're doing the work (i.e. because some partners don't have network at bootstrap and thus would, before this fix, be stuck not able to validate tokens for 5 minutes)

@TimHannMSFT
Copy link
Contributor

As discussed in standup, let's remove the public getter/setter for now and not have any new public parts AND let's switch to an exponential backoff with jitter up to 5 min

@ciaozhang ciaozhang requested a review from TimHannMSFT April 12, 2023 17:16
if (_currentConfiguration == null) // Throw an exception if there's no configuration to return.
{
if (_bootstrapRefreshInterval < DefaultRefreshInterval)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This evaluation needs to be done AFTER we calculate the time for sync after. e.g. It's possible 1_bootstrapRefreshInterval` here will be 256 seconds which is less than 5 minutes but THEN we will double it on line 200 resulting in >8 minutes of wait time so it will go from ~4min of wait to ~8 min of wait, then back down to 5 minutes of wait. DefaultRefreshInterval should always be the max for the failure condition retry

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Additionally, when this is done in the correct spot, I think it should be keying off he MinimulmRefreshInterval, not the default. Reason being we don't want this more aggressive strategy to be LESS aggressive than the configured default. We seem to currently allow folks to set this to retry every second which seems a bit dangerous but at least that won't be the default.

@brentschmaltz do you think we should look into a higher min refresh time? Allowing people to refresh every second on error seems awfully agressive. NOTE: we don't have to close on that here but want to flag it.

{
// Adopt exponential backoff for bootstrap refresh interval with a jitter if it is not longer than 5 minutes.
_bootstrapRefreshInterval += _bootstrapRefreshInterval;
_syncAfter = DateTimeUtil.Add(now.UtcDateTime, _bootstrapRefreshInterval + TimeSpan.FromMilliseconds(new Random().Next((int)(_bootstrapRefreshInterval.TotalMilliseconds / 2))));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have a look at the link @victorm-hernandez shared: https://aws.amazon.com/blogs/architecture/exponential-backoff-and-jitter/

It seems to suggest the "Full Jitter" or "Decorrelated Jitter" approach. I think what you have is something close to "Equal Jitter" which is mostly okay but let's make sure to cap it at the RefreshInterval to avoid cases where the more aggressive mode actually waits LONGER in between calls than the steady state.

Copy link
Contributor

@TimHannMSFT TimHannMSFT left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would like to see us cap out at the RefreshInterval with exponential backoff til then. Current implementation has strange, unintuitive behavior from a client perspective.

if (_bootstrapRefreshInterval < DefaultRefreshInterval)
{
// Adopt exponential backoff for bootstrap refresh interval with a jitter if it is not longer than 5 minutes.
_bootstrapRefreshInterval += _bootstrapRefreshInterval;
Copy link
Contributor

@victorm-hernandez victorm-hernandez Apr 12, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_bootstrapRefreshInterval

This increment is not enough to be called exponential. You can do something like Math.Pow(2, retryAttempt) #Closed

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this identical? Doubling each time is the same as (2 ** retry attempt) if we start at 1

2 ** 0 == 1 == 1
2 ** 1 == 2 == 1+1
2 ** 2 == 4 == 2+2
2 * 3 == 8 == 4+4
...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nevermind this, Tim pointed out this is actually equivalent. Since you are duplicating the value each time.

if (_currentConfiguration == null) // Throw an exception if there's no configuration to return.
{
if (_bootstrapRefreshInterval < DefaultRefreshInterval)
Copy link
Contributor

@victorm-hernandez victorm-hernandez Apr 12, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_bootstrapRefreshInterval

We should check the type of exception, only transient HTTP errors should be included here not every single exception. #WontFix

{
// Adopt exponential backoff for bootstrap refresh interval with a jitter if it is not longer than 5 minutes.
_bootstrapRefreshInterval += _bootstrapRefreshInterval;
_syncAfter = DateTimeUtil.Add(now.UtcDateTime, _bootstrapRefreshInterval + TimeSpan.FromMilliseconds(new Random().Next((int)(_bootstrapRefreshInterval.TotalMilliseconds / 2))));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_bootstrapRefreshInterval.TotalMilliseconds

The jitter could be increased as well, up to 3 secs maybe?

if (_bootstrapRefreshInterval < DefaultRefreshInterval)
{
// Adopt exponential backoff for bootstrap refresh interval with a jitter if it is not longer than 5 minutes.
_bootstrapRefreshInterval += _bootstrapRefreshInterval;
Copy link
Contributor

@victorm-hernandez victorm-hernandez Apr 12, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_bootstrapRefreshInterval

This value is never reset, we should set it back to the default value upon successful retrieval of the config. #Closed

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we only bootstrap once. Once we get a config we use that stale config forever and only retry every RefreshInterval

{
var context = new CompareContext($"{this}.BootstrapRefreshIntervalTest");

var documentRetriever = new HttpDocumentRetriever(HttpResponseMessageUtils.SetupHttpClientThatReturns("OpenIdConnectMetadata.json", HttpStatusCode.NotFound));
Copy link
Contributor

@victorm-hernandez victorm-hernandez Apr 12, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not found is not a transient error. We should not retry if this happens. This means that the URL we are using is wrong. #WontFix

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can have a collection of HTTP status codes that are transient and those that are permanent to decide if a retry is required.

In addition we should also have a list of .NET exceptions for scenarios like timeouts.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree we should take into consideration retriable vs non-retriable errors but IMO can wait til a subsequent interation.

@@ -190,15 +191,32 @@ public async Task<T> GetConfigurationAsync(CancellationToken cancel)
catch (Exception ex)
Copy link
Contributor

@victorm-hernandez victorm-hernandez Apr 12, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exception

The change here is limiting when the next refresh can be requested rather than performing another request.

Have we discussed the possibility of doing the retry in here? #WontFix

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIRC the initial fetch already has a retry so that would be multiple levels of retries. That's not to say we shouldn't do this but my preference would be to have that as a separate change.

Copy link
Contributor

@victorm-hernandez victorm-hernandez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants