-
Notifications
You must be signed in to change notification settings - Fork 413
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add BootstrapRefreshInterval #2052
Conversation
src/Microsoft.IdentityModel.Protocols/Configuration/ConfigurationManager.cs
Outdated
Show resolved
Hide resolved
_bootstrapRefreshRetryCount = (_bootstrapRefreshRetryCount < int.MaxValue) ? _bootstrapRefreshRetryCount + 1 : int.MaxValue; | ||
|
||
if (_bootstrapRefreshRetryCount <= BootstrapRefreshMaxAttempt) | ||
_syncAfter = DateTimeUtil.Add(now.UtcDateTime, BootstrapRefreshInterval); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we come in here we should log which retry # we're on at a fairly high log level, maybe max-1
catch (Exception firstFetchMetadataFailure) | ||
{ | ||
// Refresh interval is BootstrapRefreshInterval | ||
var syncAfter = configManager.GetType().GetField("_syncAfter", BindingFlags.NonPublic | BindingFlags.Instance).GetValue(configManager); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we just expose an internal getter here for the tests? Using reflection seems unnecessarily brittle to me. If someone updates this we'll have to wait til runtime to catch the break.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks pretty good. I think we need range validation on the new public controls. Given these are new configurable settings they should be exposed via the config layer too. Perhaps we can keep internal for now and then expose publicly when we're able to add via config?
In general would also be good to ensure the description contains not only what we're doing (I think you cover that pretty well) but why we're doing the work (i.e. because some partners don't have network at bootstrap and thus would, before this fix, be stuck not able to validate tokens for 5 minutes) |
As discussed in standup, let's remove the public getter/setter for now and not have any new public parts AND let's switch to an exponential backoff with jitter up to 5 min |
if (_currentConfiguration == null) // Throw an exception if there's no configuration to return. | ||
{ | ||
if (_bootstrapRefreshInterval < DefaultRefreshInterval) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This evaluation needs to be done AFTER we calculate the time for sync after. e.g. It's possible 1_bootstrapRefreshInterval` here will be 256 seconds which is less than 5 minutes but THEN we will double it on line 200 resulting in >8 minutes of wait time so it will go from ~4min of wait to ~8 min of wait, then back down to 5 minutes of wait. DefaultRefreshInterval should always be the max for the failure condition retry
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Additionally, when this is done in the correct spot, I think it should be keying off he MinimulmRefreshInterval, not the default. Reason being we don't want this more aggressive strategy to be LESS aggressive than the configured default. We seem to currently allow folks to set this to retry every second which seems a bit dangerous but at least that won't be the default.
@brentschmaltz do you think we should look into a higher min refresh time? Allowing people to refresh every second on error seems awfully agressive. NOTE: we don't have to close on that here but want to flag it.
{ | ||
// Adopt exponential backoff for bootstrap refresh interval with a jitter if it is not longer than 5 minutes. | ||
_bootstrapRefreshInterval += _bootstrapRefreshInterval; | ||
_syncAfter = DateTimeUtil.Add(now.UtcDateTime, _bootstrapRefreshInterval + TimeSpan.FromMilliseconds(new Random().Next((int)(_bootstrapRefreshInterval.TotalMilliseconds / 2)))); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have a look at the link @victorm-hernandez shared: https://aws.amazon.com/blogs/architecture/exponential-backoff-and-jitter/
It seems to suggest the "Full Jitter" or "Decorrelated Jitter" approach. I think what you have is something close to "Equal Jitter" which is mostly okay but let's make sure to cap it at the RefreshInterval
to avoid cases where the more aggressive mode actually waits LONGER in between calls than the steady state.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would like to see us cap out at the RefreshInterval with exponential backoff til then. Current implementation has strange, unintuitive behavior from a client perspective.
if (_bootstrapRefreshInterval < DefaultRefreshInterval) | ||
{ | ||
// Adopt exponential backoff for bootstrap refresh interval with a jitter if it is not longer than 5 minutes. | ||
_bootstrapRefreshInterval += _bootstrapRefreshInterval; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't this identical? Doubling each time is the same as (2 ** retry attempt) if we start at 1
2 ** 0 == 1 == 1
2 ** 1 == 2 == 1+1
2 ** 2 == 4 == 2+2
2 * 3 == 8 == 4+4
...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nevermind this, Tim pointed out this is actually equivalent. Since you are duplicating the value each time.
if (_currentConfiguration == null) // Throw an exception if there's no configuration to return. | ||
{ | ||
if (_bootstrapRefreshInterval < DefaultRefreshInterval) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
{ | ||
// Adopt exponential backoff for bootstrap refresh interval with a jitter if it is not longer than 5 minutes. | ||
_bootstrapRefreshInterval += _bootstrapRefreshInterval; | ||
_syncAfter = DateTimeUtil.Add(now.UtcDateTime, _bootstrapRefreshInterval + TimeSpan.FromMilliseconds(new Random().Next((int)(_bootstrapRefreshInterval.TotalMilliseconds / 2)))); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if (_bootstrapRefreshInterval < DefaultRefreshInterval) | ||
{ | ||
// Adopt exponential backoff for bootstrap refresh interval with a jitter if it is not longer than 5 minutes. | ||
_bootstrapRefreshInterval += _bootstrapRefreshInterval; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we only bootstrap once. Once we get a config we use that stale config forever and only retry every RefreshInterval
{ | ||
var context = new CompareContext($"{this}.BootstrapRefreshIntervalTest"); | ||
|
||
var documentRetriever = new HttpDocumentRetriever(HttpResponseMessageUtils.SetupHttpClientThatReturns("OpenIdConnectMetadata.json", HttpStatusCode.NotFound)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not found is not a transient error. We should not retry if this happens. This means that the URL we are using is wrong. #WontFix
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can have a collection of HTTP status codes that are transient and those that are permanent to decide if a retry is required.
In addition we should also have a list of .NET exceptions for scenarios like timeouts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree we should take into consideration retriable vs non-retriable errors but IMO can wait til a subsequent interation.
@@ -190,15 +191,32 @@ public async Task<T> GetConfigurationAsync(CancellationToken cancel) | |||
catch (Exception ex) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIRC the initial fetch already has a retry so that would be multiple levels of retries. That's not to say we shouldn't do this but my preference would be to have that as a separate change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a new timer with exponential backoff for failing to fetch metadata at bootstrap til we get the first config (30s)