Issue with _currentConfiguration being null after the first call #2050

dmitrymal · 2023-04-05T19:22:15Z

Summary:
We ran into an issue with our Azure Function that has a very high throughput. With high volume functions a lot of new function instances start and go. Sometime on the startup of a new instance, the very first call to the GetConfigurationAsync fails (e.g. OKTA server was unavailable).
This causes all the subsequent calls to fail for the next 5 minutes.

Details:
When this line var configuration = await _configRetriever.GetConfigurationAsync(MetadataAddress, _docRetriever, CancellationToken.None).ConfigureAwait(false); fails, the exception is caught and _syncAfter is set to now + 5 mins. However, _currentConfiguration is still null.

On the subsequent calls the condition if (_syncAfter <= now) is true and the code falls through to this line

azure-activedirectory-identitymodel-extensions-for-dotnet/src/Microsoft.IdentityModel.Protocols/Configuration/ConfigurationManager.cs

Line 209 in 614642a

throw LogHelper.LogExceptionMessage(

causing the another exception.

This keeps happing for 5 mins until _syncAfter becomes > than now
In our case this behavior causes 10 - 50K exceptions in 5 min timeframe.

Proposed fix:
add || _currentConfiguration == null condition to the following if statement:

azure-activedirectory-identitymodel-extensions-for-dotnet/src/Microsoft.IdentityModel.Protocols/Configuration/ConfigurationManager.cs

Line 170 in 614642a

if (_syncAfter <= now)

Something like this:
if (_currentConfiguration == null || _syncAfter <= now)

Question:
Before this is fixed, is there any workaround that you could suggest?

Disclaimer:
All this analysis is done just by reviewing the code, so I might have overlooked something. Let me know if this sounds correct to you.

The text was updated successfully, but these errors were encountered:

brentschmaltz · 2023-04-10T17:39:33Z

@dmitrymal we are modifying this so that until configuration is obtained, we will retry at a 30 second interval 4 times.
Does this seem reasonable?

dmitrymal · 2023-04-10T19:51:45Z

@dmitrymal we are modifying this so that until configuration is obtained, we will retry at a 30 second interval 4 times. Does this seem reasonable?

It is your call, but personally, I would have left it up to the client to decide what retry strategy to use. '30 second interval 4 times' seems just a random decision and may not work for all clients. For instance, in our case with the rate of 10K requests per minute, we might end up with 40K queued requests waiting for a retry (which is probably not good).

After all, the issue is not about not being able to get the configuration or retry logic. The issue is in the code which does not check whether or not the _currentConfiguration is null.

In my opinion, the code should check if currentConfiguration == null and if it is, try to get it. The code should not simply throw an exception leaving the client without any option to fix the issue. This logical bug should be fixed regardless of retry logic. The easiest and quickest way to fix it is to have it something like this:
if (_currentConfiguration == null || _syncAfter <= now)

I could create a PR with the proposed changes if that would help.

By the way, I see you labeled it as an 'Enhancement'. I would consider this as a Bug (not an Enhancement). Thoughts?

brentschmaltz · 2023-04-11T02:21:03Z

@dmitrymal A request has arrived and we need to validate the token. Metadata may be needed since TokenValidationParameters.ConfigurationManager is not null the user must have set it.
If metadata cannot be obtained, we may not be able to validate the token, but we will try.

It seems correct that if ConfigurationManger never obtained metadata, we should try again next time through.
A simple change would be to not set _syncAfter in the exception block here.

I need to circle back with the team to understand why continuous attempts without a throttle were not acceptable.

brentschmaltz · 2023-04-11T17:55:53Z

@dmitrymal we discussed this as a team and landed on exponential backoff.

dmitrymal · 2023-04-11T18:58:50Z

@dmitrymal we discussed this as a team and landed on exponential backoff.

The retry logic might help (in some cases). However, the client code should have control of the retry logic. Otherwise, it might lead to more issues. At the very least the client should be able to set/define retry parameters.

Anyway, all this (the retry discussion) is a separate concern and has nothing to do with the reported issue in the code.
If _currentConfiguration ends up to be null even after retries, the client would still run into the same issue.

Any thoughts on fixing that?

brentschmaltz · 2023-04-13T15:46:13Z

@dmitrymal we have implemented a exponential increase in this PR #2052

We should leave this open to add the ability for users to set their own algorithm as you are correct in that we can assume to understand all the scenarios.

brentschmaltz added Enhancement The issue is a new feature Customer reported Indicates issue was opened by customer labels Apr 10, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issue with _currentConfiguration being null after the first call #2050

Issue with _currentConfiguration being null after the first call #2050

dmitrymal commented Apr 5, 2023

brentschmaltz commented Apr 10, 2023

dmitrymal commented Apr 10, 2023 •

edited

Loading

brentschmaltz commented Apr 11, 2023

brentschmaltz commented Apr 11, 2023

dmitrymal commented Apr 11, 2023

brentschmaltz commented Apr 13, 2023

Issue with _currentConfiguration being null after the first call #2050

Issue with _currentConfiguration being null after the first call #2050

Comments

dmitrymal commented Apr 5, 2023

brentschmaltz commented Apr 10, 2023

dmitrymal commented Apr 10, 2023 • edited Loading

brentschmaltz commented Apr 11, 2023

brentschmaltz commented Apr 11, 2023

dmitrymal commented Apr 11, 2023

brentschmaltz commented Apr 13, 2023

dmitrymal commented Apr 10, 2023 •

edited

Loading