Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue with _currentConfiguration being null after the first call #2050

Open
dmitrymal opened this issue Apr 5, 2023 · 6 comments
Open

Issue with _currentConfiguration being null after the first call #2050

dmitrymal opened this issue Apr 5, 2023 · 6 comments
Labels
Customer reported Indicates issue was opened by customer Enhancement The issue is a new feature

Comments

@dmitrymal
Copy link

Summary:
We ran into an issue with our Azure Function that has a very high throughput. With high volume functions a lot of new function instances start and go. Sometime on the startup of a new instance, the very first call to the GetConfigurationAsync fails (e.g. OKTA server was unavailable).
This causes all the subsequent calls to fail for the next 5 minutes.

Details:
When this line var configuration = await _configRetriever.GetConfigurationAsync(MetadataAddress, _docRetriever, CancellationToken.None).ConfigureAwait(false); fails, the exception is caught and _syncAfter is set to now + 5 mins. However, _currentConfiguration is still null.

On the subsequent calls the condition if (_syncAfter <= now) is true and the code falls through to this line


causing the another exception.

This keeps happing for 5 mins until _syncAfter becomes > than now
In our case this behavior causes 10 - 50K exceptions in 5 min timeframe.

Proposed fix:
add || _currentConfiguration == null condition to the following if statement:


Something like this:
if (_currentConfiguration == null || _syncAfter <= now)

Question:
Before this is fixed, is there any workaround that you could suggest?

Disclaimer:
All this analysis is done just by reviewing the code, so I might have overlooked something. Let me know if this sounds correct to you.

@brentschmaltz
Copy link
Member

@dmitrymal we are modifying this so that until configuration is obtained, we will retry at a 30 second interval 4 times.
Does this seem reasonable?

@brentschmaltz brentschmaltz added Enhancement The issue is a new feature Customer reported Indicates issue was opened by customer labels Apr 10, 2023
@dmitrymal
Copy link
Author

dmitrymal commented Apr 10, 2023

@dmitrymal we are modifying this so that until configuration is obtained, we will retry at a 30 second interval 4 times. Does this seem reasonable?

It is your call, but personally, I would have left it up to the client to decide what retry strategy to use. '30 second interval 4 times' seems just a random decision and may not work for all clients. For instance, in our case with the rate of 10K requests per minute, we might end up with 40K queued requests waiting for a retry (which is probably not good).

After all, the issue is not about not being able to get the configuration or retry logic. The issue is in the code which does not check whether or not the _currentConfiguration is null.

In my opinion, the code should check if currentConfiguration == null and if it is, try to get it. The code should not simply throw an exception leaving the client without any option to fix the issue. This logical bug should be fixed regardless of retry logic. The easiest and quickest way to fix it is to have it something like this:
if (_currentConfiguration == null || _syncAfter <= now)

I could create a PR with the proposed changes if that would help.

By the way, I see you labeled it as an 'Enhancement'. I would consider this as a Bug (not an Enhancement). Thoughts?

@brentschmaltz
Copy link
Member

@dmitrymal A request has arrived and we need to validate the token. Metadata may be needed since TokenValidationParameters.ConfigurationManager is not null the user must have set it.
If metadata cannot be obtained, we may not be able to validate the token, but we will try.

It seems correct that if ConfigurationManger never obtained metadata, we should try again next time through.
A simple change would be to not set _syncAfter in the exception block here.

I need to circle back with the team to understand why continuous attempts without a throttle were not acceptable.

@brentschmaltz
Copy link
Member

@dmitrymal we discussed this as a team and landed on exponential backoff.

@dmitrymal
Copy link
Author

@dmitrymal we discussed this as a team and landed on exponential backoff.

The retry logic might help (in some cases). However, the client code should have control of the retry logic. Otherwise, it might lead to more issues. At the very least the client should be able to set/define retry parameters.

Anyway, all this (the retry discussion) is a separate concern and has nothing to do with the reported issue in the code.
If _currentConfiguration ends up to be null even after retries, the client would still run into the same issue.

Any thoughts on fixing that?

@brentschmaltz
Copy link
Member

@dmitrymal we have implemented a exponential increase in this PR #2052

We should leave this open to add the ability for users to set their own algorithm as you are correct in that we can assume to understand all the scenarios.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Customer reported Indicates issue was opened by customer Enhancement The issue is a new feature
Projects
None yet
Development

No branches or pull requests

2 participants