Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Automatic http failover base on Srv Dns #442

Merged
merged 39 commits into from
Dec 12, 2023

Conversation

RichardChen820
Copy link
Contributor

@RichardChen820 RichardChen820 commented Aug 1, 2023

Automatic http failover base on Srv Dns

For supporting automatic Http failover, there are SRV DNS record set being added for each replica by RP.

  1. Each replica has _origin._tcp.* SRV record been created to indicate the origin store of the replica
    _origin._tcp. contoso-replica1.azconfig.io --> SRV 1 100 443 contoso.azconfig.io

  2. Each replica has _alt*._tcp.* SRV record created in _alt{i}._tcp record set for querying auto failover hosts.
    _alt0._tcp. contoso.azconfig.io --> SRV 1 100 443 contoso-replica1.azconfig.io
    _alt0._tcp. contoso.azconfig.io --> SRV 1 100 443 contoso-replica1.azconfig.io

    "i" is a self-increment index since each record set only supports to accommodate up to 20 records, if there're 30 replicas, it will dispersed into two _alt* record sets.

Design:

  1. auto failover should be enabled by default, CX can explicitly disable it.
  2. auto failover happens when all clients that passed by CX failed because of any failoverable issue.
  3. If multiple auto failover client are detected, .net provider pick up one from them randomly.

}
else
{
throw new ArgumentNullException(nameof(_endpoints));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I dont think we should throw for ArgumentNullException for _endpoints here. Null/empty _connectionStrings could also be the cause for ArgumentNullException here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about throw new ArgumentNullException(nameof(_clients));

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks more like an InvalidOperationException if both connectrings and endpoints are not provided.

// Only remove the client if it is not in the user passed connection string or endpoints, as well as not in returned SRV records.
private bool IsEligibleToRemove(IEnumerable<string> srvEndpointHosts, ConfigurationClientWrapper client)
{
if (_connectionStrings != null && _connectionStrings.Any(c => GetHostFromConnectionString(c).Equals(client.Endpoint.Host, StringComparison.OrdinalIgnoreCase)))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would parse the endpoint from each connection string everytime we invoke IsEligibleToRemove.
How about storing the list of endpoints instead of connection strings? Looks like we only use the first connection string anyway. Maybe we can store that one connection string separately and extract endpoint from all other connection strings before storing in _endpoint list.
That also makes it simpler to check isUsingConnectionString. We can avoid the check for _connectionStrings.Any() and just check if the single connection string is set.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have updated this, could you please tale a look again?

endpoint = _endpoints.First();
}

var lookup = new SrvLookupClient(logger);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it necessary to create a new SrvLookupClient every time we get auto failover clients? Can we create once and reuse it?

Copy link

@RichardChen1122 RichardChen1122 Sep 22, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, this is also what I want to optimize, the udpClient and tcpClient in it should be reusable. Besides this, I use cache in the SrvLookupClient, if new SrvLookupClient every time, using cache is meanless.


return resultRecords;
}
catch (Exception e)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We shouldn't catch base exception in any case. Only known exceptions should be caught and handled.
Also, InternalQueryAsync is the only method that can throw, so it would be better to wrap only that line of code in its own try/catch block so that its easy to understand where the exception is coming from.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need to catch any exceptions here. Dns lookup error is nothing different from other errors, just throw is fine

/// <summary>
/// Flag to indicate whether discover the replicas automatically.
/// </summary>
public bool AutoDiscoverReplica { get; set; } = true;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
public bool AutoDiscoverReplica { get; set; } = true;
public bool EnableReplicaDiscovery { get; set; } = true;

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See this comment

My suggestion should be ReplicaDiscoveryEnabled in order to avoid a verb as a property name.

}
else
{
throw new ArgumentNullException(nameof(_clients));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_clients is not an argument to this constructor.

IEnumerable<SrvRecord> results = await _srvLookupClient.Value.QueryAsync(endpoint.DnsSafeHost, cancellationToken).ConfigureAwait(false);

// shuffle the results to ensure hosts can be picked randomly.
IEnumerable<string> srvTargetHosts = results.Select(r => $"{r.Target.Value.Trim('.')}").Shuffle().ToList();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mind shuffling on the non-transformed enumerable.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's that trim doing, just removing a preceding '.' ? If so, may be clearer to be explicit about that.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@RichardChen820

Why was this resolved? Did other comments get resolved without being considered?

@RichardChen820 RichardChen820 force-pushed the user/junbchen/srvDnsClient branch 2 times, most recently from f92fa5a to b6a344d Compare December 4, 2023 09:25
@RichardChen820 RichardChen820 force-pushed the user/junbchen/srvDnsClient branch from b6a344d to bc94a95 Compare December 4, 2023 09:26
NOTICE Show resolved Hide resolved
@RichardChen820 RichardChen820 force-pushed the user/junbchen/srvDnsClient branch from e209865 to cec2f6a Compare December 6, 2023 15:12
@RichardChen820 RichardChen820 force-pushed the user/junbchen/srvDnsClient branch from 0f8a5d2 to 684a374 Compare December 7, 2023 09:32
@RichardChen820 RichardChen820 force-pushed the user/junbchen/srvDnsClient branch from 684a374 to fdaa51f Compare December 7, 2023 09:37
{
await RefreshFallbackClients(cts.Token).ConfigureAwait(false);
}
catch (OperationCanceledException e)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't log a warning on cancellation due to dispose.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated

results = results.Concat(records);

// If we get no record from _alt{i} SRV, we have reached the end of _alt* list
if (records.Count() == 0)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can concact empty. The check can be above to avoid it.

_ = DiscoverFallbackClients();
}

List<ConfigurationClient> clients = new List<ConfigurationClient>(_clients.Where(c => c.BackoffEndTime <= now).Select(c => c.Client));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why copy?

Copy link
Contributor Author

@RichardChen820 RichardChen820 Dec 8, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To ensure thread safety, the _dynamicClients could be updated asynchronously

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, we don't have to worry about this, updated.

_ = DiscoverFallbackClients();
}

List<ConfigurationClient> clients = new List<ConfigurationClient>(_clients.Select(c => c.Client));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that creates copy

@RichardChen820 RichardChen820 force-pushed the user/junbchen/srvDnsClient branch from a141380 to 312ad19 Compare December 8, 2023 09:21

namespace Microsoft.Extensions.Configuration.AzureAppConfiguration
{
internal interface IConfigurationClientManager
{
IEnumerable<ConfigurationClient> GetAvailableClients(DateTimeOffset time);
IEnumerable<ConfigurationClient> GetAvailableClients();

IEnumerable<ConfigurationClient> GetAllClients();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These two are odd. Separate method for a subset of data is not great for an interface.
In this case we can always return all and have a property in ConfigurationClient that the caller can filter on instead.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ConfigurationClient type is not defined in this project, it's from .net SDK, we can't add a property to it, we can only extend it or wrap it.

Then it seems would be a little bit overkill since the callers only care about the clients, they don't have to worry about how the clients are provided.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From my perspective , changing this seems doesn't make it much better.
https://github.com/Azure/AppConfiguration-DotnetProvider/pull/502/files

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the reason we need two different subsets of ConfigurationClients (Available vs All)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GetAllClients is a new method since 7.0.0 for time-based retry on start-up. When the app starts, there is no longer a fixed number of retries for each client, but as long as the time allows (default is 100s), we will continue to try all clients until a certain client can work or the time runs out. Therefore, it is necessary to obtain all Clients. Of course, we will not keep trying all Clients without limit. Whenever a round of attempts is made and all Clients are invalid, it will back off for a certain period of time and then start trying the next round.

GetAvailableClients is for getting the available clients for refreshing scenario, each client has unique backoff time which is determined by its failure attempts. Hence, we would always get the clients whose back off time is due.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should have uniform strategy to use client instances in all cases (startup and refresh). Beside on startup, backoff isn't specified (no one has failed yet). I would assume that at startup GetAvailableClients always return all clients. Am I missing something?

cc @jimmyca15

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds like we should move the backoff tracking out of ConfigurationClientManager and use ConfigurationClientManager just as factory.
This can be done in a separate PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea, I would merge this one first, go out in 7.1.0-preview first.

@RichardChen820 RichardChen820 merged commit 06f237f into preview Dec 12, 2023
2 checks passed
amerjusupovic pushed a commit that referenced this pull request Feb 21, 2024
* Automatic http failover base on Srv Dns

* update failover condition

* Logging messages for auto failover

* Not use generic method

* Remove useless constructor

* Failover to TCP when response is truncated

* Error handling of TCP failover

* Resolve comments and add thrid party notice

* Fix the tests

* Fix the tess

* Fix the test

* Update to use DnsClient.NET

* Remove unecessary code

* Resolve conflict

* Stop once it has been auto failovered

* Update GetAvaliableClient method

* Resolve comments

* Resolve comments

* Internalize getting dynamic client in ConfigurationClientManager

* Fix format

* Fix format

* Resolve comments

* Resolve comments

* Resolve comments

* Resolve comments

* Resolve comments

* Resolve comments

* Resolve comments

* Not await RefreshFallbackClients when startup

* Revert non-await RefreshFallbackClients

* Validate domain of Srv endpoints (#489)

* Validate domain of Srv endpoints

* Resolve comments

* Resolve comments

* Create object in using

* Add the origin host to the srv lookup result

* Validate endpoint by domain name (#499)

* Validate endpoint by domain name

* Resolve comments

* Remove *-test

* Fix formatResolve conflict

* Use await using for IAsyncDisposible

* Replace IAsyncEumerable with ValueTask<IEnumerable>

* Resolve comments

* Remove cancallation token

* Resolve comments
amerjusupovic added a commit that referenced this pull request Feb 22, 2024
* Automatic http failover base on Srv Dns (#442)

* Automatic http failover base on Srv Dns

* update failover condition

* Logging messages for auto failover

* Not use generic method

* Remove useless constructor

* Failover to TCP when response is truncated

* Error handling of TCP failover

* Resolve comments and add thrid party notice

* Fix the tests

* Fix the tess

* Fix the test

* Update to use DnsClient.NET

* Remove unecessary code

* Resolve conflict

* Stop once it has been auto failovered

* Update GetAvaliableClient method

* Resolve comments

* Resolve comments

* Internalize getting dynamic client in ConfigurationClientManager

* Fix format

* Fix format

* Resolve comments

* Resolve comments

* Resolve comments

* Resolve comments

* Resolve comments

* Resolve comments

* Resolve comments

* Not await RefreshFallbackClients when startup

* Revert non-await RefreshFallbackClients

* Validate domain of Srv endpoints (#489)

* Validate domain of Srv endpoints

* Resolve comments

* Resolve comments

* Create object in using

* Add the origin host to the srv lookup result

* Validate endpoint by domain name (#499)

* Validate endpoint by domain name

* Resolve comments

* Remove *-test

* Fix formatResolve conflict

* Use await using for IAsyncDisposible

* Replace IAsyncEumerable with ValueTask<IEnumerable>

* Resolve comments

* Remove cancallation token

* Resolve comments

* Refactor and move backoff logic to AzureAppConfigurationProvider (#508)

* move backoff logic to provider, can't access backoffendtime from manager for dynamic clients

* add condition before calling getclients in refreshasync, add refreshclients method

* fix logic to check if backoff ended for client

* fix indents

* check client backoff before refresh, fix tests in progress

* remove tolist

* comment changes, fix refresh tests

* reduce minimum backoff when testing

* fix tests, fix typo with failover logic

* update comments in tests

* resolve PR comments

* resolve small comments

* fix tests backoff duration

* fix tests backoff duration

* fix error from merging main branch in

* add back fix for attempts backoff calculation

---------

Co-authored-by: Richard chen <99175581+RichardChen820@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants