-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Per-partition automatic failover #41029
base: main
Are you sure you want to change the base?
Conversation
…zure-sdk-for-java into PerPartitionAutomaticFailover
/azp run java - cosmos - tests |
Azure Pipelines successfully started running 1 pipeline(s). |
…rPartitionAutomaticFailover
API change check APIView has identified API level changes in this PR and created following API reviews. |
…rPartitionAutomaticFailover # Conflicts: # sdk/cosmos/azure-cosmos-tests/src/test/java/com/azure/cosmos/GlobalPartitionEndpointManagerForPerPartitionCircuitBreakerTests.java # sdk/cosmos/azure-cosmos-tests/src/test/java/com/azure/cosmos/faultinjection/SessionRetryOptionsTests.java # sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/Configs.java # sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/RxDocumentClientImpl.java # sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/perPartitionCircuitBreaker/GlobalPartitionEndpointManagerForPerPartitionCircuitBreaker.java # sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/perPartitionCircuitBreaker/LocationSpecificHealthContextTransitionHandler.java # sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/query/ChangeFeedFetcher.java
…array of array of nos.
…rPartitionAutomaticFailover # Conflicts: # sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/RxDocumentClientImpl.java
…rPartitionAutomaticFailover # Conflicts: # sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/Configs.java
…rPartitionAutomaticFailover # Conflicts: # sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/CosmosClientBuilder.java # sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/Configs.java
…rPartitionAutomaticFailover
…rPartitionAutomaticFailover
@@ -193,7 +193,7 @@ public Mono<ShouldRetryResult> shouldRetry(Exception exception) { | |||
Duration timeout; | |||
boolean forceRefreshAddressCache; | |||
if (isNonRetryableException(exception)) { | |||
logger.debug("Operation will NOT be retried. Current attempt {}, Exception: ", this.attemptCount, | |||
logger.warn("Operation will NOT be retried. Current attempt {}, Exception: ", this.attemptCount, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comment for self: revert warn
to debug
prior to merge.
@@ -209,7 +209,9 @@ public Mono<ShouldRetryResult> shouldRetry(Exception exception) { | |||
this.attemptCount, | |||
exception); | |||
|
|||
return Mono.just(ShouldRetryResult.noRetry( | |||
exceptionToThrow = logAndWrapExceptionWithLastRetryWithException(exception); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Idea is to wrap non-retriable (within the region) 410s as 503s.
Background
Refer issue: #43143
Important classes introduced
GlobalPartitionEndpointManagerForPerPartitionAutomaticFailover
This class stores failover information for a partition key range and collection resource id combination. The failover information is stored using a
ConcurrentHashMap
where the key is of typePartitionKeyRangeWrapper
and value is of typePartitionLevelFailoverInfo
. Dependent classes use this information to override the available region for the partition key range and collection resource id combination encapsulated inPartitionKeyRangeWrapper
.Important Sequence diagrams
Reacting to relevant error codes
Adding location override for a data-plane request routed to relevant partition key range
Important Activity Diagrams
Marking a partition key range as Unavailable for a given location
Client-level settings
To enable per-partition automatic failover from the perspective of the client instance, use the below system property.
Testing outcomes
In all cases, an upsert workload is run for some arbitrary duration. This workload is run against a
Strong
consistency single-write account with 3 regions -NorthCentralUs
,WestUs2
,CentralUs
.Inject quorum loss into a server partition in the primary region as a warm client is running the workload.
In this scenario, complete quorum loss is injected into a server partition in
NorthCentralUs
. The upsert operation against the affected server partition is failed over toWestUs2
from which a success is received.Inject quorum loss into a server partition in the primary region after which a cold client starts running the workload.
Inject quorum loss into a server partition and master in the primary region after which a cold client starts running the workload.
In this scenario, a server partition and master partition in
NorthCentralUs
are injected with quorum loss. Post this, a workload is started. From the diagnostics, a success response is obtained fromWestUs2
.Inject quorum loss into a server partition in the primary region and secondary region where a warm client is running the workload.
In this scenario, quorum loss is injected into a server partition belonging to the same partition set in
NorthCentralUs
andWestUs2
while the workload is running. A success response is obtained fromCentralUs
.Inject quorum loss into a server partition in the primary region and secondary region after which a cold client starts running the workload.