-
Notifications
You must be signed in to change notification settings - Fork 875
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[service-bus] Error handling needs work to be consistent and reliable #15610
Comments
I'm taking a first stab at this for beta 1 (October release). The big things:
As a first beta release bar I've got a stress test running, with network outages every minute for 5 seconds. This gives us some decent recovery testing, while also getting a good ballpark on our longevity. There are a few spots that are being string-matched rather than doing a type/interface check. We need to work to eliminate these. |
Some of this work has been incorporated by not requiring the user to intervene. We still report the error but the next call to Receiver.ReceiveMessages() will repair it. We still want to add in some classification, as that will help customers if they want to decide to make that next call or not,. |
Note, another interesting discussion was here: When applicable, we should return a structured error. The equivalent of a 404 here would be useful. |
My team is working on a servicebus integration, and running into a problem differentiating recoverable and unrecoverable errors. Specifically we're writing a function to ensure a topic is defined, calling resp, err := client.GetTopic(ctx, "topic", &admin.GetTopicOptions{})
if err != nil {
// err could be an unrecoverable auth/network/etc problem
// or a recoverable 404 because the topic is not defined
} There's an internal It'd be great if y'all could export something like |
@ryepup, I'll need to look at it a bit more. The admin client could use some more work to get it consistent with how the rest of the Azure SDK's http based clients work, which should involve exporting a standard error type to look at the HTTP resp and status code, directly. |
@richardpark-msft including a status code would be great |
Fixing several reliability and recovery bugs that were affecting sending and receiving. This PR finally exports retry options so callers can configure how long and how many retries they want. The big changes: * Retry configuration options are now exposed through the ClientOptions for both the Admin client and the Service Bus client. * amqpLinks.go has it's own Retry mechanism, based on the azcore retry algorithm. This does exponential backoff, jitter and is configurable using Client.RetryOptions. This retry mechanism cleans up several areas that were doing the BackoffRetrier before. * MgmtClient is gone, and we just use rpcLink now. There were some coordination issues that just weren't working properly, so the rpc link code has been moved into azservicebus for now (it was always an internal type and remains so). * Sender/Receiver both use the better amqplinks.Retry() and their code has been structured to make that easier to read (hopefully!) Bug fixes: * Fixed a bug where the proper error wasn't being broadcast when the connection failed in rpc, preventing proper connection recovery. * Fixed a bug where links wouldn't properly recreate if the connection recovery happened in the claim negotiation. Fixes #15170 (Connection/network recovery) Fixes #15610 (Error handling needs work to be consistent and reliable) Fixes #15642 (Unify all the various retry mechanisms) These were fixed by more or less sidestepping retryableRPC() mechanism: Fixes #15644 (amqp-common's retryableRPC method returns string errors) Fixes #16088 (amqp-common's link.RetryableRPC() retries an entity does not exist)
Thanks for the fixes, definitely will solve my problem! Any idea when the changes from #16831 will get a release? |
Fixing several reliability and recovery bugs that were affecting sending and receiving. This PR finally exports retry options so callers can configure how long and how many retries they want. The big changes: * Retry configuration options are now exposed through the ClientOptions for both the Admin client and the Service Bus client. * amqpLinks.go has it's own Retry mechanism, based on the azcore retry algorithm. This does exponential backoff, jitter and is configurable using Client.RetryOptions. This retry mechanism cleans up several areas that were doing the BackoffRetrier before. * MgmtClient is gone, and we just use rpcLink now. There were some coordination issues that just weren't working properly, so the rpc link code has been moved into azservicebus for now (it was always an internal type and remains so). * Sender/Receiver both use the better amqplinks.Retry() and their code has been structured to make that easier to read (hopefully!) Bug fixes: * Fixed a bug where the proper error wasn't being broadcast when the connection failed in rpc, preventing proper connection recovery. * Fixed a bug where links wouldn't properly recreate if the connection recovery happened in the claim negotiation. Fixes Azure#15170 (Connection/network recovery) Fixes Azure#15610 (Error handling needs work to be consistent and reliable) Fixes Azure#15642 (Unify all the various retry mechanisms) These were fixed by more or less sidestepping retryableRPC() mechanism: Fixes Azure#15644 (amqp-common's retryableRPC method returns string errors) Fixes Azure#16088 (amqp-common's link.RetryableRPC() retries an entity does not exist)
* generated code for azcerts * add readme and doc.go files * formatting, added BeginCreateCertificate, although not a poller yet * starting GetCertificate * adding GetCertificate endpoint * changed to a poller using GetCertificate, should be with GetCertificateOperation * adding GetCertificateOperation * added several more methods * adding list certs method * ListCertificates implementation, but test is not working yet * adding CreateIssuer method * adding GetIssuer * adding issuer pager * adding delete issuer method * doc updates and UpdateIssuer method * Adding contacts crud operations * adding Get and Update CertificatePolicy * adding UpdateCertificateProperties * Added BeginRecoverDeletedCert method * restore certificate backup * removing the poller.go file * adding list deleted certs func * last test * adding certificates to eng/config.json * removing generated from two public models * adding ci.yml file * trigger ci run * adding license file, format header, and running go fmt * adding examples and readme * adding delete and cancel operation funcs * error handling on examples * updating broken link * adding a changelog * changing Tags to map[string]string * updating comments, pagers * formatting * Only fetch provision application oid via API if not supplied (#16878) Co-authored-by: Ben Broderick Phillips <bebroder@microsoft.com> * More resilient updating of ExpiringResource (#16789) * Sync eng/common directory with azure-sdk-tools for PR 2585 (#16876) * updating to target test-proxy that only gets filename from body of /start request * update recording.go to send x-recording-file in body of /start request Co-authored-by: scbedd <45376673+scbedd@users.noreply.github.com> * Sync eng/common directory with azure-sdk-tools for PR 2596 (#16894) - Fix bug in Add-ReleaseLease helper script - Enable better local logging for handling token - Enable strict mode to help catch issues like this in the future * Sync eng/common directory with azure-sdk-tools for PR 2578 (#16895) * excluding duplicated env:USER in image tag * NIT * removing default repo name Co-authored-by: Albert Cheng <albertcheng@microsoft.com> * renaming * Changes for AZURE_CLIENT_SEND_CERTIFICATE_CHAIN (#16851) * add support for AZURE_CLIENT_SEND_CERTIFICATE_CHAIN * Update CHANGELOG.md (#16903) * Update CHANGELOG.md * Update version * Update version.go * Increment version for internal releases (#16905) Increment package version after release of internal * Sync eng/common directory with azure-sdk-tools for PR 2581 (#16910) * Generate token for aad app. * Remove testing mode * Update Generate-AddToken.ps1 * Update Generate-AddToken.ps1 * Generate token inside of opensource API call scripts * Add resource * Address feedback * Update eng/common/scripts/Get-AADIdentityFromGithubUser.ps1 Co-authored-by: Ben Broderick Phillips <ben@benbp.net> * Update eng/common/scripts/Get-AADIdentityFromGithubUser.ps1 Co-authored-by: Ben Broderick Phillips <ben@benbp.net> * Remove the printout * Update eng/common/scripts/Get-AADIdentityFromGithubUser.ps1 Co-authored-by: Wes Haggard <weshaggard@users.noreply.github.com> * Update eng/common/scripts/Get-AADIdentityFromGithubUser.ps1 Co-authored-by: Ben Broderick Phillips <ben@benbp.net> * Added PS script strict mode * Update Get-AADIdentityFromGithubUser.ps1 * Update Get-AADIdentityFromGithubUser.ps1 Co-authored-by: sima-zhu <sizhu@microsoft.com> Co-authored-by: Sima Zhu <48036328+sima-zhu@users.noreply.github.com> Co-authored-by: Ben Broderick Phillips <ben@benbp.net> Co-authored-by: Wes Haggard <weshaggard@users.noreply.github.com> * Syncing eng/common (#16904) Co-authored-by: Albert Cheng <albertcheng@microsoft.com> * [Release] sdk/resourcemanager/storage/armstorage/0.4.0 generation from spec commit: d830271b241897bad300f9275ff7b27d502aa6c5 (#16913) * [Tables] updating recordings for latest breaking change (#16906) * updating recordings for latest breaking change * go mod tidy * reverting eng changes to docker tool instead of proxy direct * running docker dump twice * Update azidentity recordings (#16912) * [KeyVault] Update recordings (#16917) * updating azsecrets and azkeys * updating azkeys recordings * [azservicebus] Fixing several reliability/recovery bugs (#16831) Fixing several reliability and recovery bugs that were affecting sending and receiving. This PR finally exports retry options so callers can configure how long and how many retries they want. The big changes: * Retry configuration options are now exposed through the ClientOptions for both the Admin client and the Service Bus client. * amqpLinks.go has it's own Retry mechanism, based on the azcore retry algorithm. This does exponential backoff, jitter and is configurable using Client.RetryOptions. This retry mechanism cleans up several areas that were doing the BackoffRetrier before. * MgmtClient is gone, and we just use rpcLink now. There were some coordination issues that just weren't working properly, so the rpc link code has been moved into azservicebus for now (it was always an internal type and remains so). * Sender/Receiver both use the better amqplinks.Retry() and their code has been structured to make that easier to read (hopefully!) Bug fixes: * Fixed a bug where the proper error wasn't being broadcast when the connection failed in rpc, preventing proper connection recovery. * Fixed a bug where links wouldn't properly recreate if the connection recovery happened in the claim negotiation. Fixes #15170 (Connection/network recovery) Fixes #15610 (Error handling needs work to be consistent and reliable) Fixes #15642 (Unify all the various retry mechanisms) These were fixed by more or less sidestepping retryableRPC() mechanism: Fixes #15644 (amqp-common's retryableRPC method returns string errors) Fixes #16088 (amqp-common's link.RetryableRPC() retries an entity does not exist) * fix incidentally added path addition (#16918) Co-authored-by: scbedd <45376673+scbedd@users.noreply.github.com> * sdk generation pipeline script (#16782) * compatible with old SDK generation and new codegen automation * remove greedy suffix on link check (#16922) Co-authored-by: Dane Walton <dawalton@microsoft.com> * remove (#16919) * Use Payload() for reading response bodies (#16911) Poller.FinalResponse() wasn't restoring the response body after reading. Removed duplicate definitions from GetJSON and bodyDownloadPolicy. * [azservicebus] Explicitly reference gin-gonic until nhooyr.io/websocket is updated #16927 * return *Client * removing LRO and pager interfaces * all tests passing again * updating and fixing a couple bugs, newest autorest version * updating embedded models * formatting * updating to latest internal * adding to doc.go Co-authored-by: Azure SDK Bot <53356347+azure-sdk@users.noreply.github.com> Co-authored-by: Ben Broderick Phillips <bebroder@microsoft.com> Co-authored-by: Charles Lowell <10964656+chlowell@users.noreply.github.com> Co-authored-by: scbedd <45376673+scbedd@users.noreply.github.com> Co-authored-by: Albert Cheng <albertcheng@microsoft.com> Co-authored-by: Christopher Scott <chriss@microsoft.com> Co-authored-by: sima-zhu <sizhu@microsoft.com> Co-authored-by: Sima Zhu <48036328+sima-zhu@users.noreply.github.com> Co-authored-by: Ben Broderick Phillips <ben@benbp.net> Co-authored-by: Wes Haggard <weshaggard@users.noreply.github.com> Co-authored-by: Jiahui Peng <46921893+Alancere@users.noreply.github.com> Co-authored-by: Richard Park <51494936+richardpark-msft@users.noreply.github.com> Co-authored-by: chunyu3 <chunyu@microsoft.com> Co-authored-by: Dane Walton <dawalton@microsoft.com> Co-authored-by: Joel Hendrix <jhendrix@microsoft.com>
* generated code for azcerts * add readme and doc.go files * formatting, added BeginCreateCertificate, although not a poller yet * starting GetCertificate * adding GetCertificate endpoint * changed to a poller using GetCertificate, should be with GetCertificateOperation * adding GetCertificateOperation * added several more methods * adding list certs method * ListCertificates implementation, but test is not working yet * adding CreateIssuer method * adding GetIssuer * adding issuer pager * adding delete issuer method * doc updates and UpdateIssuer method * Adding contacts crud operations * adding Get and Update CertificatePolicy * adding UpdateCertificateProperties * Added BeginRecoverDeletedCert method * restore certificate backup * removing the poller.go file * adding list deleted certs func * last test * adding certificates to eng/config.json * removing generated from two public models * adding ci.yml file * trigger ci run * adding license file, format header, and running go fmt * adding examples and readme * adding delete and cancel operation funcs * error handling on examples * updating broken link * adding a changelog * changing Tags to map[string]string * updating comments, pagers * formatting * Only fetch provision application oid via API if not supplied (Azure#16878) Co-authored-by: Ben Broderick Phillips <bebroder@microsoft.com> * More resilient updating of ExpiringResource (Azure#16789) * Sync eng/common directory with azure-sdk-tools for PR 2585 (Azure#16876) * updating to target test-proxy that only gets filename from body of /start request * update recording.go to send x-recording-file in body of /start request Co-authored-by: scbedd <45376673+scbedd@users.noreply.github.com> * Sync eng/common directory with azure-sdk-tools for PR 2596 (Azure#16894) - Fix bug in Add-ReleaseLease helper script - Enable better local logging for handling token - Enable strict mode to help catch issues like this in the future * Sync eng/common directory with azure-sdk-tools for PR 2578 (Azure#16895) * excluding duplicated env:USER in image tag * NIT * removing default repo name Co-authored-by: Albert Cheng <albertcheng@microsoft.com> * renaming * Changes for AZURE_CLIENT_SEND_CERTIFICATE_CHAIN (Azure#16851) * add support for AZURE_CLIENT_SEND_CERTIFICATE_CHAIN * Update CHANGELOG.md (Azure#16903) * Update CHANGELOG.md * Update version * Update version.go * Increment version for internal releases (Azure#16905) Increment package version after release of internal * Sync eng/common directory with azure-sdk-tools for PR 2581 (Azure#16910) * Generate token for aad app. * Remove testing mode * Update Generate-AddToken.ps1 * Update Generate-AddToken.ps1 * Generate token inside of opensource API call scripts * Add resource * Address feedback * Update eng/common/scripts/Get-AADIdentityFromGithubUser.ps1 Co-authored-by: Ben Broderick Phillips <ben@benbp.net> * Update eng/common/scripts/Get-AADIdentityFromGithubUser.ps1 Co-authored-by: Ben Broderick Phillips <ben@benbp.net> * Remove the printout * Update eng/common/scripts/Get-AADIdentityFromGithubUser.ps1 Co-authored-by: Wes Haggard <weshaggard@users.noreply.github.com> * Update eng/common/scripts/Get-AADIdentityFromGithubUser.ps1 Co-authored-by: Ben Broderick Phillips <ben@benbp.net> * Added PS script strict mode * Update Get-AADIdentityFromGithubUser.ps1 * Update Get-AADIdentityFromGithubUser.ps1 Co-authored-by: sima-zhu <sizhu@microsoft.com> Co-authored-by: Sima Zhu <48036328+sima-zhu@users.noreply.github.com> Co-authored-by: Ben Broderick Phillips <ben@benbp.net> Co-authored-by: Wes Haggard <weshaggard@users.noreply.github.com> * Syncing eng/common (Azure#16904) Co-authored-by: Albert Cheng <albertcheng@microsoft.com> * [Release] sdk/resourcemanager/storage/armstorage/0.4.0 generation from spec commit: d830271b241897bad300f9275ff7b27d502aa6c5 (Azure#16913) * [Tables] updating recordings for latest breaking change (Azure#16906) * updating recordings for latest breaking change * go mod tidy * reverting eng changes to docker tool instead of proxy direct * running docker dump twice * Update azidentity recordings (Azure#16912) * [KeyVault] Update recordings (Azure#16917) * updating azsecrets and azkeys * updating azkeys recordings * [azservicebus] Fixing several reliability/recovery bugs (Azure#16831) Fixing several reliability and recovery bugs that were affecting sending and receiving. This PR finally exports retry options so callers can configure how long and how many retries they want. The big changes: * Retry configuration options are now exposed through the ClientOptions for both the Admin client and the Service Bus client. * amqpLinks.go has it's own Retry mechanism, based on the azcore retry algorithm. This does exponential backoff, jitter and is configurable using Client.RetryOptions. This retry mechanism cleans up several areas that were doing the BackoffRetrier before. * MgmtClient is gone, and we just use rpcLink now. There were some coordination issues that just weren't working properly, so the rpc link code has been moved into azservicebus for now (it was always an internal type and remains so). * Sender/Receiver both use the better amqplinks.Retry() and their code has been structured to make that easier to read (hopefully!) Bug fixes: * Fixed a bug where the proper error wasn't being broadcast when the connection failed in rpc, preventing proper connection recovery. * Fixed a bug where links wouldn't properly recreate if the connection recovery happened in the claim negotiation. Fixes Azure#15170 (Connection/network recovery) Fixes Azure#15610 (Error handling needs work to be consistent and reliable) Fixes Azure#15642 (Unify all the various retry mechanisms) These were fixed by more or less sidestepping retryableRPC() mechanism: Fixes Azure#15644 (amqp-common's retryableRPC method returns string errors) Fixes Azure#16088 (amqp-common's link.RetryableRPC() retries an entity does not exist) * fix incidentally added path addition (Azure#16918) Co-authored-by: scbedd <45376673+scbedd@users.noreply.github.com> * sdk generation pipeline script (Azure#16782) * compatible with old SDK generation and new codegen automation * remove greedy suffix on link check (Azure#16922) Co-authored-by: Dane Walton <dawalton@microsoft.com> * remove (Azure#16919) * Use Payload() for reading response bodies (Azure#16911) Poller.FinalResponse() wasn't restoring the response body after reading. Removed duplicate definitions from GetJSON and bodyDownloadPolicy. * [azservicebus] Explicitly reference gin-gonic until nhooyr.io/websocket is updated Azure#16927 * return *Client * removing LRO and pager interfaces * all tests passing again * updating and fixing a couple bugs, newest autorest version * updating embedded models * formatting * updating to latest internal * adding to doc.go Co-authored-by: Azure SDK Bot <53356347+azure-sdk@users.noreply.github.com> Co-authored-by: Ben Broderick Phillips <bebroder@microsoft.com> Co-authored-by: Charles Lowell <10964656+chlowell@users.noreply.github.com> Co-authored-by: scbedd <45376673+scbedd@users.noreply.github.com> Co-authored-by: Albert Cheng <albertcheng@microsoft.com> Co-authored-by: Christopher Scott <chriss@microsoft.com> Co-authored-by: sima-zhu <sizhu@microsoft.com> Co-authored-by: Sima Zhu <48036328+sima-zhu@users.noreply.github.com> Co-authored-by: Ben Broderick Phillips <ben@benbp.net> Co-authored-by: Wes Haggard <weshaggard@users.noreply.github.com> Co-authored-by: Jiahui Peng <46921893+Alancere@users.noreply.github.com> Co-authored-by: Richard Park <51494936+richardpark-msft@users.noreply.github.com> Co-authored-by: chunyu3 <chunyu@microsoft.com> Co-authored-by: Dane Walton <dawalton@microsoft.com> Co-authored-by: Joel Hendrix <jhendrix@microsoft.com>
As part of a recent refactor it's become more obvious that the error handling in Service Bus needs to be made more consistent. There are some good classification functions in the older code (which needs to be brought forward) and basically we just need to have enough information to say:
This affects the Receiver and Sender a bit more since the Processor doesn't really make any distinctions about what's retryable vs not.
Tagging @jhendrixMSFT on this as we've had discussion about how we might provide some interfaces or similar to make the classification easier for client code.
The text was updated successfully, but these errors were encountered: