Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[service-bus] Error handling needs work to be consistent and reliable #15610

Closed
richardpark-msft opened this issue Sep 22, 2021 · 7 comments · Fixed by #16831
Closed

[service-bus] Error handling needs work to be consistent and reliable #15610

richardpark-msft opened this issue Sep 22, 2021 · 7 comments · Fixed by #16831
Assignees
Labels
Client This issue points to a problem in the data-plane of the library. feature-request This issue requires a new behavior in the product in order be resolved. Service Bus

Comments

@richardpark-msft
Copy link
Member

As part of a recent refactor it's become more obvious that the error handling in Service Bus needs to be made more consistent. There are some good classification functions in the older code (which needs to be brought forward) and basically we just need to have enough information to say:

  • This link is dead and needs to restart
  • There is a retryable issue here (transient) and we should just try it again

This affects the Receiver and Sender a bit more since the Processor doesn't really make any distinctions about what's retryable vs not.

Tagging @jhendrixMSFT on this as we've had discussion about how we might provide some interfaces or similar to make the classification easier for client code.

@richardpark-msft richardpark-msft added Service Bus Client This issue points to a problem in the data-plane of the library. labels Sep 22, 2021
@richardpark-msft richardpark-msft added this to the [2021] October milestone Sep 22, 2021
@RickWinter RickWinter added the feature-request This issue requires a new behavior in the product in order be resolved. label Sep 23, 2021
@richardpark-msft
Copy link
Member Author

richardpark-msft commented Oct 3, 2021

I'm taking a first stab at this for beta 1 (October release).

The big things:

  • The processor type brings in built-in infinite retrying, which should be more durable than the previous APIs.
  • Our error classification has been improved - much thanks to the go-shuttle project (and @serbrech). I've added in some further checking and also allow for link and connection level recovery (we fallback to connection more than I'd like, but it's a first go).

As a first beta release bar I've got a stress test running, with network outages every minute for 5 seconds. This gives us some decent recovery testing, while also getting a good ballpark on our longevity.

There are a few spots that are being string-matched rather than doing a type/interface check. We need to work to eliminate these.

@richardpark-msft
Copy link
Member Author

Some of this work has been incorporated by not requiring the user to intervene. We still report the error but the next call to Receiver.ReceiveMessages() will repair it.

We still want to add in some classification, as that will help customers if they want to decide to make that next call or not,.

@richardpark-msft
Copy link
Member Author

Note, another interesting discussion was here:
Azure/azure-service-bus-go#96

When applicable, we should return a structured error. The equivalent of a 404 here would be useful.

@ryepup
Copy link

ryepup commented Jan 13, 2022

My team is working on a servicebus integration, and running into a problem differentiating recoverable and unrecoverable errors.

Specifically we're writing a function to ensure a topic is defined, calling GetTopic to see if it exists, and a CreateTopic if needed. We're not seeing an official way to differentiate between recoverable "topic not found" errors and unrecoverable network errors.

resp, err := client.GetTopic(ctx, "topic", &admin.GetTopicOptions{})
if err != nil {
  // err could be an unrecoverable auth/network/etc problem
  // or a recoverable 404 because the topic is not defined
}

There's an internal atom.NotFound(err) helper to differentiate these error cases, but that's not exported. Currently we're using string comparisons on err.Error(), but that feels brittle; the specific error messages are not part of the exported API and we'd like to only rely on documented behavior. We have a similar situation with GetSubscription/CreateSubscription.

It'd be great if y'all could export something like feedEmptyError, NotFound(err), or any other mechanism to enable this logic.

@richardpark-msft
Copy link
Member Author

@ryepup, I'll need to look at it a bit more. The admin client could use some more work to get it consistent with how the rest of the Azure SDK's http based clients work, which should involve exporting a standard error type to look at the HTTP resp and status code, directly.

@ryepup
Copy link

ryepup commented Jan 18, 2022

@richardpark-msft including a status code would be great

richardpark-msft added a commit that referenced this issue Jan 25, 2022
Fixing several reliability and recovery bugs that were affecting sending and receiving. This PR finally exports retry options so callers can configure how long and how many retries they want.

The big changes:
* Retry configuration options are now exposed through the ClientOptions for both the Admin client and the Service Bus client.
* amqpLinks.go has it's own Retry mechanism, based on the azcore retry algorithm. This does exponential backoff, jitter and is configurable using Client.RetryOptions. This retry mechanism cleans up several areas that were doing the BackoffRetrier before.
* MgmtClient is gone, and we just use rpcLink now. There were some coordination issues that just weren't working properly, so the rpc link code has been moved into azservicebus for now (it was always an internal type and remains so). 
* Sender/Receiver both use the better amqplinks.Retry() and their code has been structured to make that easier to read (hopefully!)

Bug fixes:
* Fixed a bug where the proper error wasn't being broadcast when the connection failed in rpc, preventing proper connection recovery.
* Fixed a bug where links wouldn't properly recreate if the connection recovery happened in the claim negotiation.

Fixes #15170 (Connection/network recovery)
Fixes #15610 (Error handling needs work to be consistent and reliable)
Fixes #15642 (Unify all the various retry mechanisms)

These were fixed by more or less sidestepping retryableRPC() mechanism:
Fixes #15644 (amqp-common's retryableRPC method returns string errors)
Fixes #16088 (amqp-common's link.RetryableRPC() retries an entity does not exist)
@ryepup
Copy link

ryepup commented Jan 27, 2022

Thanks for the fixes, definitely will solve my problem!

Any idea when the changes from #16831 will get a release?

sadasant pushed a commit to sadasant/azure-sdk-for-go that referenced this issue Jan 27, 2022
Fixing several reliability and recovery bugs that were affecting sending and receiving. This PR finally exports retry options so callers can configure how long and how many retries they want.

The big changes:
* Retry configuration options are now exposed through the ClientOptions for both the Admin client and the Service Bus client.
* amqpLinks.go has it's own Retry mechanism, based on the azcore retry algorithm. This does exponential backoff, jitter and is configurable using Client.RetryOptions. This retry mechanism cleans up several areas that were doing the BackoffRetrier before.
* MgmtClient is gone, and we just use rpcLink now. There were some coordination issues that just weren't working properly, so the rpc link code has been moved into azservicebus for now (it was always an internal type and remains so). 
* Sender/Receiver both use the better amqplinks.Retry() and their code has been structured to make that easier to read (hopefully!)

Bug fixes:
* Fixed a bug where the proper error wasn't being broadcast when the connection failed in rpc, preventing proper connection recovery.
* Fixed a bug where links wouldn't properly recreate if the connection recovery happened in the claim negotiation.

Fixes Azure#15170 (Connection/network recovery)
Fixes Azure#15610 (Error handling needs work to be consistent and reliable)
Fixes Azure#15642 (Unify all the various retry mechanisms)

These were fixed by more or less sidestepping retryableRPC() mechanism:
Fixes Azure#15644 (amqp-common's retryableRPC method returns string errors)
Fixes Azure#16088 (amqp-common's link.RetryableRPC() retries an entity does not exist)
seankane-msft added a commit that referenced this issue Jan 28, 2022
* generated code for azcerts

* add readme and doc.go files

* formatting, added BeginCreateCertificate, although not a poller yet

* starting GetCertificate

* adding GetCertificate endpoint

* changed to a poller using GetCertificate, should be with GetCertificateOperation

* adding GetCertificateOperation

* added several more methods

* adding list certs method

* ListCertificates implementation, but test is not working yet

* adding CreateIssuer method

* adding GetIssuer

* adding issuer pager

* adding delete issuer method

* doc updates and UpdateIssuer method

* Adding contacts crud operations

* adding Get and Update CertificatePolicy

* adding UpdateCertificateProperties

* Added BeginRecoverDeletedCert method

* restore certificate backup

* removing the poller.go file

* adding list deleted certs func

* last test

* adding certificates to eng/config.json

* removing generated from two public models

* adding ci.yml file

* trigger ci run

* adding license file, format header, and running go fmt

* adding examples and readme

* adding delete and cancel operation funcs

* error handling on examples

* updating broken link

* adding a changelog

* changing Tags to map[string]string

* updating comments, pagers

* formatting

* Only fetch provision application oid via API if not supplied (#16878)

Co-authored-by: Ben Broderick Phillips <bebroder@microsoft.com>

* More resilient updating of ExpiringResource (#16789)

* Sync eng/common directory with azure-sdk-tools for PR 2585 (#16876)

* updating to target test-proxy that only gets filename from body of /start request

* update recording.go to send x-recording-file in body of /start request

Co-authored-by: scbedd <45376673+scbedd@users.noreply.github.com>

* Sync eng/common directory with azure-sdk-tools for PR 2596 (#16894)

- Fix bug in Add-ReleaseLease helper script
- Enable better local logging for handling token
- Enable strict mode to help catch issues like this in the future

* Sync eng/common directory with azure-sdk-tools for PR 2578 (#16895)

* excluding duplicated env:USER in image tag

* NIT

* removing default repo name

Co-authored-by: Albert Cheng <albertcheng@microsoft.com>

* renaming

* Changes for AZURE_CLIENT_SEND_CERTIFICATE_CHAIN (#16851)

* add support for AZURE_CLIENT_SEND_CERTIFICATE_CHAIN

* Update CHANGELOG.md (#16903)

* Update CHANGELOG.md

* Update version

* Update version.go

* Increment version for internal releases (#16905)

Increment package version after release of internal

* Sync eng/common directory with azure-sdk-tools for PR 2581 (#16910)

* Generate token for aad app.

* Remove testing mode

* Update Generate-AddToken.ps1

* Update Generate-AddToken.ps1

* Generate token inside of opensource API call scripts

* Add resource

* Address feedback

* Update eng/common/scripts/Get-AADIdentityFromGithubUser.ps1

Co-authored-by: Ben Broderick Phillips <ben@benbp.net>

* Update eng/common/scripts/Get-AADIdentityFromGithubUser.ps1

Co-authored-by: Ben Broderick Phillips <ben@benbp.net>

* Remove the printout

* Update eng/common/scripts/Get-AADIdentityFromGithubUser.ps1

Co-authored-by: Wes Haggard <weshaggard@users.noreply.github.com>

* Update eng/common/scripts/Get-AADIdentityFromGithubUser.ps1

Co-authored-by: Ben Broderick Phillips <ben@benbp.net>

* Added PS script strict mode

* Update Get-AADIdentityFromGithubUser.ps1

* Update Get-AADIdentityFromGithubUser.ps1

Co-authored-by: sima-zhu <sizhu@microsoft.com>
Co-authored-by: Sima Zhu <48036328+sima-zhu@users.noreply.github.com>
Co-authored-by: Ben Broderick Phillips <ben@benbp.net>
Co-authored-by: Wes Haggard <weshaggard@users.noreply.github.com>

* Syncing eng/common (#16904)

Co-authored-by: Albert Cheng <albertcheng@microsoft.com>

* [Release] sdk/resourcemanager/storage/armstorage/0.4.0 generation from spec commit: d830271b241897bad300f9275ff7b27d502aa6c5 (#16913)

* [Tables] updating recordings for latest breaking change (#16906)

* updating recordings for latest breaking change

* go mod tidy

* reverting eng changes to docker tool instead of proxy direct

* running docker dump twice

* Update azidentity recordings (#16912)

* [KeyVault] Update recordings (#16917)

* updating azsecrets and azkeys

* updating azkeys recordings

* [azservicebus] Fixing several reliability/recovery bugs (#16831)

Fixing several reliability and recovery bugs that were affecting sending and receiving. This PR finally exports retry options so callers can configure how long and how many retries they want.

The big changes:
* Retry configuration options are now exposed through the ClientOptions for both the Admin client and the Service Bus client.
* amqpLinks.go has it's own Retry mechanism, based on the azcore retry algorithm. This does exponential backoff, jitter and is configurable using Client.RetryOptions. This retry mechanism cleans up several areas that were doing the BackoffRetrier before.
* MgmtClient is gone, and we just use rpcLink now. There were some coordination issues that just weren't working properly, so the rpc link code has been moved into azservicebus for now (it was always an internal type and remains so). 
* Sender/Receiver both use the better amqplinks.Retry() and their code has been structured to make that easier to read (hopefully!)

Bug fixes:
* Fixed a bug where the proper error wasn't being broadcast when the connection failed in rpc, preventing proper connection recovery.
* Fixed a bug where links wouldn't properly recreate if the connection recovery happened in the claim negotiation.

Fixes #15170 (Connection/network recovery)
Fixes #15610 (Error handling needs work to be consistent and reliable)
Fixes #15642 (Unify all the various retry mechanisms)

These were fixed by more or less sidestepping retryableRPC() mechanism:
Fixes #15644 (amqp-common's retryableRPC method returns string errors)
Fixes #16088 (amqp-common's link.RetryableRPC() retries an entity does not exist)

* fix incidentally added path addition (#16918)

Co-authored-by: scbedd <45376673+scbedd@users.noreply.github.com>

* sdk generation pipeline script (#16782)

* compatible with old SDK generation and new codegen automation

* remove greedy suffix on link check (#16922)

Co-authored-by: Dane Walton <dawalton@microsoft.com>

* remove (#16919)

* Use Payload() for reading response bodies (#16911)

Poller.FinalResponse() wasn't restoring the response body after reading.
Removed duplicate definitions from GetJSON and bodyDownloadPolicy.

* [azservicebus] Explicitly reference gin-gonic until nhooyr.io/websocket is updated #16927

* return *Client

* removing LRO and pager interfaces

* all tests passing again

* updating and fixing a couple bugs, newest autorest version

* updating embedded models

* formatting

* updating to latest internal

* adding to doc.go

Co-authored-by: Azure SDK Bot <53356347+azure-sdk@users.noreply.github.com>
Co-authored-by: Ben Broderick Phillips <bebroder@microsoft.com>
Co-authored-by: Charles Lowell <10964656+chlowell@users.noreply.github.com>
Co-authored-by: scbedd <45376673+scbedd@users.noreply.github.com>
Co-authored-by: Albert Cheng <albertcheng@microsoft.com>
Co-authored-by: Christopher Scott <chriss@microsoft.com>
Co-authored-by: sima-zhu <sizhu@microsoft.com>
Co-authored-by: Sima Zhu <48036328+sima-zhu@users.noreply.github.com>
Co-authored-by: Ben Broderick Phillips <ben@benbp.net>
Co-authored-by: Wes Haggard <weshaggard@users.noreply.github.com>
Co-authored-by: Jiahui Peng <46921893+Alancere@users.noreply.github.com>
Co-authored-by: Richard Park <51494936+richardpark-msft@users.noreply.github.com>
Co-authored-by: chunyu3 <chunyu@microsoft.com>
Co-authored-by: Dane Walton <dawalton@microsoft.com>
Co-authored-by: Joel Hendrix <jhendrix@microsoft.com>
benbp added a commit to benbp/azure-sdk-for-go that referenced this issue Jan 28, 2022
* generated code for azcerts

* add readme and doc.go files

* formatting, added BeginCreateCertificate, although not a poller yet

* starting GetCertificate

* adding GetCertificate endpoint

* changed to a poller using GetCertificate, should be with GetCertificateOperation

* adding GetCertificateOperation

* added several more methods

* adding list certs method

* ListCertificates implementation, but test is not working yet

* adding CreateIssuer method

* adding GetIssuer

* adding issuer pager

* adding delete issuer method

* doc updates and UpdateIssuer method

* Adding contacts crud operations

* adding Get and Update CertificatePolicy

* adding UpdateCertificateProperties

* Added BeginRecoverDeletedCert method

* restore certificate backup

* removing the poller.go file

* adding list deleted certs func

* last test

* adding certificates to eng/config.json

* removing generated from two public models

* adding ci.yml file

* trigger ci run

* adding license file, format header, and running go fmt

* adding examples and readme

* adding delete and cancel operation funcs

* error handling on examples

* updating broken link

* adding a changelog

* changing Tags to map[string]string

* updating comments, pagers

* formatting

* Only fetch provision application oid via API if not supplied (Azure#16878)

Co-authored-by: Ben Broderick Phillips <bebroder@microsoft.com>

* More resilient updating of ExpiringResource (Azure#16789)

* Sync eng/common directory with azure-sdk-tools for PR 2585 (Azure#16876)

* updating to target test-proxy that only gets filename from body of /start request

* update recording.go to send x-recording-file in body of /start request

Co-authored-by: scbedd <45376673+scbedd@users.noreply.github.com>

* Sync eng/common directory with azure-sdk-tools for PR 2596 (Azure#16894)

- Fix bug in Add-ReleaseLease helper script
- Enable better local logging for handling token
- Enable strict mode to help catch issues like this in the future

* Sync eng/common directory with azure-sdk-tools for PR 2578 (Azure#16895)

* excluding duplicated env:USER in image tag

* NIT

* removing default repo name

Co-authored-by: Albert Cheng <albertcheng@microsoft.com>

* renaming

* Changes for AZURE_CLIENT_SEND_CERTIFICATE_CHAIN (Azure#16851)

* add support for AZURE_CLIENT_SEND_CERTIFICATE_CHAIN

* Update CHANGELOG.md (Azure#16903)

* Update CHANGELOG.md

* Update version

* Update version.go

* Increment version for internal releases (Azure#16905)

Increment package version after release of internal

* Sync eng/common directory with azure-sdk-tools for PR 2581 (Azure#16910)

* Generate token for aad app.

* Remove testing mode

* Update Generate-AddToken.ps1

* Update Generate-AddToken.ps1

* Generate token inside of opensource API call scripts

* Add resource

* Address feedback

* Update eng/common/scripts/Get-AADIdentityFromGithubUser.ps1

Co-authored-by: Ben Broderick Phillips <ben@benbp.net>

* Update eng/common/scripts/Get-AADIdentityFromGithubUser.ps1

Co-authored-by: Ben Broderick Phillips <ben@benbp.net>

* Remove the printout

* Update eng/common/scripts/Get-AADIdentityFromGithubUser.ps1

Co-authored-by: Wes Haggard <weshaggard@users.noreply.github.com>

* Update eng/common/scripts/Get-AADIdentityFromGithubUser.ps1

Co-authored-by: Ben Broderick Phillips <ben@benbp.net>

* Added PS script strict mode

* Update Get-AADIdentityFromGithubUser.ps1

* Update Get-AADIdentityFromGithubUser.ps1

Co-authored-by: sima-zhu <sizhu@microsoft.com>
Co-authored-by: Sima Zhu <48036328+sima-zhu@users.noreply.github.com>
Co-authored-by: Ben Broderick Phillips <ben@benbp.net>
Co-authored-by: Wes Haggard <weshaggard@users.noreply.github.com>

* Syncing eng/common (Azure#16904)

Co-authored-by: Albert Cheng <albertcheng@microsoft.com>

* [Release] sdk/resourcemanager/storage/armstorage/0.4.0 generation from spec commit: d830271b241897bad300f9275ff7b27d502aa6c5 (Azure#16913)

* [Tables] updating recordings for latest breaking change (Azure#16906)

* updating recordings for latest breaking change

* go mod tidy

* reverting eng changes to docker tool instead of proxy direct

* running docker dump twice

* Update azidentity recordings (Azure#16912)

* [KeyVault] Update recordings (Azure#16917)

* updating azsecrets and azkeys

* updating azkeys recordings

* [azservicebus] Fixing several reliability/recovery bugs (Azure#16831)

Fixing several reliability and recovery bugs that were affecting sending and receiving. This PR finally exports retry options so callers can configure how long and how many retries they want.

The big changes:
* Retry configuration options are now exposed through the ClientOptions for both the Admin client and the Service Bus client.
* amqpLinks.go has it's own Retry mechanism, based on the azcore retry algorithm. This does exponential backoff, jitter and is configurable using Client.RetryOptions. This retry mechanism cleans up several areas that were doing the BackoffRetrier before.
* MgmtClient is gone, and we just use rpcLink now. There were some coordination issues that just weren't working properly, so the rpc link code has been moved into azservicebus for now (it was always an internal type and remains so). 
* Sender/Receiver both use the better amqplinks.Retry() and their code has been structured to make that easier to read (hopefully!)

Bug fixes:
* Fixed a bug where the proper error wasn't being broadcast when the connection failed in rpc, preventing proper connection recovery.
* Fixed a bug where links wouldn't properly recreate if the connection recovery happened in the claim negotiation.

Fixes Azure#15170 (Connection/network recovery)
Fixes Azure#15610 (Error handling needs work to be consistent and reliable)
Fixes Azure#15642 (Unify all the various retry mechanisms)

These were fixed by more or less sidestepping retryableRPC() mechanism:
Fixes Azure#15644 (amqp-common's retryableRPC method returns string errors)
Fixes Azure#16088 (amqp-common's link.RetryableRPC() retries an entity does not exist)

* fix incidentally added path addition (Azure#16918)

Co-authored-by: scbedd <45376673+scbedd@users.noreply.github.com>

* sdk generation pipeline script (Azure#16782)

* compatible with old SDK generation and new codegen automation

* remove greedy suffix on link check (Azure#16922)

Co-authored-by: Dane Walton <dawalton@microsoft.com>

* remove (Azure#16919)

* Use Payload() for reading response bodies (Azure#16911)

Poller.FinalResponse() wasn't restoring the response body after reading.
Removed duplicate definitions from GetJSON and bodyDownloadPolicy.

* [azservicebus] Explicitly reference gin-gonic until nhooyr.io/websocket is updated Azure#16927

* return *Client

* removing LRO and pager interfaces

* all tests passing again

* updating and fixing a couple bugs, newest autorest version

* updating embedded models

* formatting

* updating to latest internal

* adding to doc.go

Co-authored-by: Azure SDK Bot <53356347+azure-sdk@users.noreply.github.com>
Co-authored-by: Ben Broderick Phillips <bebroder@microsoft.com>
Co-authored-by: Charles Lowell <10964656+chlowell@users.noreply.github.com>
Co-authored-by: scbedd <45376673+scbedd@users.noreply.github.com>
Co-authored-by: Albert Cheng <albertcheng@microsoft.com>
Co-authored-by: Christopher Scott <chriss@microsoft.com>
Co-authored-by: sima-zhu <sizhu@microsoft.com>
Co-authored-by: Sima Zhu <48036328+sima-zhu@users.noreply.github.com>
Co-authored-by: Ben Broderick Phillips <ben@benbp.net>
Co-authored-by: Wes Haggard <weshaggard@users.noreply.github.com>
Co-authored-by: Jiahui Peng <46921893+Alancere@users.noreply.github.com>
Co-authored-by: Richard Park <51494936+richardpark-msft@users.noreply.github.com>
Co-authored-by: chunyu3 <chunyu@microsoft.com>
Co-authored-by: Dane Walton <dawalton@microsoft.com>
Co-authored-by: Joel Hendrix <jhendrix@microsoft.com>
@github-actions github-actions bot locked and limited conversation to collaborators Apr 11, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Client This issue points to a problem in the data-plane of the library. feature-request This issue requires a new behavior in the product in order be resolved. Service Bus
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants