Skip to content
This repository has been archived by the owner on Oct 12, 2023. It is now read-only.

servicebus.Queue.Receive returns "context canceled" #156

Closed
glebteterin opened this issue Jan 16, 2020 · 11 comments
Closed

servicebus.Queue.Receive returns "context canceled" #156

glebteterin opened this issue Jan 16, 2020 · 11 comments

Comments

@glebteterin
Copy link
Contributor

Every now and then servicebus.Queue.Receive returns "context canceled" error on production. I suspect every time it happens we lose a queue message.
I can't reproduce it locally and not sure whether it's the SDK bug or an implementation issue.

Any ideas what it could be?

Package version: github.com/Azure/azure-service-bus-go v0.9.1

Example of code below:

// q is *servicebus.Queue

for {
    err = q.Receive(context.Background(), servicebus.HandlerFunc(func(ctx context.Context, msg *servicebus.Message) error {
        log.Println("MESSAGE RECEIVED", msg.ID)
        return msg.Complete(context.Background())
    }))

    if err != nil {
        if err == context.DeadlineExceeded {
            log.Println("queue timeout")
        } else {
            log.Println("queue receiver returned an error", err) // error is "context canceled"
        }
        continue
    }
}
@NadeemAfana
Copy link

I am also experiencing the same issue but I am using github.com/Azure/azure-service-bus-go v0.10.0. Almost every 2 days, I get the same error "context canceled".

@eaardal
Copy link

eaardal commented Jan 31, 2020

Also experiencing this on version github.com/Azure/azure-service-bus-go v0.10.0. Did not have this issue before upgrading from github.com/Azure/azure-service-bus-go v0.1.1-0.20181113211747-01395671aecd

@14leeyuchieh
Copy link

14leeyuchieh commented Nov 18, 2020

FYI, for anyone who find the unexpected modification of context affecting their deployable, consider passing a child context of the original context instead to Azure service bus. Then, when the child context exits, it won't affect the parent context. If the child context needs to cancel across several go routine, cancelling the parent context will ensure that. Using the example above,

parentCtx, parentCancel := context.WithCancel(context.Background())

for {
    if parentCtx.Err() == context.Canceled {
        break
    }
    func() {
        childCtx, childCancel := context.WithDeadline(parentCtx, time.Millisecond)
        defer childCancel()
        err = q.Receive(childCtx, servicebus.HandlerFunc(func(ctx context.Context, msg *servicebus.Message) error {
            log.Println("MESSAGE RECEIVED", msg.ID)
            return msg.Complete(context.Background())
        }))

        if err != nil {
            if err == context.DeadlineExceeded {
                log.Println("queue timeout")
            } else if err == context.Canceled {
                log.Println("queue receiver returned an error. Continuing", err) // error is "context canceled"
            }
        }
    }()
}

...
// cancel all goroutines/functions receiving from azure service bus (e.g. SIGTERM/SIGINT)
parentCancel()

@zihengCat
Copy link

Same problem in github.com/Azure/azure-service-bus-go v0.10.7.
Does anyone know the reasons?

@serbrech
Copy link
Member

serbrech commented Feb 18, 2021

A servicebus handler should never return an error. if you error out, it closes the connection, and the context, which is probably not what you want.

All message handler share the same context unless you specifically pass a child context to your handler.
I suspect that the cases exposed in this issue is when everything uses the same context, an error in one handler closes the connection and the context associated, impacting all other on-going message handling.

this is what @14leeyuchieh explained above.

@14leeyuchieh
Copy link

14leeyuchieh commented Feb 25, 2021

Btw, there's actually 1 more case where my code did not work for. That is in the case of a network fault when the box loses internet access. I notice in that case, the servicebus client hangs indefinitely, even after internet access is restored. I didn't get a chance to figure out why exactly, but closing the client and re-instantiating a new one fixed it for me.

I'll try to provide an example later if I have time, but otherwise just leaving the note here for other folks.

@serbrech
Copy link
Member

There are a couple PR that got merge about restoring connections, if you manage to get a reliable test, make sure to validate it with the latest version :)

@rokf
Copy link

rokf commented Apr 30, 2021

@14leeyuchieh hey, we're having the same issue on a project at work and I have a few questions.

With "the same issue" I mean these context.Canceled errors after two hours and few minutes (usually 5-10) and that the client sometimes stops receiving/processing new messages after a while.

  1. Will Queue.Receive with a new context.Context really continue to process messages after it fails with a context.Canceled error? Currently, we Close the Queue and DeadLetter instances when we receive the context.Cancel error from any of their Receive and ReceiveOne methods, then we create them anew (NewNamespace, NewQueue and NewDeadLetter). According to your example above this is not necessary (overkill).
  2. What timeout value are you suggesting for the context that'll be passed to the Queue.Receive method and why? Is there a specific reason to pass a context with a timeout instead of a regular context to this method, since it can run "forever" (until there's an error) in a separate goroutine.
  3. Have you perhaps found a solution to handling network faults? Is there a way to find out if the receivers are stuck because of a network fault/outage without periodically sending a message to the queue using a separate (fresh and working) client instance for example?
  4. How are you handling DeadLetter messages? Are you using the DeadLetter.ReceiveOne method in a loop, similar to what you've provided in the example above? Again, what timeout value are you suggesting for the context that's passed to the DeadLetter.ReceiveOne method?

We're still using v0.10.10, since we're having some other troubles with v0.10.11.

@serbrech what's the state of those connection restoration merge requests? They've all been merged and are part of the latest version - v0.10.11? Are any of them already part of v0.10.10?

Some facts

Using a common context.Context without a timeout for both Queue.Receive and DeadLetter.ReceiveOne causes a failure in both methods at once. In our case, after roughly two hours of runtime, the DeadLetter.ReceiveOne method gave up with the following error:

link detached, reason: *Error{Condition: amqp:link:detach-forced, Description: The link 'MASKED' is force detached. Code: RenewToken. Details: Unauthorized access. 'Listen' claim(s) are required to perform this operation. Resource: 'sb://MASKED.servicebus.windows.net/MASKED/$deadletterqueue'.. TrackingId:MASKED, SystemTracker:gateway7, Timestamp:MASKED, Info: map[]}

I've MASKED out some potentially sensitive data in the above snippet.

At the same time, the Queue.Receive method stopped with context.Canceled.

After changing to a child context with a 10 minute timeout for the DeadLetter.ReceiveOne method call and leaving the parent context used by the Queue.Receive method without a timeout they now both only OCCASIONALLY fail with context.Canceled. If they fail, they still fail at once - I guess that the reason is now in the parent context used by the Queue.Receive method.

This OCCASIONALLY here is important, since they consistently failed every two hours and few minutes with the same error pair (context canceled and RenewToken) before the change to a child context for the DeadLetter.ReceiveOne method.

In fact, they only failed the first few times (1x, 2x) and then stabilised for some reason in the latest test that I've been making yesterday & today.

Here are a few stripped down and annotated logs from my example. See the timestamps. I've started three queue consumers/clients at the same time. Two of them failed only once after being started, one failed twice, and that was it. After that they kept receiving. The DeadLetter.ReceiveOne method call restarts every 10 minutes because of the previously mentioned timeout.

{"level":"info","time":"2021-04-29T19:07:52Z","message":"Consumers started"}

FIRST QUEUE:

{"level":"error","error":"context canceled","time":"2021-04-29T21:18:00Z","message":"Queue.Receive failed"}
{"level":"error","error":"context canceled","time":"2021-04-29T21:18:00Z","message":"DeadLetter.ReceiveOne failed"}

SECOND QUEUE:

{"level":"error","error":"context canceled","time":"2021-04-29T21:18:00Z","message":"Queue.Receive failed"}
{"level":"error","error":"context canceled","time":"2021-04-29T21:18:00Z","message":"DeadLetter.ReceiveOne failed"}

THIRD QUEUE:

{"level":"error","error":"context canceled","time":"2021-04-29T21:18:00Z","message":"Queue.Receive failed"}
{"level":"error","error":"context canceled","time":"2021-04-29T21:18:00Z","message":"DeadLetter.ReceiveOne failed"}

THIRD QUEUE, SECOND TIME:

{"level":"error","error":"context canceled","time":"2021-04-29T23:29:04Z","message":"Queue.Receive failed"}
{"level":"error","error":"context canceled","time":"2021-04-29T23:29:04Z","message":"DeadLetter.ReceiveOne failed"}

@lohmanndouglas
Copy link

lohmanndouglas commented May 5, 2021

@rokf, I am having the same problems in both versions v0.10.10 and v0.10.11

  1. I could not make it work like in the examples above, just restart the receive with new context did not work for me. Now I am trying to get the queue again but not restarting the namespace. If I need to restart the namespace it would be a mess in my application :(.

This issue is impacting my work and I am stuck on it.

EDITED:
Just getting the queue again solves the problem. It is not necessary to recreate the namespace in the application.

My code is like this:

  • sb.ns is a "*servicebus.Namespace" initialized once when the container start
  • sb.queueName const "string" with queue name
  • refreshQueue is a helper function that gets ns.NewQueue(queueName)
for {
	err = queue.Receive(
		context.Background(),
		servicebus.HandlerFunc(func(ctx context.Context, message *servicebus.Message) error {
		log.Debug("consumer received msg")
		// handler logic here
		return message.Complete(ctx)
		}))
	if err != nil {
		if err == context.DeadlineExceeded {
			log.Fatal("consumer: ", err)
		} else {
			log.Error("consumer: ", err) // error is "context canceled"
			queue, err = refreshQueue(sb.ns, sb.queueName)
			if err != nil {
				log.Fatalf("failed to refresh queue named %s", sb.queueName)
			}
		}
		continue
	}
}

@rokf
Copy link

rokf commented May 21, 2021

Here's an interesting observation with v0.10.10 (I haven't tested this with other versions). Let's say that there are 4 queue receivers in an executable. All the queues reside in a single namespace on Azure Service Bus - all Namespace instances use the same connection string.

If they use a common Namespace instance then the Receive call fails with the context canceled error (almost) every two hours.

If each one of them has their own Namespace instance then they don't fail that frequently anymore, only occasionally (a few times in the beginning and later occasionally, sometimes there's no error for a whole day).

Any idea why they behave like this?

@richardpark-msft
Copy link
Member

Hi all,

We've moved development of this package to the azure-sdk-for-go repo link.

We've tried to really simplify how error handling works and are still working on how to make errors more actionable (Azure/azure-sdk-for-go#15610). Note that while it's not "done" that's not because we haven't started or done work in the area.

If you'd like to give it a shot there's a migration guide to ease the transition: https://github.com/Azure/azure-sdk-for-go/blob/main/sdk/messaging/azservicebus/migrationguide.md

I've run this particular scenario quite a bit in the new library, as adding in a regular stress testing regimen is part of the development cycle. I created a scenario that is the same as what you have above, and I'm not seeing any errors. This doesn't mean they don't exist, but it's no longer easy to replicate this kind of bug.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

9 participants