-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pub/Sub: buffered messages delivered multiple times ignoring acknowledgment deadline #9252
Comments
@qvik-olli Thanks for the detailed report! This sounds as if the request(s) to modify the message acknowledge deadline do not arrive in time, and the backend re-sends the message. Can you check if the subscription's max ACK deadline is indeed set to 300 seconds (5 minutes), and not just to e.g. the reported 30 seconds because of a typo or something? Technical details: The messages in the buffer are not lease-managed, and if they need to wait there for longer than the acknowledgment deadline, the backend assumes they got lost and re-sends them. Processing the first message takes ~40 seconds, which exceeds the default subscription's ACK deadline of 10 seconds. However, if the deadline is indeed correctly set to 300 seconds (5 minutes), the reported behavior is actually surprising and would require a deeper look. |
Thank you @plamut for your quick response. I'm sure about the ack deadline and I double-checked that it is 300 seconds (5 minutes). You can forget what I said about 30 seconds. I did some further testing and it seems that there is some dynamic timeout that seems to be changing depending on how long does it take to process a batch of messages. This timeout seems to ignore ack deadline. I did some tests with larger amount of messages. I used the script you can find in my first message. I set the message handler to sleep 1 second when a message arrived.
So when I started the subscriber and sent the first 60 messages, I got 145 duplicates. When I sent another 60 messages, I got just one duplicate message. After that, I sent 100 messages and I got a high amount of duplicates again. |
Interesting, I will try to run it today myself and see if I can reproduce it. I do not see anything apparent in the client itself, but I have not looked too deeply yet. Update: This can be achieved by rapidly publishing them one after another, i.e. without waiting on each individual publish future, which results in both messages being sent to the backend in the same publish batch. |
After experimenting with various things, I was not able to pinpoint the exact source of the issue (in the client at least). It appears that if the delay in the message handler is considerably shorter than 10 seconds, the issue is not reproducible. The probability of a duplicate message increases when the I also tried disabling the automatic modifications of the ACK deadlines, just in case the related client's logic is flawed: diff --git a/pubsub/google/cloud/pubsub_v1/subscriber/_protocol/dispatcher.py b/pubsub/google/cloud/pubsub_v1/subscriber/_protocol/dispatcher.py
index 2b257482930..d6b1d2c54d0 100644
--- a/pubsub/google/cloud/pubsub_v1/subscriber/_protocol/dispatcher.py
+++ b/pubsub/google/cloud/pubsub_v1/subscriber/_protocol/dispatcher.py
@@ -152,7 +156,10 @@ class Dispatcher(object):
"""
ack_ids = [item.ack_id for item in items]
seconds = [item.seconds for item in items]
-
+ #################
+ _LOGGER.debug(f"\x1b[33m(DISABLED) Modifying ACK deadline to {seconds}, ACK IDs: {ack_ids}\x1b[0m")
+ return
+ ######################
request = types.StreamingPullRequest(
modify_deadline_ack_ids=ack_ids, modify_deadline_seconds=seconds
) The result was the same, which makes me think it might be something with the subscription on the backend (the ACK deadline is set to 300 seconds), and that the deadline expected by the server is actually around 10 seconds, in my case at least.
I got a similar impression. The actual server-side deadline does not seem to be static. If it was really, say, 10 seconds, a sleep delay of 8 seconds should almost definitely not reproduce the issue, and at the same time a delay of 12 seconds should always reproduce it - but the outcome of the test script was not deterministic. Labeling as a backend issue until further evidence. |
I don't think this is a backend issue. Looking at some logs, it appears that the server is delivering the message based on its understanding of the ack deadline with no unexpected duplicates. I think there are two things at play here:
|
Indeed, although the issue was reproducible even with these calls disabled. However, it appears that the topic ACK deadline is overridden even before pulling the messages start, i.e. when the stream is established. This key piece of info is missing in the client debug logs and needs to be added (along with the fix for the observed behavior). Thanks for checking the backend logs, @kamalaboulhosn ! |
@qvik-olli We have just released a new version of the PubSub client (1.0.1), which includes the fix for this issue. The ACK deadline for the messages should now more closely match the deadline set on the subscription, resulting in fewer duplicate re-deliveries. |
Pulling the mesages with the streaming pull should work with the default pubsub.subscriber role. This commit removes the call to fetch a subscription, and replaces the subscription's ACK deadline with a fixed deadline of 60 seconds. That *will* re-introduce the issue googleapis#9252, but at least in a less severe manner.
Pulling the mesages with the streaming pull should work with the default pubsub.subscriber role. This commit removes the call to fetch a subscription, and replaces the subscription's ACK deadline with a fixed deadline of 60 seconds. That *will* re-introduce the issue googleapis#9252, but at least in a less severe manner.
Pulling the mesages with the streaming pull should work with the default pubsub.subscriber role. This commit removes the call to fetch a subscription, and replaces the subscription's ACK deadline with a fixed deadline of 60 seconds. That *will* re-introduce the issue googleapis#9252, but at least in a less severe manner.
…on (#9360) Pulling the mesages with the streaming pull should work with the default pubsub.subscriber role. This commit removes the call to fetch a subscription, and replaces the subscription's ACK deadline with a fixed deadline of 60 seconds. That *will* re-introduce the issue #9252, but at least in a less severe manner.
@qvik-olli Unfortunately, it turned out that he the fix for this caused a different issue (#9339) which is more severe, and thus had to be reverted in PubSub 1.0.2 (released ~15 minutes ago). Fetching a subscription to read its ACK deadline requires additional permissions, and that broke some the users' applications. For the time being the default stream ACK deadline is set to 60 seconds. Some of the messages can still time out a bit to soon and will be re-delivered, but the time window is now wider, giving the user code more time to process the received messages that exceed the A different fix will have to be made, although that will likely require some more substantial changes to the streaming pull core logic. |
Okay, I see. Increasing the default stream ACK deadline to 60 seconds is helpful, but I do hope that the issue will get fixed. Our systems were experiencing this issue because we were controlling concurrency by setting I'm sharing this snippet if someone else is affected by this issue. Creating custom scheduler to control concurrency: subscriber = pubsub.SubscriberClient()
executor = concurrent.futures.ThreadPoolExecutor(max_workers=5)
# google.cloud.pubsub_v1.subscriber.scheduler.ThreadScheduler
scheduler = ThreadScheduler(executor=executor)
subscriber.subscribe(subscription_name, callback, scheduler=scheduler) |
For CPU-intensive computations in callbacks, providing a shallower thread pool for invoking callbacks is indeed one way of limiting the concurrency (the default thread pool size is 10, FWIW). Nevertheless, I reopened this issue until it gets a proper fix, which will probably require to auto-lease (i.e. auto-extend the ACK deadlines) all received messages, and also to decouple tracking of the current load from the number of actively leased messages. (the load is used to decide how many messages to dispatch to the callbacks, and when to pause/resume the message stream) |
@plamut Is this issue resolved in the later versions? I am still facing the duplication of messages when I set the ack_dealine to the max 600 sec, I sometimes get a duplicate msg with in 10 sec of the previous msg. |
@saitej09 How often does that happen? By design, some duplication is actually expected. IIRC something like 0.1 % of messages being re-sent is considered normal, and applications must be prepared to handle this. Are you perhaps seeing much higher re-delivery rates? Re: 10 seconds - that's the initial ACK deadline when a new streaming pull is started ( |
@plamut I have set the deadline to 600 sec by default in the subscription setting. I am not setting any value in the subscriber.subscribe() call. So will it be reset by any chance ? If not, then pubsub resending the msg before 600 sec cannot be attributed to network loss etc as I am not updating the deadline. I am not calling any method (modify_ack_deadline) to extend the deadline currently. More than 1% of the message are being duplicated but the behavior is very random. Does it have any correlation with flow control ? |
@saitej09 The ack deadline is not a guarantee that messages will not be redelivered prior to the deadline's expiration, so it is possible that no matter what, messages will be delivered before the 600 seconds passes. @plamut Do we always start with an ack deadline of 10 seconds? If so, this is something we might want to consider changing. I think we did or were going to introduce a notion of "minimum ack deadline" in the same way we have "maximum ack deadline" which would allow one to specify the minimum amount of time sent back with modAckDeadline, including the initial one. |
@saitej09 The library sets the initial deadline internally. Initially it uses the minimum ACK deadline, not the 600 seconds. This is to optimize for throughput, because if the subscriber crashes and is restarted, messages will not be re-delivered for full 10 minutes. This ACK deadline is then gradually adjusted based on the history, i.e. how much time the subscriber needed to po process past messages. |
@kamalaboulhosn Yes, we always start with the minimum deadline (when there's not histogram data yet) and then gradually adjust it. We could change that to use a different value, though, although IIRC it is desired to start with a short deadline for potentially better throughput on average? |
@plamut I think what we want to do is have a minimum and a maximum and only adjust within that range. We never want to go below the minimum specified, even if we have histogram data. We basically want the user to have control over this if they want it. |
@kamalaboulhosn We already do that, but if I undrestood correctly, the user wants their |
@plamut Right, so I think we need a property in the client library for MIN_ACK_DEADLINE so that it can be controlled by the user. In the same way we have |
Sounds right, I'll create a feature request issue in the new repo. |
I noticed a problem where buffered messages in pubsub library are delivered multiple times ignoring acknowledgment deadline. I have a subscription with acknowledgment deadline set to 5 minutes but messages are delivered multiple times to the same subscriber after about 30 seconds.
My code example starts a subscriber client with
max_messages
set to 1. When two messages are sent to the subscriptions, the second message is always duplicated. This same behavior also happens with highermax_messages
values. For example, using the default value 100 formax_messages
, when 150 messages are sent to the subscription, some of the messages are duplicated if the processing of the first 100 messages takes more than 30 seconds.I understand that with large backlog of small messages, the messages can get redelivered (Dealing with large backlogs of small messages), however, shouldn't the messages be re-delivered only after the ack_deadline is exceeded?
Environment details
Steps to reproduce
python subscriber.py -p PROJECT_ID SUBSCRIPTION_NAME
.Code example
Example subscriber code here: https://gist.github.com/qvik-olli/0bfd4ace2d06def1675a76fbc20493e5
Logs
From the subscriber script:
With debug logging from the pubsub library:
The text was updated successfully, but these errors were encountered: