-
Notifications
You must be signed in to change notification settings - Fork 207
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
"Retrying due to 504 Deadline Exceeded, sleeping 0.0s ..." frequently in GKE when publishing data #126
Comments
@dboshardy If I understand correctly you pull some data from an external stream and then package and publish it as PubSub messages? And the deadline exceeded messages appear in the logs because, presumably, publishing takes too long? It might be that some of the messages from particular datasets are "large" and publishing them takes a long time, especially if on a slower network. To test this theory, could you perhaps try changing the client's publish flow control, similarly to the sample in Client docs? publisher_options = pubsub_v1.types.PublisherOptions(
flow_control=pubsub_v1.types.PublishFlowControl(
message_limit=100,
byte_limit==10 * 1024 * 1024,
limit_exceeded_behavior=pubsub_v1.types.LimitExceededBehavior.BLOCK,
)
)
publisher_client = pubsub_v1.PublisherClient(publisher_options=publisher_options) Another user reported something similar recently (issue #72), and reducing the amount of data published at once fixed the issue for them. They implemented their own publish throttling, but there is a new built-in publish flow control feature that can achieve the same. If this doesn't work, it would be helpful to provide a reproducible code sample, if possible, and a more detailed log (DEBUG level) to see what's going on. Thanks! |
I have a similar problem, too. I'm using flask and pubsub on GKE and the messages of pubsub recieve properly, but the error messages always appear. My publisher's and subscriber's code is almost the same as the official doc. Althought I only set the FlowControl in subscriber, the I think we can exclude the gunicorn because I have tried using the flask development server, and the error message still appears. Additionally, it can't be timeout problem because I have use the small datasets and ack the message immediately after subescriber recieved message for testing.
|
Indeed |
@dboshardy The purpose of PublishFlowControl settings is to prevent too many messages being published at the same time, which can lead to network timeouts if the bandwidth is not sufficient. Instead, publish flow control can be used to queue messages before they are actually sent to the server, avoiding these timeouts. Think of it as automatic publish throttling. The setting dictate how many messages are let through before a configured limit is reached (message count and/or total message size). If that threshold is hit, the
Does that make it more clear? To make the feature more understandable, would it help if we add a short section wiht an example to the publishing docs? |
And that internal queue size is determined by the messages and bytes parameters? Ok that helps a lot. Thank you! I'll do some tweaking to those settings. Thanks for your help. |
That's correct. If either of these thresholds is exceeded, the messages that overflow are temporarily put in a waiting queue (if the behavior is set to BLOCK). |
Sorry to bother, but with BLOCK, you say there is a separate queue. But I'm seeing the for loop I'm reading the data from and calling publish within does stop reading while it waits for the max_messages to go back down. Am I misunderstanding? |
@dboshardy That is correct - if an overflow is detected, calls to The way it works is that inside the No worries about "bothering", I'm always glad to help. :) |
Would it stand to reason that if |
If I understand correctly, you would like the In principle that's possible, but it would likely require refactoring the sequencers (a component that orders the messages by ordering keys, if necessary), as futures are actually created by the sequencer. Sequencers implement some publishing logic on their own (e.g. message batching, ordering...) and placing the flow controller there cleanly might not be entirely straightforward. @pradn Do you think that could be done with a reasonable effort, or would the sequencer logic have to be refactored substantially? Edit: IIRC that was actually the primary reason that we opted for the blocking behavior rather than just silently queuing everything internally. |
Makes sense. But certainly offering that as an option with the necessary warnings and and settings configurability would benefit some. |
Could be, yes, thus I will check with the Product if this is something that we want to add (and if this is consistent with what PubSub clients for other supported languages do). |
The future we return on publish() comes directly from the network/transport GRPC layer. So, to be able to immediately return a future on publish, we'd need to make a new future that wraps the process of making the network request. I think this can exist before the sequencer layer, so we shouldn't need to modify them too much. I worry this will overcomplicate this code though. The ordered sequence code, especially, is complex as is. It should be straightforward wrap the publishing code to manage the publishing, and return a future immediately. A message-to-be-published is added to a queue. Another thread actually does the blocking publish. Whenever a publish finishes, a condition variable would signal that the next message in the queue can be published. This works easily with blocking publish flow control. I would prefer not to add this to the client library unless there's significant demand. The publish flow control feature is new so we are happy to hear feedback and suggestions. |
Perhaps tied to this issue as we were seeing this before as well, but some instances seem to fail silently when the publisher client goes to publish the first time. The last log we see is a call to the metadata endpoint to authorize the request. |
Could it be that the auth request gets lost every now and then? Or maybe establishing a stream sometimes fails in a weird way? |
I wondered that as well, but this failure happens consistently with certain data sources and not others. And I am able to run the process locally and it works fine for those that it fails on in GKE. Other data sources work fine consistently. I tried going back to the default behavior of the publisher client (not specifying any batch or flow settings) and the behavior persists, it just happens sooner than it did before, i.e. not having accumulated a batch. Is there a way to surface more logging in the publisher client? |
Out of the box I would set the default logging level to DEBUG, if you haven't done that yet, as there are several P.S.: Going offline now, since it's almost midnight here. :) |
We're already using that log level, but not getting anything out of the logs in this case. Thanks for the help! |
Looks like we're getting a segfault in the grpc python lib similar to this issue: grpc/grpc#13327. Adding a delay of even 0.001s prevents this segfault, but obviously is not ideal. |
@dboshardy If that's the case, it would be best to report it to the I also got an answer on the proposed publish flow controller change, and it will probably not be implemented, I'm afraid, at least not in the foreseeable future, as adding it would deviate from the implementations in PubSub clients for other languages (plus the aforementioned OOM concerns). |
@plamut Thanks for all your help! |
Using a python gunicorn/flask service to pull streaming data and then publish to pubsub running in a pod on GKE, I receive errors regardless of batch size settings (including default).
Using both the pubsub and pubsub_v1 PublisherClient, with certain sets of data and default or larger batch sizes (>1000), I get:
Until ultimately I receive the following error in the added
done_callback
.Or, in some cases, no logs at all, the gunicorn worker just reboots silently. With smaller batch sizes in the low hundreds, I typically see no indication of an error, but the worker still silently fails and reboots and does so in under 60s.
This occurs regardless of if I'm running many publish workers or 1.
The GKE cluster is on a shared private VPC and is using workload identity.
This does seem to occur with specific datasets but there seems to be no clear reason why those datasets would cause any error like this.
This is the output of
pip3 freeze
The text was updated successfully, but these errors were encountered: