-
-
Notifications
You must be signed in to change notification settings - Fork 810
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
UserWarning: Enforce ordering consumer reached retry limit in production #141
Comments
Yes, they are indeed running in a tight loop as you suggest, so if they are "too idle" and the other messages have not yet made it through they'd quickly spin lock into the retry limit. It's possible I might have to drop enforce_ordering's strict mode (the one you're using here) and just have it provide slight mode, where it merely makes sure connect runs before receive, or somehow provide more logic so that it knows when things before it in the queue are still being processed. I'm not really sure how to proceed, it's definitely not doing what you want in this case, and increasing the retry limit will just lead to more spinlocks (though ironically higher load would probably solve the problem). Let me think about it a bit. |
One of the supposed benefits of websockets over http is strict message ordering, so I'd love to see us find a way to maintain a mainline path to support this. Otherwise it feels like an implementation detail of django channels ended up breaking a typical websocket benefit. Let me know throw out a couple of approaches to begin the brainstorming...
|
I agree, in-order processing is important for WebSocket and any other similar protocol, I want to keep it a thing.
The basic problem here is that The fundamental problem to solve is not to re-inject messages back onto an active channel while another worker is processing something from the same Thinking about it, it's perhaps possible that this could be done with a new channel per reply channel that unordered messages get shunted onto by the decorator if they're out of order, and then the decorator round the running consumer pulls things from that channel back onto the original one. It means a lot of shoving things to and fro, but it gets it done, I think. The other, more radical, option is to change the very way |
This actually seems like a great solution to me. You avoid unnecessary spin locks and the messages become available for pickup by channel workers on the main channel at the exact moment they can be serviced since you are guaranteed the previous blocking message has now finished. You'd want to make sure this mechanism could be used for an arbitrary number of unordered messages (ex: while message 1 is being processed, message 2 and 3 come in, both will be picked up by channel workers and inserted into the out-of-order channel based on the reply channel; and then when message 1 completes processing, both of these messages are taken from the out-of-order channel and placed back on the main channel in FIFO order). I assume you have a mechanism to also handle the potential race condition when worker 2 adds message to out-of-order channel at the exact same time worker 1 checks for any out-of-order messages in the channel to insert back into the main channel. Otherwise if worker 1 completes before worker2 adds the message, then you could have a message in the out-of-order channel but no way to pull it back out to the main channel since there are no longer any workers processing messages associated with that reply channel. regarding the more radical approach: preventing messages from being processed in parallel is still also important for my scenarios, so wouldn't want a solution to ordering to come at the cost of now allowing parallel message processing. |
Well, I haven't written it yet, so I don't; this is certainly a risk, and the problem is that there's no event to hang a later check of that delayed task channel on either, so I suspect the naive implementation would lose the message in this scenario (though it would get it back once another message came in and the decorator got a chance to look at the channel again). I do suspect there's no perfect solution here without either proper distributed locking or exactly pinning all message processing for one WebSocket to a single thread (which is how most other systems achieve this right now). I suspect the best we could do is to strictly run the "mark next message as available for running" check strictly before the "empty the wait queue back onto the main channel" code, so that the failure case is that the next worker begins before the current one exits (but after the business logic has run), rather than the other way around where, as you say, you risk losing messages forever. |
Oh ya, the strict ordering you suggest seems fine then since the message wouldn't be lost. |
The new approach addressing this issue is now merged, so will be available in the next release. |
I'm unfortunately seeing the following error in production:
As I mentioned before, I believe I have correctly decorated all protocol consumers, as here is my consumer code:
I'm sending socket messages to the server at a fairly slow rate now (one socket message sent every 500ms). I'm currently running 3 channel workers against Daphne. There are currently no other socket connections on the server, just the one user / socket connection. The server in general is pretty idle.
My best guess at what may be happening is that my server may in-fact be too idle. When a message comes in, one channel worker picks it up and then the next message comes in and the next channel worker tries to pick up, but since the first one is not done it raises ConsumeLater. But since it has nothing else to pick up other than the same task since there are no other users, it simply spins very quickly through it's retries. Is this possible? Are the channel workers running in a tight loop constantly picking up tasks and in this case raising ConsumerLater? Or is there some delay before it will pick up a subsequent task? Or maybe is there some mechanism by which they only pull subsequent tasks when another worker has finished it's task? If it's just in a tight loop, then it seems like this issue would happen whenever the rate of incoming messages exceeds the processing rate of the messages and there are more idle workers than socket connections to process.
The dropping of the message is unfortunately breaking my collaborative editing scenario since a message from the client is being dropped and not being processed. I'm planning on building re-send logic from the client as well as message ids so the server doesn't process the same message twice, but I was expecting this to be only used in exceptional cases (server connection died, poor latency, etc) not a mainline scenario like this.
Thoughts on how to resolve?
The text was updated successfully, but these errors were encountered: