-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Periodic IO stalls/pauses #1494
Comments
I should also add that this hypothesis isn't supported by other application-level metrics; we have a pretty good idea of max notification sizes, and unless we have some inconveniently-missing data points, there's no indication that any unusually large payloads were in play at the time of these stalls. |
Thanks a lot for the elaborate analysis. Looking at the log lines, the stall doesn't seem to be related to I/O since calls to
Are you able to deploy a customized Lettuce version to your systems so we can continue investigation? It would be helpful to identify which method causes the delay ( It would also make sense to check on which thread the debug log stems from. Being on the EventLoop thread causes a direct write in netty. Being on a different thread enqueues the action to be invoked by an EventLoop threrad. |
Thanks for the quick reply!
Sure; I think I can just tweak our log message format to include the calling thread. Did you have something else in mind?
Potentially yes! Do you have an estimate for when that build could be ready? If you expect it to be ready quickly, I can roll together the custom build, the thread dump logic, and the logging changes into a single deployment. Cheers! |
I'd rather like that you add log lines to your fork of lettuce as that allows faster iteration/feedback cycles. |
Okay; I think I'm a little confused about next steps then. Is the idea that you will produce a branch with the specific information you're looking for, or are you asking me to instrument everything in my own branch? |
I'm sorry—I re-parsed your message and it's clear that you're asking me to put together the build. I can certainly do that; can you clarify what, specifically, you're hoping to learn? Just timings/threads around |
Yes, basically to narrow down where this comes from. |
Understood. Will report back soon. It may take a couple days because America. |
Don't mean to press you for a detailed code review, but as an "are we on the same page?" check, does 6.0.x...jon-signal:debug_timing_logging look reasonable to you? |
Looks good. Note that activation should happen only once per TCP connection. |
I've captured logs/thread dumps around a few instances of command timeouts. There's a LOT in here and I'll need a little time to sift through it all and extract the important pieces, but I wanted to share an update and be clear that things are moving forward. |
The logs I've captured so far are less clear-cut than the ones from previous captures, and I'm going to adjust our strategy a bit. One thing I HAVE noticed from the data we've collected so far is that Analysis is still very much in progress (and a shift in data collection strategy may be in order on my part), but I wanted to share some early findings. |
I'm continuing to capture and analyze new data. It's still too soon to say for sure, but it looks like the nature of this issue has changed since I deployed our custom build of Lettuce. Notably, it's derived from Lettuce 6.0.1 instead of Lettuce 6.0.0; that was mostly an oversight on my part, but now I'm wondering if something in #1462 or #1476 wound up making a significant difference after all. It may also be that there are multiple phenomena at work, and this is just weird sampling/luck on my part. That said, what I'm seeing now is that, rather than all channels grinding to a halt simultaneously, only one will. And, significantly, this time, a stall begins exactly when we see a group of TCP retransmissions from Lettuce to the Redis server and resolves almost exactly 3 seconds (our command timeout) later.
The packets that are getting retransmitted are both While this stall is happening, Lettuce's logs show Lettuce happily continuing to write and flush commands, but receiving no data back. As soon as the stall resolves, Lettuce reports that it's receiving data again (and has a backlog of commands on the stack). So in this most recent batch of samples, it looks like Lettuce is mostly doing the right thing and the problem is happening deeper in the stack. At the same time, it's fishy to me that data starts flowing again as soon as the first command times out (not that I can think of anything Lettuce itself could be doing to produce the observed effect at the TCP level). I'll certainly acknowledge the possibility that these stalls are more frequent than the ones I'm observing, and it's just a coincidence that the stalls long enough to cause a command timeout are almost exactly 3s. I'm going to continue to capture samples to see if this pattern holds. |
Wow, that's getting pretty weird. Can you correlate anything like keep alive or send/receive buffer effects with the timeouts? At this point, maybe even something holds a monitor somewhere and the I/O Thread isn't able to continue. |
I've just captured two more snapshots that show almost exactly the same behavior: we retransmit a TCP packet to the Redis server, then the connection essentially goes dead for three seconds. One of the snapshots happened issuing the same kind of
Not yet, but I'll see what I can find.
I think we can test and refute that hypothesis with the information we already have (unless I'm misunderstanding which thread you mean). According to the Lettuce debug logs, the thread that handles the frozen channel ( |
Thanks, I missed that the thread continues to progress on other channels. If you see TCP retransmission, then this sounds a bit like packet loss. |
I realize this may have come across as snarky, which wasn't my intent! I just mean that I checked the logs after you asked and didn't need to capture new data. I don't think the behavior of the IO thread was obvious from the prior discussion. Please pardon the poor phrasing on my part!
I agree, though the consistent three-second timing remains suspicious to me. I'll see what I can learn on the AWS side of things. |
Ah—an important clue: the first four default TCP retransmission times (with exponential backoff) are 200ms, 400ms, 800ms, and 1600ms. That means that four retransmissions takes almost exactly three seconds. I'll try moving our timeout to 3.5 seconds to see how or if that changes things. |
No worries, I understand that getting to the root cause can be a pretty extensive undertaking. Thanks for keeping us posted. TIL that retransmission windows are so close together. I somehow expected multiple seconds (something above 10 or 30 seconds). |
I'm continuing to work with AWS on this issue, but don't have anything to report from that thread yet. In thinking through what could be happening, I tried to more closely align some things in the Lettuce logs with our packet captures and noticed something curious. Here's another representative stall:
I filtered the logs to see what the thread handling that connection was doing at the time:
Even though none of those lines are related to the connection in question (on port 51434), there's a suspicious stall between 18:01:17,020 and 18:01:17,753, and that's right when traffic from the Redis host seems to wake back up abruptly. Checking our application logs, we see a batch of 46 timeouts starting at 18:01:17,017 and ending at 18:01:17,756. I've attached a slightly longer excerpt from the Lettuce logs that shows the backlog getting cleared out. One thing that's curious to me is that, even though we can see packets arriving during that pause in the logs, Lettuce doesn't notice them for several hundred milliseconds, and then the packet logs and Lettuce logs seem to (temporarily?) lose agreement about what's happening when. My wager is that the IO thread is getting tied up populating stack traces for the One thing I think this does tell us is that we're probably not overrunning the TCP receive buffer because we see packets arriving before Lettuce logs that it's processed them (unless there's some weird packet purgatory where something is taking packets from the buffer and holding them for a significant amount of time before Lettuce processes them). It still doesn't explain why the stall happened in the first place, though. There may also be some new hints in the attached thread log that jump out to an experienced reader that aren't obvious to me. Anyhow, don't think there are any earth-shattering revelations in here, and I'll continue to work with the AWS folks. Still, I wanted to share updates as I have them. Cheers! |
To address this directly, I haven't seen any signs of buffer "events" anywhere; if we were overruning our buffers, I'd expect to see zero-window/window full packets, but those never seem to happen before a stall. Sometimes we'll wind up with a zero-window packet after a stall when we seem to get a gigantic wave of deferred traffic, but nothing that seems to predict a stall before it happens. |
Is there any update, specifically anything to do for Lettuce? |
Good morning! No update yet, but I'm actively gathering more information and hope to have an update to share soon. In the meantime, the status is:
Thanks for your patience! |
Since we haven't heard any update on this ticket, I'm going to close it. If you would like us to look at this issue, please provide additional information and we will re-open the issue. |
I understand the decision to close this, but please accept my assurance that the investigation is both active and ongoing on our end. I'll post an update as soon as I have one. |
…is/lettuce#1494." This reverts commit 4d5fbec.
We've observed behavior in our production systems where all Lettuce operations will (seemingly without provocation) grind to a halt, then resume just as abruptly. We've directly measured "stalls" between 3 and 5 seconds, and can infer the existence stalls shorter than 3s and longer than 10s from other metrics. Stalls appear to happen on the order of once every host-week (which can add up quickly when there are lots of hosts in play). This generally results in a bunch of
RedisCommandTimeoutException
and a pileup of pending requests.We've observed this behavior both with Lettuce 5.3.4 and 6.0.0 under Ubuntu 18.04. We have not yet tried 5.3.5 or 6.0.1, but (unless this is somehow a TCP NODELAY issue) I haven't spotted anything in the changelog that seems relevant.
Our setup is that we have tens of application servers, each of which has a default
ClientResources
instance shared between threeRedisClusterClient
instances. We have a default command timeout of 3 seconds. Our application servers have 8 physical cores, and so my understanding is that our IO thread pools have 8 threads. Our application servers live in AWS EC2 and communicate with three Redis clusters hosted by AWS ElastiCache:Between all three clusters, application instances issue somewhere between 5,000 and 10,000 Redis commands per second.
We've instrumented instances by logging from
io.lettuce.core.protocol
atDEBUG
and by observing TCP traffic. When we experience a stall, the symptoms are:A representative log line when attempting to write a command during a stall:
…which will resolve several seconds later:
We do not see any reads reported by Lettuce during a stall, though we do see them before and after.
I have not yet found anything in our application logs, Lettuce's logs, or TCP events that predicts when one of these stalls will occur. We will often see TCP retransmissions or "zero window" packets during or after a stall, but have not yet seen a consistent pattern prior to the stall. We have not observed any reconnections or cluster topology changes in our own logs, and to the best of my knowledge, that's supported by Lettuce's own logging because channel IDs remain consistent before and after a stall.
I do usually see an increase in allocated direct memory in the minute or two before one of these stalls, but we also see increases in allocated direct memory that do not precede a stall.
I'm aware that similar issues have been reported in the past, but have generally been closed due to a lack of debug information. My next step is to capture some thread dumps when these stalls occur. In the meantime, I think we can address some hypotheses with reasonable (but not perfect) confidence:
ExecutorService
to avoid blocking Lettuce/Netty IO threadssync()
commands for virtually everythingI expect to have a thread dump in the next few days. In the meantime, while I can't provide the raw debug data in its entirety, I'd be happy to answer any questions about our debug data and go digging for potentially-relevant excerpts that I can share here.
The text was updated successfully, but these errors were encountered: