-
Notifications
You must be signed in to change notification settings - Fork 183
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Deadlock spotted in v0.0.80 #108
Comments
Hi @GiedriusS, thanks you for providing the dump. I will try to find out the cause ASAP. |
Hi @GiedriusS, it looks like 4
rueidis should not switch to a replica proactively. |
Hi @GiedriusS, I am still trying hard to find the cause, but haven't found it yet. Could you also provide some information about your setup and how do you generate the heavy load? |
Yeah, this happens when the network usage gets saturated so the connection with Redis servers is probably lost at that point. I'll try to make a reproducible case tomorrow or when time permits. It should probably be enough to make lots of calls to Redis and then introduce network errors to reproduce this 🤔 |
fix: shutdown deadlock of pipe when its ring is full (#108)
Hi @GiedriusS, Thank you for your goroutine dump. I do find the deadlock from it. From the dump, there some stack traces that help me find the deadlock, the first one is:
It indicated that the Since the underlying connection had already been closed after Normally, it won't stuck here after However, from you goroutine dump, the injected
This indicated that the ring was full. Releasing the ring was exactly what the pipe was about to do next but it was stuck on waiting for PR #113 fixed this deadlock by releasing the ring right after There is also a new test case reproduces the deadlock condition. The old code won't pass this test: |
Hi @GiedriusS, The fix has been released in v0.0.82. Please check if the same issue or other issue occurs. |
Thank you for your detailed report and the amazing library! Let me try to update it and give it a try. Sorry for not helping you out much with the code, I'm a bit preoccupied with other projects :( |
Couldn't reproduce it so far with v0.0.82 even though there were occurrences with really high load. I'll close this in a few days if nothing comes up. Thank you again 💪 |
Hi @GiedriusS, Thank you very much for testing rueidis! Please feel free to report your findings. ❤️ |
Yeah, it's not reproducible anymore. Thanks for your fix ❤️ |
48 (the count from the graphic) goroutines are hanging here:
Hanging here, I think:
|
Hi @GiedriusS, Thank you for keep trying it. While I am still investigating your latest dump, could you also try v0.0.83? There is a gorutine leak fix in v0.0.83. |
Hey @GiedriusS, Thank you for your goroutine dump,
From the dump, the above goroutines were actually all hanging on |
Hi @GiedriusS, While I am still investigating how will |
Let me try and get back to you 👍 |
Hi @GiedriusS, I have fix the leaking It turns out that this was a naive bug of common func (c *lru) GetEntry(key, cmd string) (entry *entry) {
c.mu.RLock()
store, ok := c.store[key]
if ok {
entry, ok = store.cache[cmd]
}
c.mu.RUnlock()
if entry != nil {
return entry
}
c.mu.Lock()
if store == nil { // <-- this check is the bug, the store might already been evicted
if store, ok = c.store[key]; !ok {
store = newStore()
c.store[key] = store
}
}
if entry, ok = store.cache[cmd]; !ok {
entry = newEntry()
store.cache[cmd] = entry
}
c.mu.Unlock()
return entry
} It tries to shortcut lookup with the cheap You probably have noticed that the The The fix has been released in v0.0.85. Thank you for being patient and please let me know if you have found further problems. |
Coming back to this - cannot reproduce this anymore. Thx for your work 🍻 |
Return custom error implementations as pointers
I am trying Rueidis before updating my PR here: thanos-io/thanos#5593
I have noticed that under heavy load, some goroutines get stuck (deadlock) in Rueidis code:
I would expect the goroutines count to go down but it stays the same.
Here's the goroutine dump from pprof:
rueidis_dump.txt
My guess is that under heavy load it tries to switch to a replica in Redis Cluster and then fails to timeout some operation somewhere? I don't know how to reproduce this ATM but maybe you'll be able to get some ideas from the dump?
The text was updated successfully, but these errors were encountered: