Deadlock spotted in v0.0.80 #108

GiedriusS · 2022-10-24T12:35:22Z

I am trying Rueidis before updating my PR here: thanos-io/thanos#5593

I have noticed that under heavy load, some goroutines get stuck (deadlock) in Rueidis code:

I would expect the goroutines count to go down but it stays the same.

Here's the goroutine dump from pprof:
rueidis_dump.txt

My guess is that under heavy load it tries to switch to a replica in Redis Cluster and then fails to timeout some operation somewhere? I don't know how to reproduce this ATM but maybe you'll be able to get some ideas from the dump?

rueian · 2022-10-24T12:41:33Z

Hi @GiedriusS, thanks you for providing the dump. I will try to find out the cause ASAP.

rueian · 2022-10-24T16:29:17Z

Hi @GiedriusS, it looks like 4 DoCache calls were stuck probably due to somehow deadlock or missing response from redis node. To narrow down the possibilities, did some of your cluster nodes failover before the stuck period?

My guess is that under heavy load it tries to switch to a replica in Redis Cluster and then fails to timeout some operation somewhere?

rueidis should not switch to a replica proactively.

rueian · 2022-10-26T14:51:19Z

Hi @GiedriusS, I am still trying hard to find the cause, but haven't found it yet. Could you also provide some information about your setup and how do you generate the heavy load?

GiedriusS · 2022-10-26T15:18:48Z

Yeah, this happens when the network usage gets saturated so the connection with Redis servers is probably lost at that point. I'll try to make a reproducible case tomorrow or when time permits. It should probably be enough to make lots of calls to Redis and then introduce network errors to reproduce this 🤔

fix: shutdown deadlock of pipe when its ring is full (#108)

rueian · 2022-10-30T03:49:07Z

Hi @GiedriusS,

Thank you for your goroutine dump. I do find the deadlock from it.

From the dump, there some stack traces that help me find the deadlock, the first one is:

1 @ 0x43cd76 0x406d9b 0x406898 0x1953588 0x46e2a1
#	0x1953587	github.com/rueian/rueidis.(*pipe)._background+0x227	/home/giedrius/go/pkg/mod/github.com/rueian/rueidis@v0.0.80/pipe.go:231

It indicated that the _backgroundRead had exited and the pipe was waiting for the _backgroundWrite to exit and then about to cleaning all pending goroutines, but the _backgroundWrite was stuck.
https://github.com/rueian/rueidis/blob/defba736b89d5ce25cc3a1a21986d00316ecd11f/pipe.go#L218-L231

Since the underlying connection had already been closed after _backgroundRead had exited, the _backgroundWrite could only stuck at p.queue.WaitForWrite() to wait for the next write:
https://github.com/rueian/rueidis/blob/defba736b89d5ce25cc3a1a21986d00316ecd11f/pipe.go#L291

Normally, it won't stuck here after _backgroundRead had exited because there would be a QUIT injected after _backgroundRead.
https://github.com/rueian/rueidis/blob/defba736b89d5ce25cc3a1a21986d00316ecd11f/pipe.go#L227

However, from you goroutine dump, the injected QUIT was also stuck:

1 @ 0x43cd76 0x46a36c 0x46a34c 0x47698c 0x19600a5 0x1953a22 0x46e2a1
#	0x46a34b	sync.runtime_notifyListWait+0x12b				/usr/lib/go-1.19/src/runtime/sema.go:517
#	0x47698b	sync.(*Cond).Wait+0x8b						/usr/lib/go-1.19/src/sync/cond.go:70
#	0x19600a4	github.com/rueian/rueidis.(*ring).PutOne+0x84			/home/giedrius/go/pkg/mod/github.com/rueian/rueidis@v0.0.80/ring.go:58
#	0x1953a21	github.com/rueian/rueidis.(*pipe)._background.func4+0x41	/home/giedrius/go/pkg/mod/github.com/rueian/rueidis@v0.0.80/pipe.go:227

This indicated that the ring was full. Releasing the ring was exactly what the pipe was about to do next but it was stuck on waiting for _backgroundWrite to exit and thus deadlocked. This is also the cause that many DoCache in the dump stucked at sync.(*Cond).Wait or rueidis.(*entry).Wait.

PR #113 fixed this deadlock by releasing the ring right after _backgroundRead, so that _backgroundWrite won't stuck at p.queue.WaitForWrite().

There is also a new test case reproduces the deadlock condition. The old code won't pass this test:
https://github.com/rueian/rueidis/blob/b26c3bada71d8eac685e38f10219acc71e59548b/pipe_test.go#L2448-L2473

rueian · 2022-10-30T05:27:59Z

Hi @GiedriusS,

The fix has been released in v0.0.82. Please check if the same issue or other issue occurs.

GiedriusS · 2022-10-31T08:27:29Z

Thank you for your detailed report and the amazing library! Let me try to update it and give it a try. Sorry for not helping you out much with the code, I'm a bit preoccupied with other projects :(

GiedriusS · 2022-11-04T11:14:13Z

Couldn't reproduce it so far with v0.0.82 even though there were occurrences with really high load. I'll close this in a few days if nothing comes up. Thank you again 💪

rueian · 2022-11-04T12:44:34Z

Hi @GiedriusS,

Thank you very much for testing rueidis! Please feel free to report your findings. ❤️

GiedriusS · 2022-11-08T08:00:10Z

Yeah, it's not reproducible anymore. Thanks for your fix ❤️

GiedriusS · 2022-11-11T08:18:15Z

Somehow it still happens :(

goroutine dump 0.0.82.txt

48 (the count from the graphic) goroutines are hanging here:

48 @ 0x43cd76 0x44dcfe 0x44dcd5 0x46a065 0x47a112 0x990be7 0x1a6ae26 0x1462643 0x1a69a05 0x13fbe50 0x128d02c 0x142333a 0x128d02c 0x1460cfb 0x142333a 0x128d02c 0x142333a 0x1ce8f69 0x142333a 0x1cedd5e 0x142333a 0x14231de 0xdfd23b 0xdfea4a 0xdf7fb8 0x46e2a1
#	0x46a064	sync.runtime_Semacquire+0x24											/usr/lib/go-1.19/src/runtime/sema.go:62
#	0x47a111	sync.(*WaitGroup).Wait+0x51											/usr/lib/go-1.19/src/sync/waitgroup.go:139
#	0x990be6	golang.org/x/sync/errgroup.(*Group).Wait+0x26									/home/giedrius/go/pkg/mod/golang.org/x/sync@v0.0.0-20220907140024-f12130a52804/errgroup/errgroup.go:53

Hanging here, I think:

119 @ 0x43cd76 0x44dcfe 0x44dcd5 0x46a065 0x47a112 0x196c0c5 0x196bef2 0x194df67 0x194d9e8 0x1977ce7 0x19aeecb 0x1a76e7a 0x1a6685b 0x1a6be45 0x990d84 0x46e2a1
#	0x46a064	sync.runtime_Semacquire+0x24								/usr/lib/go-1.19/src/runtime/sema.go:62
#	0x47a111	sync.(*WaitGroup).Wait+0x51								/usr/lib/go-1.19/src/sync/waitgroup.go:139
#	0x196c0c4	github.com/rueian/rueidis/internal/util.closeThenParallel[...]+0x184			/home/giedrius/go/pkg/mod/github.com/rueian/rueidis@v0.0.82/internal/util/parallel.go:44
#	0x196bef1	github.com/rueian/rueidis/internal/util.ParallelVals[...]+0xf1				/home/giedrius/go/pkg/mod/github.com/rueian/rueidis@v0.0.82/internal/util/parallel.go:21

rueian · 2022-11-11T10:09:00Z

Hi @GiedriusS,

Thank you for keep trying it. While I am still investigating your latest dump, could you also try v0.0.83? There is a gorutine leak fix in v0.0.83.

rueian · 2022-11-11T12:40:37Z

Hey @GiedriusS,

Thank you for your goroutine dump,

Hanging here, I think:

119 @ 0x43cd76 0x44dcfe 0x44dcd5 0x46a065 0x47a112 0x196c0c5 0x196bef2 0x194df67 0x194d9e8 0x1977ce7 0x19aeecb 0x1a76e7a 0x1a6685b 0x1a6be45 0x990d84 0x46e2a1
#	0x46a064	sync.runtime_Semacquire+0x24								/usr/lib/go-1.19/src/runtime/sema.go:62
#	0x47a111	sync.(*WaitGroup).Wait+0x51								/usr/lib/go-1.19/src/sync/waitgroup.go:139
#	0x196c0c4	github.com/rueian/rueidis/internal/util.closeThenParallel[...]+0x184			/home/giedrius/go/pkg/mod/github.com/rueian/rueidis@v0.0.82/internal/util/parallel.go:44
#	0x196bef1	github.com/rueian/rueidis/internal/util.ParallelVals[...]+0xf1				/home/giedrius/go/pkg/mod/github.com/rueian/rueidis@v0.0.82/internal/util/parallel.go:21

From the dump, the above goroutines were actually all hanging on rueidis.(*entry).Wait and indicating that there were *entry leaking from the lru cache. I will try to figure out the cause ASAP.

rueian · 2022-11-13T15:03:32Z

Hi @GiedriusS,

While I am still investigating how will *entry leak and other possibilities causing *entry.Wait stucked, would you like to try v0.0.84? *entry.Wait will now respect to ctx cancellation so that it can exit if ctx is done.

GiedriusS · 2022-11-14T14:15:38Z

Let me try and get back to you 👍

rueian · 2022-11-20T07:22:05Z

Hi @GiedriusS,

I have fix the leaking *entry in commit 069b77a.

It turns out that this was a naive bug of common sync.RWMutex scenario but I had missed during the development of the lru cache. It look like this:

func (c *lru) GetEntry(key, cmd string) (entry *entry) {
	c.mu.RLock()
	store, ok := c.store[key]
	if ok {
		entry, ok = store.cache[cmd]
	}
	c.mu.RUnlock()
	if entry != nil {
		return entry
	}
	c.mu.Lock()
	if store == nil { // <-- this check is the bug, the store might already been evicted
		if store, ok = c.store[key]; !ok {
			store = newStore()
			c.store[key] = store
		}
	}
	if entry, ok = store.cache[cmd]; !ok {
		entry = newEntry()
		store.cache[cmd] = entry
	}
	c.mu.Unlock()
	return entry
}

It tries to shortcut lookup with the cheap mu.RLock() first than only do mu.Lock() when the entry does not exists.

You probably have noticed that the if store == nil check under the mu.Lock() is risky. And it is indeed the cause of leaking *entry: It will cause a newEntry() to be stored into a evicted store and then the entry.Wait() will never be released.

The if store == nil check was valid at the early days of the development because I didn't evict the store until the commit ec19a5e.

The fix has been released in v0.0.85. Thank you for being patient and please let me know if you have found further problems.

GiedriusS · 2022-12-01T16:44:30Z

Coming back to this - cannot reproduce this anymore. Thx for your work 🍻

Return custom error implementations as pointers

rueian added a commit that referenced this issue Oct 30, 2022

fix: shutdown deadlock of the pipe when ring is full (#108)

b332bff

rueian added a commit that referenced this issue Oct 30, 2022

fix: shutdown deadlock of pipe when its ring is full (#108)

b26c3ba

rueian mentioned this issue Oct 30, 2022

fix: shutdown deadlock of pipe when its ring is full (#108) #113

Merged

rueian added a commit that referenced this issue Oct 30, 2022

Merge pull request #113 from rueian/fix-shutdown-deadlock-ring-full

53b7d6a

fix: shutdown deadlock of pipe when its ring is full (#108)

rueian mentioned this issue Oct 30, 2022

RedisLabs Enterprise Support #16

Closed

GiedriusS closed this as completed Nov 8, 2022

GiedriusS reopened this Nov 11, 2022

GiedriusS closed this as completed Dec 1, 2022

appleboy pushed a commit to appleboy/rueidis that referenced this issue Oct 15, 2023

Merge pull request redis#108 from ldcicconi/error_pointers

98ca47d

Return custom error implementations as pointers

qihuaheng mentioned this issue Jun 14, 2024

Write batching strategy #156

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Deadlock spotted in v0.0.80 #108

Deadlock spotted in v0.0.80 #108

GiedriusS commented Oct 24, 2022

rueian commented Oct 24, 2022

rueian commented Oct 24, 2022 •

edited

Loading

rueian commented Oct 26, 2022

GiedriusS commented Oct 26, 2022

rueian commented Oct 30, 2022 •

edited

Loading

rueian commented Oct 30, 2022 •

edited

Loading

GiedriusS commented Oct 31, 2022

GiedriusS commented Nov 4, 2022

rueian commented Nov 4, 2022

GiedriusS commented Nov 8, 2022

GiedriusS commented Nov 11, 2022 •

edited

Loading

rueian commented Nov 11, 2022

rueian commented Nov 11, 2022

rueian commented Nov 13, 2022

GiedriusS commented Nov 14, 2022

rueian commented Nov 20, 2022

GiedriusS commented Dec 1, 2022

Deadlock spotted in v0.0.80 #108

Deadlock spotted in v0.0.80 #108

Comments

GiedriusS commented Oct 24, 2022

rueian commented Oct 24, 2022

rueian commented Oct 24, 2022 • edited Loading

rueian commented Oct 26, 2022

GiedriusS commented Oct 26, 2022

rueian commented Oct 30, 2022 • edited Loading

rueian commented Oct 30, 2022 • edited Loading

GiedriusS commented Oct 31, 2022

GiedriusS commented Nov 4, 2022

rueian commented Nov 4, 2022

GiedriusS commented Nov 8, 2022

GiedriusS commented Nov 11, 2022 • edited Loading

rueian commented Nov 11, 2022

rueian commented Nov 11, 2022

rueian commented Nov 13, 2022

GiedriusS commented Nov 14, 2022

rueian commented Nov 20, 2022

GiedriusS commented Dec 1, 2022

rueian commented Oct 24, 2022 •

edited

Loading

rueian commented Oct 30, 2022 •

edited

Loading

rueian commented Oct 30, 2022 •

edited

Loading

GiedriusS commented Nov 11, 2022 •

edited

Loading