Eliminate redundant dial mutex causing unbounded connection queue contention #3088

LINKIWI · 2024-08-10T14:14:39Z

The immediate symptom this PR attempts to address: during periods of transient server connectivity errors, go-redis commands time out after upwards of 60s (or more), even though the socket read/write timeouts are 3s and the context timeout on the commands is 1s. We have root caused this bug to lock contention in the connection pool's lazy dialer.

There is a mutual exclusion lock in DialHook, which allows only one server dial to occur at the same time. In the event of server connectivity errors, this causes unbounded connection queueing under highly concurrent workloads.

Consider, for example, a concurrent workload with the default dial timeout of 5s, and an unresponsive server endpoint. During this period, all dials are timing out.

The connection pool is empty, or all connections are currently occupied by in-flight I/O.
N commands are executed concurrently, all of which need to acquire new connections.
All N commands attempt to add a connection to the pool, which lazily calls DialHook.
The first connection attempt acquires the mutex, and times out after 5s.
The second connection attempt is blocked on mutex acquisition, and after acquiring the lock, itself also times out after 5s. The total wall clock time that the second command was blocked is now 10s.
...
This results in a cascading failure mode where, under this scenario, individual commands can occupy multiple minutes of wall clock time due to lock contention.

DialHook itself does not mutate any state in the hooksMixin. I believe the mutex is redundant, and can be eliminated. The original commit that introduced the lock attempts to fix a race condition encountered in AddHook. The mutex that guards chain should be sufficient for this purpose.

We have validated that this change fixes the unbounded queueing, and prevents the system from entering a prolonged state of not serving any useful throughput, during these periods.

I have also added a unit test to capture the regression. Without this patch, the unit test correctly fails; Ping takes successively longer on each invocation (1s, 2s, 3s, etc.).

…tention

LINKIWI · 2024-08-24T21:25:37Z

@monkey92t Are you the right person to review this? Thanks.

monkey92t · 2024-08-28T03:07:12Z

@monkey92t Are you the right person to review this? Thanks.

Sorry, I don't have the relevant permission. ping @ofekshenawa ?

…tention (#3088) * Eliminate redundant dial mutex causing unbounded connection queue contention * Dialer connection timeouts unit test --------- Co-authored-by: ofekshenawa <104765379+ofekshenawa@users.noreply.github.com>

* Add guidance on unstable RESP3 support for RediSearch commands to README (#3177) * Add UnstableResp3 to docs * Add RawVal and RawResult to wordlist * Explain more about SetVal * Add UnstableResp to wordlist * Eliminate redundant dial mutex causing unbounded connection queue contention (#3088) * Eliminate redundant dial mutex causing unbounded connection queue contention * Dialer connection timeouts unit test --------- Co-authored-by: ofekshenawa <104765379+ofekshenawa@users.noreply.github.com> * SortByWithCount FTSearchOptions fix (#3201) * SortByWithCount FTSearchOptions fix * FTSearch test fix * Another FTSearch test fix * Another FTSearch test fix --------- Co-authored-by: Christopher Golling <Chris.Golling@aexp.com> * Fix race condition in clusterNodes.Addrs() (#3219) Resolve a race condition in the clusterNodes.Addrs() method. Previously, the method returned a reference to a string slice, creating the potential for concurrent reads by the caller while the slice was being modified by the garbage collection process. Co-authored-by: Nedyalko Dyakov <nedyalko.dyakov@gmail.com> * chore: fix some comments (#3226) Signed-off-by: zhuhaicity <zhuhai@52it.net> Co-authored-by: Nedyalko Dyakov <nedyalko.dyakov@gmail.com> * fix(aggregate, search): ft.aggregate bugfixes (#3263) * fix: rearange args for ft.aggregate apply should be before any groupby or sortby * improve test * wip: add scorer and addscores * enable all tests * fix ftsearch with count test * make linter happy * Addscores is available in later redisearch releases. For safety state it is available in redis ce 8 * load an apply seem to break scorer and addscores * fix: add unstableresp3 to cluster client (#3266) * fix: add unstableresp3 to cluster client * propagate unstableresp3 * proper test that will ignore error, but fail if client panics * add separate test for clusterclient constructor * fix: flaky ClientKillByFilter test (#3268) * Reinstate read-only lock on hooks access in dialHook (#3225) * use limit when limitoffset is zero (#3275) * remove redis 8 comments * update package versions * use latest golangci-lint * fix(search&aggregate):fix error overwrite and typo #3220 (#3224) * fix (#3220) * LOAD has NO AS param(https://redis.io/docs/latest/commands/ft.aggregate/) * fix typo: WITHCOUT -> WITHCOUNT * fix (#3220): * Compatible with known RediSearch issue in test * fix (#3220) * fixed the calculation bug of the count of load params * test should not include special condition * return errors when they occur --------- Co-authored-by: Nedyalko Dyakov <nedyalko.dyakov@gmail.com> Co-authored-by: ofekshenawa <104765379+ofekshenawa@users.noreply.github.com> * Recognize byte slice for key argument in cluster client hash slot computation (#3049) Co-authored-by: Vladyslav Vildanov <117659936+vladvildanov@users.noreply.github.com> Co-authored-by: ofekshenawa <104765379+ofekshenawa@users.noreply.github.com> --------- Signed-off-by: zhuhaicity <zhuhai@52it.net> Co-authored-by: ofekshenawa <104765379+ofekshenawa@users.noreply.github.com> Co-authored-by: LINKIWI <LINKIWI@users.noreply.github.com> Co-authored-by: Cgol9 <chris.golling@verizon.net> Co-authored-by: Christopher Golling <Chris.Golling@aexp.com> Co-authored-by: Shawn Wang <62313353+shawnwgit@users.noreply.github.com> Co-authored-by: ZhuHaiCheng <zhuhai@52it.net> Co-authored-by: herodot <54836727+bitsark@users.noreply.github.com> Co-authored-by: Vladyslav Vildanov <117659936+vladvildanov@users.noreply.github.com>

LINKIWI added 2 commits August 10, 2024 07:12

Eliminate redundant dial mutex causing unbounded connection queue con…

fedd45f

…tention

Dialer connection timeouts unit test

8975e1e

LINKIWI mentioned this pull request Aug 10, 2024

Bound connection pool background dials to configured dial timeout #3089

Open

iphpweb approved these changes Nov 5, 2024

View reviewed changes

Merge branch 'master' into redundant-dial-mutex

7532edf

ofekshenawa approved these changes Nov 13, 2024

View reviewed changes

ofekshenawa merged commit 080e051 into redis:master Nov 20, 2024
10 checks passed

LINKIWI mentioned this pull request Jan 11, 2025

Reinstate read-only lock on hooks access in dialHook to fix data race #3225

Merged

ndyakov mentioned this pull request Feb 21, 2025

release: 9.7.1 patch #3278

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Eliminate redundant dial mutex causing unbounded connection queue contention #3088

Eliminate redundant dial mutex causing unbounded connection queue contention #3088

LINKIWI commented Aug 10, 2024 •

edited

Loading

LINKIWI commented Aug 24, 2024

monkey92t commented Aug 28, 2024

Eliminate redundant dial mutex causing unbounded connection queue contention #3088

Eliminate redundant dial mutex causing unbounded connection queue contention #3088

Conversation

LINKIWI commented Aug 10, 2024 • edited Loading

LINKIWI commented Aug 24, 2024

monkey92t commented Aug 28, 2024

LINKIWI commented Aug 10, 2024 •

edited

Loading