Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Eliminate redundant dial mutex causing unbounded connection queue contention #3088

Merged
merged 3 commits into from
Nov 20, 2024

Conversation

LINKIWI
Copy link
Contributor

@LINKIWI LINKIWI commented Aug 10, 2024

The immediate symptom this PR attempts to address: during periods of transient server connectivity errors, go-redis commands time out after upwards of 60s (or more), even though the socket read/write timeouts are 3s and the context timeout on the commands is 1s. We have root caused this bug to lock contention in the connection pool's lazy dialer.

There is a mutual exclusion lock in DialHook, which allows only one server dial to occur at the same time. In the event of server connectivity errors, this causes unbounded connection queueing under highly concurrent workloads.

Consider, for example, a concurrent workload with the default dial timeout of 5s, and an unresponsive server endpoint. During this period, all dials are timing out.

  1. The connection pool is empty, or all connections are currently occupied by in-flight I/O.
  2. N commands are executed concurrently, all of which need to acquire new connections.
  3. All N commands attempt to add a connection to the pool, which lazily calls DialHook.
  4. The first connection attempt acquires the mutex, and times out after 5s.
  5. The second connection attempt is blocked on mutex acquisition, and after acquiring the lock, itself also times out after 5s. The total wall clock time that the second command was blocked is now 10s.
  6. ...
  7. This results in a cascading failure mode where, under this scenario, individual commands can occupy multiple minutes of wall clock time due to lock contention.

DialHook itself does not mutate any state in the hooksMixin. I believe the mutex is redundant, and can be eliminated. The original commit that introduced the lock attempts to fix a race condition encountered in AddHook. The mutex that guards chain should be sufficient for this purpose.

We have validated that this change fixes the unbounded queueing, and prevents the system from entering a prolonged state of not serving any useful throughput, during these periods.

I have also added a unit test to capture the regression. Without this patch, the unit test correctly fails; Ping takes successively longer on each invocation (1s, 2s, 3s, etc.).

@LINKIWI
Copy link
Contributor Author

LINKIWI commented Aug 24, 2024

@monkey92t Are you the right person to review this? Thanks.

@monkey92t
Copy link
Collaborator

@monkey92t Are you the right person to review this? Thanks.

Sorry, I don't have the relevant permission. ping @ofekshenawa ?

@ofekshenawa ofekshenawa merged commit 080e051 into redis:master Nov 20, 2024
10 checks passed
ndyakov pushed a commit that referenced this pull request Feb 17, 2025
…tention (#3088)

* Eliminate redundant dial mutex causing unbounded connection queue contention

* Dialer connection timeouts unit test

---------

Co-authored-by: ofekshenawa <104765379+ofekshenawa@users.noreply.github.com>
@ndyakov ndyakov mentioned this pull request Feb 21, 2025
ndyakov added a commit that referenced this pull request Feb 21, 2025
* Add guidance on unstable RESP3 support for RediSearch commands to README (#3177)

* Add UnstableResp3 to docs

* Add RawVal and RawResult to wordlist

* Explain more about SetVal

* Add UnstableResp to wordlist

* Eliminate redundant dial mutex causing unbounded connection queue contention (#3088)

* Eliminate redundant dial mutex causing unbounded connection queue contention

* Dialer connection timeouts unit test

---------

Co-authored-by: ofekshenawa <104765379+ofekshenawa@users.noreply.github.com>

* SortByWithCount FTSearchOptions fix (#3201)

* SortByWithCount FTSearchOptions fix

* FTSearch test fix

* Another FTSearch test fix

* Another FTSearch test fix

---------

Co-authored-by: Christopher Golling <Chris.Golling@aexp.com>

* Fix race condition in clusterNodes.Addrs() (#3219)

Resolve a race condition in the clusterNodes.Addrs() method.
Previously, the method returned a reference to a string slice, creating
the potential for concurrent reads by the caller while the slice was
being modified by the garbage collection process.

Co-authored-by: Nedyalko Dyakov <nedyalko.dyakov@gmail.com>

* chore: fix some comments (#3226)

Signed-off-by: zhuhaicity <zhuhai@52it.net>
Co-authored-by: Nedyalko Dyakov <nedyalko.dyakov@gmail.com>

* fix(aggregate, search): ft.aggregate bugfixes (#3263)

* fix: rearange args for ft.aggregate

apply should be before any groupby or sortby

* improve test

* wip: add scorer and addscores

* enable all tests

* fix ftsearch with count test

* make linter happy

* Addscores is available in later redisearch releases.

For safety state it is available in redis ce 8

* load an apply seem to break scorer and addscores

* fix: add unstableresp3 to cluster client (#3266)

* fix: add unstableresp3 to cluster client

* propagate unstableresp3

* proper test that will ignore error, but fail if client panics

* add separate test for clusterclient constructor

* fix: flaky ClientKillByFilter test (#3268)

* Reinstate read-only lock on hooks access in dialHook (#3225)

* use limit when limitoffset is zero (#3275)

* remove redis 8 comments

* update package versions

* use latest golangci-lint

* fix(search&aggregate):fix error overwrite and typo  #3220 (#3224)

* fix (#3220)

* LOAD has NO AS param(https://redis.io/docs/latest/commands/ft.aggregate/)

* fix typo: WITHCOUT -> WITHCOUNT

* fix (#3220):

    * Compatible with known RediSearch issue in test

* fix (#3220)

    * fixed the calculation bug of the count of load params

* test should not include special condition

* return errors when they occur

---------

Co-authored-by: Nedyalko Dyakov <nedyalko.dyakov@gmail.com>
Co-authored-by: ofekshenawa <104765379+ofekshenawa@users.noreply.github.com>

* Recognize byte slice for key argument in cluster client hash slot computation (#3049)

Co-authored-by: Vladyslav Vildanov <117659936+vladvildanov@users.noreply.github.com>
Co-authored-by: ofekshenawa <104765379+ofekshenawa@users.noreply.github.com>

---------

Signed-off-by: zhuhaicity <zhuhai@52it.net>
Co-authored-by: ofekshenawa <104765379+ofekshenawa@users.noreply.github.com>
Co-authored-by: LINKIWI <LINKIWI@users.noreply.github.com>
Co-authored-by: Cgol9 <chris.golling@verizon.net>
Co-authored-by: Christopher Golling <Chris.Golling@aexp.com>
Co-authored-by: Shawn Wang <62313353+shawnwgit@users.noreply.github.com>
Co-authored-by: ZhuHaiCheng <zhuhai@52it.net>
Co-authored-by: herodot <54836727+bitsark@users.noreply.github.com>
Co-authored-by: Vladyslav Vildanov <117659936+vladvildanov@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants