-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Eliminate redundant dial mutex causing unbounded connection queue contention #3088
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
@monkey92t Are you the right person to review this? Thanks. |
Sorry, I don't have the relevant permission. ping @ofekshenawa ? |
iphpweb
approved these changes
Nov 5, 2024
ofekshenawa
approved these changes
Nov 13, 2024
ndyakov
pushed a commit
that referenced
this pull request
Feb 17, 2025
…tention (#3088) * Eliminate redundant dial mutex causing unbounded connection queue contention * Dialer connection timeouts unit test --------- Co-authored-by: ofekshenawa <104765379+ofekshenawa@users.noreply.github.com>
Merged
ndyakov
added a commit
that referenced
this pull request
Feb 21, 2025
* Add guidance on unstable RESP3 support for RediSearch commands to README (#3177) * Add UnstableResp3 to docs * Add RawVal and RawResult to wordlist * Explain more about SetVal * Add UnstableResp to wordlist * Eliminate redundant dial mutex causing unbounded connection queue contention (#3088) * Eliminate redundant dial mutex causing unbounded connection queue contention * Dialer connection timeouts unit test --------- Co-authored-by: ofekshenawa <104765379+ofekshenawa@users.noreply.github.com> * SortByWithCount FTSearchOptions fix (#3201) * SortByWithCount FTSearchOptions fix * FTSearch test fix * Another FTSearch test fix * Another FTSearch test fix --------- Co-authored-by: Christopher Golling <Chris.Golling@aexp.com> * Fix race condition in clusterNodes.Addrs() (#3219) Resolve a race condition in the clusterNodes.Addrs() method. Previously, the method returned a reference to a string slice, creating the potential for concurrent reads by the caller while the slice was being modified by the garbage collection process. Co-authored-by: Nedyalko Dyakov <nedyalko.dyakov@gmail.com> * chore: fix some comments (#3226) Signed-off-by: zhuhaicity <zhuhai@52it.net> Co-authored-by: Nedyalko Dyakov <nedyalko.dyakov@gmail.com> * fix(aggregate, search): ft.aggregate bugfixes (#3263) * fix: rearange args for ft.aggregate apply should be before any groupby or sortby * improve test * wip: add scorer and addscores * enable all tests * fix ftsearch with count test * make linter happy * Addscores is available in later redisearch releases. For safety state it is available in redis ce 8 * load an apply seem to break scorer and addscores * fix: add unstableresp3 to cluster client (#3266) * fix: add unstableresp3 to cluster client * propagate unstableresp3 * proper test that will ignore error, but fail if client panics * add separate test for clusterclient constructor * fix: flaky ClientKillByFilter test (#3268) * Reinstate read-only lock on hooks access in dialHook (#3225) * use limit when limitoffset is zero (#3275) * remove redis 8 comments * update package versions * use latest golangci-lint * fix(search&aggregate):fix error overwrite and typo #3220 (#3224) * fix (#3220) * LOAD has NO AS param(https://redis.io/docs/latest/commands/ft.aggregate/) * fix typo: WITHCOUT -> WITHCOUNT * fix (#3220): * Compatible with known RediSearch issue in test * fix (#3220) * fixed the calculation bug of the count of load params * test should not include special condition * return errors when they occur --------- Co-authored-by: Nedyalko Dyakov <nedyalko.dyakov@gmail.com> Co-authored-by: ofekshenawa <104765379+ofekshenawa@users.noreply.github.com> * Recognize byte slice for key argument in cluster client hash slot computation (#3049) Co-authored-by: Vladyslav Vildanov <117659936+vladvildanov@users.noreply.github.com> Co-authored-by: ofekshenawa <104765379+ofekshenawa@users.noreply.github.com> --------- Signed-off-by: zhuhaicity <zhuhai@52it.net> Co-authored-by: ofekshenawa <104765379+ofekshenawa@users.noreply.github.com> Co-authored-by: LINKIWI <LINKIWI@users.noreply.github.com> Co-authored-by: Cgol9 <chris.golling@verizon.net> Co-authored-by: Christopher Golling <Chris.Golling@aexp.com> Co-authored-by: Shawn Wang <62313353+shawnwgit@users.noreply.github.com> Co-authored-by: ZhuHaiCheng <zhuhai@52it.net> Co-authored-by: herodot <54836727+bitsark@users.noreply.github.com> Co-authored-by: Vladyslav Vildanov <117659936+vladvildanov@users.noreply.github.com>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
The immediate symptom this PR attempts to address: during periods of transient server connectivity errors, go-redis commands time out after upwards of 60s (or more), even though the socket read/write timeouts are 3s and the context timeout on the commands is 1s. We have root caused this bug to lock contention in the connection pool's lazy dialer.
There is a mutual exclusion lock in
DialHook
, which allows only one server dial to occur at the same time. In the event of server connectivity errors, this causes unbounded connection queueing under highly concurrent workloads.Consider, for example, a concurrent workload with the default dial timeout of 5s, and an unresponsive server endpoint. During this period, all dials are timing out.
DialHook
.DialHook
itself does not mutate any state in thehooksMixin
. I believe the mutex is redundant, and can be eliminated. The original commit that introduced the lock attempts to fix a race condition encountered inAddHook
. The mutex that guardschain
should be sufficient for this purpose.We have validated that this change fixes the unbounded queueing, and prevents the system from entering a prolonged state of not serving any useful throughput, during these periods.
I have also added a unit test to capture the regression. Without this patch, the unit test correctly fails;
Ping
takes successively longer on each invocation (1s, 2s, 3s, etc.).