-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Redis cache: Failed to connect to all nodes of the cluster #37041
Comments
/cc @cescoffier (redis), @gsmet (redis), @gwenneg (cache), @machi1990 (redis) |
\CC @Ladicek |
I would like to help contribute to this issue. I'm able to reproduce the issue on my system, but unsure where to start. Any tips or code pointers would help as I'm familiar with java but new to quarkus and vertx. |
We got a report for something similar when there is a DNS issue. It looks like the connections are not released after an error. I believe the issue is not in Quarkus but in the Vert.x Redis client. @Ladicek should know more, as he recently looked at this code. I think that investigating what happens when a failure happens in the Vert.x redis client code would be a first great step. There should be exceptionHandlers and I suspect that they are not releasing the connection. The Vert.x redis client code is in https://github.com/vert-x3/vertx-redis-client/tree/4.4. Select the 4.4 branch - it's the one used in Quarkus (a forward port should be possible once we find the issue). Build it using <dependency>
<groupId>io.vertx</groupId>
<artifactId>vertx-redis-client</artifactId>
<version>4.4.6-SNAPSHOT</version> <!-- verify it's what you built --> |
We are having the same issue, getting the |
I agree the problem likely exists in the Vert.x Redis client, but I guess it's more like an exception handler is missing. But I didn't have time to take a proper look yet. If someone beats me to it, that's cool :-) |
I am unsure if DNS is involved because we are able to reproduce the issue locally with Docker compose. |
Debugging the reproducer what I've found is that it seems that we have an exception being swallowed (https://github.com/vert-x3/vertx-redis-client/blob/fdd8de224f22a74553774cca1e1e5fcfca24bc85/src/main/java/io/vertx/redis/client/impl/RedisClusterClient.java#L170)... so the error that happens in the reproducer during the jMeter test an exception of And for that specific error just setting the configs:
Would solve the max simultaneous requests that the service can handle... |
If you also run https://github.com/rakyll/hey with the following parameters:
You will see the same thing happening... so tweaking the concurrency |
We should be able to collect all the failures and then fail the promise with an exception that has all the collected failures as suppressed exceptions I think? |
Can you also confirm that the connection pool doesn't recover after it's exhausted? |
In my tests the connection pool always recovered, one thing that I noted is that the jmeter test is kinda of unbounded, in my machine when running it it takes forever to recover later since I will have a too many files open exception everywhere, so after I kill the load test it takes a while for my computer to be back hahahah but after it is back everything works again... |
Based on the analysis, it seems to be a problem in the Vert.x Redis client. @Ladicek can you open an issue there? |
Based on the analysis, this does not seem to be a problem in the Vert.x Redis client -- baring the unhelpful exception, which really should carry the underlying exceptions (as suppressed exceptions IMHO). The original reproducer has 10_000 concurrent users, so it either needs a connection pool of size 10_000 (most likely a bad idea), or a connection pool queue that can hold 10_000 queued requests. If I add this line to the configuration of the reproducer, everything works smoothly: quarkus.redis.max-pool-waiting=10000 I will submit a PR to the Vert.x Redis client to expose the underlying exceptions properly, but that's all I can think of. Well, maybe we should document the queuing mechanism, the Vert.x Redis client documentation mentions it briefly (https://vertx.io/docs/vertx-redis-client/java/#_connection_pooling), but it seems the Quarkus Redis client documentation doesn't mention it at all. |
OK, suppressed exceptions don't work all that well, as they tend to produce huge stack traces. So I think I'll go with just expanding the error message to include all underlying error messages. In the Quarkus log, that will look like:
That should be good enough, I guess. |
PRs to Vert.x:
PR to Quarkus: I can't see anything else we could do here. |
Thanks for the help! @cescoffier @Ladicek @luneo7 |
I think there is an issue with reconnects here, let me know if i should file a new issue: // manual reproduction that consistently reproduces the failure to reconnect:
My theory is that when all endpoints of the cluster are down, the slots / endpoints being saved are incorrect and getSlots is always called with index >= endpoints.size. |
Yes, the Vert.x Redis client intentionally doesn't implement reconnect on error, see https://vertx.io/docs/vertx-redis-client/java/#_implementing_reconnect_on_error We should probably implement something like that in Quarkus. Please file a feature request. |
Describe the bug
The Redis cache implementation fails to work in cluster mode. Under load, it will thrown an error: Failed to connect to all nodes of the cluster. This can be found in the vert.x redis client here: https://github.com/vert-x3/vertx-redis-client/blob/4.4.6/src/main/java/io/vertx/redis/client/impl/RedisClusterClient.java#L202
Expected behavior
Cluster mode works.
Actual behavior
Cluster mode throws an exception without stacktrace.
Setting
quarkus.redis.max-pool-size
to a higher value seems to postpone the error, but it will still eventually fail.How to Reproduce?
docker-compose up
./mvnw quarkus:dev
After about 30 seconds of load testing, you should see
Output of
uname -a
orver
No response
Output of
java -version
openjdk version "17.0.7" 2023-04-18 OpenJDK Runtime Environment Temurin-17.0.7+7 (build 17.0.7+7) OpenJDK 64-Bit Server VM Temurin-17.0.7+7 (build 17.0.7+7, mixed mode, sharing)
Quarkus version or git rev
3.5.1
Build tool (ie. output of
mvnw --version
orgradlew --version
)Apache Maven 3.9.3 (21122926829f1ead511c958d89bd2f672198ae9f)
Additional information
No response
The text was updated successfully, but these errors were encountered: