Redis cache: Failed to connect to all nodes of the cluster #37041

bartm-dvb · 2023-11-13T11:19:28Z

Describe the bug

The Redis cache implementation fails to work in cluster mode. Under load, it will thrown an error: Failed to connect to all nodes of the cluster. This can be found in the vert.x redis client here: https://github.com/vert-x3/vertx-redis-client/blob/4.4.6/src/main/java/io/vertx/redis/client/impl/RedisClusterClient.java#L202

Expected behavior

Cluster mode works.

Actual behavior

Cluster mode throws an exception without stacktrace.

ERROR [io.qua.ver.htt.run.QuarkusErrorHandler] (vert.x-eventloop-thread-2) HTTP Request to /cat-fact failed, error id: 1cd9b164-ba80-4118-b675-ff5cfdfd93eb-869: io.vertx.core.impl.NoStackTraceThrowable: Failed to connect to all nodes of the cluster

Setting quarkus.redis.max-pool-size to a higher value seems to postpone the error, but it will still eventually fail.

How to Reproduce?

Clone repo: https://github.com/bartm-dvb/quarkus-redis-bug
Start Redis in cluster mode with docker-compose up
./mvnw quarkus:dev
Start load test with jMeter, configuration file is in the repository.

After about 30 seconds of load testing, you should see

ERROR [io.qua.ver.htt.run.QuarkusErrorHandler] (vert.x-eventloop-thread-2) HTTP Request to /cat-fact failed, error id: 1cd9b164-ba80-4118-b675-ff5cfdfd93eb-869: io.vertx.core.impl.NoStackTraceThrowable: Failed to connect to all nodes of the cluster

Output of `uname -a` or `ver`

No response

Output of `java -version`

openjdk version "17.0.7" 2023-04-18 OpenJDK Runtime Environment Temurin-17.0.7+7 (build 17.0.7+7) OpenJDK 64-Bit Server VM Temurin-17.0.7+7 (build 17.0.7+7, mixed mode, sharing)

Quarkus version or git rev

3.5.1

Build tool (ie. output of `mvnw --version` or `gradlew --version`)

Apache Maven 3.9.3 (21122926829f1ead511c958d89bd2f672198ae9f)

Additional information

No response

The text was updated successfully, but these errors were encountered:

quarkus-bot · 2023-11-13T11:19:31Z

/cc @cescoffier (redis), @gsmet (redis), @gwenneg (cache), @machi1990 (redis)

cescoffier · 2023-11-13T12:42:14Z

\CC @Ladicek

sfali16 · 2023-11-16T01:30:39Z

I would like to help contribute to this issue. I'm able to reproduce the issue on my system, but unsure where to start.
From what I gathered, the load test causes the connections to fail and once the connection objects are "corrupted" (have encountered exceptions), it never recovers.
I see a similar pattern if i shutdown the redis cluster and restart it - the exceptions continue.

Any tips or code pointers would help as I'm familiar with java but new to quarkus and vertx.

@cescoffier ^

cescoffier · 2023-11-16T06:52:59Z

We got a report for something similar when there is a DNS issue. It looks like the connections are not released after an error. I believe the issue is not in Quarkus but in the Vert.x Redis client. @Ladicek should know more, as he recently looked at this code.

I think that investigating what happens when a failure happens in the Vert.x redis client code would be a first great step. There should be exceptionHandlers and I suspect that they are not releasing the connection.

The Vert.x redis client code is in https://github.com/vert-x3/vertx-redis-client/tree/4.4. Select the 4.4 branch - it's the one used in Quarkus (a forward port should be possible once we find the issue). Build it using mvn clean install -DskipTests. Then override the version in your project, just add the dependency:

<dependency>
  <groupId>io.vertx</groupId>
  <artifactId>vertx-redis-client</artifactId>
  <version>4.4.6-SNAPSHOT</version> <!-- verify it's what you built -->

luneo7 · 2023-11-17T18:19:38Z

We are having the same issue, getting the io.vertx.core.impl.NoStackTraceThrowable: Failed to connect to all nodes of the cluster error log... I our case we have a custom DNS (CNAME) that points to an AWS Configuration endpoint, and we are running in cluster mode there as well... dunno if it might be the DNS roundtrip, since we use our custom DNS so we are able to easy construct the Redis endpoint, and terraform takes care of creating the CNAME to the AWS Configuration endpoint... so the vert.x redis client needs to solve the CNAME and later the configuration endpoint... one interesting thing is that looking at the cluster metrics the connections there never zeroed, and after we did a restart in the service we saw that the connections bumped, so it points to a load increase as well... as if when there is a sudden increase in the utilization of the client it fails to get new connections but existing connection remains there

Ladicek · 2023-11-20T09:09:51Z

I agree the problem likely exists in the Vert.x Redis client, but I guess it's more like an exception handler is missing. But I didn't have time to take a proper look yet. If someone beats me to it, that's cool :-)

stefvanderweldevolksbanknl · 2023-11-20T09:37:40Z

We are having the same issue, getting the io.vertx.core.impl.NoStackTraceThrowable: Failed to connect to all nodes of the cluster error log... I our case we have a custom DNS (CNAME) that points to an AWS Configuration endpoint, and we are running in cluster mode there as well... dunno if it might be the DNS roundtrip, since we use our custom DNS so we are able to easy construct the Redis endpoint, and terraform takes care of creating the CNAME to the AWS Configuration endpoint... so the vert.x redis client needs to solve the CNAME and later the configuration endpoint... one interesting thing is that looking at the cluster metrics the connections there never zeroed, and after we did a restart in the service we saw that the connections bumped, so it points to a load increase as well... as if when there is a sudden increase in the utilization of the client it fails to get new connections but existing connection remains there

I am unsure if DNS is involved because we are able to reproduce the issue locally with Docker compose.

luneo7 · 2023-11-20T20:05:32Z

Debugging the reproducer what I've found is that it seems that we have an exception being swallowed (https://github.com/vert-x3/vertx-redis-client/blob/fdd8de224f22a74553774cca1e1e5fcfca24bc85/src/main/java/io/vertx/redis/client/impl/RedisClusterClient.java#L170)... so the error that happens in the reproducer during the jMeter test an exception of Connection pool reached max wait queue size of 24 but there is an onFailure there that doesn't log that exception and sets failed to true and calls connectionComplete and then it makes it return onConnected.handle(Future.failedFuture("Failed to connect to all nodes of the cluster")) .... so the cause there is that there is not enough pool resources to handle the load, but we don't get the actual exception... just a static Failed to connect to all nodes of the cluster error...

And for that specific error just setting the configs:

quarkus.redis.max-pool-size: number of connections
quarkus.redis-max-pool-waiting: max waiting queue size

Would solve the max simultaneous requests that the service can handle...

luneo7 · 2023-11-20T20:37:36Z

If you also run https://github.com/rakyll/hey with the following parameters:

hey -c 100 -z 30s -m GET http://localhost:8080/cat-fact

You will see the same thing happening... so tweaking the concurrency -c 100 you will see when it happens... so it seems that it looks like a pool exhaustion issue...

Ladicek · 2023-11-21T07:55:44Z

We should be able to collect all the failures and then fail the promise with an exception that has all the collected failures as suppressed exceptions I think?

bartm-dvb · 2023-11-21T07:57:07Z

If you also run https://github.com/rakyll/hey with the following parameters:
hey -c 100 -z 30s -m GET http://localhost:8080/cat-fact
You will see the same thing happening... so tweaking the concurrency -c 100 you will see when it happens... so it seems that it looks like a pool exhaustion issue...

Can you also confirm that the connection pool doesn't recover after it's exhausted?

luneo7 · 2023-11-21T17:10:31Z

In my tests the connection pool always recovered, one thing that I noted is that the jmeter test is kinda of unbounded, in my machine when running it it takes forever to recover later since I will have a too many files open exception everywhere, so after I kill the load test it takes a while for my computer to be back hahahah but after it is back everything works again...

cescoffier · 2023-11-22T08:12:58Z

Based on the analysis, it seems to be a problem in the Vert.x Redis client. @Ladicek can you open an issue there?

Ladicek · 2023-11-22T10:06:05Z

Based on the analysis, this does not seem to be a problem in the Vert.x Redis client -- baring the unhelpful exception, which really should carry the underlying exceptions (as suppressed exceptions IMHO).

The original reproducer has 10_000 concurrent users, so it either needs a connection pool of size 10_000 (most likely a bad idea), or a connection pool queue that can hold 10_000 queued requests. If I add this line to the configuration of the reproducer, everything works smoothly:

quarkus.redis.max-pool-waiting=10000

I will submit a PR to the Vert.x Redis client to expose the underlying exceptions properly, but that's all I can think of.

Well, maybe we should document the queuing mechanism, the Vert.x Redis client documentation mentions it briefly (https://vertx.io/docs/vertx-redis-client/java/#_connection_pooling), but it seems the Quarkus Redis client documentation doesn't mention it at all.

Ladicek · 2023-11-22T11:57:34Z

OK, suppressed exceptions don't work all that well, as they tend to produce huge stack traces. So I think I'll go with just expanding the error message to include all underlying error messages. In the Quarkus log, that will look like:

2023-11-22 12:52:54,676 ERROR [io.qua.ver.htt.run.QuarkusErrorHandler] (vert.x-eventloop-thread-1) HTTP Request to /cat-fact failed, error id: 4d97804c-4cd0-42fc-b175-42eec8800bbd-5308: io.vertx.core.impl.NoStackTraceThrowable: Failed to connect to all nodes of the cluster
- io.vertx.core.http.ConnectionPoolTooBusyException: Connection pool reached max wait queue size of 24

That should be good enough, I guess.

Ladicek · 2023-11-22T13:21:01Z

PRs to Vert.x:

PR to Quarkus:

Redis: add minimal connection pooling documentation #37267

I can't see anything else we could do here.

bartm-dvb · 2023-11-22T16:49:57Z

Thanks for the help! @cescoffier @Ladicek @luneo7

sfali16 · 2023-11-25T07:45:00Z

I think there is an issue with reconnects here, let me know if i should file a new issue:

// manual reproduction that consistently reproduces the failure to reconnect:

Start quarkus. Start Redis (docker compose up using the docker-compose file provided in this issue). Access http://localhost:8080/cat-fact
a) stop redis (docker compose down)
b) access cat-fact again -> see the following error in the quarkus log:
2023-11-17 22:27:36,551 ERROR [io.qua.ver.htt.run.QuarkusErrorHandler] (vert.x-eventloop-thread-1) HTTP Request to /cat-fact failed, error id: 736e1d37-7d84-43fb-8e37-2d4d34f4eda6-11: io.vertx.core.impl.NoStackTraceThrowable: Cannot connect to any of the provided endpoints"
Start redis. Access cat-fact - see above error repeat over and over until quarkus is restarted. Note if 2(a) and (3) is done without 2(b) the error never occurs.

My theory is that when all endpoints of the cluster are down, the slots / endpoints being saved are incorrect and getSlots is always called with index >= endpoints.size.

Ladicek · 2023-11-27T09:38:04Z

Yes, the Vert.x Redis client intentionally doesn't implement reconnect on error, see https://vertx.io/docs/vertx-redis-client/java/#_implementing_reconnect_on_error We should probably implement something like that in Quarkus. Please file a feature request.

bartm-dvb added the kind/bug Something isn't working label Nov 13, 2023

quarkus-bot bot added area/cache area/redis labels Nov 13, 2023

cescoffier assigned Ladicek Nov 14, 2023

cescoffier mentioned this issue Nov 22, 2023

[Redis] Connection not released after DNS issue #36853

Closed

This was referenced Nov 22, 2023

Redis cluster improvements vert-x3/vertx-redis-client#419

Merged

Redis: add minimal connection pooling documentation #37267

Merged

cescoffier closed this as completed in #37267 Nov 22, 2023

quarkus-bot bot added this to the 3.7 - main milestone Nov 22, 2023

sfali16 mentioned this issue Nov 28, 2023

Redis cluster doesn't reconnect when node returns online #37348

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Redis cache: Failed to connect to all nodes of the cluster #37041

Redis cache: Failed to connect to all nodes of the cluster #37041

bartm-dvb commented Nov 13, 2023

quarkus-bot bot commented Nov 13, 2023

cescoffier commented Nov 13, 2023

sfali16 commented Nov 16, 2023 •

edited

Loading

cescoffier commented Nov 16, 2023

luneo7 commented Nov 17, 2023

Ladicek commented Nov 20, 2023

stefvanderweldevolksbanknl commented Nov 20, 2023

luneo7 commented Nov 20, 2023

luneo7 commented Nov 20, 2023

Ladicek commented Nov 21, 2023

bartm-dvb commented Nov 21, 2023

luneo7 commented Nov 21, 2023

cescoffier commented Nov 22, 2023

Ladicek commented Nov 22, 2023

Ladicek commented Nov 22, 2023

Ladicek commented Nov 22, 2023

bartm-dvb commented Nov 22, 2023

sfali16 commented Nov 25, 2023

Ladicek commented Nov 27, 2023

Redis cache: Failed to connect to all nodes of the cluster #37041

Redis cache: Failed to connect to all nodes of the cluster #37041

Comments

bartm-dvb commented Nov 13, 2023

Describe the bug

Expected behavior

Actual behavior

How to Reproduce?

Output of uname -a or ver

Output of java -version

Quarkus version or git rev

Build tool (ie. output of mvnw --version or gradlew --version)

Additional information

quarkus-bot bot commented Nov 13, 2023

cescoffier commented Nov 13, 2023

sfali16 commented Nov 16, 2023 • edited Loading

cescoffier commented Nov 16, 2023

luneo7 commented Nov 17, 2023

Ladicek commented Nov 20, 2023

stefvanderweldevolksbanknl commented Nov 20, 2023

luneo7 commented Nov 20, 2023

luneo7 commented Nov 20, 2023

Ladicek commented Nov 21, 2023

bartm-dvb commented Nov 21, 2023

luneo7 commented Nov 21, 2023

cescoffier commented Nov 22, 2023

Ladicek commented Nov 22, 2023

Ladicek commented Nov 22, 2023

Ladicek commented Nov 22, 2023

bartm-dvb commented Nov 22, 2023

sfali16 commented Nov 25, 2023

Ladicek commented Nov 27, 2023

Output of `uname -a` or `ver`

Output of `java -version`

Build tool (ie. output of `mvnw --version` or `gradlew --version`)

sfali16 commented Nov 16, 2023 •

edited

Loading