Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Redis cache: Failed to connect to all nodes of the cluster #37041

Closed
bartm-dvb opened this issue Nov 13, 2023 · 19 comments · Fixed by #37267
Closed

Redis cache: Failed to connect to all nodes of the cluster #37041

bartm-dvb opened this issue Nov 13, 2023 · 19 comments · Fixed by #37267
Assignees
Labels
Milestone

Comments

@bartm-dvb
Copy link
Contributor

Describe the bug

The Redis cache implementation fails to work in cluster mode. Under load, it will thrown an error: Failed to connect to all nodes of the cluster. This can be found in the vert.x redis client here: https://github.com/vert-x3/vertx-redis-client/blob/4.4.6/src/main/java/io/vertx/redis/client/impl/RedisClusterClient.java#L202

Expected behavior

Cluster mode works.

Actual behavior

Cluster mode throws an exception without stacktrace.

ERROR [io.qua.ver.htt.run.QuarkusErrorHandler] (vert.x-eventloop-thread-2) HTTP Request to /cat-fact failed, error id: 1cd9b164-ba80-4118-b675-ff5cfdfd93eb-869: io.vertx.core.impl.NoStackTraceThrowable: Failed to connect to all nodes of the cluster

Setting quarkus.redis.max-pool-size to a higher value seems to postpone the error, but it will still eventually fail.

How to Reproduce?

  1. Clone repo: https://github.com/bartm-dvb/quarkus-redis-bug
  2. Start Redis in cluster mode with docker-compose up
  3. ./mvnw quarkus:dev
  4. Start load test with jMeter, configuration file is in the repository.

After about 30 seconds of load testing, you should see

ERROR [io.qua.ver.htt.run.QuarkusErrorHandler] (vert.x-eventloop-thread-2) HTTP Request to /cat-fact failed, error id: 1cd9b164-ba80-4118-b675-ff5cfdfd93eb-869: io.vertx.core.impl.NoStackTraceThrowable: Failed to connect to all nodes of the cluster

Output of uname -a or ver

No response

Output of java -version

openjdk version "17.0.7" 2023-04-18 OpenJDK Runtime Environment Temurin-17.0.7+7 (build 17.0.7+7) OpenJDK 64-Bit Server VM Temurin-17.0.7+7 (build 17.0.7+7, mixed mode, sharing)

Quarkus version or git rev

3.5.1

Build tool (ie. output of mvnw --version or gradlew --version)

Apache Maven 3.9.3 (21122926829f1ead511c958d89bd2f672198ae9f)

Additional information

No response

@bartm-dvb bartm-dvb added the kind/bug Something isn't working label Nov 13, 2023
Copy link

quarkus-bot bot commented Nov 13, 2023

/cc @cescoffier (redis), @gsmet (redis), @gwenneg (cache), @machi1990 (redis)

@cescoffier
Copy link
Member

\CC @Ladicek

@sfali16
Copy link

sfali16 commented Nov 16, 2023

I would like to help contribute to this issue. I'm able to reproduce the issue on my system, but unsure where to start.
From what I gathered, the load test causes the connections to fail and once the connection objects are "corrupted" (have encountered exceptions), it never recovers.
I see a similar pattern if i shutdown the redis cluster and restart it - the exceptions continue.

Any tips or code pointers would help as I'm familiar with java but new to quarkus and vertx.

@cescoffier ^

@cescoffier
Copy link
Member

We got a report for something similar when there is a DNS issue. It looks like the connections are not released after an error. I believe the issue is not in Quarkus but in the Vert.x Redis client. @Ladicek should know more, as he recently looked at this code.

I think that investigating what happens when a failure happens in the Vert.x redis client code would be a first great step. There should be exceptionHandlers and I suspect that they are not releasing the connection.

The Vert.x redis client code is in https://github.com/vert-x3/vertx-redis-client/tree/4.4. Select the 4.4 branch - it's the one used in Quarkus (a forward port should be possible once we find the issue). Build it using mvn clean install -DskipTests. Then override the version in your project, just add the dependency:

<dependency>
  <groupId>io.vertx</groupId>
  <artifactId>vertx-redis-client</artifactId>
  <version>4.4.6-SNAPSHOT</version> <!-- verify it's what you built -->

@luneo7
Copy link
Contributor

luneo7 commented Nov 17, 2023

We are having the same issue, getting the io.vertx.core.impl.NoStackTraceThrowable: Failed to connect to all nodes of the cluster error log... I our case we have a custom DNS (CNAME) that points to an AWS Configuration endpoint, and we are running in cluster mode there as well... dunno if it might be the DNS roundtrip, since we use our custom DNS so we are able to easy construct the Redis endpoint, and terraform takes care of creating the CNAME to the AWS Configuration endpoint... so the vert.x redis client needs to solve the CNAME and later the configuration endpoint... one interesting thing is that looking at the cluster metrics the connections there never zeroed, and after we did a restart in the service we saw that the connections bumped, so it points to a load increase as well... as if when there is a sudden increase in the utilization of the client it fails to get new connections but existing connection remains there

@Ladicek
Copy link
Contributor

Ladicek commented Nov 20, 2023

I agree the problem likely exists in the Vert.x Redis client, but I guess it's more like an exception handler is missing. But I didn't have time to take a proper look yet. If someone beats me to it, that's cool :-)

@stefvanderweldevolksbanknl
Copy link
Contributor

We are having the same issue, getting the io.vertx.core.impl.NoStackTraceThrowable: Failed to connect to all nodes of the cluster error log... I our case we have a custom DNS (CNAME) that points to an AWS Configuration endpoint, and we are running in cluster mode there as well... dunno if it might be the DNS roundtrip, since we use our custom DNS so we are able to easy construct the Redis endpoint, and terraform takes care of creating the CNAME to the AWS Configuration endpoint... so the vert.x redis client needs to solve the CNAME and later the configuration endpoint... one interesting thing is that looking at the cluster metrics the connections there never zeroed, and after we did a restart in the service we saw that the connections bumped, so it points to a load increase as well... as if when there is a sudden increase in the utilization of the client it fails to get new connections but existing connection remains there

I am unsure if DNS is involved because we are able to reproduce the issue locally with Docker compose.

@luneo7
Copy link
Contributor

luneo7 commented Nov 20, 2023

Debugging the reproducer what I've found is that it seems that we have an exception being swallowed (https://github.com/vert-x3/vertx-redis-client/blob/fdd8de224f22a74553774cca1e1e5fcfca24bc85/src/main/java/io/vertx/redis/client/impl/RedisClusterClient.java#L170)... so the error that happens in the reproducer during the jMeter test an exception of Connection pool reached max wait queue size of 24 but there is an onFailure there that doesn't log that exception and sets failed to true and calls connectionComplete and then it makes it return onConnected.handle(Future.failedFuture("Failed to connect to all nodes of the cluster")) .... so the cause there is that there is not enough pool resources to handle the load, but we don't get the actual exception... just a static Failed to connect to all nodes of the cluster error...

Screenshot 2023-11-20 at 1 08 14 PM

And for that specific error just setting the configs:

quarkus.redis.max-pool-size: number of connections
quarkus.redis-max-pool-waiting: max waiting queue size

Would solve the max simultaneous requests that the service can handle...

@luneo7
Copy link
Contributor

luneo7 commented Nov 20, 2023

If you also run https://github.com/rakyll/hey with the following parameters:

hey -c 100 -z 30s -m GET http://localhost:8080/cat-fact

You will see the same thing happening... so tweaking the concurrency -c 100 you will see when it happens... so it seems that it looks like a pool exhaustion issue...

@Ladicek
Copy link
Contributor

Ladicek commented Nov 21, 2023

We should be able to collect all the failures and then fail the promise with an exception that has all the collected failures as suppressed exceptions I think?

@bartm-dvb
Copy link
Contributor Author

If you also run https://github.com/rakyll/hey with the following parameters:

hey -c 100 -z 30s -m GET http://localhost:8080/cat-fact

You will see the same thing happening... so tweaking the concurrency -c 100 you will see when it happens... so it seems that it looks like a pool exhaustion issue...

Can you also confirm that the connection pool doesn't recover after it's exhausted?

@luneo7
Copy link
Contributor

luneo7 commented Nov 21, 2023

In my tests the connection pool always recovered, one thing that I noted is that the jmeter test is kinda of unbounded, in my machine when running it it takes forever to recover later since I will have a too many files open exception everywhere, so after I kill the load test it takes a while for my computer to be back hahahah but after it is back everything works again...

@cescoffier
Copy link
Member

Based on the analysis, it seems to be a problem in the Vert.x Redis client. @Ladicek can you open an issue there?

@Ladicek
Copy link
Contributor

Ladicek commented Nov 22, 2023

Based on the analysis, this does not seem to be a problem in the Vert.x Redis client -- baring the unhelpful exception, which really should carry the underlying exceptions (as suppressed exceptions IMHO).

The original reproducer has 10_000 concurrent users, so it either needs a connection pool of size 10_000 (most likely a bad idea), or a connection pool queue that can hold 10_000 queued requests. If I add this line to the configuration of the reproducer, everything works smoothly:

quarkus.redis.max-pool-waiting=10000

I will submit a PR to the Vert.x Redis client to expose the underlying exceptions properly, but that's all I can think of.

Well, maybe we should document the queuing mechanism, the Vert.x Redis client documentation mentions it briefly (https://vertx.io/docs/vertx-redis-client/java/#_connection_pooling), but it seems the Quarkus Redis client documentation doesn't mention it at all.

@Ladicek
Copy link
Contributor

Ladicek commented Nov 22, 2023

OK, suppressed exceptions don't work all that well, as they tend to produce huge stack traces. So I think I'll go with just expanding the error message to include all underlying error messages. In the Quarkus log, that will look like:

2023-11-22 12:52:54,676 ERROR [io.qua.ver.htt.run.QuarkusErrorHandler] (vert.x-eventloop-thread-1) HTTP Request to /cat-fact failed, error id: 4d97804c-4cd0-42fc-b175-42eec8800bbd-5308: io.vertx.core.impl.NoStackTraceThrowable: Failed to connect to all nodes of the cluster
- io.vertx.core.http.ConnectionPoolTooBusyException: Connection pool reached max wait queue size of 24

That should be good enough, I guess.

@Ladicek
Copy link
Contributor

Ladicek commented Nov 22, 2023

PRs to Vert.x:

PR to Quarkus:

I can't see anything else we could do here.

@quarkus-bot quarkus-bot bot added this to the 3.7 - main milestone Nov 22, 2023
@bartm-dvb
Copy link
Contributor Author

Thanks for the help! @cescoffier @Ladicek @luneo7

@sfali16
Copy link

sfali16 commented Nov 25, 2023

I think there is an issue with reconnects here, let me know if i should file a new issue:

// manual reproduction that consistently reproduces the failure to reconnect:

  1. Start quarkus. Start Redis (docker compose up using the docker-compose file provided in this issue). Access http://localhost:8080/cat-fact
  2. a) stop redis (docker compose down)
  3. b) access cat-fact again -> see the following error in the quarkus log:
    2023-11-17 22:27:36,551 ERROR [io.qua.ver.htt.run.QuarkusErrorHandler] (vert.x-eventloop-thread-1) HTTP Request to /cat-fact failed, error id: 736e1d37-7d84-43fb-8e37-2d4d34f4eda6-11: io.vertx.core.impl.NoStackTraceThrowable: Cannot connect to any of the provided endpoints"
  4. Start redis. Access cat-fact - see above error repeat over and over until quarkus is restarted. Note if 2(a) and (3) is done without 2(b) the error never occurs.

My theory is that when all endpoints of the cluster are down, the slots / endpoints being saved are incorrect and getSlots is always called with index >= endpoints.size.

@Ladicek
Copy link
Contributor

Ladicek commented Nov 27, 2023

Yes, the Vert.x Redis client intentionally doesn't implement reconnect on error, see https://vertx.io/docs/vertx-redis-client/java/#_implementing_reconnect_on_error We should probably implement something like that in Quarkus. Please file a feature request.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants