Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Redis] Connection not released after DNS issue #36853

Closed
lucaspouzac opened this issue Nov 3, 2023 · 17 comments
Closed

[Redis] Connection not released after DNS issue #36853

lucaspouzac opened this issue Nov 3, 2023 · 17 comments
Labels
area/redis kind/bug Something isn't working

Comments

@lucaspouzac
Copy link

Describe the bug

When a connection is closed prematurely, it is not always released. We used redis-cli with Uni for reactive programming.

At the time of the problem, we have several log traces of this type :

logger : io.vertx.redis.client.impl.RedisConnectionManager
errorType: "io.netty.channel.StacklessClosedChannelException"
message: "Unhandled Error"
stacktrace : io.netty.channel.StacklessClosedChannelException
	at io.netty.channel.AbstractChannel$AbstractUnsafe.write(Object, ChannelPromise)(Unknown Source)
threadName: "vert.x-eventloop-thread-2"

quarkus redis configuration

  redis:
    cache:
      max-pool-size: 6
      max-pool-waiting: 1000
      reconnect-attempts: -1
      pool-cleaner-interval: 5S
      tcp:
        keep-alive: true
        connection-timeout: 100MS
        read-idle-timeout: 1S
        write-idle-timeout: 1S

redis_pool_active is full and after redis_pool_queue_size is full.
We use redis 7.2.2

image
image

Expected behavior

The redis connection must be correctly release after an error

Actual behavior

No response

How to Reproduce?

No response

Output of uname -a or ver

No response

Output of java -version

21

Quarkus version or git rev

3.5.0

Build tool (ie. output of mvnw --version or gradlew --version)

No response

Additional information

No response

@lucaspouzac lucaspouzac added the kind/bug Something isn't working label Nov 3, 2023
Copy link

quarkus-bot bot commented Nov 3, 2023

/cc @cescoffier (redis), @gsmet (redis), @machi1990 (redis)

@cescoffier
Copy link
Member

@Ladicek didn't you fix the connection pool recently?

@Ladicek
Copy link
Contributor

Ladicek commented Nov 3, 2023

I fixed the Redis clustered client to actually use a connection pool instead of creating new connections all the time. This seems different.

@lucaspouzac Do you have any indication of what the error might be? Otherwise it's gonna be a blind search.

@lucaspouzac
Copy link
Author

@Ladicek 2 seconds before, there is a thread blocked on an http call. Maybe related?

errorMessage: "Thread blocked"
errorType: "io.vertx.core.VertxException"
loggerName: "io.vertx.core.impl.BlockedThreadChecker"
message: "Thread Thread[#83,vert.x-eventloop-thread-2,5,main] has been blocked for 5480 ms, time limit is 2000 ms"
stackTrace: "io.vertx.core.VertxException: Thread blocked
	at java.base@21.0.1/java.net.Inet4AddressImpl.lookupAllHostAddr(Native Method)
	at java.base@21.0.1/java.net.Inet4AddressImpl.lookupAllHostAddr(Inet4AddressImpl.java:43)
	at java.base@21.0.1/java.net.InetAddress$PlatformResolver.lookupByName(InetAddress.java:1211)
	at java.base@21.0.1/java.net.InetAddress.getAddressesFromNameService(InetAddress.java:1828)
	at java.base@21.0.1/java.net.InetAddress$NameServiceAddresses.get(InetAddress.java:1139)
	at java.base@21.0.1/java.net.InetAddress.getAllByName0(InetAddress.java:1818)
	at java.base@21.0.1/java.net.InetAddress.getAllByName(InetAddress.java:1688)
	at java.base@21.0.1/java.net.InetAddress.getByName(InetAddress.java:1568)

The native compilation is used with mandrel 23.1.
RedisStandalone is used (only one redis installation)

@cescoffier
Copy link
Member

The stack trace comes from a slow DNS resolution. Do you have the same in JVM mode?

@lucaspouzac
Copy link
Author

I have not tried to deploy this application in JVM mode. For other spring boot applications in JVM mode, in the same k8s cluster and same namespace, it does not seem to be a problem observed.

We have another quarkus application in native compilation which does not seem to have any problems. This has not yet been migrated to quarkus 3 (still in 2.8). No redis, only http calls

@lucaspouzac
Copy link
Author

After checking, I found some DNS resolution problems in the cluster. These are sporadic problems, because we have very few logs observed over 1 week.

These cases do not cause visible problems because a retry is performed correctly and connection is correctly released.

I suspect that in certain cases of connection errors in Redis, the connection is not correctly released on the connection pool side.
This generates stacking at the connection pool level.

@cescoffier
Copy link
Member

We may not release the connection after a DNS failure (that would obviously be a bug).

@cescoffier cescoffier self-assigned this Nov 12, 2023
@cescoffier cescoffier changed the title [Redis] Problem with release connection with reactive programming [Redis] Connection not released after DNS issue Nov 12, 2023
@cescoffier cescoffier removed their assignment Nov 22, 2023
@cescoffier
Copy link
Member

We got another issue reported where the connection are not released in case of an exception - See #37041

@Ladicek
Copy link
Contributor

Ladicek commented Nov 22, 2023

Looking again at

logger : io.vertx.redis.client.impl.RedisConnectionManager
errorType: "io.netty.channel.StacklessClosedChannelException"
message: "Unhandled Error"
stacktrace : io.netty.channel.StacklessClosedChannelException
	at io.netty.channel.AbstractChannel$AbstractUnsafe.write(Object, ChannelPromise)(Unknown Source)
threadName: "vert.x-eventloop-thread-2"

This message seems to be logged by this handler: https://github.com/vert-x3/vertx-redis-client/blob/4.4/src/main/java/io/vertx/redis/client/impl/RedisConnectionManager.java#L46, which is installed as the connection's exception handler here: https://github.com/vert-x3/vertx-redis-client/blob/4.4/src/main/java/io/vertx/redis/client/impl/RedisConnectionManager.java#L208

The connection's exception handler is called on several places. All of them but one pass an IllegalStateException with message Result is already complete. The only caller that can possibly pass another exception is https://github.com/vert-x3/vertx-redis-client/blob/4.4/src/main/java/io/vertx/redis/client/impl/RedisStandaloneConnection.java#L459

Unless I'm seriously totally absolutely wrong and mistaken, that's where the log message in this issue comes from. However, that's the end of the fail() method, and the fail() method is only called:

  • from the protocol parser (RESPParser), which is installed as a handler on the connection's NetSocket
  • from the connection's NetSocket itself, because it's installed as the NetSocket's exception handler
  • from the send / batch methods, because it's used as a failure callback for NetSocket.write() calls

This means that:

  1. at the time the error is logged, we're way past DNS, we've established a connection to Redis, we've installed a protocol handler on the NetSocket, there might even have been some communication;
  2. at the moment when the exception is logged, the connection should already have been removed from the connection pool (evict() is called a few lines above).

Item 1 seems like an exceptional, but expected situation -- errors do occur in the wild. From the "stacktrace" in the log message, the error seems to be a "closed channel" and it occured when writing [a command to Redis, probably?].

Item 2 is where it gets weird. If my analysis above is correct and the log message indeed comes from the RedisStandaloneConnection.fail() method, the connection should not stay in the pool. At the moment, I can't see a reason why it does.

@lucaspouzac
Copy link
Author

Is it possible that using the mutiny part for the reactive part could induce this behavior?

@cescoffier
Copy link
Member

Hard to say. In general it actually does the opposite and capture more failures. However, we may have forgotten an exception handler (so a failure not propagated downstream, but reported separately)

@lucaspouzac
Copy link
Author

Ok, moreover we see in the monitoring graph that the number of connections increases up to 22, while the configured max is 6 (threshold not exceeded during the period when it works)

@cescoffier
Copy link
Member

that looks like a pool issue in native. How do you get the number of connections? Metrics?

@lucaspouzac
Copy link
Author

Yes, with redis metrics by prometheus : https://quarkus.io/guides/redis-reference#enable-metrics

@lucaspouzac
Copy link
Author

We deployed a version without native compilation, and we see the same problem.
This happened after 3 days, without significant load at that time

JVM : registry.access.redhat.com/ubi8/openjdk-17:1.17
quarkus : 3.6.5

@lucaspouzac
Copy link
Author

Since quarkus 3.7.x, the problem is not reproducible.

Thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/redis kind/bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants