-
Notifications
You must be signed in to change notification settings - Fork 169
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Deadlock when connection try to disconnect #133
Comments
Hi, so I'm wondering, based on looking at the JDK11 bug history, if this may not be related to some past issues with JDK11 and SSL connections hitting a deadlock scenario where you have this unique combination of the socket shutting down: https://bugs.openjdk.org/browse/JDK-8207004 Can you please share what JDK vendor you're using? Would it be possible to upgrade to 11.0.20 and see whether this problem goes away, or better yet move to JDK17 as JDK11 is being EoL in October 2024? |
@Naros thanks, looks like it, indeed we use this
|
@Naros unfortunately the issue continues to happen with OpenJDK 17 ...
|
stack trace involving shyiko :
|
Hi @aadant, can you describe the steps on how to reproduce it reliably? |
It happens in the context of debezium embedded |
…on-try-to-disconnect DBZ-7570/#133: add workaround using SO_LINGER with 0 timeout
Hi, bringing this topic up again because we have encountered it in our company. Given: AWS Aurora 3, AWS MSK, Confluent Kafka Connect 7.6.1, Debezium 2.6.1. Problems started after MySQL upgrade from version 5.7 to 8 i.e. Aurora 2 -> 3. In the logs we see the message
About reliably reproducing the bug. Based on what I described above, namely that the hang happens on half-close, I'm not sure how it can be reproduced. In our case, Aurora DB sometimes has such extreme loads that it just dies for a few minutes, at which point Debezium stops getting heartbeats in response and tries to restart BinaryLogClient. In the case of a read socket close, an SSLException occurs:
it is ignored and proceeds to a write socket close, where the endless waiting takes place, because in our case the server is in a coma and not responding. How to reproduce this, I don't know. P.S. Recommendation from me: please add more logs, at least with DEBUG level. |
@comrada Debezium 2.6.1 contains 0.29.1 of binlog client. The issue should be fixed in 0.29.2. Could you please try to use it in yor deployment if it helps with the problem? |
Hi @comrada , does the 0.29.2 work for you? I'm facing a similar issue after upgrading to Aurora v3. |
@jpechane no, 0.29.2 does not fix this problem, because as I wrote in my comment, the infinite wait in my case happens while trying to close the socket for writing, and the 0.29.2 fix sets the SO_LINGER property after this call. |
Hi @comrada We're also on Aurora v3 and running into the same deadlock issue, although I'm not sure if the disconnect happens because of aurora dying sometimes like you mentioned in your case. For this change you mention though, closing the socket without linger is generally discouraged from what I've read online. So I guess this should be directed towards the people that implemented this initially to address this issue. But I wanted to ask if this can lead to any data loss? Has this fix been working fine for you ? |
I also experienced the same issue after upgrading from Aurora 2 -> 3, and the thread dump status was very similar. In my opinion, it is a bug in open jdk that occurs under certain conditions in SSL connections. It worked after I applied "database.ssl.mode" : disabled. Originally, preferred was being used as the default value, but this option does not seem to use SSL in the Aurora 2 version and seems to use SSL in the Aurora 3 version. I hope my experience helps you |
+1 Just experienced this today, even with |
Also saw a debezium deadlock (caused by this library) in my org which caused a small prod outage before monitoring flagged it and we think its related. I see the authors of this repo have actually made a subsequent change (looks like @comrada s change): Which looks like it will be part of the 0.30.0 release? |
Looks like this might have already made its way into debezium itself: here which is actually now a few versions ahead of the last release in this repo. |
Back with updates, In debezium 2.7.2 the version of this library being used is 0.29.2, which still doesnt include the change from comrada. However, in the latest debezium 3.0.0 it is using 0.31.0 of this library and his change is included. I ended up pulling the class files out of the docker images and decompiling them to find out. |
I suspect the TLS version of the connection between Debezium and MySQL database. When using Aurora 2 (MySQL 5.7), the connection was using TLSv1.2. After the upgrade to Aurora 3 (MySQL 8), the connection was changed to use TLSv1.3 as the default (https://docs.aws.amazon.com/AmazonRDS/latest/AuroraUserGuide/AuroraMySQL.Security.html) Our temporary fix is to update the tls_version to TLSv1.2 DB Cluster parameter group then the issue disappears. I haven't re-tested the bug with the latest Debezium release yet. |
Hey 👋 I am experiencing the same problem. Has anyone tested the Debezium 3.0.0 release or the best way to currently workaround the problem is to disable ssl using |
message in the log file
waiting forever
Given the code, it looks like this affects master ([0.28.3])
The text was updated successfully, but these errors were encountered: