-
-
Notifications
You must be signed in to change notification settings - Fork 627
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Idle Connections are not being cleaned up under certain conditions. #3020
Comments
Hey @wellwelwel, Im just wondering what the next steps are in moving from PR to Release, if did attempt to deploy from my github clone via npm, but during our CI it failed saying that the lur-cache module was not found, so I am assuming that there may be a build step your side to having a working package. If your able to provide me with a npm installable package I will be happy to deploy to our staging infra across our 200+ lambdas that use the mysql2 library, we can update all the lambdas relatively easily and monitor for any observations. Thanks |
I'd like to take a closer look before merging (if @sidorares doesn't do it first).
To better understand the lru-cache conflicts, see #2988 and its tracks 🙋🏻♂️ |
Thanks @wellwelwel, yea please take a closer look, and in regards to the #2988, I have created a branch with this change and the connection cleanup change, currently deploying to our SIT environment where I can leave it running for a few hours to see see if there is any regressions. Ill update here soon |
My updates so far: I am using AWS CDK to manage my infrastructure, so by me making tis version change and deploying I have redeployed hundreds of lambda functions that now use the latest code, the latest code being this issue and the patch I applied from #2988. A full deployment like this took close to 2 hours to fully rollout, below are my observations.
During this time, I would expected the number of database connections to increase as more lambda instances are running during the deployment, and the number of connections should reduce back to the running average. Connections During Deployment: The above screenshot shows the pattern I expected, and also shows that the number of connections reduced back to the expected amount. I have also attached the screenshot of the Aborted Client metric, showing that during the deployment window the number of aborted clients jumped around as connections were bing made from new lambdas, and connections being aborted my lambdas being terminated Aborted Clients during deployment So far I am not observing any issues, I selected the sum of Errors for 500 lambdas over the past 12 hours, and I see no difference in the volume of errors before vs after the release I am still monitoring, as I also have a metric for the Socket Timeout error but as it's intermittent I will collect some data for several hours and see if we are still experiencing the errors. Will post back soon |
Hey @wellwelwel, so as far as any regressions are concerns, theres nothing on our side statistically that shows any increase in error or significant change in the number of connections for our RDS, which tells me that the new deployed code is working as expected. Here's a screenshot of the Errors that we were tracking for the Socket Timeout error, and as you can see after I released the patch to our staging environment around8:35pm, the Socket Timeout error seems to have stopped |
@wellwelwel I think we still to resolve the issue where the connection cleanup process doesn't start to tick unless the maxIdle is lower then the connectionLimit, my suggestion is that we just remove the if condition and start the timeout ticks regardless of the value, so we are able to update all connections based my logix change in the Line 30 in dee0c08
Thoughts |
Setup
Description:
Hey, I have been noticing some unusual connection errors, and have been tracing it back to the connection cleanup process within the Pool class, below I have outlined my setup and the steps I took to find the issue and how to replicate the problem using an existing test case.
We are getting a fairly sizeable percentage of requests that are failing due to socket timeout or disconnection issues, which typically show as the
PROTOCOL_CONNECTION_LOST
, as shown below:The configuration we had setup was
{"maxIdle": 3, connectionLimit: 3, idleTimeout: 30000 }
, and after digging around I found the following line of code, which implies that themaxIdle
MUST be less than theconnectionLimit
, so in this configuration the connection cleanup process is not enabled, meaning once a connection is opened, it will never be removed.node-mysql2/lib/pool.js
Line 30 in dee0c08
So I then with excitement updated one of our production lambdas to take on the config
{"maxIdle": 1, connectionLimit: 3, idleTimeout: 30000 }
in the hope that activating the connection cleanup process would resolve the connection errors we were seeing, I went to bed with the hope of waking up to no red blobs on our dashboard, however to my surprise the connection errors continued to happen, following a similar patterns to the previous nights, as if it made no effect.I looked at some of the logs and I noticed that the Timer object / reference was on the pool properties since I dropped the maxIdle to a lower number than the connectionLimit
Before:
After:
So I was certain the timer was now running, however I was still having the same connection errors 😠 ..., I added some more logs to track how long each connections thread id was lasting and what the time delays were between queries
So at this point I knew there was a possibility the the bug is within the node-mysql2 library, so I cloned the repository and setup a dockerised mysql sandbox to be able to execute the integration test suite, which thankfully was a piece of cake to get up and running, I ran the full suite to validate that everything is expected, and then I took at a look the
test/integration/test-pool-release-idle-connection.test.cjs
test file as that was closes to what I was doing.https://github.com/sidorares/node-mysql2/blob/dee0c0813854658baf5f73055e5de36a1a78e7bf/test/integration/test-pool-release-idle-connection.test.cjs
I setup my MYSQL, ensured that there were no connections from anything else and executed the test file as is, without making any changes
MySQL before the tests:
Executed the tests via the following command:
The test PASSES as expected 🎉
Within the MYSQL console, I was refreshing the process list every second and I can see 3 connections created, as expected, and I see those same three connections drop off after around 7-8 seconds, which again Is what I expect given the setTimeout within this test scenario.
After the test was completed, mysql was back down to the initial two base connections:
Ok, here's where it seems to go wrong, which I think could be the cause for connection socket timeouts that I am experiencing, when I change the configuration to
{connectionLimit: 3, maxIdle: 2}
, the test just hangs, it never finishes (even after 10/15+ minutes).When checking MySQL it seems as though the there are connections outstanding that have not been closed
Im pretty certain that the cause of this bug is somewhere near this code, however I don't have a huge amount of experience with this library so looking for some support here on replicating the bug to offering a solution that wont cause any regressions.
Thanks
References:
The text was updated successfully, but these errors were encountered: