-
Notifications
You must be signed in to change notification settings - Fork 506
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Queries that fail due to bad connections are not automatically retried (problem with Issue #275 changes) #586
Comments
Issue #275 and related sql.go Issue 20807 could be solved with error wrapping. that wasn't an option back when these issues were first raised, but it is now. Of course that requires coordinated changes in both the sql and mssql modules. Absent a complete fix, could we at least make the behavior configurable? Users could choose which they prefer: either Go's standard connection pool management with automatic retries or detailed connection error messages. I can think of a couple of mechanisms for configuring the behavior. One would be a function similar to the existing SetLogger function. Call it EnableErrorDetail or SetErrorPolicy or something along those lines. Another would be to add more driver names to differentiate the behavior, similar to the way "mssql" and "sqlserver" currently yield different behavior. I don't like this idea, but it is an option. |
P.S. I did look into the new Raw function in Go's sql module that allows access to the underlying driver connection, hoping that it might provide a mechanism mssql could use to expose a detailed "last error" message from the driver connection. It would work, except that Raw does not grant access to connections that have been closed, which are exactly the connections from which we would want a detailed last error. |
I see that the database/sql/driver package in Go 1.15 will include a new Validator interface that appears to specifically target the problem raised in Issue #275:
Are there plans to implement the new Validator interface in go-mssqldb and, in doing so, correct the bug introduced by the work-around for Issue #275? |
Would there be value in a 2 part fix?
|
Linked PR adds IsValid to the connection. This should help. Then we can update the errors returned to be something more useful. |
Supporting the Validator interface is good, but it doesn't change the retry behavior and won't help with this issue. I was hoping that it would when I mentioned it last year, but now that I have seen its implementation within the database/sql package, I understand that it does not. But first, a demonstration. Here is a test program. It creates a pool of five connections, fills the pool with five simultaneous queries, then runs a series of 11 total queries to demonstrate the error/retry behavior. The first three queries are normal, successful queries. The next eight queries are run after SQL Server is restarted, which closes the socket on which SQL Server is listening and thus makes all five connections in the pool bad. To successfully demonstrate this program, you must manually restart the SQL Server service during the 6-second pause built in to the program.
When you run this program using commit 3516239 of go-msqldb (the commit immediately preceding kardianos' Validator commit), you get these results:
Notice that five queries fail, one per bad connection in the pool, before queries begin to succeed again. If database/sql's automatic retry logic hadn't been defeated by the changes for #275, all queries would have succeeded. Next is the same program, but this time run with kardianos' Validator commit 333f2e3:
Externally, the behavior is exactly the same. Five queries, one per bad connection in the pool, must fail before the pool is emptied and queries begin to succeed again. Under the covers, of course, the behavior is slightly different. Without the Validator interface, the bad connections are returned to the pool and then removed the next time they are used. In this program, that happens during queries 6, 7, and 8 when go-smsqldb's ResetSession method is called: Lines 14 to 17 in b597672
Whereas with the Validator interface, the bad connections are removed before they are returned to the pool. In this program, that happens during queries 1 through 5 when database/sql's putConn calls go-mssqldb's IsValid function via the validateConnection function:
The above code is the only place in database/sql where the Validator interface is called. That is why implementing Validator doesn't change the external behavior. The error is reset locally within putConn's scope to remove the bad connection before it is returned to the pool, but the original error is returned to QueryContext, which continues to check for driver.ErrBadConn in its auto-retry logic. Therefore, the auto-retry logic does not work:
In the end, implementing the Validator interface doesn't change much for go-mssqldb. Externally it makes no difference. Internally, the bad connections are discarded slightly earlier in the process, but that doesn't have any material effect. Effectively, the Validator interface formalizes what go-mssqldb already implemented in response to issue #275. That's a good thing, but it doesn't help those who want auto-retry. So what do we do for users who want to use database/sql's standard auto-retry logic? I propose adding an ErrorHandler interface to go-mssqldb. The interface would have a single function named HandleError:
The inputs give the implementer information about what went wrong and where it went wrong. The output is the error that should be returned to database/sql. This allows the implementer to log the error, change it, or take any other action appropriate to their application. A simple implementation to restore database/sql's default auto-retry logic would be:
If the user supplies an implementation of HandleError, checkBadConn would call it just before returning. If the user does not supply an implementation, checkBadConn would return as it normally does now. I believe this would maintain go-mssql's existing behavior for backwards compatibility while allowing those who wish to fine tune the error behavior to do so. Is this something the maintainers would accept? |
This is a great write up.
|
Thank you
There are a lot of nice things in the database/sql package, but there is room for improvement too. It seems that the designers wanted to shield end users from the complexity of pool management. That's a good thing in general. Getting a connection pool to work correctly and efficiently is not an easy task, so relieving the end user of that responsibility makes the package much easier to use. But at the same time, it causes issues when the built-in behavior doesn't meet your needs. Do you want more detailed error messages like d1rewolf did in issue #275? It can't be done, unless the driver maintainers create a work-around the way go-mssqldb's maintainers did. Do you want to control what queries get retried (e.g. auto-retry read-only queries but not queries that write data)? Can't be done. Do you want to auto-retry queries but still log errors that were auto retried (this could help alert operators to degradation before it becomes severe enough to overcome the retry logic)? Can't be done. Do you want to flush the pool to get rid of bad/suspect connections? Can't be done, unless you close the pool altogether and create a new one, with the issues that entails. Do you want to instruct the pool to use a fresh, new connection rather than re-using one from the pool? Can't be done. The last point particularly gets in the way of taking control of retry logic yourself. I could try to write my own retry logic for use with go-msqldb that emulates the built-in logic of database/sql. That built-in logic tries the query maxBadConnRetries times (2 times, right now) with connections that are either pulled from the pool or created new as needed, and after the second failure tries once more with a guaranteed new connection:
However, end users don't have access to the cachedOrNewConn and alwaysNewConn flags. We have to take what the pool gives us. If the entire pool is polluted and the auto-retry logic is bypassed, as it is with go-mssqldb, then we have to make at least pool-size attempts before we have a chance of getting a new connection. My test program demonstrates this. Of course, database/sql is well established and used in countless programs. Altering its behavior in a non-backwards compatible way is out of the question. Any changes in it would need to maintain the existing behavior as default. That brings us to:
I don't want to make a breaking change in this driver either, which is why I am proposing an optional interface that allows the default behavior to be augmented rather than changed. Here are a few more details about my thinking regarding the interface signature:
I think the combination of all of those gives the implementer enough information to make conditional decisions on how to handle particular errors. I am open to suggestions for more or fewer arguments. In my personal use case I would trace and log the error with a trace ID that is already included in all my database queries, and then instruct database/sql to automatically retry queries that failed due to a bad connection. The queries I deal with are read-only. |
To make my proposal more concrete and give you something more easy to evaluate, I have implemented a proof of concept in commit 9cbf4b8 of my fork https://github.com/fineol/go-mssqldb, using the existing SetLogger feature as a guide for the sake of consistency. The implementation is not complete. For example, it does not yet pass the contexts and statement. Nor are there any unit tests. But it illustrates my idea and demonstrates the functionality. Running the test program from #586 (comment) with this fork, you get the same results, because the change is designed to be backwards compatible:
But by adding to the test program the single line Here is the updated test program:
And here are the results:
All eight queries after the restart now complete successfully, because the errors were converted back to driver.ErrBadConn by the applied ErrorHandler and thus swallowed by database/sql's auto-retry logic. The bad connections were flushed from the pool during the execution of queries 1 through 3. What do you think? |
I think the ability to tell go-mssql to revert to the "old" behavior is useful and perhaps it should go beyond adding a way for app developers to use it. Driver-level behavior is often buried in application-level logic and I suspect there are end users running applications whose code they don't control which would also like this behavior. |
I can see that adding a connection string parameter to enable auto-retry could be useful in scenarios that you describe, shueybubbles. However, making it a connection string parameter implies that the setting can vary by connection. I would still want to keep the SetErrorHandler function for use by developers who want finer grained control. That then brings up the question: which takes precedence? Does a connection string parameter override SetErrorHandler or does SetErrorHandler override the connection string parameter? I lean towards SetErrorHandler overriding the connection string parameter. If a developer has gone to the trouble of explicitly specifying an ErrorHandler, there was likely a good reason for it and not something we would want an end user overriding without knowing the implications. |
I agree that code should override config. Implementation-wise, it would probably just mean that the default error handler provided by go-mssql would switch between new and old behavior based on the connection string. If the app provides some other error handler the property in the connection string wouldn't be checked, though conceivably there could be a way for an app's error handler to fall back to the default error handler in some cases. |
We could also include the value of a connection string auto-retry parameter as an additional argument to the HandleError function of the ErrorHandler interface. That way a developer who implements a custom ErrorHandler could vary its behavior based on user configuration the same way you propose the default handler vary based on configuration. |
Commit a47dcd0 of my fork https://github.com/fineol/go-mssqldb passes contexts to checkBadConn and HandleError. With this the developer can implement a custom ErrorHandler that can make use of information included in the query contexts. My updated test program to demonstrate this feature is:
If you comment out the line that sets the custom ErrorHandler (
But if you enable the custom ErrorHandler, all eight test queries succeed and the underlying errors are logged with tracing information (the query number, in this case):
With this additional information, you can see where the bad connections are flushed. The auto-retry logic in database/sql makes two attempts per query with cached connections from the pool, discarding each as they fail, and then switches to a new connection that succeeds. It repeats this until all five bad connections are flushed, finishing the job in query 3. |
Hello, I don't think we should require a specific handler to have the connection pool work as it should. The case that's not handled by the We don't want On a side note, I've tested the following query, which closes the current connection (because of a severity above 20): RAISERROR ('goodbye', 20, 42) WITH LOG There are 2 problems here:
|
I agree with @tc-hib , we should not require a special handler. I think there are four points we can address:
If all agree this would fix the issue, and we can empirically test (let me know if you need help compiling Go from source and applying the std lib Go CL I'm about to send). |
Please see https://go-review.googlesource.com/c/go/+/333949 for a std lib change that wouldn't land until Go1.18 if accepted. See (1) in the above list. @tc-hib Can you verify that this would resolve this problem (assuming we implement (2))? |
I'll have a look as soon as I can. Meanwhile, I want to make sure we don't forget the cases when we should not return For example, imagine you run a stored procedure that does several things, and the communication is lost in the middle of it. Returning I don't know enough about the driver, but I suppose we should only return |
I don't object to wrapping the errors and using errors.Is within database/sql. I suggested that as a solution in my comment a year ago when I first opened this issue:
It didn't get any traction then, which I assumed was either because their wasn't a desire to change this driver's bad connection behavior yet again or because database/sql's retry logic was felt to be harmful and should be deliberately avoided. Those are both valid concerns (backwards compatibility and inadvertently retrying a query that ought not be automatically retried), which is why I proposed the separate ErrorHandler interface to let developers pick the behavior needed for their applications without changing the existing behavior. However, I think both of those concerns may be addressed by kardianos' third and fourth points. I think tc-hib's idea about only invoking the retry logic when first starting a query has merit as well, and this may already be effectively in place. For example, I don't believe that retry occurs when Nexting through a returned recordset. @kardianos, would point three be something like adding an optional "BadConnAction" parameter to the DSN that accepts values "Fail" and "Retry"? If so, I'd like to see that implemented now without waiting for 1.18 (assuming your proposal for database/sql is accepted). I've already looked at the code and don't think it would be hard to do. |
Yes. We should only return ErrBadConn when we want to retry during some type of dialing or initialization problem. Most any other situation we should return the actual error.
That's roughly point (3) in my list. I would ideally like to come up with a testcase either unit test or motivating example that is easy to walk someone through, and easily reproducible. |
Are there situations where an error class >= 20 doesn't result in the server closing the connection but the client should treat the connection as bad anyway? |
Here is a unit test that demonstrates the problem using a single connection. It is modeled after the existing TestDisconnect1 and TestDisconnect2 tests in queries_test.go:
Here are the results of running it as go-msqldb exists today:
Here are the results running it with a version of go-msqldb modified to accept and use a BadConnAction=Retry connection string parameter (along the lines of kardianos' point three):
|
Here is an alternative unit test that uses tc-hib's idea of raising a fatal error in SQL Server. This may be a bit simpler to understand for someone who isn't familiar with the internal dialerInterrupt used in go-mssqldb unit tests:
Results with current version of go-mssqldb:
Results with modified version that auto-retries:
|
@fineol I think this is a great unit test to add. Want to submit a PR? Ideally, if the auto-retry fix is turned off, then it would return a |
@shueybubbles asked:
Microsoft's documentation for raiserror states that 20 or higher always terminates the client connection:
However, Microsoft's documentation for error severities is less adamant, stating only that in most cases the connection "may" terminate:
Therefore, it isn't clear if SQL Server always closes the connection. But if you inspect the detailed log messages in my last comment, you can see that go-mssqldb does get additional error information that it isn't currently returning to the caller, and that error information looks to be fairly conclusive about the connection being bad:
I suspect that information can be relied on, but if there is any doubt, I think it is safe to mark the connection as bad at go-mssqldb's end if it detects an error severity 20 or higher. |
@kardianos: Yes, I can submit a PR. Two questions though:
|
|
@fineol FWIW the ADO code seems to manage broken state based on fatal errors here: https://github.com/dotnet/SqlClient/blob/98ea349b0d37d2c5b80f0ebc70e3fe1d66832f31/src/Microsoft.Data.SqlClient/netcore/src/Microsoft/Data/SqlClient/TdsParserStateObject.cs#L3911 |
…me to test it now)
I'd like to show with code how I thought this could have been fixed:
So we might only have to use it two ways: either to only mark the connection as bad, or to return We need CL 333949 to have actual errors when we lose connections between two queries, or when the server is down. (I've had a look at the code before #275: ErrBadConn already was returned only when sending queries, sorry for having stated the obvious :) |
I like tc-hib's proposal. It is good to separate the bad connection status from eligibility to be retried, even though the two combine in the end for the final retry decision. It is also good to mark the connection as bad when SQL Server responds with a fatal server error. I pulled tc-hib's 05022b9 and integrated my unit tests. Both now pass, but that is partly due to the fact the proposal by itself effectively reverts the changes requested in #275, which tc-hib acknowledges with "We need CL 333949...". Therefore, I don't think we want to merge it until we also cover kardianos' point 3 (making the retry behavior optional). I also note that with tc-hib's proposal, my TestDisconnect4 no longer is an effective test of breaking a connection between queries, because tc-hib's proposal now (properly, I believe) marks the connection as bad before it finishes the raiserror query. Therefore, we really need both unit tests: 3 to test a connection going bad between queries, and 4 to test a connection going bad during a query. I have done some work on kardianos' point 3 and will continue, combining it with tc-hib's proposal unless anyone objects. Then I can move on to points 1 and 2. |
Thanks. I've updated https://github.com/denisenkom/go-mssqldb/compare/master...tc-hib:fixing-586?expand=1
This is only a draft, but I think we need these changes in one way or another. Edit: I've tried this with CL 333949, seems to work well. |
Isn't Maybe this could be handled with a specific optional interface instead. func isBadConn(err error) bool {
if err == driver.ErrBadConn {
return true
}
if err2, ok := err.(driver.BadConner); ok {
return err2.BadConn()
}
return false
} |
This comment has been minimized.
This comment has been minimized.
@tc-hib The solution is this line in the
So if the error was of type
If the error is not |
This is a great solution. I didn't know it existed. |
Is it possible / likely to have a partially committed query and then get a raiserror 20 and abort the connection? Could this cause side effects? |
I think so, since any number of statements or transactions can be processed for one query. IIRC the documentation only says you should ignore the resultset, and you can't be sure the server is still able to process new queries. There's a common use case for raiserror with severity 20: when you write a script which is not supposed to run if some condition is not matched. But it is especially useful when running a batch separated with |
AFAIK any severity 20+ error means the connection should be discarded, which is what the ADO driver does. |
I added unit tests and a DSN parameter that allows configuration of auto-retry behavior on top of tc-hib's commit 405258f. You can see my changes on my fork at commit 5ad7cee. Tc-hib's changes failed the existing unit tests TestShortTimeout and TestQueryTimeout because the changes wrapped the original errors in msql.Error errors:
I think, based on the fields and functions of mssql.Error and the fact that there is also an existing StreamError struct, that Error was meant exclusively for errors returned by SQL Server. Therefore, we don't want to overload it and return it for all errors when a connection is bad. Instead, I propose creating a new RetryableError for the exclusive purpose of conveying information about errors that can be retried. RetryableError implements errors.Is, returning true when compared to driver.ErrBadConn to enable database/sql's retry logic. It also wraps the original error so that the caller can get the root cause if the automatic retry logic fails. RetryableError is only used within checkBadConn, where I believe it simplifies and clarifies the logic. It also obviates the need to include a badConn field in mssql.Error, which may be out of place there, because SQL Server errors don't generally have anything to do with connection state. I think we are getting pretty close to where we want to be, but we still need to:
What are everyone's thoughts? Also, is it appropriate to carry on this level of discussion in the issue comments? If not, is GitHub's discussion feature the appropriate venue (in which case someone would have to enable it in this repository)? |
I think these types of discussions become more useful in a PR than in the Issues forum; I've not yet had much participation in a github Discussions forum to know if that model is any better. Getting the driver to the point where its default is to follow the data/sql retry convention for #586 is desirable. It's basically table stakes for go-mssqldb to behave with a similar pattern as other drivers, isn't it? I would expect frameworks like Dapr value this kind of predictability, and using this model has been requested by some teams like Microsoft like AKS. That said, SQL Server app developers that have moved over to our cloud offerings have found the simplistic retry models in previous drivers to be insufficient, leading to the new ADO driver implementing a fully customizable retry model per https://docs.microsoft.com/en-us/sql/connect/ado-net/configurable-retry-logic?view=sql-server-ver15 Eventually the go data/sql core should consider adopting something similar, as there's no one-size-fits-all retry model that can fit all applications across all environments. Even within one application, it's likely that connections to one database should follow a different retry model than connections to another database simply because of network topology and performance differences between the databases. |
I agree, and I like your solution.
The discussion board would be useful for questions such as #663, #659, #646... |
I issued a pull request as you can see above. It has failed the AppVeyor tests for TestDisconnect5 because:
I had hoped to guarantee that the connection was broken as expected by checking in detail, but I see now that I will have to accept a less-detailed broken connection verification in order to support the testing environment. I will push an update when I get a chance. |
* This is how I would have tried to fix #586 (I don't have time to test it now) * Wrap sql errors in the serverError * Make ServerError and wrap sql errors in it * To try CL 333949 wrap ErrBadConn in mssql.Error * Forgot to test connectionGood in the (probably useless) case of mssql.Error * Correctly handle other cases (other types of errors, with connectionGood unset before calling checkBadConn) * Reordering of switch cases for readability * Add dsn param to disable retry of bad conn queries * Unit tests for auto retry of broken connections * Add RetryableError type * Allow disabling of bad-connection retry logic * Base default retry policy on Go version * Improve broken connection tests * Update disconnect test to recognize net.OpError * Break all dialers' connections in disconnect tests * Document expected errors in TestDisconnect5 * Make badStreamPanic panic with StreamError * Add test for logic of checkBadConn function * Add unit tests for custom error types * Test bad connection rejection Co-authored-by: Thomas Combeléran <55949036+tc-hib@users.noreply.github.com>
Describe the bug
Go's connection pool includes logic to discard bad connections and retry queries that failed due to a bad connection. This logic does not work with the mssql driver, because mssql does not return the standard driver.ErrBadConn error.
To Reproduce
Unfortunately, this bug cannot be reproduced with a simple code snippet, because it involves simulating transient network errors. A general guide to reproducing it is:
Expected behavior
Further technical details
SQL Server version: any
Operating system: any
Table schema: any
Additional context
I read through Issue #275 and understand the desire for detailed error messages. I also understand the limitations of Go's connection pooling and am disappointed that Go's Issue #20807 languished unresolved after hopeful initial attention.
However, the changes made to address #275 violate the contract with Go's connection pool, break Go's widely advertised automatic retry logic, and interfere with the pool's connection management (even though bad connections will eventually be purged). I believe that these costs are not worth the benefit of the detailed message.
Are the mssql maintainers amenable to rethinking Issue #275 and discussions of alternate approaches?
For handy reference, here is the code for sql.go's QueryContext. You can see that by not returning driver.ErrBadConn, mssql causes the connection pool logic to break out of the loop and return an error without making any retry attempts:
Similar retry logic exists throughout sql.go, and it is all broken by returning specific errors rather than driver.ErrBadConn.
The text was updated successfully, but these errors were encountered: