Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Queries that fail due to bad connections are not automatically retried (problem with Issue #275 changes) #586

Closed
fineol opened this issue Jul 1, 2020 · 42 comments · Fixed by #681

Comments

@fineol
Copy link
Contributor

fineol commented Jul 1, 2020

Describe the bug
Go's connection pool includes logic to discard bad connections and retry queries that failed due to a bad connection. This logic does not work with the mssql driver, because mssql does not return the standard driver.ErrBadConn error.

To Reproduce
Unfortunately, this bug cannot be reproduced with a simple code snippet, because it involves simulating transient network errors. A general guide to reproducing it is:

  1. Start a debug session
  2. Set a breakpoint on checkBadConn in mssql.go
  3. Establish a database connection and execute a query via sql.go's DB.QueryContext
  4. Observe that the query completes successfully
  5. Disable your network
  6. Repeat the query
  7. Step through, but not out of, checkBadConn and observe that it returns a detailed error rather than the standard driver.ErrBadConn error
  8. Re-enable your network
  9. Step out of checkBadConn up to DB.QueryContext
  10. Observe that the logic to discard the bad connection and retry with another is bypassed. The error propagates immediately to the caller

Expected behavior

  1. mssql immediately returns driver.ErrBadConn when it encounters a bad connection, not later when the connection is used again
  2. Go's connection pool logic is then able to immediately discard the bad connection and retry with a new connection

Further technical details

SQL Server version: any
Operating system: any
Table schema: any

Additional context
I read through Issue #275 and understand the desire for detailed error messages. I also understand the limitations of Go's connection pooling and am disappointed that Go's Issue #20807 languished unresolved after hopeful initial attention.

However, the changes made to address #275 violate the contract with Go's connection pool, break Go's widely advertised automatic retry logic, and interfere with the pool's connection management (even though bad connections will eventually be purged). I believe that these costs are not worth the benefit of the detailed message.

Are the mssql maintainers amenable to rethinking Issue #275 and discussions of alternate approaches?

For handy reference, here is the code for sql.go's QueryContext. You can see that by not returning driver.ErrBadConn, mssql causes the connection pool logic to break out of the loop and return an error without making any retry attempts:

func (db *DB) QueryContext(ctx context.Context, query string, args ...interface{}) (*Rows, error) {
	var rows *Rows
	var err error
	for i := 0; i < maxBadConnRetries; i++ {
		rows, err = db.query(ctx, query, args, cachedOrNewConn)
		if err != driver.ErrBadConn {
			break
		}
	}
	if err == driver.ErrBadConn {
		return db.query(ctx, query, args, alwaysNewConn)
	}
	return rows, err

Similar retry logic exists throughout sql.go, and it is all broken by returning specific errors rather than driver.ErrBadConn.

@fineol
Copy link
Contributor Author

fineol commented Jul 3, 2020

Issue #275 and related sql.go Issue 20807 could be solved with error wrapping. that wasn't an option back when these issues were first raised, but it is now. Of course that requires coordinated changes in both the sql and mssql modules.

Absent a complete fix, could we at least make the behavior configurable? Users could choose which they prefer: either Go's standard connection pool management with automatic retries or detailed connection error messages.

I can think of a couple of mechanisms for configuring the behavior. One would be a function similar to the existing SetLogger function. Call it EnableErrorDetail or SetErrorPolicy or something along those lines.

Another would be to add more driver names to differentiate the behavior, similar to the way "mssql" and "sqlserver" currently yield different behavior. I don't like this idea, but it is an option.

@fineol
Copy link
Contributor Author

fineol commented Jul 3, 2020

P.S. I did look into the new Raw function in Go's sql module that allows access to the underlying driver connection, hoping that it might provide a mechanism mssql could use to expose a detailed "last error" message from the driver connection. It would work, except that Raw does not grant access to connections that have been closed, which are exactly the connections from which we would want a detailed last error.

@fineol
Copy link
Contributor Author

fineol commented Jul 15, 2020

I see that the database/sql/driver package in Go 1.15 will include a new Validator interface that appears to specifically target the problem raised in Issue #275:

Validator may be implemented by Conn to allow drivers to signal if a connection is valid or if it should be discarded.

If implemented, drivers may return the underlying error from queries, even if the connection should be discarded by the connection pool.

Are there plans to implement the new Validator interface in go-mssqldb and, in doing so, correct the bug introduced by the work-around for Issue #275?

@shueybubbles
Copy link
Contributor

Would there be value in a 2 part fix?

  1. for Go 1.15+ builds, add Validator support
  2. For pre-1.15, add a connection string property to tell the driver to use the Go contract. If that property is set, it will revert to the pre-275 behavior.

@kardianos
Copy link
Collaborator

Linked PR adds IsValid to the connection. This should help. Then we can update the errors returned to be something more useful.

@fineol
Copy link
Contributor Author

fineol commented Jul 8, 2021

Supporting the Validator interface is good, but it doesn't change the retry behavior and won't help with this issue. I was hoping that it would when I mentioned it last year, but now that I have seen its implementation within the database/sql package, I understand that it does not.

But first, a demonstration. Here is a test program. It creates a pool of five connections, fills the pool with five simultaneous queries, then runs a series of 11 total queries to demonstrate the error/retry behavior.

The first three queries are normal, successful queries.

The next eight queries are run after SQL Server is restarted, which closes the socket on which SQL Server is listening and thus makes all five connections in the pool bad.

To successfully demonstrate this program, you must manually restart the SQL Server service during the 6-second pause built in to the program.

package main

import (
	"database/sql"
	"fmt"
	"sync"
	"time"

	_ "github.com/denisenkom/go-mssqldb"
)

func main() {

	fmt.Println("Opening connection pool")

	db, err := sql.Open("sqlserver", "server=192.168.32.129;port=1433;user id=conntest;password=XXX;database=conntest;app name=conntester")

	if err != nil {
		fmt.Printf("Error opening database connection: %v\n", err)
		return
	}

	db.SetMaxIdleConns(5)
	db.SetMaxOpenConns(5)
	db.SetConnMaxLifetime(0)

	fmt.Println("Pool size set to 5 with infinite lifetime")

	var wg sync.WaitGroup

	for i := 1; i < 6; i++ {
		wg.Add(1)
		fmt.Printf("Creating pool connection %d\n", i)
		go prime(&wg, db, i)
	}
	wg.Wait()

	fmt.Println("Pool filled")

	fmt.Println("Executing 3 queries to demonstrate successful pool setup")
	for i := 1; i < 4; i++ {
		query(i, db)
	}

	fmt.Println("Pausing for 6 seconds while you restart SQL Server. This will pollute all 5 connections in the pool.")
	time.Sleep(6 * time.Second)

	fmt.Println("Executing 8 more queries to demonstrate 5 must fail to drain pool of bad connections")
	for i := 1; i < 9; i++ {
		query(i, db)
	}
}

func prime(wg *sync.WaitGroup, db *sql.DB, i int) {

	defer wg.Done()

	r, err := db.Query("waitfor delay '00:00:01'; select getdate();")

	if err != nil {
		fmt.Printf("Error creating pool connection %d: %v\n", i, err)
		return
	}

	r.Close()
}

func query(i int, db *sql.DB) {

	r, err := db.Query("declare @delay varchar(100);select @delay = '00:00:0' + left(cast(rand()/4 as varchar(10)),5); waitfor delay @delay; select getdate();")

	if err != nil {
		fmt.Printf("Query %d: Error executing database query: %v\n", i, err)
		return
	}

	defer r.Close()

	for r.Next() {
		var now time.Time
		if err := r.Scan(&now); err != nil {
			fmt.Printf("Query %d: Error scanning results: %v\n", i, err)
			break
		}
		fmt.Printf("Query %d: %v\n", i, now)
	}
}

When you run this program using commit 3516239 of go-msqldb (the commit immediately preceding kardianos' Validator commit), you get these results:

Opening connection pool
Pool size set to 5 with infinite lifetime
Creating pool connection 1
Creating pool connection 2
Creating pool connection 3
Creating pool connection 4
Creating pool connection 5
Pool filled
Executing 3 queries to demonstrate successful pool setup
Query 1: 2021-07-08 11:25:28.467 +0000 UTC
Query 2: 2021-07-08 11:25:28.59 +0000 UTC
Query 3: 2021-07-08 11:25:28.73 +0000 UTC
Pausing for 6 seconds while you restart SQL Server. This will pollute all 5 connections in the pool.
Executing 8 more queries to demonstrate 5 must fail to drain pool of bad connections
Query 1: Error executing database query: failed to send SQL Batch: write tcp 192.168.32.130:51228->192.168.32.129:1433: write: connection reset by peer
Query 2: Error executing database query: failed to send SQL Batch: write tcp 192.168.32.130:51234->192.168.32.129:1433: write: connection reset by peer
Query 3: Error executing database query: failed to send SQL Batch: write tcp 192.168.32.130:51236->192.168.32.129:1433: write: connection reset by peer
Query 4: Error executing database query: failed to send SQL Batch: write tcp 192.168.32.130:51232->192.168.32.129:1433: write: connection reset by peer
Query 5: Error executing database query: failed to send SQL Batch: write tcp 192.168.32.130:51230->192.168.32.129:1433: write: connection reset by peer
Query 6: 2021-07-08 11:25:41.297 +0000 UTC
Query 7: 2021-07-08 11:25:43.92 +0000 UTC

Notice that five queries fail, one per bad connection in the pool, before queries begin to succeed again. If database/sql's automatic retry logic hadn't been defeated by the changes for #275, all queries would have succeeded.

Next is the same program, but this time run with kardianos' Validator commit 333f2e3:

Opening connection pool
Pool size set to 5 with infinite lifetime
Creating pool connection 1
Creating pool connection 2
Creating pool connection 3
Creating pool connection 4
Creating pool connection 5
Pool filled
Executing 3 queries to demonstrate successful pool setup
Query 1: 2021-07-08 11:42:05.983 +0000 UTC
Query 2: 2021-07-08 11:42:06.06 +0000 UTC
Query 3: 2021-07-08 11:42:06.28 +0000 UTC
Pausing for 6 seconds while you restart SQL Server. This will pollute all 5 connections in the pool.
Executing 8 more queries to demonstrate 5 must fail to drain pool of bad connections
Query 1: Error executing database query: failed to send SQL Batch: write tcp 192.168.32.130:51270->192.168.32.129:1433: write: connection reset by peer
Query 2: Error executing database query: failed to send SQL Batch: write tcp 192.168.32.130:51274->192.168.32.129:1433: write: connection reset by peer
Query 3: Error executing database query: failed to send SQL Batch: write tcp 192.168.32.130:51272->192.168.32.129:1433: write: connection reset by peer
Query 4: Error executing database query: failed to send SQL Batch: write tcp 192.168.32.130:51266->192.168.32.129:1433: write: connection reset by peer
Query 5: Error executing database query: failed to send SQL Batch: write tcp 192.168.32.130:51268->192.168.32.129:1433: write: connection reset by peer
Query 6: 2021-07-08 11:42:28.327 +0000 UTC
Query 7: 2021-07-08 11:42:28.36 +0000 UTC

Externally, the behavior is exactly the same. Five queries, one per bad connection in the pool, must fail before the pool is emptied and queries begin to succeed again.

Under the covers, of course, the behavior is slightly different. Without the Validator interface, the bad connections are returned to the pool and then removed the next time they are used. In this program, that happens during queries 6, 7, and 8 when go-smsqldb's ResetSession method is called:

func (c *Conn) ResetSession(ctx context.Context) error {
if !c.connectionGood {
return driver.ErrBadConn
}

Whereas with the Validator interface, the bad connections are removed before they are returned to the pool. In this program, that happens during queries 1 through 5 when database/sql's putConn calls go-mssqldb's IsValid function via the validateConnection function:

https://github.com/golang/go/blob/2ca44fe2213b53ccaf6f555c11858c6e36490624/src/database/sql/sql.go#L1414-L1419

func (db *DB) putConn(dc *driverConn, err error, resetSession bool) {
	if err != driver.ErrBadConn {
		if !dc.validateConnection(resetSession) {
			err = driver.ErrBadConn
		}
	}

The above code is the only place in database/sql where the Validator interface is called. That is why implementing Validator doesn't change the external behavior. The error is reset locally within putConn's scope to remove the bad connection before it is returned to the pool, but the original error is returned to QueryContext, which continues to check for driver.ErrBadConn in its auto-retry logic. Therefore, the auto-retry logic does not work:

https://github.com/golang/go/blob/2ca44fe2213b53ccaf6f555c11858c6e36490624/src/database/sql/sql.go#L1670-L1683

func (db *DB) QueryContext(ctx context.Context, query string, args ...interface{}) (*Rows, error) {
	var rows *Rows
	var err error
	for i := 0; i < maxBadConnRetries; i++ {
		rows, err = db.query(ctx, query, args, cachedOrNewConn)
		if err != driver.ErrBadConn {
			break
		}
	}
	if err == driver.ErrBadConn {
		return db.query(ctx, query, args, alwaysNewConn)
	}
	return rows, err
}

In the end, implementing the Validator interface doesn't change much for go-mssqldb. Externally it makes no difference. Internally, the bad connections are discarded slightly earlier in the process, but that doesn't have any material effect. Effectively, the Validator interface formalizes what go-mssqldb already implemented in response to issue #275. That's a good thing, but it doesn't help those who want auto-retry.

So what do we do for users who want to use database/sql's standard auto-retry logic? I propose adding an ErrorHandler interface to go-mssqldb. The interface would have a single function named HandleError:

HandleError(queryCtx context.Context, transactionCtx context.Context, stmt string, connectionGood bool, err error) error

The inputs give the implementer information about what went wrong and where it went wrong. The output is the error that should be returned to database/sql. This allows the implementer to log the error, change it, or take any other action appropriate to their application. A simple implementation to restore database/sql's default auto-retry logic would be:

HandleError(queryCtx context.Context, transactionCtx context.Context, stmt string, connectionGood bool, err error) {
	if !connectionGood {
		return driver.ErrBadConn
	}
	return err
}

If the user supplies an implementation of HandleError, checkBadConn would call it just before returning. If the user does not supply an implementation, checkBadConn would return as it normally does now.

I believe this would maintain go-mssql's existing behavior for backwards compatibility while allowing those who wish to fine tune the error behavior to do so.

Is this something the maintainers would accept?

@kardianos
Copy link
Collaborator

This is a great write up.

  1. This should be addressed at the database/sql level for sure.
  2. Let me think about your proposal for this repo.

@fineol
Copy link
Contributor Author

fineol commented Jul 8, 2021

This is a great write up.

Thank you

  1. This should be addressed at the database/sql level for sure.

There are a lot of nice things in the database/sql package, but there is room for improvement too.

It seems that the designers wanted to shield end users from the complexity of pool management. That's a good thing in general. Getting a connection pool to work correctly and efficiently is not an easy task, so relieving the end user of that responsibility makes the package much easier to use. But at the same time, it causes issues when the built-in behavior doesn't meet your needs.

Do you want more detailed error messages like d1rewolf did in issue #275? It can't be done, unless the driver maintainers create a work-around the way go-mssqldb's maintainers did.

Do you want to control what queries get retried (e.g. auto-retry read-only queries but not queries that write data)? Can't be done.

Do you want to auto-retry queries but still log errors that were auto retried (this could help alert operators to degradation before it becomes severe enough to overcome the retry logic)? Can't be done.

Do you want to flush the pool to get rid of bad/suspect connections? Can't be done, unless you close the pool altogether and create a new one, with the issues that entails.

Do you want to instruct the pool to use a fresh, new connection rather than re-using one from the pool? Can't be done.

The last point particularly gets in the way of taking control of retry logic yourself. I could try to write my own retry logic for use with go-msqldb that emulates the built-in logic of database/sql. That built-in logic tries the query maxBadConnRetries times (2 times, right now) with connections that are either pulled from the pool or created new as needed, and after the second failure tries once more with a guaranteed new connection:

https://github.com/golang/go/blob/2ca44fe2213b53ccaf6f555c11858c6e36490624/src/database/sql/sql.go#L1670-L1683

func (db *DB) QueryContext(ctx context.Context, query string, args ...interface{}) (*Rows, error) {
	var rows *Rows
	var err error
	for i := 0; i < maxBadConnRetries; i++ {
		rows, err = db.query(ctx, query, args, cachedOrNewConn)
		if err != driver.ErrBadConn {
			break
		}
	}
	if err == driver.ErrBadConn {
		return db.query(ctx, query, args, alwaysNewConn)
	}
	return rows, err
}

However, end users don't have access to the cachedOrNewConn and alwaysNewConn flags. We have to take what the pool gives us. If the entire pool is polluted and the auto-retry logic is bypassed, as it is with go-mssqldb, then we have to make at least pool-size attempts before we have a chance of getting a new connection. My test program demonstrates this.

Of course, database/sql is well established and used in countless programs. Altering its behavior in a non-backwards compatible way is out of the question. Any changes in it would need to maintain the existing behavior as default.

That brings us to:

  1. Let me think about your proposal for this repo.

I don't want to make a breaking change in this driver either, which is why I am proposing an optional interface that allows the default behavior to be augmented rather than changed.

Here are a few more details about my thinking regarding the interface signature:

  1. The two context arguments allow the handler access to any information the caller included with the query. Such information could include a trace id or correlation id for tracing or logging, a flag that indicates the class of query (e.g. read-only, writable), or even a query-specific function (likely as another interface itself) that knows how to handle errors for that specific query. These are totally flexible, limited only by the needs and imagination of the designer. I included both the query and transaction contexts, because I think either of them could be non-nil for any given query.
  2. The stmt argument gives the actual SQL being executed.
  3. The connectionGood argument gives insight into how the driver itself classifies the error and state of the connection.
  4. The err argument contains the details of the error.

I think the combination of all of those gives the implementer enough information to make conditional decisions on how to handle particular errors. I am open to suggestions for more or fewer arguments.

In my personal use case I would trace and log the error with a trace ID that is already included in all my database queries, and then instruct database/sql to automatically retry queries that failed due to a bad connection. The queries I deal with are read-only.

@fineol
Copy link
Contributor Author

fineol commented Jul 9, 2021

To make my proposal more concrete and give you something more easy to evaluate, I have implemented a proof of concept in commit 9cbf4b8 of my fork https://github.com/fineol/go-mssqldb, using the existing SetLogger feature as a guide for the sake of consistency.

The implementation is not complete. For example, it does not yet pass the contexts and statement. Nor are there any unit tests. But it illustrates my idea and demonstrates the functionality.

Running the test program from #586 (comment) with this fork, you get the same results, because the change is designed to be backwards compatible:

Opening connection pool
Pool size set to 5 with infinite lifetime
Creating pool connection 1
Creating pool connection 2
Creating pool connection 3
Creating pool connection 4
Creating pool connection 5
Pool filled
Executing 3 queries to demonstrate successful pool setup
Query 1: 2021-07-09 12:18:31.473 +0000 UTC
Query 2: 2021-07-09 12:18:31.583 +0000 UTC
Query 3: 2021-07-09 12:18:31.817 +0000 UTC
Pausing for 6 seconds while you restart SQL Server. This will pollute all 5 connections in the pool.
Executing 8 more queries to demonstrate 5 must fail to drain pool of bad connections
Query 1: Error executing database query: failed to send SQL Batch: write tcp 192.168.32.130:54314->192.168.32.129:1433: write: connection reset by peer
Query 2: Error executing database query: failed to send SQL Batch: write tcp 192.168.32.130:54320->192.168.32.129:1433: write: connection reset by peer
Query 3: Error executing database query: failed to send SQL Batch: write tcp 192.168.32.130:54316->192.168.32.129:1433: write: connection reset by peer
Query 4: Error executing database query: failed to send SQL Batch: write tcp 192.168.32.130:54318->192.168.32.129:1433: write: connection reset by peer
Query 5: Error executing database query: failed to send SQL Batch: write tcp 192.168.32.130:54322->192.168.32.129:1433: write: connection reset by peer
Query 6: 2021-07-09 12:18:56.317 +0000 UTC
Query 7: 2021-07-09 12:18:56.443 +0000 UTC
Query 8: 2021-07-09 12:18:56.55 +0000 UTC

But by adding to the test program the single line gomssql.SetErrorHandler(gomssql.AutoRetryErrorHandler{}), I override the default error handling behavior with an ErrorHandler that re-enables database/sql's default auto-retry logic. This AutoRetryErrorHandler is included in go-mssqldb for the convenience of those who want auto-retry but don't want to be bothered with the potential complexity/danger of customizing error handling.

Here is the updated test program:

package main

import (
	"database/sql"
	"fmt"
	"sync"
	"time"

	gomssql "github.com/denisenkom/go-mssqldb"
)

func main() {

	fmt.Println("Opening connection pool")

	gomssql.SetErrorHandler(gomssql.AutoRetryErrorHandler{})

	db, err := sql.Open("sqlserver", "server=192.168.32.129;port=1433;user id=conntest;password=XXX;database=conntest;app name=conntester")

	if err != nil {
		fmt.Printf("Error opening database connection: %v\n", err)
		return
	}

	db.SetMaxIdleConns(5)
	db.SetMaxOpenConns(5)
	db.SetConnMaxLifetime(0)

	fmt.Println("Pool size set to 5 with infinite lifetime")

	var wg sync.WaitGroup

	for i := 1; i < 6; i++ {
		wg.Add(1)
		fmt.Printf("Creating pool connection %d\n", i)
		go prime(&wg, db, i)
	}
	wg.Wait()

	fmt.Println("Pool filled")

	fmt.Println("Executing 3 queries to demonstrate successful pool setup")
	for i := 1; i < 4; i++ {
		query(i, db)
	}

	fmt.Println("Pausing for 6 seconds while you restart SQL Server. This will pollute all 5 connections in the pool.")
	time.Sleep(6 * time.Second)

	fmt.Println("Executing 8 more queries to demonstrate 5 must fail to drain pool of bad connections")
	for i := 1; i < 9; i++ {
		query(i, db)
	}
}

func prime(wg *sync.WaitGroup, db *sql.DB, i int) {

	defer wg.Done()

	r, err := db.Query("waitfor delay '00:00:01'; select getdate();")

	if err != nil {
		fmt.Printf("Error creating pool connection %d: %v\n", i, err)
		return
	}

	r.Close()
}

func query(i int, db *sql.DB) {

	r, err := db.Query("declare @delay varchar(100);select @delay = '00:00:0' + left(cast(rand()/4 as varchar(10)),5); waitfor delay @delay; select getdate();")

	if err != nil {
		fmt.Printf("Query %d: Error executing database query: %v\n", i, err)
		return
	}

	defer r.Close()

	for r.Next() {
		var now time.Time
		if err := r.Scan(&now); err != nil {
			fmt.Printf("Query %d: Error scanning results: %v\n", i, err)
			break
		}
		fmt.Printf("Query %d: %v\n", i, now)
	}
}

And here are the results:

Opening connection pool
Pool size set to 5 with infinite lifetime
Creating pool connection 1
Creating pool connection 2
Creating pool connection 3
Creating pool connection 4
Creating pool connection 5
Pool filled
Executing 3 queries to demonstrate successful pool setup
Query 1: 2021-07-09 12:19:40.007 +0000 UTC
Query 2: 2021-07-09 12:19:40.13 +0000 UTC
Query 3: 2021-07-09 12:19:40.333 +0000 UTC
Pausing for 6 seconds while you restart SQL Server. This will pollute all 5 connections in the pool.
Executing 8 more queries to demonstrate 5 must fail to drain pool of bad connections
Query 1: 2021-07-09 12:19:52.16 +0000 UTC
Query 2: 2021-07-09 12:19:54.757 +0000 UTC
Query 3: 2021-07-09 12:19:56.13 +0000 UTC
Query 4: 2021-07-09 12:19:56.32 +0000 UTC
Query 5: 2021-07-09 12:19:56.507 +0000 UTC
Query 6: 2021-07-09 12:19:56.677 +0000 UTC
Query 7: 2021-07-09 12:19:56.85 +0000 UTC
Query 8: 2021-07-09 12:19:57.02 +0000 UTC

All eight queries after the restart now complete successfully, because the errors were converted back to driver.ErrBadConn by the applied ErrorHandler and thus swallowed by database/sql's auto-retry logic. The bad connections were flushed from the pool during the execution of queries 1 through 3.

What do you think?

@shueybubbles
Copy link
Contributor

I think the ability to tell go-mssql to revert to the "old" behavior is useful and perhaps it should go beyond adding a way for app developers to use it. Driver-level behavior is often buried in application-level logic and I suspect there are end users running applications whose code they don't control which would also like this behavior.
For other programming languages/driver models SQL server has typically relied on connection string parameters to inform driver behavior. Connection strings are often stored in a config so they can be modified without changing application code. With the Go module pattern it's not clear to me that an end user could upgrade the go-mssql driver without also upgrading the app, but adding this option as a connection string parameter might be helpful.

@fineol
Copy link
Contributor Author

fineol commented Jul 9, 2021

I can see that adding a connection string parameter to enable auto-retry could be useful in scenarios that you describe, shueybubbles. However, making it a connection string parameter implies that the setting can vary by connection.

I would still want to keep the SetErrorHandler function for use by developers who want finer grained control. That then brings up the question: which takes precedence? Does a connection string parameter override SetErrorHandler or does SetErrorHandler override the connection string parameter?

I lean towards SetErrorHandler overriding the connection string parameter. If a developer has gone to the trouble of explicitly specifying an ErrorHandler, there was likely a good reason for it and not something we would want an end user overriding without knowing the implications.

@shueybubbles
Copy link
Contributor

I agree that code should override config. Implementation-wise, it would probably just mean that the default error handler provided by go-mssql would switch between new and old behavior based on the connection string. If the app provides some other error handler the property in the connection string wouldn't be checked, though conceivably there could be a way for an app's error handler to fall back to the default error handler in some cases.

@fineol
Copy link
Contributor Author

fineol commented Jul 9, 2021

We could also include the value of a connection string auto-retry parameter as an additional argument to the HandleError function of the ErrorHandler interface. That way a developer who implements a custom ErrorHandler could vary its behavior based on user configuration the same way you propose the default handler vary based on configuration.

@fineol
Copy link
Contributor Author

fineol commented Jul 9, 2021

Commit a47dcd0 of my fork https://github.com/fineol/go-mssqldb passes contexts to checkBadConn and HandleError. With this the developer can implement a custom ErrorHandler that can make use of information included in the query contexts.

My updated test program to demonstrate this feature is:

package main

import (
	"context"
	"database/sql"
	"database/sql/driver"
	"fmt"
	"sync"
	"time"

	gomssql "github.com/denisenkom/go-mssqldb"
)

type TraceIDKey struct{}

type LoggingRetryHandler struct{}

func (l LoggingRetryHandler) HandleError(
	queryCtx context.Context,
	transactionCtx context.Context,
	stmt string,
	connectionGood bool,
	err error) error {

	if !connectionGood {
		if traceID, ok := queryCtx.Value(TraceIDKey{}).(int); ok {
			fmt.Printf("Retrying query %d due to error \"%v\"\n", traceID, err)
		}
		return driver.ErrBadConn
	}
	return err
}

func main() {

	fmt.Println("Opening connection pool")

	gomssql.SetErrorHandler(LoggingRetryHandler{})

	db, err := sql.Open("sqlserver", "server=192.168.32.129;port=1433;user id=conntest;password=XXX;database=conntest;app name=conntester")

	if err != nil {
		fmt.Printf("Error opening database connection: %v\n", err)
		return
	}

	db.SetMaxIdleConns(5)
	db.SetMaxOpenConns(5)
	db.SetConnMaxLifetime(0)

	fmt.Println("Pool size set to 5 with infinite lifetime")

	var wg sync.WaitGroup

	for i := 1; i < 6; i++ {
		wg.Add(1)
		fmt.Printf("Creating pool connection %d\n", i)
		go prime(&wg, db, i)
	}
	wg.Wait()

	fmt.Println("Pool filled")

	fmt.Println("Executing 3 queries to demonstrate successful pool setup")
	for i := -3; i < 0; i++ {
		query(i, db)
	}

	fmt.Println("Pausing for 6 seconds while you restart SQL Server. This will pollute all 5 connections in the pool.")
	time.Sleep(6 * time.Second)

	fmt.Println("Executing 8 more queries to demonstrate handling of bad connections in pool")
	for i := 1; i < 9; i++ {
		query(i, db)
	}
}

func prime(wg *sync.WaitGroup, db *sql.DB, i int) {

	defer wg.Done()

	r, err := db.Query("waitfor delay '00:00:01'; select getdate();")

	if err != nil {
		fmt.Printf("Error creating pool connection %d: %v\n", i, err)
		return
	}

	r.Close()
}

func query(i int, db *sql.DB) {

	ctx := context.WithValue(context.Background(), TraceIDKey{}, i)

	r, err := db.QueryContext(ctx, "declare @delay varchar(100);select @delay = '00:00:0' + left(cast(rand()/4 as varchar(10)),5); waitfor delay @delay; select getdate();")

	if err != nil {
		fmt.Printf("Query %d: Error executing database query: %v\n", i, err)
		return
	}

	defer r.Close()

	for r.Next() {
		var now time.Time
		if err := r.Scan(&now); err != nil {
			fmt.Printf("Query %d: Error scanning results: %v\n", i, err)
			break
		}
		fmt.Printf("Query %d: %v\n", i, now)
	}
	if err = r.Err(); nil != err {
		fmt.Printf("Query %d: Error getting next row: %v\n", i, err)
	}
}

If you comment out the line that sets the custom ErrorHandler (gomssql.SetErrorHandler(LoggingRetryHandler{})), you get go-mssqldb's standard behavior:

Opening connection pool
Pool size set to 5 with infinite lifetime
Creating pool connection 1
Creating pool connection 2
Creating pool connection 3
Creating pool connection 4
Creating pool connection 5
Pool filled
Executing 3 queries to demonstrate successful pool setup
Query -3: 2021-07-09 19:14:58.1 +0000 UTC
Query -2: 2021-07-09 19:14:58.13 +0000 UTC
Query -1: 2021-07-09 19:14:58.3 +0000 UTC
Pausing for 6 seconds while you restart SQL Server. This will pollute all 5 connections in the pool.
Executing 8 more queries to demonstrate handling of bad connections in pool
Query 1: Error executing database query: failed to send SQL Batch: write tcp 192.168.32.130:56150->192.168.32.129:1433: write: connection reset by peer
Query 2: Error executing database query: failed to send SQL Batch: write tcp 192.168.32.130:56148->192.168.32.129:1433: write: connection reset by peer
Query 3: Error executing database query: failed to send SQL Batch: write tcp 192.168.32.130:56152->192.168.32.129:1433: write: connection reset by peer
Query 4: Error executing database query: failed to send SQL Batch: write tcp 192.168.32.130:56146->192.168.32.129:1433: write: connection reset by peer
Query 5: Error executing database query: failed to send SQL Batch: write tcp 192.168.32.130:56144->192.168.32.129:1433: write: connection reset by peer
Query 6: 2021-07-09 19:15:18.193 +0000 UTC
Query 7: 2021-07-09 19:15:18.413 +0000 UTC
Query 8: 2021-07-09 19:15:18.6 +0000 UTC

But if you enable the custom ErrorHandler, all eight test queries succeed and the underlying errors are logged with tracing information (the query number, in this case):

Opening connection pool
Pool size set to 5 with infinite lifetime
Creating pool connection 1
Creating pool connection 2
Creating pool connection 3
Creating pool connection 4
Creating pool connection 5
Pool filled
Executing 3 queries to demonstrate successful pool setup
Query -3: 2021-07-09 19:16:35.7 +0000 UTC
Query -2: 2021-07-09 19:16:35.933 +0000 UTC
Query -1: 2021-07-09 19:16:36.12 +0000 UTC
Pausing for 6 seconds while you restart SQL Server. This will pollute all 5 connections in the pool.
Executing 8 more queries to demonstrate handling of bad connections in pool
Retrying query 1 due to error "failed to send SQL Batch: write tcp 192.168.32.130:56172->192.168.32.129:1433: write: connection reset by peer"
Retrying query 1 due to error "failed to send SQL Batch: write tcp 192.168.32.130:56168->192.168.32.129:1433: write: connection reset by peer"
Query 1: 2021-07-09 19:16:42.403 +0000 UTC
Retrying query 2 due to error "failed to send SQL Batch: write tcp 192.168.32.130:56174->192.168.32.129:1433: write: connection reset by peer"
Retrying query 2 due to error "failed to send SQL Batch: write tcp 192.168.32.130:56166->192.168.32.129:1433: write: connection reset by peer"
Query 2: 2021-07-09 19:16:42.48 +0000 UTC
Retrying query 3 due to error "failed to send SQL Batch: write tcp 192.168.32.130:56170->192.168.32.129:1433: write: connection reset by peer"
Query 3: 2021-07-09 19:16:42.56 +0000 UTC
Query 4: 2021-07-09 19:16:42.62 +0000 UTC
Query 5: 2021-07-09 19:16:42.683 +0000 UTC
Query 6: 2021-07-09 19:16:42.747 +0000 UTC
Query 7: 2021-07-09 19:16:42.81 +0000 UTC
Query 8: 2021-07-09 19:16:42.857 +0000 UTC

With this additional information, you can see where the bad connections are flushed. The auto-retry logic in database/sql makes two attempts per query with cached connections from the pool, discarding each as they fail, and then switches to a new connection that succeeds. It repeats this until all five bad connections are flushed, finishing the job in query 3.

@tc-hib
Copy link
Contributor

tc-hib commented Jul 10, 2021

Hello,

I don't think we should require a specific handler to have the connection pool work as it should.
Standard behavior should be available by default.

The case that's not handled by the Validator interface should probably be handled by only returning ErrBadConn at the very beginning of the query, when we find out the connection is lost before we could do anything.

We don't want ErrBadConn to be returned when something might have happened on the server, we might even have T-SQL errors to return along with the communication error.

On a side note, I've tested the following query, which closes the current connection (because of a severity above 20):

RAISERROR ('goodbye', 20, 42) WITH LOG

There are 2 problems here:

  • The error is "SQL Server had internal error" instead of the 3 T-SQL errors.
  • It does not unset connectionGood, so it has the same outcome as restarting the server.

@kardianos
Copy link
Collaborator

I agree with @tc-hib , we should not require a special handler.

I think there are four points we can address:

  1. Adjust database/sql to use errors.Is(err, driver.ErrBadConn) rather then a direct compare ==.
  2. Add an Unwrap() error to mssql.Error, where if the connection is bad, return it. In fact, we should do this right away.
  3. Allow setting in the DSN and msdsn.Config a setting to return driver.ErrBadConn when we get a bad connection, rather then the actual error so the driver pool for older versions of Go will retry.
  4. Inspect and better understand what errors will result in a "bad connection".

If all agree this would fix the issue, and we can empirically test (let me know if you need help compiling Go from source and applying the std lib Go CL I'm about to send).

@kardianos
Copy link
Collaborator

Please see https://go-review.googlesource.com/c/go/+/333949 for a std lib change that wouldn't land until Go1.18 if accepted. See (1) in the above list.

@tc-hib Can you verify that this would resolve this problem (assuming we implement (2))?

@tc-hib
Copy link
Contributor

tc-hib commented Jul 12, 2021

I'll have a look as soon as I can.

Meanwhile, I want to make sure we don't forget the cases when we should not return ErrBadConn at all.
Maybe I didn't understand how things work but, to me, the Validator interface is complementary to ErrBadConn in that it avoids retrying a query that failed in the middle of its execution.
When Validator can do the job, we usually do not want ErrBadConn.

For example, imagine you run a stored procedure that does several things, and the communication is lost in the middle of it. Returning ErrBadConn would be harmful.

I don't know enough about the driver, but I suppose we should only return ErrBadConn when trying to send a query, shouldn't we?

@fineol
Copy link
Contributor Author

fineol commented Jul 12, 2021

I don't object to wrapping the errors and using errors.Is within database/sql. I suggested that as a solution in my comment a year ago when I first opened this issue:

Issue #275 and related sql.go Issue 20807 could be solved with error wrapping. that wasn't an option back when these issues were first raised, but it is now. Of course that requires coordinated changes in both the sql and mssql modules.

It didn't get any traction then, which I assumed was either because their wasn't a desire to change this driver's bad connection behavior yet again or because database/sql's retry logic was felt to be harmful and should be deliberately avoided. Those are both valid concerns (backwards compatibility and inadvertently retrying a query that ought not be automatically retried), which is why I proposed the separate ErrorHandler interface to let developers pick the behavior needed for their applications without changing the existing behavior.

However, I think both of those concerns may be addressed by kardianos' third and fourth points. I think tc-hib's idea about only invoking the retry logic when first starting a query has merit as well, and this may already be effectively in place. For example, I don't believe that retry occurs when Nexting through a returned recordset.

@kardianos, would point three be something like adding an optional "BadConnAction" parameter to the DSN that accepts values "Fail" and "Retry"? If so, I'd like to see that implemented now without waiting for 1.18 (assuming your proposal for database/sql is accepted). I've already looked at the code and don't think it would be hard to do.

@kardianos
Copy link
Collaborator

I don't know enough about the driver, but I suppose we should only return ErrBadConn when trying to send a query, shouldn't we?

Yes. We should only return ErrBadConn when we want to retry during some type of dialing or initialization problem. Most any other situation we should return the actual error.

would point three be something like adding an optional "BadConnAction" parameter to the DSN that accepts values "Fail" and "Retry"?

That's roughly point (3) in my list.


I would ideally like to come up with a testcase either unit test or motivating example that is easy to walk someone through, and easily reproducible.

@shueybubbles
Copy link
Contributor

Are there situations where an error class >= 20 doesn't result in the server closing the connection but the client should treat the connection as bad anyway?

@fineol
Copy link
Contributor Author

fineol commented Jul 13, 2021

Here is a unit test that demonstrates the problem using a single connection. It is modeled after the existing TestDisconnect1 and TestDisconnect2 tests in queries_test.go:

// TestDisconnect3 tests that a bad connection is removed from the pool and
// replaced with a new connection when a subsequent query executes
func TestDisconnect3(t *testing.T) {
	if testing.Short() {
		t.Skip("short")
	}
	checkConnStr(t)
	SetLogger(testLogger{t})

	// Revert to the normal dialer after the test is done.
	normalCreateDialer := createDialer
	defer func() {
		createDialer = normalCreateDialer
	}()

	var di *dialerInterrupt
	createDialer = func(p *connectParams) Dialer {
		nd := netDialer{&net.Dialer{Timeout: p.dial_timeout, KeepAlive: p.keepAlive}}
		di = &dialerInterrupt{nd: nd}
		return di
	}

	// Open a connection pool that holds a single connection to make sure all
	// queries run on the same connection.
	db, err := sql.Open("sqlserver", makeConnStr(t).String())
	if err != nil {
		t.Log(err)
	}
	defer db.Close()
	db.SetMaxIdleConns(1)
	db.SetMaxOpenConns(1)

	// Demonstrate connection is good the first time we use it
	var resultBefore string
	row := db.QueryRow("select 'before disconnect'")
	if err = row.Scan(&resultBefore); err != nil || resultBefore != "before disconnect" {
		t.Fatalf("Got result = '%s', err = %v, wanted result = 'before disconnect, err = <nil>", resultBefore, err)
	}

	// Break the connection
	di.Interrupt(true)
	di.Interrupt(false)

	// Broken connection should cause the next query to initially fail, but then
	// database/sql should discard the bad connection, automatically retry the query,
	// and ultimately succeed.
	var resultAfter string
	row = db.QueryRow("select 'after disconnect'")
	if err = row.Scan(&resultAfter); err != nil || resultAfter != "after disconnect" {
		t.Fatalf("Got result = '%s', err = %v, wanted result = 'after disconnect, err = <nil>", resultAfter, err)
	}
}

Here are the results of running it as go-msqldb exists today:

Running tool: /usr/local/go/bin/go test -timeout 30s -run ^TestDisconnect3$ github.com/denisenkom/go-mssqldb

--- FAIL: TestDisconnect3 (0.02s)
    /home/ctest/go/src/go-mssqldb/tds_test.go:206: got token tokenLoginAck
    /home/ctest/go/src/go-mssqldb/tds_test.go:206: got token tokenEnvChange
    /home/ctest/go/src/go-mssqldb/tds_test.go:206: got token tokenDone
    /home/ctest/go/src/go-mssqldb/tds_test.go:206: got DONE or DONEPROC status=0
    /home/ctest/go/src/go-mssqldb/tds_test.go:210: select 'before disconnect'
    /home/ctest/go/src/go-mssqldb/tds_test.go:206: got token tokenColMetadata
    /home/ctest/go/src/go-mssqldb/tds_test.go:206: got token tokenRow
    /home/ctest/go/src/go-mssqldb/tds_test.go:206: got token tokenDone
    /home/ctest/go/src/go-mssqldb/tds_test.go:206: got DONE or DONEPROC status=16
    /home/ctest/go/src/go-mssqldb/tds_test.go:206: (1 row(s) affected)
    /home/ctest/go/src/go-mssqldb/tds_test.go:210: select 'after disconnect'
    /home/ctest/go/src/go-mssqldb/tds_test.go:206: Failed to send SqlBatch with disconnect
    /home/ctest/go/src/go-mssqldb/queries_test.go:2283: Got result = '', err = failed to send SQL Batch: disconnect, wanted result = 'after disconnect, err = <nil>
FAIL
FAIL	github.com/denisenkom/go-mssqldb	0.030s
FAIL

Here are the results running it with a version of go-msqldb modified to accept and use a BadConnAction=Retry connection string parameter (along the lines of kardianos' point three):

Running tool: /usr/local/go/bin/go test -timeout 30s -run ^TestDisconnect3$ github.com/denisenkom/go-mssqldb

ok  	github.com/denisenkom/go-mssqldb	0.065s

@fineol
Copy link
Contributor Author

fineol commented Jul 13, 2021

Here is an alternative unit test that uses tc-hib's idea of raising a fatal error in SQL Server. This may be a bit simpler to understand for someone who isn't familiar with the internal dialerInterrupt used in go-mssqldb unit tests:

// TestDisconnect4 tests that a bad connection is removed from the pool and
// replaced with a new connection when a subsequent query executes
func TestDisconnect4(t *testing.T) {
	if testing.Short() {
		t.Skip("short")
	}
	checkConnStr(t)
	SetLogger(testLogger{t})

	// Open a connection pool that holds a single connection to make sure all
	// queries run on the same connection.
	db, err := sql.Open("sqlserver", makeConnStr(t).String())
	if err != nil {
		t.Log(err)
	}
	defer db.Close()
	db.SetMaxIdleConns(1)
	db.SetMaxOpenConns(1)

	// Demonstrate connection is good the first time we use it
	var resultBefore string
	row := db.QueryRow("select 'before disconnect'")
	if err = row.Scan(&resultBefore); err != nil || resultBefore != "before disconnect" {
		t.Fatalf("Got result = '%s', err = %v, wanted result = 'before disconnect, err = <nil>", resultBefore, err)
	}

	// Break the connection by raising a fatal error in SQL Server. When you raise an
	// error of severity 20 or higher, SQL Server closes the client connection.
	res, err := db.Exec("raiserror('goodbye', 20, 1) with log")
	if err == nil || err.Error() != "SQL Server had internal error" {
		t.Fatalf("Failed to break connection to SQL database. Got err = %v, wanted error = 'SQL Server had internal error'", err)
	}
	t.Log(res)

	// Broken connection should cause the next query to initially fail, but then
	// database/sql should discard the bad connection, automatically retry the query,
	// and ultimately succeed.
	var resultAfter string
	row = db.QueryRow("select 'after disconnect'")
	if err = row.Scan(&resultAfter); err != nil || resultAfter != "after disconnect" {
		t.Fatalf("Got result = '%s', err = %v, wanted result = 'after disconnect, err = <nil>", resultAfter, err)
	}
}

Results with current version of go-mssqldb:

Running tool: /usr/local/go/bin/go test -timeout 30s -run ^TestDisconnect4$ github.com/denisenkom/go-mssqldb

--- FAIL: TestDisconnect4 (0.02s)
    /home/ctest/go/src/go-mssqldb/tds_test.go:206: got token tokenLoginAck
    /home/ctest/go/src/go-mssqldb/tds_test.go:206: got token tokenEnvChange
    /home/ctest/go/src/go-mssqldb/tds_test.go:206: got token tokenDone
    /home/ctest/go/src/go-mssqldb/tds_test.go:206: got DONE or DONEPROC status=0
    /home/ctest/go/src/go-mssqldb/tds_test.go:210: select 'before disconnect'
    /home/ctest/go/src/go-mssqldb/tds_test.go:206: got token tokenColMetadata
    /home/ctest/go/src/go-mssqldb/tds_test.go:206: got token tokenRow
    /home/ctest/go/src/go-mssqldb/tds_test.go:206: got token tokenDone
    /home/ctest/go/src/go-mssqldb/tds_test.go:206: got DONE or DONEPROC status=16
    /home/ctest/go/src/go-mssqldb/tds_test.go:206: (1 row(s) affected)
    /home/ctest/go/src/go-mssqldb/tds_test.go:210: raiserror('goodbye', 20, 1) with log
    /home/ctest/go/src/go-mssqldb/tds_test.go:206: got token tokenEnvChange
    /home/ctest/go/src/go-mssqldb/tds_test.go:206: got token tokenError
    /home/ctest/go/src/go-mssqldb/tds_test.go:206: got ERROR 50000 goodbye
    /home/ctest/go/src/go-mssqldb/tds_test.go:210: goodbye
    /home/ctest/go/src/go-mssqldb/tds_test.go:206: got token tokenError
    /home/ctest/go/src/go-mssqldb/tds_test.go:206: got ERROR 2745 Process ID 55 has raised user error 50000, severity 20. SQL Server is terminating this process.
    /home/ctest/go/src/go-mssqldb/tds_test.go:210: Process ID 55 has raised user error 50000, severity 20. SQL Server is terminating this process.
    /home/ctest/go/src/go-mssqldb/tds_test.go:206: got token tokenError
    /home/ctest/go/src/go-mssqldb/tds_test.go:206: got ERROR 596 Cannot continue the execution because the session is in the kill state.
    /home/ctest/go/src/go-mssqldb/tds_test.go:210: Cannot continue the execution because the session is in the kill state.
    /home/ctest/go/src/go-mssqldb/tds_test.go:206: got token tokenDone
    /home/ctest/go/src/go-mssqldb/tds_test.go:206: got DONE or DONEPROC status=258
    /home/ctest/go/src/go-mssqldb/queries_test.go:2318: <nil>
    /home/ctest/go/src/go-mssqldb/tds_test.go:210: select 'after disconnect'
    /home/ctest/go/src/go-mssqldb/tds_test.go:206: Failed to send SqlBatch with write tcp 192.168.32.130:60214->192.168.32.129:1433: write: connection reset by peer
    /home/ctest/go/src/go-mssqldb/queries_test.go:2326: Got result = '', err = failed to send SQL Batch: write tcp 192.168.32.130:60214->192.168.32.129:1433: write: connection reset by peer, wanted result = 'after disconnect, err = <nil>
FAIL
FAIL	github.com/denisenkom/go-mssqldb	0.030s
FAIL

Results with modified version that auto-retries:

Running tool: /usr/local/go/bin/go test -timeout 30s -run ^TestDisconnect4$ github.com/denisenkom/go-mssqldb

ok  	github.com/denisenkom/go-mssqldb	0.060s

@kardianos
Copy link
Collaborator

@fineol I think this is a great unit test to add. Want to submit a PR?

Ideally, if the auto-retry fix is turned off, then it would return a mssql.Error that has a new Unwrap function that will return the driver.ErrBadConn if applicable. When turned on, driver.ErrBadConn would be directly returned.

@fineol
Copy link
Contributor Author

fineol commented Jul 13, 2021

@shueybubbles asked:

Are there situations where an error class >= 20 doesn't result in the server closing the connection but the client should treat the connection as bad anyway?

Microsoft's documentation for raiserror states that 20 or higher always terminates the client connection:

Caution

Severity levels from 20 through 25 are considered fatal. If a fatal severity level is encountered, the client connection is terminated after receiving the message, and the error is logged in the error and application logs.

However, Microsoft's documentation for error severities is less adamant, stating only that in most cases the connection "may" terminate:

20-24 | Indicate system problems and are fatal errors, which means that the Database Engine task that is executing a statement or batch is no longer running. The task records information about what occurred and then terminates. In most cases, the application connection to the instance of the Database Engine may also terminate. If this happens, depending on the problem, the application might not be able to reconnect.

Therefore, it isn't clear if SQL Server always closes the connection. But if you inspect the detailed log messages in my last comment, you can see that go-mssqldb does get additional error information that it isn't currently returning to the caller, and that error information looks to be fairly conclusive about the connection being bad:

 got ERROR 2745 Process ID 55 has raised user error 50000, severity 20. SQL Server is terminating this process.
 got ERROR 596 Cannot continue the execution because the session is in the kill state.

I suspect that information can be relied on, but if there is any doubt, I think it is safe to mark the connection as bad at go-mssqldb's end if it detects an error severity 20 or higher.

@fineol
Copy link
Contributor Author

fineol commented Jul 13, 2021

@kardianos: Yes, I can submit a PR. Two questions though:

  1. Which version of the unit test do you prefer, 3 (with the dialer interrupt) or 4 (with the raiserror)?
  2. We wouldn't want a unit test that fails to be merged. Will you hold off on merging until we get a fix, or are you asking for a PR that includes the wrapping fix as well and not just the unit test?

@fineol I think this is a great unit test to add. Want to submit a PR?

@kardianos
Copy link
Collaborator

  1. Either one works for me. If you had to ask my, use raiserror so both client and server know the connection drops.
  2. If you feel comfortable proposing a fix in the PR, that's great. Otherwise I (or someone else) can propose the fix.

@shueybubbles
Copy link
Contributor

@fineol FWIW the ADO code seems to manage broken state based on fatal errors here: https://github.com/dotnet/SqlClient/blob/98ea349b0d37d2c5b80f0ebc70e3fe1d66832f31/src/Microsoft.Data.SqlClient/netcore/src/Microsoft/Data/SqlClient/TdsParserStateObject.cs#L3911
And they have their own bugs with this type of issue, like dotnet/SqlClient#971 (comment)

tc-hib added a commit to tc-hib/go-mssqldb that referenced this issue Jul 13, 2021
@tc-hib
Copy link
Contributor

tc-hib commented Jul 13, 2021

I'd like to show with code how I thought this could have been fixed:
https://github.com/denisenkom/go-mssqldb/compare/master...tc-hib:fixing-586?expand=1

checkBadConn seems to be called everywhere.

So we might only have to use it two ways: either to only mark the connection as bad, or to return ErrBadConn as well, depending on the call site.
In that branch I also added a case for fatal server errors, a flag in the DONE token, which seem to correspond to errors with severity 20 and above. I'm not sure we should to add a case for the ERROR token itself.

We need CL 333949 to have actual errors when we lose connections between two queries, or when the server is down.
If this is more important than the retry logic for some projects, we could have an option to always returns actual errors.

(I've had a look at the code before #275: ErrBadConn already was returned only when sending queries, sorry for having stated the obvious :)

@fineol
Copy link
Contributor Author

fineol commented Jul 13, 2021

I like tc-hib's proposal. It is good to separate the bad connection status from eligibility to be retried, even though the two combine in the end for the final retry decision. It is also good to mark the connection as bad when SQL Server responds with a fatal server error.

I pulled tc-hib's 05022b9 and integrated my unit tests. Both now pass, but that is partly due to the fact the proposal by itself effectively reverts the changes requested in #275, which tc-hib acknowledges with "We need CL 333949...". Therefore, I don't think we want to merge it until we also cover kardianos' point 3 (making the retry behavior optional).

I also note that with tc-hib's proposal, my TestDisconnect4 no longer is an effective test of breaking a connection between queries, because tc-hib's proposal now (properly, I believe) marks the connection as bad before it finishes the raiserror query. Therefore, we really need both unit tests: 3 to test a connection going bad between queries, and 4 to test a connection going bad during a query.

I have done some work on kardianos' point 3 and will continue, combining it with tc-hib's proposal unless anyone objects. Then I can move on to points 1 and 2.

@tc-hib
Copy link
Contributor

tc-hib commented Jul 14, 2021

Thanks.

I've updated https://github.com/denisenkom/go-mssqldb/compare/master...tc-hib:fixing-586?expand=1
Changes:

  • Wrap the mssql.Error in the fatal server error
  • Don't systematically return ErrBadConn instead of io.EOF
  • Wrap ErrBadConn in mssql.Error to try CL 333949

This is only a draft, but I think we need these changes in one way or another.

Edit: I've tried this with CL 333949, seems to work well.

@tc-hib
Copy link
Contributor

tc-hib commented Jul 14, 2021

Please see https://go-review.googlesource.com/c/go/+/333949 for a std lib change that wouldn't land until Go1.18 if accepted. See (1) in the above list.

Isn't Unwrap usually used the other way around?
What if the actual error already wraps an error?

Maybe this could be handled with a specific optional interface instead.
It would provide a method that returns true if the connection pool may drop the connection and try another one.

func isBadConn(err error) bool {
    if err == driver.ErrBadConn {
        return true
    }
    if err2, ok := err.(driver.BadConner); ok {
        return err2.BadConn()
    }
    return false
}

@kardianos

This comment has been minimized.

@kardianos
Copy link
Collaborator

@tc-hib The solution is this line in the errors.Is documentation:

An error is considered to match a target if it is equal to that target or if it implements a method Is(error) bool such that Is(target) returns true.

An error type might provide an Is method so it can be treated as equivalent to an existing error. For example, if MyError defines

func (m MyError) Is(target error) bool { return target == fs.ErrExist }
then Is(MyError{}, fs.ErrExist) returns true. See syscall.Errno.Is for an example in the standard library.

So if the error was of type mssql.Error we can set the BadConn flag on it to true and then create a new func:

func (e Error) Is(err error) bool {
    return e.BadConn && err == driver.ErrBadConn
}

If the error is not mssql.Error we can wrap it in a custom wrapper that also implements Is and passes through the inner error in the error message. This could be used for things like net.Dial errors if we wanted to.

@tc-hib
Copy link
Contributor

tc-hib commented Jul 14, 2021

This is a great solution. I didn't know it existed.
Maybe your CL should include a comment about it.

@kardianos
Copy link
Collaborator

Is it possible / likely to have a partially committed query and then get a raiserror 20 and abort the connection? Could this cause side effects?

@tc-hib
Copy link
Contributor

tc-hib commented Jul 14, 2021

I think so, since any number of statements or transactions can be processed for one query.
I suppose a single statement would be all or nothing. (except for a few things such as IDENTITY columns which can't be rolled back)
Are you thinking about letting the pool retry a query after such an error?

IIRC the documentation only says you should ignore the resultset, and you can't be sure the server is still able to process new queries.
Can we test such real errors with disk full?

There's a common use case for raiserror with severity 20: when you write a script which is not supposed to run if some condition is not matched. But it is especially useful when running a batch separated with GO in SSMS, I don't think any program should do that.

@shueybubbles
Copy link
Contributor

AFAIK any severity 20+ error means the connection should be discarded, which is what the ADO driver does.
Other cases where driver will “break” the connection are fatal/non-recoverable situations pre-identified in the driver, e.g. executing query on a closed connection, unexpected response/behavior from server, or client side delays in processing attention, etc. In some of these cases server-side connection may/may not be closed but driver will mark the parser as broken and that would signal Pooling layer to drop the TCP connection so it does not get re-used and the outer SqlConnection gets another new/pooled DB connection to work with when appropriate to do so.

@fineol
Copy link
Contributor Author

fineol commented Jul 19, 2021

I added unit tests and a DSN parameter that allows configuration of auto-retry behavior on top of tc-hib's commit 405258f. You can see my changes on my fork at commit 5ad7cee.

Tc-hib's changes failed the existing unit tests TestShortTimeout and TestQueryTimeout because the changes wrapped the original errors in msql.Error errors:

--- FAIL: TestShortTimeout (2.02s)
    ...
    queries_test.go:566: failure not a timeout, failed with mssql: read tcp 192.168.32.130:48322->192.168.32.129:1433: i/o timeout
--- FAIL: TestQueryTimeout (0.24s)
    ...
    queries_test.go:1965: ExecContext expected to fail with DeadlineExceeded but it returned mssql: context deadline exceeded

I think, based on the fields and functions of mssql.Error and the fact that there is also an existing StreamError struct, that Error was meant exclusively for errors returned by SQL Server. Therefore, we don't want to overload it and return it for all errors when a connection is bad.

Instead, I propose creating a new RetryableError for the exclusive purpose of conveying information about errors that can be retried. RetryableError implements errors.Is, returning true when compared to driver.ErrBadConn to enable database/sql's retry logic. It also wraps the original error so that the caller can get the root cause if the automatic retry logic fails. RetryableError is only used within checkBadConn, where I believe it simplifies and clarifies the logic. It also obviates the need to include a badConn field in mssql.Error, which may be out of place there, because SQL Server errors don't generally have anything to do with connection state.

I think we are getting pretty close to where we want to be, but we still need to:

  1. Decide what the default retry logic should be going forward:. Do we want it to be "no retry" in the style desired in "bad connection" is too generic an error to be useful? #275? Or do we want it to be "auto retry" in the style desired in Queries that fail due to bad connections are not automatically retried (problem with Issue #275 changes) #586? Keep in mind that "auto retry", although the standard and expected behavior within database/sql, is not the current behavior of go-mssqldb.
  2. Create a version that works with or without kardianos' proposed change to database/sql (presumably by creating two versions of checkBadConn in separate files gated by build tags/constraints).

What are everyone's thoughts?

Also, is it appropriate to carry on this level of discussion in the issue comments? If not, is GitHub's discussion feature the appropriate venue (in which case someone would have to enable it in this repository)?

@shueybubbles
Copy link
Contributor

I think these types of discussions become more useful in a PR than in the Issues forum; I've not yet had much participation in a github Discussions forum to know if that model is any better.
At least with a PR we can point to particular lines of code with questions. As long as the ultimate merge is a squash it shouldn't matter if it goes through a ton of iterations.

Getting the driver to the point where its default is to follow the data/sql retry convention for #586 is desirable. It's basically table stakes for go-mssqldb to behave with a similar pattern as other drivers, isn't it? I would expect frameworks like Dapr value this kind of predictability, and using this model has been requested by some teams like Microsoft like AKS.

That said, SQL Server app developers that have moved over to our cloud offerings have found the simplistic retry models in previous drivers to be insufficient, leading to the new ADO driver implementing a fully customizable retry model per https://docs.microsoft.com/en-us/sql/connect/ado-net/configurable-retry-logic?view=sql-server-ver15

Eventually the go data/sql core should consider adopting something similar, as there's no one-size-fits-all retry model that can fit all applications across all environments. Even within one application, it's likely that connections to one database should follow a different retry model than connections to another database simply because of network topology and performance differences between the databases.

@tc-hib
Copy link
Contributor

tc-hib commented Jul 19, 2021

I think, based on the fields and functions of mssql.Error and the fact that there is also an existing StreamError struct, that Error was meant exclusively for errors returned by SQL Server. Therefore, we don't want to overload it and return it for all errors when a connection is bad.

Instead, I propose creating a new RetryableError [...]

I agree, and I like your solution.

Also, is it appropriate to carry on this level of discussion in the issue comments? If not, is GitHub's discussion feature the appropriate venue (in which case someone would have to enable it in this repository)?

The discussion board would be useful for questions such as #663, #659, #646...
But I think our discussion belongs here, or under a PR. Later, if someone wonders why something was done in the code, this might provide an answer.

@fineol
Copy link
Contributor Author

fineol commented Jul 20, 2021

I issued a pull request as you can see above. It has failed the AppVeyor tests for TestDisconnect5 because:

  1. The test uses errors.As, which is not supported in some of the older versions of Go that the AppVeyor tests are using.
  2. The actual error being returned by SQL Server in response to the raiserror varies on different machines.

I had hoped to guarantee that the connection was broken as expected by checking in detail, but I see now that I will have to accept a less-detailed broken connection verification in order to support the testing environment. I will push an update when I get a chance.

kardianos pushed a commit that referenced this issue Jul 28, 2021
* This is how I would have tried to fix #586 (I don't have time to test it now)

* Wrap sql errors in the serverError

* Make ServerError and wrap sql errors in it

* To try CL 333949 wrap ErrBadConn in mssql.Error

* Forgot to test connectionGood in the (probably useless) case of mssql.Error

* Correctly handle other cases (other types of errors, with connectionGood unset before calling checkBadConn)

* Reordering of switch cases for readability

* Add dsn param to disable retry of bad conn queries

* Unit tests for auto retry of broken connections

* Add RetryableError type

* Allow disabling of bad-connection retry logic

* Base default retry policy on Go version

* Improve broken connection tests

* Update disconnect test to recognize net.OpError

* Break all dialers' connections in disconnect tests

* Document expected errors in TestDisconnect5

* Make badStreamPanic panic with StreamError

* Add test for logic of checkBadConn function

* Add unit tests for custom error types

* Test bad connection rejection

Co-authored-by: Thomas Combeléran <55949036+tc-hib@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants