-
Notifications
You must be signed in to change notification settings - Fork 2.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Mark connections as bad on MariaDB shutdown #451
Conversation
The unit tests pass on 1.6.2
|
I think this issue is caused by the fact that when the MySQL server is shutdown, because the Go client is not reading from the socket in a loop, it doesn't catch the shutdown message. Because the socket is not read from until after the next query is written to the socket, and the write doesn't fail because the socket is in CLOSED_WAIT state, it takes until the next read to detect the sequence number is wrong, because we read the server closed message. |
I've actually modified this PR to detect the MariaDB connection was killed error code and restart the query if it was found. Output is now:
|
Hi @tomponline, thanks for digging into this! The query has to be reissued anyway if the connection dies - and this error case is probably very, very uncommon. If you wrote the PR mainly because the error message bugs you, could you use mysql.SetLogger to filter these messages with a custom logger? Also, this PR reads too Maria specific for my taste. The higher error codes are not the same between all MySQL dialects and I'd like to avoid the potential conflicts that could arise. If I misread hat you're doing and this is not just about the logging, please drop me a comment and we'll talk about how we should proceed. One change you'd definitely have to do is adhering to alphabetic ordering in the AUTHORS file 😀 (it's an undocumented convention). Thanks! |
Hi @arnehormann thanks! I will update the AUTHORS file as you have requested. The problem is this error causes user queries to fail (they are not retried by the driver) even if the query is started sometime after the MySQL has been restarted. So its not that the query is interrupted by the server restart, I would expect that to be passed up to the userland. Its that a dead connection is used on subsequent new queries, even though the server has been restored for sometime (minutes/hours) since the last query was run on that connection. Its not uncommon if you restart a server, and in our case, it means the next user to issue a query is guaranteed to fail. I'm not sure if this happens with MySQL too or whether it is a MariaDB only issue, but it means that if you restart the server, once the server is up and running all applications will fail on their next query before the client reconnects. This means we would need to add error detection for connection closed on every query on every app, whereas I understood the Golang sql library was supposed to reconnect on connection failure (that doesn't affect a running query). |
@arnehormann I couldnt find any reference to error code 1927 in MySQL docs, so am assuming this is a MariaDB specific error/behavior. What seems to happen is that a message response packet is sent to the client connection when the server is shutdown, but because no queries are currently being run against the server, the client doesn't read the response until the next query is issued. For some reason, the socket write for the new query doesn't fail, even though the socket is closed, again this might be because the socket hasn't been read from since the server disconnected. Because the socket write succeeds, when the response to the query is read back, it instead gets the connection killed message waiting on the socket, which appears to be sent with a sequence number of 0. Originally, this caused an out of sequence error, but it didn't cause the socket to be closed. My original attempt at a fix was to detect the special case sequence number of 0 and return driver.ErrBadConn which caused the Go driver to reconnect and re-run the query. However I 'improved' it by specifically handling the connection killed error, in case a genuine sequence error caused 0 to be received out of order. |
@tomponline ok, I got it. Can you write a test? I'm thinking of a custom net.Dialer running in a separate goroutine which cuts or pauses the connection and can be controlled from the test. Also, this is related to #295 and #302: applications should definitely provide this kind of error handling. The driver is acting against the spec as is (and apparently at least the postgres driver, too). |
@arnehormann cool thanks, yes will give it a go to add a test. I've read the other issues and they are definitely related, although from my tests using a slightly modified test script, it doesn't look as though this error will occur if the connection isn't idle (i.e a query is in progress) when the server is shutdown, as a different server response is sent, or if the server dies hard then the connection is ended by reading EOF. package main
import _ "github.com/tomponline/mysql"
import "database/sql"
import "fmt"
import "time"
func main() {
db, err := sql.Open("mysql", "root:@tcp(127.0.0.1:3306)/mysql")
if err != nil {
fmt.Println("Open error: ", err)
return
}
err = db.Ping()
if err != nil {
fmt.Println("Ping error: ", err)
return
}
fmt.Println("Ping OK")
var resp string
for {
fmt.Println("starting query...")
if err := db.QueryRow("SELECT user FROM mysql.user WHERE SLEEP(5) = 0 limit 1").Scan(&resp); err != nil {
fmt.Println("App MySQL error: ", err)
time.Sleep(10 * time.Second)
continue
}
fmt.Println("Finished query: ", resp)
time.Sleep(10 * time.Second)
}
} This test script simulates a long-running query with a result set. If you shutdown the server whilst the query is running, then the error is correctly passed to the executing application. However if you restart the server when no query is running, the next query to be executed is transparently reconnected to the server. :) |
Here is an example of the output when restarting the server when the connection is idle, the next query that is run transparently reconnects and runs the query OK:
Here is an example of restarting the server during a query being run and the error being passed to the app:
Here is an example of killing the server hard during a query running:
Here the application isn't aware of the server socket dying and is reconnected transparently, but because the server is down, the connection is refused so the app does still see an error. This is the same behavior as the current head:
So whilst not to spec, its the current behavior of the library and not related to this change. |
@tomponline when I wrote related I mean that spec-conforming apps have to account for connection breakage and have to handle it themselves. But that won't be the case for all the apps out there. And I'd like a |
@@ -49,8 +49,15 @@ func (mc *mysqlConn) readPacket() ([]byte, error) { | |||
if data[3] > mc.sequence { | |||
return nil, ErrPktSyncMul | |||
} | |||
return nil, ErrPktSync | |||
|
|||
//When MariaDB server is shutdown connection killed packet is sent |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please put a space between `// and the comment.
Slightly changed comment: "The MariaDB server sends a packet with sequence number 0 when the server is shutdown."
@@ -525,6 +532,13 @@ func (mc *mysqlConn) handleErrorPacket(data []byte) error { | |||
pos = 9 | |||
} | |||
|
|||
//If error code is for Connection was killed, then return bad connection. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also add a space here and put "Connection was killed" in quotes. Maybe also replace "then return bad connection" with "then mark connection as bad"
@arnehormann i'm wondering if Side note I have a test case on a branch with @tomponline changes, but those changes break this test. The only way to fix that test is to change the sequence id to non- |
//https://mariadb.com/kb/en/mariadb/mariadb-error-codes/ | ||
if errno == 1927 { | ||
errLog.Print("Error ", errno, ": ", string(data[pos:])) | ||
return driver.ErrBadConn |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not safe, because we can't know this error was sent before sending query or not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
my question is assuming this came after a query is sent. is ErrPktSync
a recoverable error, i.e. can i still use the connection after I get back that error from the driver ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I can't get what is your question.
Clearly, you cannot use the connection.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So what is the difference between returning ErrBadConn
and ErrPktSync
? The latter seem to be unrecoverable and ErrBadConn
is the right way for the driver to signal to go that the connection is in a bad state
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can close this connection (mc.close()
to make next call returns ErrBadConn
.
Please continue discussion in #631 |
Description
Fix to address query when first query is run after MySQL is restarted.
Addresses issue #449
Checklist