Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mark connections as bad on MariaDB shutdown #451

Closed
wants to merge 8 commits into from

Conversation

tomponline
Copy link

@tomponline tomponline commented Apr 29, 2016

Description

Fix to address query when first query is run after MySQL is restarted.
Addresses issue #449

Checklist

  • Code compiles correctly
  • Created tests which fail without the change (if possible)
  • All tests passing
  • Extended the README / documentation, if necessary
  • Added myself / the copyright holder to the AUTHORS file

@tomponline
Copy link
Author

The unit tests pass on 1.6.2

[root@tpel6 mysql]# go test -v ./...
=== RUN   TestEmptyQuery
--- PASS: TestEmptyQuery (0.00s)
=== RUN   TestCRUD
--- PASS: TestCRUD (0.05s)
=== RUN   TestMultiQuery
--- PASS: TestMultiQuery (0.01s)
=== RUN   TestInt
--- PASS: TestInt (0.21s)
=== RUN   TestFloat32
--- PASS: TestFloat32 (0.07s)
=== RUN   TestFloat64
--- PASS: TestFloat64 (0.05s)
=== RUN   TestFloat64Placeholder
--- PASS: TestFloat64Placeholder (0.08s)
=== RUN   TestString
--- PASS: TestString (0.13s)
=== RUN   TestDateTime
--- PASS: TestDateTime (0.02s)
=== RUN   TestTimestampMicros
--- PASS: TestTimestampMicros (0.00s)
=== RUN   TestNULL
--- PASS: TestNULL (0.03s)
=== RUN   TestUint64
--- PASS: TestUint64 (0.00s)
=== RUN   TestLongData
--- PASS: TestLongData (0.25s)
=== RUN   TestLoadData
--- PASS: TestLoadData (0.06s)
=== RUN   TestFoundRows
--- PASS: TestFoundRows (0.10s)
=== RUN   TestStrict
--- PASS: TestStrict (0.05s)
=== RUN   TestTLS
--- SKIP: TestTLS (0.00s)
    driver_test.go:1230: server does not support TLS
=== RUN   TestReuseClosedConnection
[mysql] 2016/04/29 13:08:34 statement.go:46: invalid connection
--- PASS: TestReuseClosedConnection (0.00s)
=== RUN   TestCharset
--- PASS: TestCharset (0.01s)
=== RUN   TestFailingCharset
--- PASS: TestFailingCharset (0.00s)
=== RUN   TestCollation
--- PASS: TestCollation (0.02s)
=== RUN   TestColumnsWithAlias
--- PASS: TestColumnsWithAlias (0.00s)
=== RUN   TestRawBytesResultExceedsBuffer
--- PASS: TestRawBytesResultExceedsBuffer (0.00s)
=== RUN   TestTimezoneConversion
--- PASS: TestTimezoneConversion (0.17s)
=== RUN   TestRowsClose
--- PASS: TestRowsClose (0.00s)
=== RUN   TestCloseStmtBeforeRows
--- PASS: TestCloseStmtBeforeRows (0.00s)
=== RUN   TestStmtMultiRows
--- PASS: TestStmtMultiRows (0.01s)
=== RUN   TestPreparedManyCols
--- PASS: TestPreparedManyCols (0.02s)
=== RUN   TestConcurrent
--- SKIP: TestConcurrent (0.00s)
    driver_test.go:1664: MYSQL_TEST_CONCURRENT env var not set
=== RUN   TestCustomDial
--- PASS: TestCustomDial (0.00s)
=== RUN   TestSQLInjection
--- PASS: TestSQLInjection (0.16s)
=== RUN   TestInsertRetrieveEscapedData
--- PASS: TestInsertRetrieveEscapedData (0.08s)
=== RUN   TestUnixSocketAuthFail
--- PASS: TestUnixSocketAuthFail (0.00s)
    driver_test.go:1838: socket: /var/lib/mysql/mysql.sock
    driver_test.go:1838: socket: /var/lib/mysql/mysql.sock
    driver_test.go:1838: socket: /var/lib/mysql/mysql.sock
=== RUN   TestDSNParser
--- PASS: TestDSNParser (0.00s)
=== RUN   TestDSNParserInvalid
--- PASS: TestDSNParserInvalid (0.00s)
=== RUN   TestDSNReformat
--- PASS: TestDSNReformat (0.00s)
=== RUN   TestDSNWithCustomTLS
--- PASS: TestDSNWithCustomTLS (0.00s)
=== RUN   TestDSNWithCustomTLSQueryEscape
--- PASS: TestDSNWithCustomTLSQueryEscape (0.00s)
=== RUN   TestDSNUnsafeCollation
--- PASS: TestDSNUnsafeCollation (0.00s)
=== RUN   TestErrorsSetLogger
--- PASS: TestErrorsSetLogger (0.00s)
=== RUN   TestErrorsStrictIgnoreNotes
--- PASS: TestErrorsStrictIgnoreNotes (0.00s)
=== RUN   TestScanNullTime
--- PASS: TestScanNullTime (0.00s)
=== RUN   TestLengthEncodedInteger
--- PASS: TestLengthEncodedInteger (0.00s)
=== RUN   TestOldPass
--- PASS: TestOldPass (0.00s)
=== RUN   TestFormatBinaryDateTime
--- PASS: TestFormatBinaryDateTime (0.00s)
=== RUN   TestEscapeBackslash
--- PASS: TestEscapeBackslash (0.00s)
=== RUN   TestEscapeQuotes
--- PASS: TestEscapeQuotes (0.00s)
PASS
ok      github.com/tomponline/mysql 1.599s

@tomponline
Copy link
Author

I think this issue is caused by the fact that when the MySQL server is shutdown, because the Go client is not reading from the socket in a loop, it doesn't catch the shutdown message. Because the socket is not read from until after the next query is written to the socket, and the write doesn't fail because the socket is in CLOSED_WAIT state, it takes until the next read to detect the sequence number is wrong, because we read the server closed message.

@tomponline
Copy link
Author

tomponline commented Apr 29, 2016

I've actually modified this PR to detect the MariaDB connection was killed error code and restart the query if it was found.

Output is now:

[root@tpel6 ~]# go run db.go 
Ping OK
Conn ID:  88
[mysql] 2016/04/29 13:42:39 packets.go:538: Connection was killed
[mysql] 2016/04/29 13:42:39 packets.go:131: write tcp 127.0.0.1:42188->127.0.0.1:3306: write: broken pipe
Conn ID:  4

@arnehormann
Copy link
Member

Hi @tomponline, thanks for digging into this!

The query has to be reissued anyway if the connection dies - and this error case is probably very, very uncommon. If you wrote the PR mainly because the error message bugs you, could you use mysql.SetLogger to filter these messages with a custom logger?

Also, this PR reads too Maria specific for my taste. The higher error codes are not the same between all MySQL dialects and I'd like to avoid the potential conflicts that could arise.

If I misread hat you're doing and this is not just about the logging, please drop me a comment and we'll talk about how we should proceed. One change you'd definitely have to do is adhering to alphabetic ordering in the AUTHORS file 😀 (it's an undocumented convention).

Thanks!

@tomponline
Copy link
Author

Hi @arnehormann thanks! I will update the AUTHORS file as you have requested.

The problem is this error causes user queries to fail (they are not retried by the driver) even if the query is started sometime after the MySQL has been restarted.

So its not that the query is interrupted by the server restart, I would expect that to be passed up to the userland.

Its that a dead connection is used on subsequent new queries, even though the server has been restored for sometime (minutes/hours) since the last query was run on that connection.

Its not uncommon if you restart a server, and in our case, it means the next user to issue a query is guaranteed to fail.

I'm not sure if this happens with MySQL too or whether it is a MariaDB only issue, but it means that if you restart the server, once the server is up and running all applications will fail on their next query before the client reconnects.

This means we would need to add error detection for connection closed on every query on every app, whereas I understood the Golang sql library was supposed to reconnect on connection failure (that doesn't affect a running query).

@tomponline
Copy link
Author

tomponline commented May 3, 2016

@arnehormann I couldnt find any reference to error code 1927 in MySQL docs, so am assuming this is a MariaDB specific error/behavior.

What seems to happen is that a message response packet is sent to the client connection when the server is shutdown, but because no queries are currently being run against the server, the client doesn't read the response until the next query is issued.

For some reason, the socket write for the new query doesn't fail, even though the socket is closed, again this might be because the socket hasn't been read from since the server disconnected.

Because the socket write succeeds, when the response to the query is read back, it instead gets the connection killed message waiting on the socket, which appears to be sent with a sequence number of 0.

Originally, this caused an out of sequence error, but it didn't cause the socket to be closed.

My original attempt at a fix was to detect the special case sequence number of 0 and return driver.ErrBadConn which caused the Go driver to reconnect and re-run the query.

However I 'improved' it by specifically handling the connection killed error, in case a genuine sequence error caused 0 to be received out of order.

@arnehormann
Copy link
Member

@tomponline ok, I got it.

Can you write a test? I'm thinking of a custom net.Dialer running in a separate goroutine which cuts or pauses the connection and can be controlled from the test.

Also, this is related to #295 and #302: applications should definitely provide this kind of error handling. The driver is acting against the spec as is (and apparently at least the postgres driver, too).

@tomponline
Copy link
Author

@arnehormann cool thanks, yes will give it a go to add a test.

I've read the other issues and they are definitely related, although from my tests using a slightly modified test script, it doesn't look as though this error will occur if the connection isn't idle (i.e a query is in progress) when the server is shutdown, as a different server response is sent, or if the server dies hard then the connection is ended by reading EOF.

package main

import _ "github.com/tomponline/mysql"
import "database/sql"
import "fmt"
import "time"

func main() {

    db, err := sql.Open("mysql", "root:@tcp(127.0.0.1:3306)/mysql")

    if err != nil {
        fmt.Println("Open error: ", err)
        return
    }

    err = db.Ping()

    if err != nil {
        fmt.Println("Ping error: ", err)
        return
    }
    fmt.Println("Ping OK")

    var resp string

    for {
        fmt.Println("starting query...")
        if err := db.QueryRow("SELECT user FROM mysql.user WHERE SLEEP(5) = 0 limit 1").Scan(&resp); err != nil {
            fmt.Println("App MySQL error: ", err)
            time.Sleep(10 * time.Second)
            continue
        }
        fmt.Println("Finished query: ", resp)
        time.Sleep(10 * time.Second)
    }

}

This test script simulates a long-running query with a result set.

If you shutdown the server whilst the query is running, then the error is correctly passed to the executing application.

However if you restart the server when no query is running, the next query to be executed is transparently reconnected to the server. :)

@tomponline
Copy link
Author

Here is an example of the output when restarting the server when the connection is idle, the next query that is run transparently reconnects and runs the query OK:

[root@tpel6 ~]# go run db.go 
Ping OK
starting query...
Finished query:  root
starting query...
[mysql] 2016/05/09 21:38:51 packets.go:538: Error 1927: Connection was killed
[mysql] 2016/05/09 21:38:51 packets.go:131: write tcp 127.0.0.1:51297->127.0.0.1:3306: write: broken pipe
Finished query:  root

Here is an example of restarting the server during a query being run and the error being passed to the app:

[root@tpel6 ~]# go run db.go 
Ping OK
starting query...
App MySQL error:  Error 1053: Server shutdown in progress
starting query...
[mysql] 2016/05/09 21:37:55 packets.go:33: unexpected EOF
[mysql] 2016/05/09 21:37:55 packets.go:131: write tcp 127.0.0.1:51059->127.0.0.1:3306: write: broken pipe
Finished query:  root

Here is an example of killing the server hard during a query running:

[root@tpel6 ~]# go run db.go 
Ping OK
starting query...
[mysql] 2016/05/09 21:39:57 packets.go:33: unexpected EOF
App MySQL error:  dial tcp 127.0.0.1:3306: getsockopt: connection refused
starting query...
Finished query:  root
starting query...

Here the application isn't aware of the server socket dying and is reconnected transparently, but because the server is down, the connection is refused so the app does still see an error.

This is the same behavior as the current head:

[root@tpel6 ~]# go run db.go 
Ping OK
starting query...
[mysql] 2016/05/09 21:40:49 packets.go:33: unexpected EOF
App MySQL error:  dial tcp 127.0.0.1:3306: getsockopt: connection refused
starting query...
App MySQL error:  dial tcp 127.0.0.1:3306: getsockopt: connection refused
^Csignal: interrupt

So whilst not to spec, its the current behavior of the library and not related to this change.

@arnehormann
Copy link
Member

@tomponline when I wrote related I mean that spec-conforming apps have to account for connection breakage and have to handle it themselves. But that won't be the case for all the apps out there.

And I'd like a func TestConnectionLoss(t *testing.T) with a custom RegisterDial - and DeregisterDial after the test is run. We should add this to the test suites so we can detect regressions. If you stop the query in transmission - maybe after the first one or two Write calls in the Dialer after the connection was established - it should interrupt something in flight, right?

@julienschmidt julienschmidt added this to the v1.3 milestone Oct 25, 2016
@@ -49,8 +49,15 @@ func (mc *mysqlConn) readPacket() ([]byte, error) {
if data[3] > mc.sequence {
return nil, ErrPktSyncMul
}
return nil, ErrPktSync

//When MariaDB server is shutdown connection killed packet is sent
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please put a space between `// and the comment.
Slightly changed comment: "The MariaDB server sends a packet with sequence number 0 when the server is shutdown."

@@ -525,6 +532,13 @@ func (mc *mysqlConn) handleErrorPacket(data []byte) error {
pos = 9
}

//If error code is for Connection was killed, then return bad connection.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also add a space here and put "Connection was killed" in quotes. Maybe also replace "then return bad connection" with "then mark connection as bad"

@julienschmidt julienschmidt changed the title fixes reconnect problem when mysql is restarted Mark connections as bad on MariaDB shutdown Oct 25, 2016
@julienschmidt julienschmidt modified the milestones: v1.4, v1.3 Nov 2, 2016
@jvshahid
Copy link

@arnehormann i'm wondering if ErrPktSync should be replaced with driver.ErrBadConn ? May be I misunderstood the docs, but I cannot think of a way in which a golang program can recover from this error. I read in the docs a few ways in which you can get into this state, but assuming the program did get into that bad state and ErrPktSync is returned, the connection will return to the connection pool and some random query will run on that connection at a later time.

Side note I have a test case on a branch with @tomponline changes, but those changes break this test. The only way to fix that test is to change the sequence id to non-0 value which feels hacky to me. What do you think ?

//https://mariadb.com/kb/en/mariadb/mariadb-error-codes/
if errno == 1927 {
errLog.Print("Error ", errno, ": ", string(data[pos:]))
return driver.ErrBadConn
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not safe, because we can't know this error was sent before sending query or not.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

my question is assuming this came after a query is sent. is ErrPktSync a recoverable error, i.e. can i still use the connection after I get back that error from the driver ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I can't get what is your question.
Clearly, you cannot use the connection.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So what is the difference between returning ErrBadConn and ErrPktSync ? The latter seem to be unrecoverable and ErrBadConn is the right way for the driver to signal to go that the connection is in a bad state

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can close this connection (mc.close() to make next call returns ErrBadConn.

@julienschmidt
Copy link
Member

Please continue discussion in #631

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants