-
Notifications
You must be signed in to change notification settings - Fork 916
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
allow retry on netErrors in safe situations #871
Conversation
a46ee34
to
89cb5db
Compare
In some situations the sql package does not retry a pq operation when it should. One of the situations is lib#870. When a postgresql-server is restarted and after the restart is finished an operation is triggered on the same db handle it failed with an broken pipe error in some circumstances. The sql package does not retry the operation and instead fail because the pq driver does not return driver.ErrBadConn for network errors. The driver must not return ErrBadConn when the server might have already executed the operation. This would cause that sql package is retrying it and the operation would be run multiple times by the postgresql server. In some situations it's safe to return ErrBadConn on network errors. This is the case when it's ensured that the server did not receive the message that triggers the operation. This commit introduces a netErrorNoWrite error. This error should be used when network operations panic when it's safe to retry the operation. When errRecover() receives this error it returns ErrBadConn() and marks the connection as bad. A mustSendRetryable() function is introduced that wraps a netOpError in an netErrorNoWrite when panicing. mustSendRetryable() is called in situations when the send that triggers the operation failed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
CI jobs fails during tool installation:
|
@mjibson could you please provide some feedback regarding the change when you have time? thanks! |
func (cn *conn) mustSendRetryable(m *writeBuf) { | ||
err := cn.send(m) | ||
if err != nil { | ||
if _, ok := err.(*net.OpError); ok { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not convinced this makes something safe to retry. If you send something to a server and get a network error, it is still possible for the server to have received and processed the result, but the error happens during return to the client. In that case it's not safe to retry. Do I understand the circumstances where this can happen correctly? If so, this is closer to a "maybe it worked?" error, which means the client has to check if it is safe to retry or not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your review, @mjibson and @cbandy!
If you send something to a server and get a network error, it is still possible for the server to
have received and processed the result, but the error happens during return to the client.
My networking knowledge is a bit rusty but from my understanding this can not happen:
When send()
is called the data is only written to the kernel's send buffer and then the call returns. The send()
call does not block until the peer acknowledged receiving the data.
Therefore it can not happen that cn.send(m)
returns an error because something went wrong during returning an ACK to the sender.
Also errors that are returned by send()
only indicate local errors.
Quote from Linux manpage of the send()
syscall:
No indication of failure to deliver is implicit in a send(). Locally detected errors are indicated by a return value of -1.
http://man7.org/linux/man-pages/man2/send.2.html
(Golang uses internally the write()
syscall to write the data to the fd of the socket. It's the same then using the send()
syscall without additional flags.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a lot of assumptions that care about how the linux kernel and go function and I'm not at all comfortable changing this until there are tests that clearly demonstrate these assumptions. Also, what about if the client and or server are on windows, mac, or some version of linux or go that doesn't happen to function this way? I'm still not convinced that it's always safe to retry if there's an error here. Distributed systems indeed have a "maybe this failed?" error because of this exact problem.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mjibson ok, I can make the code more safe by narrowing the situation when a send()
error is identified as retryable by:
- Also check if
Conn.Write()
returned that 0 bytes were transferred when it returned a net.OpError error. - Narrow down the error types to:
syscall.EPIPE
: "The local end has been shut down on a connection oriented socket.".syscall.ENOTSOCK
: "The file descriptor sockfd does not refer to a socket.".syscall.EINVAL
: "Invalid argument passed."syscall.ECONNRESET
: "Connection reset by peer."syscall.EBADF
: "sockfd is not a valid open file descriptor."
- Narrow down the error type to
syscall.EPIPE
only. The PR would then address specifically the problem described in db operation fails with "broken pipe" instead of reconnecting transparently after server restart #870.
What do you think?
I would favor doing 1 and 2.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't really know what exactly the EPIPE error means actually happened and what didn't happen. Can it change between OS or version of linux? Is the bytes written count always non-zero if any bytes were sent even if there's an error return for all paths? These questions make me very wary to do even 1 and 3.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't really know what exactly the EPIPE error means actually happened and what didn't happen.
You can look it up in the documentation for the related syscalls:
Posix: https://pubs.opengroup.org/onlinepubs/009695399/functions/write.html
The documentation in the e.g. linux or freebsd manpages are more clear and they implement the Posix standard, see e.g.:
https://www.freebsd.org/cgi/man.cgi?write(2) or http://man7.org/linux/man-pages/man2/write.2.html
See for Windows: https://docs.microsoft.com/en-us/windows/desktop/api/fileapi/nf-fileapi-writefile
Google can also help with understanding what EPIPE means:
https://www.google.com/search?q=what+does+%22broken+pipe%22+mean&oq=what+does+%22broken+pipe%22+mean
Can it change between OS or version of linux?
The syscall API of operating systems is stable. It does not change in downwards incompatible manner. This would break tons of applications for the effected operation system.
Unix-like operating systems implement the POSIX API which specify this behavior. This includes OSX.
Windows Winsock API is not POSIX compliant.
But in this case it behaves the same, the WriteFile()
returns ERROR_BROKEN_PIPE
when the filedescriptor is closed. ERROR_BROKEN_PIPE
is mapped to the same error then EPIPE
by golang
See:
https://docs.microsoft.com/en-us/windows/desktop/api/fileapi/nf-fileapi-writefile
https://golang.org/src/syscall/zerrors_windows.go#L120
Is the bytes written count always non-zero if any bytes were sent even if there's an error return for all paths?
Yes
See the implementation of Golangs conn.Write()
and the parent methods that it calls.
https://golang.org/src/net/net.go?s=6460:6503#L175
Unix: https://golang.org/src/internal/poll/fd_unix.go#L254
Windows: https://golang.org/src/internal/poll/fd_windows.go#657
fd.Write()
calls syscall.Write()
in a loop until all bytes were written out.
If it fails in the first loop iteration, it only returns an error nothing was sent out in this case.
If it fails on subsequent loop iterations it returns with the error how many bytes were send out.
It is possible that a query string contains multiple SQL statements. Depending on the server implementation, it might be possible that the server executes the SQL statements as soon as it receives a termination character. If a net.OpError happened afterwards we can not send the whole query string again to the server because the already executed statements would be executed again. Add an additional check in mustSendRetryable() to only return a netErrorNoWrite error if the number of bytes that were send out is 0.
It looks like someone in the MySQL driver proposed having an extra go routine that checks for closed connections: go-sql-driver/mysql#529 (comment) Not sure if that's something worth considering but it would remove the issues of retrying when the client can't be sure if the server accepted the request. |
In my own service, I'll likely be using sql.DB's The current approach in this PR (only marking the error as retriable if 0-bytes were written) is the same approach used in pgx: https://github.com/jackc/pgconn/blob/master/pgconn.go#L860 I understand the concern that there may be cases where Write to a network socket will erroneously report 0 bytes written on error. Would adding a tests for our expected behavior (for example, a test that creates a tcp connection, a writer that attempts a large write, a reader that does a small, incomplete read and then closes the connection) on particular platforms and go versions and then using conditional compilation to only enable the retry behavior on those platforms be an acceptable solution? Changes to Go or the OS could still introduce issues, but this would perhaps give us some early warning. No pressure either way, just wanted to check in on this since I hit it recently. |
Seeing the Github article today about the same issue in the mysql driver : https://github.blog/2020-05-20-three-bugs-in-the-go-mysql-driver/ Would it make sense to implement the same check for |
|
Any update on this? Just wondering if this is planned to be merged or how we should be dealing with this because it's a real problem |
I'm not convinced of the safety of this PR and do not intend to merge it. As described in my comments above, I believe that a network error could occur after a message has been sent to the server, in which case it is not guaranteed that it is safe for the client to auto retry. In that case, the client must check itself if a retry is safe. I could absolutely be wrong, and the error messages coming back here could be used to verify that 0 bytes have been sent, at which point it's clearly safe to retry. However I'm not certain about the guarantees of the error messages and that those guarantees are the same across all OSs running Go. Currently you'll have to solve this yourself in the application layer or merge this PR locally and hope it works. |
The issue is solved. |
I'm seeing "broken pipe" errors randomly on one of our servers The issue seems to be the db driver conns become stale when tcp connections were disconnected. It happens more often when the DB is behind a proxy. This has been reported and fixed in the lib/pq in v1.9+ lib/pq#871
In some situations the sql package does not retry a pq operation when it
should.
One of the situations is #870.
When a postgresql-server is restarted and after the restart is finished an
operation is triggered on the same db handle, it failed with an broken
pipe error in some circumstances.
The sql package does not retry the operation and instead fail because
the pq driver does not return driver.ErrBadConn for network errors.
The driver must not return ErrBadConn when the server might have already
executed the operation. This would cause that sql package is retrying it
and the operation would be run multiple times by the postgresql server.
In some situations it's safe to return ErrBadConn on network errors.
This is the case when it's ensured that the server did not receive the
message that triggers the operation.
This commit introduces a netErrorNoWrite error. This error should be
used when network operations panic when it's safe to retry the
operation.
When errRecover() receives this error it returns ErrBadConn() and marks
the connection as bad.
A mustSendRetryable() function is introduced that wraps a netOpError in
an netErrorNoWrite when panicing.
mustSendRetryable() is called in situations when the send that triggers
the operation failed.