Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

allow retry on netErrors in safe situations #871

Closed
wants to merge 2 commits into from

Conversation

fho
Copy link

@fho fho commented May 28, 2019

In some situations the sql package does not retry a pq operation when it
should.
One of the situations is #870.
When a postgresql-server is restarted and after the restart is finished an
operation is triggered on the same db handle, it failed with an broken
pipe error in some circumstances.

The sql package does not retry the operation and instead fail because
the pq driver does not return driver.ErrBadConn for network errors.
The driver must not return ErrBadConn when the server might have already
executed the operation. This would cause that sql package is retrying it
and the operation would be run multiple times by the postgresql server.
In some situations it's safe to return ErrBadConn on network errors.
This is the case when it's ensured that the server did not receive the
message that triggers the operation.

This commit introduces a netErrorNoWrite error. This error should be
used when network operations panic when it's safe to retry the
operation.
When errRecover() receives this error it returns ErrBadConn() and marks
the connection as bad.
A mustSendRetryable() function is introduced that wraps a netOpError in
an netErrorNoWrite when panicing.
mustSendRetryable() is called in situations when the send that triggers
the operation failed.

@fho fho force-pushed the retryableNetErrors branch 2 times, most recently from a46ee34 to 89cb5db Compare May 28, 2019 15:28
@fho fho marked this pull request as ready for review May 28, 2019 15:29
In some situations the sql package does not retry a pq operation when it
should.
One of the situations is lib#870. When a
postgresql-server is restarted and after the restart is finished an
operation is triggered on the same db handle it failed with an broken
pipe error in some circumstances.

The sql package does not retry the operation and instead fail because
the pq driver does not return driver.ErrBadConn for network errors.
The driver must not return ErrBadConn when the server might have already
executed the operation. This would cause that sql package is retrying it
and the operation would be run multiple times by the postgresql server.
In some situations it's safe to return ErrBadConn on network errors.
This is the case when it's ensured that the server did not receive the
message that triggers the operation.

This commit introduces a netErrorNoWrite error. This error should be
used when network operations panic when it's safe to retry the
operation.
When errRecover() receives this error it returns ErrBadConn() and marks
the connection as bad.
A mustSendRetryable() function is introduced that wraps a netOpError in
an netErrorNoWrite when panicing.
mustSendRetryable() is called in situations when the send that triggers
the operation failed.
@fho fho force-pushed the retryableNetErrors branch from 89cb5db to 682ce78 Compare May 28, 2019 15:35
Copy link

@sirkjohannsen sirkjohannsen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@fho
Copy link
Author

fho commented May 29, 2019

CI jobs fails during tool installation:

E: Package 'postgresql-9.0' has no installation candidate
E: Unable to locate package postgresql-server-dev-9.0
E: Couldn't find any package by glob 'postgresql-server-dev-9.0'
E: Couldn't find any package by regex 'postgresql-server-dev-9.0'
E: Unable to locate package postgresql-contrib-9.0
E: Couldn't find any package by glob 'postgresql-contrib-9.0'
E: Couldn't find any package by regex 'postgresql-contrib-9.0'
fatal: cannot change to '/home/travis/gopath/src/honnef.co/go/tools/': No such file or directory
The command "./.travis.sh megacheck_install" failed and exited with 128 during .

@fho
Copy link
Author

fho commented Jun 4, 2019

@mjibson could you please provide some feedback regarding the change when you have time? thanks!

func (cn *conn) mustSendRetryable(m *writeBuf) {
err := cn.send(m)
if err != nil {
if _, ok := err.(*net.OpError); ok {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not convinced this makes something safe to retry. If you send something to a server and get a network error, it is still possible for the server to have received and processed the result, but the error happens during return to the client. In that case it's not safe to retry. Do I understand the circumstances where this can happen correctly? If so, this is closer to a "maybe it worked?" error, which means the client has to check if it is safe to retry or not.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your review, @mjibson and @cbandy!

If you send something to a server and get a network error, it is still possible for the server to
have received and processed the result, but the error happens during return to the client.

My networking knowledge is a bit rusty but from my understanding this can not happen:

When send() is called the data is only written to the kernel's send buffer and then the call returns. The send() call does not block until the peer acknowledged receiving the data.
Therefore it can not happen that cn.send(m) returns an error because something went wrong during returning an ACK to the sender.

Also errors that are returned by send() only indicate local errors.

Quote from Linux manpage of the send() syscall:

No indication of failure to deliver is implicit in a send(). Locally detected errors are indicated by a return value of -1.

http://man7.org/linux/man-pages/man2/send.2.html

(Golang uses internally the write() syscall to write the data to the fd of the socket. It's the same then using the send() syscall without additional flags.)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a lot of assumptions that care about how the linux kernel and go function and I'm not at all comfortable changing this until there are tests that clearly demonstrate these assumptions. Also, what about if the client and or server are on windows, mac, or some version of linux or go that doesn't happen to function this way? I'm still not convinced that it's always safe to retry if there's an error here. Distributed systems indeed have a "maybe this failed?" error because of this exact problem.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mjibson ok, I can make the code more safe by narrowing the situation when a send() error is identified as retryable by:

  1. Also check if Conn.Write() returned that 0 bytes were transferred when it returned a net.OpError error.
  2. Narrow down the error types to:
  • syscall.EPIPE: "The local end has been shut down on a connection oriented socket.".
  • syscall.ENOTSOCK: "The file descriptor sockfd does not refer to a socket.".
  • syscall.EINVAL: "Invalid argument passed."
  • syscall.ECONNRESET: "Connection reset by peer."
  • syscall.EBADF: "sockfd is not a valid open file descriptor."
  1. Narrow down the error type to syscall.EPIPE only. The PR would then address specifically the problem described in db operation fails with "broken pipe" instead of reconnecting transparently after server restart #870.

What do you think?
I would favor doing 1 and 2.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't really know what exactly the EPIPE error means actually happened and what didn't happen. Can it change between OS or version of linux? Is the bytes written count always non-zero if any bytes were sent even if there's an error return for all paths? These questions make me very wary to do even 1 and 3.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't really know what exactly the EPIPE error means actually happened and what didn't happen.

You can look it up in the documentation for the related syscalls:

Posix: https://pubs.opengroup.org/onlinepubs/009695399/functions/write.html
The documentation in the e.g. linux or freebsd manpages are more clear and they implement the Posix standard, see e.g.:
https://www.freebsd.org/cgi/man.cgi?write(2) or http://man7.org/linux/man-pages/man2/write.2.html

See for Windows: https://docs.microsoft.com/en-us/windows/desktop/api/fileapi/nf-fileapi-writefile

Google can also help with understanding what EPIPE means:
https://www.google.com/search?q=what+does+%22broken+pipe%22+mean&oq=what+does+%22broken+pipe%22+mean

Can it change between OS or version of linux?

The syscall API of operating systems is stable. It does not change in downwards incompatible manner. This would break tons of applications for the effected operation system.
Unix-like operating systems implement the POSIX API which specify this behavior. This includes OSX.

Windows Winsock API is not POSIX compliant.
But in this case it behaves the same, the WriteFile() returns ERROR_BROKEN_PIPE when the filedescriptor is closed. ERROR_BROKEN_PIPE is mapped to the same error then EPIPE by golang

See:
https://docs.microsoft.com/en-us/windows/desktop/api/fileapi/nf-fileapi-writefile
https://golang.org/src/syscall/zerrors_windows.go#L120

Is the bytes written count always non-zero if any bytes were sent even if there's an error return for all paths?

Yes
See the implementation of Golangs conn.Write() and the parent methods that it calls.
https://golang.org/src/net/net.go?s=6460:6503#L175
Unix: https://golang.org/src/internal/poll/fd_unix.go#L254
Windows: https://golang.org/src/internal/poll/fd_windows.go#657

fd.Write() calls syscall.Write() in a loop until all bytes were written out.
If it fails in the first loop iteration, it only returns an error nothing was sent out in this case.
If it fails on subsequent loop iterations it returns with the error how many bytes were send out.

conn.go Show resolved Hide resolved
It is possible that a query string contains multiple SQL statements.
Depending on the server implementation, it might be possible that the
server executes the SQL statements as soon as it receives a
termination character.
If a net.OpError happened afterwards we can not send the whole
query string again to the server because the already executed statements
would be executed again.

Add an additional check in mustSendRetryable() to only return a
netErrorNoWrite error if the number of bytes that were send out is 0.
@aeneasr
Copy link

aeneasr commented Oct 14, 2019

It looks like someone in the MySQL driver proposed having an extra go routine that checks for closed connections:

go-sql-driver/mysql#529 (comment)

Not sure if that's something worth considering but it would remove the issues of retrying when the client can't be sure if the server accepted the request.

@danielsdeleo danielsdeleo mentioned this pull request Feb 12, 2020
4 tasks
@stevendanna
Copy link

stevendanna commented Feb 26, 2020

In my own service, I'll likely be using sql.DB's SetMaxConnLifetime and higher-level retries to get around the main place I hit this, but I was wondering if there is still maintainer interest in moving this forward?

The current approach in this PR (only marking the error as retriable if 0-bytes were written) is the same approach used in pgx:

https://github.com/jackc/pgconn/blob/master/pgconn.go#L860
https://github.com/jackc/pgconn/blob/master/errors.go#L160-L162
https://github.com/jackc/pgx/blob/master/stdlib/sql.go#L282-L284

I understand the concern that there may be cases where Write to a network socket will erroneously report 0 bytes written on error. Would adding a tests for our expected behavior (for example, a test that creates a tcp connection, a writer that attempts a large write, a reader that does a small, incomplete read and then closes the connection) on particular platforms and go versions and then using conditional compilation to only enable the retry behavior on those platforms be an acceptable solution? Changes to Go or the OS could still introduce issues, but this would perhaps give us some early warning.

No pressure either way, just wanted to check in on this since I hit it recently.

@gnuletik
Copy link

Seeing the Github article today about the same issue in the mysql driver : https://github.blog/2020-05-20-three-bugs-in-the-go-mysql-driver/

Would it make sense to implement the same check for lib/pq ?
go-sql-driver/mysql#934

@cbandy
Copy link
Contributor

cbandy commented May 21, 2020

Seeing the Github article today about the same issue in the mysql driver : https://github.blog/2020-05-20-three-bugs-in-the-go-mysql-driver/

Would it make sense to implement the same check for lib/pq ?
go-sql-driver/mysql#934

  1. That solution (and perhaps any solution) is still dependent on the OS. In their master branch the list of supported systems is longer than the PR but far from complete. The fallback is no-op: https://github.com/go-sql-driver/mysql/blob/master/conncheck_dummy.go

  2. The linked article states that this approach (non-blocking read before write) is safe for the MySQL protocol because the server only sends data in response to requests from the client. This is not the case for the PostgreSQL protocol in which the server can send unprompted messages to the client. It may be possible to manage that but the implementation won't be as simple as the MySQL pull request.

    https://www.postgresql.org/docs/current/protocol-flow.html#PROTOCOL-ASYNC

@hibrid
Copy link

hibrid commented Jul 13, 2020

Any update on this? Just wondering if this is planned to be merged or how we should be dealing with this because it's a real problem

@maddyblue
Copy link
Collaborator

I'm not convinced of the safety of this PR and do not intend to merge it. As described in my comments above, I believe that a network error could occur after a message has been sent to the server, in which case it is not guaranteed that it is safe for the client to auto retry. In that case, the client must check itself if a retry is safe. I could absolutely be wrong, and the error messages coming back here could be used to verify that 0 bytes have been sent, at which point it's clearly safe to retry. However I'm not certain about the guarantees of the error messages and that those guarantees are the same across all OSs running Go. Currently you'll have to solve this yourself in the application layer or merge this PR locally and hope it works.

@fho
Copy link
Author

fho commented Nov 23, 2020

The issue is solved.
The same solution from the PR #1013 was merged.

@fho fho closed this Nov 23, 2020
kanaksinghal pushed a commit to kanaksinghal/gorm that referenced this pull request Jul 22, 2021
I'm seeing "broken pipe" errors randomly on one of our servers
The issue seems to be the db driver conns become stale when tcp connections were disconnected.

It happens more often when the DB is behind a proxy.

This has been reported and fixed in the lib/pq in v1.9+

lib/pq#871
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants