allow retry on netErrors in safe situations #871

fho · 2019-05-28T15:25:02Z

In some situations the sql package does not retry a pq operation when it
should.
One of the situations is #870.
When a postgresql-server is restarted and after the restart is finished an
operation is triggered on the same db handle, it failed with an broken
pipe error in some circumstances.

The sql package does not retry the operation and instead fail because
the pq driver does not return driver.ErrBadConn for network errors.
The driver must not return ErrBadConn when the server might have already
executed the operation. This would cause that sql package is retrying it
and the operation would be run multiple times by the postgresql server.
In some situations it's safe to return ErrBadConn on network errors.
This is the case when it's ensured that the server did not receive the
message that triggers the operation.

This commit introduces a netErrorNoWrite error. This error should be
used when network operations panic when it's safe to retry the
operation.
When errRecover() receives this error it returns ErrBadConn() and marks
the connection as bad.
A mustSendRetryable() function is introduced that wraps a netOpError in
an netErrorNoWrite when panicing.
mustSendRetryable() is called in situations when the send that triggers
the operation failed.

In some situations the sql package does not retry a pq operation when it should. One of the situations is lib#870. When a postgresql-server is restarted and after the restart is finished an operation is triggered on the same db handle it failed with an broken pipe error in some circumstances. The sql package does not retry the operation and instead fail because the pq driver does not return driver.ErrBadConn for network errors. The driver must not return ErrBadConn when the server might have already executed the operation. This would cause that sql package is retrying it and the operation would be run multiple times by the postgresql server. In some situations it's safe to return ErrBadConn on network errors. This is the case when it's ensured that the server did not receive the message that triggers the operation. This commit introduces a netErrorNoWrite error. This error should be used when network operations panic when it's safe to retry the operation. When errRecover() receives this error it returns ErrBadConn() and marks the connection as bad. A mustSendRetryable() function is introduced that wraps a netOpError in an netErrorNoWrite when panicing. mustSendRetryable() is called in situations when the send that triggers the operation failed.

sirkjohannsen

LGTM

fho · 2019-05-29T08:44:52Z

CI jobs fails during tool installation:

E: Package 'postgresql-9.0' has no installation candidate
E: Unable to locate package postgresql-server-dev-9.0
E: Couldn't find any package by glob 'postgresql-server-dev-9.0'
E: Couldn't find any package by regex 'postgresql-server-dev-9.0'
E: Unable to locate package postgresql-contrib-9.0
E: Couldn't find any package by glob 'postgresql-contrib-9.0'
E: Couldn't find any package by regex 'postgresql-contrib-9.0'

fatal: cannot change to '/home/travis/gopath/src/honnef.co/go/tools/': No such file or directory
The command "./.travis.sh megacheck_install" failed and exited with 128 during .

fho · 2019-06-04T07:29:23Z

@mjibson could you please provide some feedback regarding the change when you have time? thanks!

maddyblue · 2019-06-04T17:59:21Z

conn.go

+func (cn *conn) mustSendRetryable(m *writeBuf) {
+	err := cn.send(m)
+	if err != nil {
+		if _, ok := err.(*net.OpError); ok {


I'm not convinced this makes something safe to retry. If you send something to a server and get a network error, it is still possible for the server to have received and processed the result, but the error happens during return to the client. In that case it's not safe to retry. Do I understand the circumstances where this can happen correctly? If so, this is closer to a "maybe it worked?" error, which means the client has to check if it is safe to retry or not.

Thanks for your review, @mjibson and @cbandy!

If you send something to a server and get a network error, it is still possible for the server to
have received and processed the result, but the error happens during return to the client.

My networking knowledge is a bit rusty but from my understanding this can not happen:

When send() is called the data is only written to the kernel's send buffer and then the call returns. The send() call does not block until the peer acknowledged receiving the data.
Therefore it can not happen that cn.send(m) returns an error because something went wrong during returning an ACK to the sender.

Also errors that are returned by send() only indicate local errors.

Quote from Linux manpage of the send() syscall:

No indication of failure to deliver is implicit in a send(). Locally detected errors are indicated by a return value of -1.

http://man7.org/linux/man-pages/man2/send.2.html

(Golang uses internally the write() syscall to write the data to the fd of the socket. It's the same then using the send() syscall without additional flags.)

That's a lot of assumptions that care about how the linux kernel and go function and I'm not at all comfortable changing this until there are tests that clearly demonstrate these assumptions. Also, what about if the client and or server are on windows, mac, or some version of linux or go that doesn't happen to function this way? I'm still not convinced that it's always safe to retry if there's an error here. Distributed systems indeed have a "maybe this failed?" error because of this exact problem.

@mjibson ok, I can make the code more safe by narrowing the situation when a send() error is identified as retryable by:

Also check if Conn.Write() returned that 0 bytes were transferred when it returned a net.OpError error.

Narrow down the error types to:

syscall.EPIPE: "The local end has been shut down on a connection oriented socket.".

syscall.ENOTSOCK: "The file descriptor sockfd does not refer to a socket.".

syscall.EINVAL: "Invalid argument passed."

syscall.ECONNRESET: "Connection reset by peer."

syscall.EBADF: "sockfd is not a valid open file descriptor."

Narrow down the error type to syscall.EPIPE only. The PR would then address specifically the problem described in db operation fails with "broken pipe" instead of reconnecting transparently after server restart #870.

What do you think?
I would favor doing 1 and 2.

I don't really know what exactly the EPIPE error means actually happened and what didn't happen. Can it change between OS or version of linux? Is the bytes written count always non-zero if any bytes were sent even if there's an error return for all paths? These questions make me very wary to do even 1 and 3.

I don't really know what exactly the EPIPE error means actually happened and what didn't happen.

You can look it up in the documentation for the related syscalls:

Posix: https://pubs.opengroup.org/onlinepubs/009695399/functions/write.html
The documentation in the e.g. linux or freebsd manpages are more clear and they implement the Posix standard, see e.g.:
https://www.freebsd.org/cgi/man.cgi?write(2) or http://man7.org/linux/man-pages/man2/write.2.html

See for Windows: https://docs.microsoft.com/en-us/windows/desktop/api/fileapi/nf-fileapi-writefile

Google can also help with understanding what EPIPE means:
https://www.google.com/search?q=what+does+%22broken+pipe%22+mean&oq=what+does+%22broken+pipe%22+mean

Can it change between OS or version of linux?

The syscall API of operating systems is stable. It does not change in downwards incompatible manner. This would break tons of applications for the effected operation system.
Unix-like operating systems implement the POSIX API which specify this behavior. This includes OSX.

Windows Winsock API is not POSIX compliant.
But in this case it behaves the same, the WriteFile() returns ERROR_BROKEN_PIPE when the filedescriptor is closed. ERROR_BROKEN_PIPE is mapped to the same error then EPIPE by golang

See:
https://docs.microsoft.com/en-us/windows/desktop/api/fileapi/nf-fileapi-writefile
https://golang.org/src/syscall/zerrors_windows.go#L120

Is the bytes written count always non-zero if any bytes were sent even if there's an error return for all paths?

Yes
See the implementation of Golangs conn.Write() and the parent methods that it calls.
https://golang.org/src/net/net.go?s=6460:6503#L175
Unix: https://golang.org/src/internal/poll/fd_unix.go#L254
Windows: https://golang.org/src/internal/poll/fd_windows.go#657

fd.Write() calls syscall.Write() in a loop until all bytes were written out.
If it fails in the first loop iteration, it only returns an error nothing was sent out in this case.
If it fails on subsequent loop iterations it returns with the error how many bytes were send out.

conn.go

It is possible that a query string contains multiple SQL statements. Depending on the server implementation, it might be possible that the server executes the SQL statements as soon as it receives a termination character. If a net.OpError happened afterwards we can not send the whole query string again to the server because the already executed statements would be executed again. Add an additional check in mustSendRetryable() to only return a netErrorNoWrite error if the number of bytes that were send out is 0.

aeneasr · 2019-10-14T09:38:06Z

It looks like someone in the MySQL driver proposed having an extra go routine that checks for closed connections:

go-sql-driver/mysql#529 (comment)

Not sure if that's something worth considering but it would remove the issues of retrying when the client can't be sure if the server accepted the request.

stevendanna · 2020-02-26T13:54:59Z

In my own service, I'll likely be using sql.DB's SetMaxConnLifetime and higher-level retries to get around the main place I hit this, but I was wondering if there is still maintainer interest in moving this forward?

The current approach in this PR (only marking the error as retriable if 0-bytes were written) is the same approach used in pgx:

https://github.com/jackc/pgconn/blob/master/pgconn.go#L860
https://github.com/jackc/pgconn/blob/master/errors.go#L160-L162
https://github.com/jackc/pgx/blob/master/stdlib/sql.go#L282-L284

I understand the concern that there may be cases where Write to a network socket will erroneously report 0 bytes written on error. Would adding a tests for our expected behavior (for example, a test that creates a tcp connection, a writer that attempts a large write, a reader that does a small, incomplete read and then closes the connection) on particular platforms and go versions and then using conditional compilation to only enable the retry behavior on those platforms be an acceptable solution? Changes to Go or the OS could still introduce issues, but this would perhaps give us some early warning.

No pressure either way, just wanted to check in on this since I hit it recently.

gnuletik · 2020-05-21T17:00:07Z

Seeing the Github article today about the same issue in the mysql driver : https://github.blog/2020-05-20-three-bugs-in-the-go-mysql-driver/

Would it make sense to implement the same check for lib/pq ?
go-sql-driver/mysql#934

cbandy · 2020-05-21T20:59:03Z

Seeing the Github article today about the same issue in the mysql driver : https://github.blog/2020-05-20-three-bugs-in-the-go-mysql-driver/

Would it make sense to implement the same check for lib/pq ?
go-sql-driver/mysql#934

That solution (and perhaps any solution) is still dependent on the OS. In their master branch the list of supported systems is longer than the PR but far from complete. The fallback is no-op: https://github.com/go-sql-driver/mysql/blob/master/conncheck_dummy.go
The linked article states that this approach (non-blocking read before write) is safe for the MySQL protocol because the server only sends data in response to requests from the client. This is not the case for the PostgreSQL protocol in which the server can send unprompted messages to the client. It may be possible to manage that but the implementation won't be as simple as the MySQL pull request.

https://www.postgresql.org/docs/current/protocol-flow.html#PROTOCOL-ASYNC

hibrid · 2020-07-13T17:20:22Z

Any update on this? Just wondering if this is planned to be merged or how we should be dealing with this because it's a real problem

maddyblue · 2020-07-13T17:29:24Z

I'm not convinced of the safety of this PR and do not intend to merge it. As described in my comments above, I believe that a network error could occur after a message has been sent to the server, in which case it is not guaranteed that it is safe for the client to auto retry. In that case, the client must check itself if a retry is safe. I could absolutely be wrong, and the error messages coming back here could be used to verify that 0 bytes have been sent, at which point it's clearly safe to retry. However I'm not certain about the guarantees of the error messages and that those guarantees are the same across all OSs running Go. Currently you'll have to solve this yourself in the application layer or merge this PR locally and hope it works.

fho · 2020-11-23T17:28:23Z

The issue is solved.
The same solution from the PR #1013 was merged.

I'm seeing "broken pipe" errors randomly on one of our servers The issue seems to be the db driver conns become stale when tcp connections were disconnected. It happens more often when the DB is behind a proxy. This has been reported and fixed in the lib/pq in v1.9+ lib/pq#871

fho force-pushed the retryableNetErrors branch 2 times, most recently from a46ee34 to 89cb5db Compare May 28, 2019 15:28

fho marked this pull request as ready for review May 28, 2019 15:29

fho force-pushed the retryableNetErrors branch from 89cb5db to 682ce78 Compare May 28, 2019 15:35

sirkjohannsen approved these changes May 29, 2019

View reviewed changes

maddyblue reviewed Jun 4, 2019

View reviewed changes

fho commented Jun 5, 2019

View reviewed changes

conn.go Show resolved Hide resolved

danielsdeleo mentioned this pull request Feb 12, 2020

Dan/bump pq chef/automate#2858

Merged

4 tasks

fho mentioned this pull request Apr 18, 2020

db operation fails with "broken pipe" instead of reconnecting transparently after server restart #870

Closed

This was referenced May 21, 2020

At time DB reconnect fail with read tcp [::1]:49528->[::1]:5432: read: connection reset by peer #835

Open

goroutines and tcp errors #723

Closed

aeneasr mentioned this pull request Jul 10, 2020

Refresh token flow breaks up during db server downtime ory/hydra#1928

Closed

easokol mentioned this pull request Nov 10, 2020

Sometimes postgres write broken pipe 500 odahu/odahu-flow#328

Open

fho closed this Nov 23, 2020

kanaksinghal mentioned this pull request Jul 22, 2021

update lib/pq from 1.1.1 to 1.10.2 (latest) jinzhu/gorm#17

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

allow retry on netErrors in safe situations #871

allow retry on netErrors in safe situations #871

fho commented May 28, 2019 •

edited

Loading

sirkjohannsen left a comment

fho commented May 29, 2019

fho commented Jun 4, 2019

maddyblue Jun 4, 2019

cbandy Jun 5, 2019

fho Jun 5, 2019

maddyblue Jun 5, 2019

fho Jun 6, 2019

maddyblue Jun 6, 2019

fho Jun 7, 2019

aeneasr commented Oct 14, 2019

stevendanna commented Feb 26, 2020 •

edited

Loading

gnuletik commented May 21, 2020

cbandy commented May 21, 2020

hibrid commented Jul 13, 2020

maddyblue commented Jul 13, 2020

fho commented Nov 23, 2020

allow retry on netErrors in safe situations #871

allow retry on netErrors in safe situations #871

Conversation

fho commented May 28, 2019 • edited Loading

sirkjohannsen left a comment

Choose a reason for hiding this comment

fho commented May 29, 2019

fho commented Jun 4, 2019

maddyblue Jun 4, 2019

Choose a reason for hiding this comment

cbandy Jun 5, 2019

Choose a reason for hiding this comment

fho Jun 5, 2019

Choose a reason for hiding this comment

maddyblue Jun 5, 2019

Choose a reason for hiding this comment

fho Jun 6, 2019

Choose a reason for hiding this comment

maddyblue Jun 6, 2019

Choose a reason for hiding this comment

fho Jun 7, 2019

Choose a reason for hiding this comment

aeneasr commented Oct 14, 2019

stevendanna commented Feb 26, 2020 • edited Loading

gnuletik commented May 21, 2020

cbandy commented May 21, 2020

hibrid commented Jul 13, 2020

maddyblue commented Jul 13, 2020

fho commented Nov 23, 2020

fho commented May 28, 2019 •

edited

Loading

stevendanna commented Feb 26, 2020 •

edited

Loading