-
-
Notifications
You must be signed in to change notification settings - Fork 79
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
NewConnPipe can wait indefinitely on peer #96
Comments
This code needs to be refactored. Frankly, the SP layer negotiation needs to be done in a different goroutine. |
This resolves #339 by setting an r/w timeout for pipe reads/writes (if present). This allows listening sockets to recover from bad incoming connections, and also goroutines suffering from broken connections, to recover.
The pipe handshake needs to happen before accept is called. This means refactoring things elsewhere. A similar situation for the dialer side exists, I think (unconfirmed.) |
Please don't get me wrong, I was in no way criticizing the way Accept is implemented. I meant to go into a different direction, since this is a very low-level networking problem
Hence my thinking was along the lines of how mangos is already using timeouts to avoid getting blocked by hard protocol state. Soft state with timeouts is used in many IETF protocols, and I believe that is also the reason that golang's By not also exposing them in mangos, the two tcp-based transports can get blocked. What do you think about the use of timeouts? The PR which I closed tried to apply the timeout to all transports. Would there be a way to pass a timeout only to the 2 transports that need them? |
Oh, but I was criticizing the way Accept is implemented. It's my own mistake, and it needs to be better. In theory the same problem would impact IPC. It doesn't impact websocket because that uses an entirely different negotiation mechanism. (In fact, as I think about it, I might be able to leverage work I've done there -- but maybe not -- it will take a little investigation to determine that.) Mangos is meant to be designed to use blocking operations, and the fact that a low level transport pipe can block during read or write should not impact the overall socket. We do have some timeouts in a few places, but mostly we try to avoid imparting those to the transports themselves. I still feel pretty strongly about that. The solution here isn't timeouts. It's to make the fact that these operations can take a long time harmless. (We may still have timeouts, but that would mostly be about reclaiming sockets and preventing denial of service. The timeouts should not be a built-in requirement for the state machines to advance.) |
Your answer is related to a different problem than I was describing, so I will explain again. The problem here is not that a r/w operation may take a long (blocking) time. It is rather that we can not assume that the network is reliable. What happens regularly is that one side of a two-sided TCP connection goes away in the middle of a conversation. The side currently involved in a blocking operation has no way of recovering from that. Each 'blocking' operation that is not completed, adds one resource leak to the system. Moving that blocking operation somewhere else does not change the fact that the operation is now blocked forever. |
You are right, it it also affects IPC/Unix, so applies to 3 transports: I understand that you would like to refactor the In addition to Accept(), there is another problem (and maybe this should be in a separate issue): when a connection aborts, the outside-visible effects may be harmless. However, goroutines that are blocked on IO ( I understand your objection to timeouts and for some time thought about re-using mango's Option{Read,Send}Deadline for the There seems to be no trivial way around the fact that a connection-oriented transport is a leaky abstraction which implies getting stuck on read or write. Hence my question: would it be possible to use a timeout similar to #341, as it deals with both issues (any kind of stuck read/write), is simpler to implement and test, and less intrusive? |
This resolves #339 by setting an r/w timeout for pipe reads/writes (if present). This allows listening sockets to recover from bad incoming connections, and also goroutines suffering from broken connections, to recover.
This allows connection-oriented protocols to recover from bad network conditions, by enabling to set a limit on the time for network I/O (#339).
This allows connection-oriented protocols to recover from bad network conditions, by enabling to set a limit on the time for network I/O. Without the option, bad connections remain hung forever.
Problem not fully solved even with timeoutsWe ran into problems after using the above-suggested timeouts in the manner implemented by PR #344. This time the tlstcp connection got stuck in another place. Both the client/server got stuck in Accept/Dial, respectively:
There were lots of stuck clients with messages similar to the above (same location). The server was stuck in tlstcp Accept: // /opt/TeamCity/buildAgent/work/34f7a3c3b5ec8423/server/syntropy3/src/go/src/nanomsg.org/go-mangos/transport/tlstcp/tlstcp.go:172
func (l *listener) Accept() (mangos.Pipe, error) {
tconn, err := l.listener.AcceptTCP()
if err != nil {
return nil, err
}
if err = l.opts.configTCP(tconn); err != nil {
tconn.Close()
return nil, err
}
conn := tls.Server(tconn, l.config)
if err = conn.Handshake(); err != nil { // <=== STUCK HERE
// ... The client was stuck in the corresponding tlstcp client handshake: func (d *dialer) Dial() (_ mangos.Pipe, err error) {
var (
addr *net.TCPAddr
config *tls.Config
)
if addr, err = mangos.ResolveTCPAddr(d.addr); err != nil {
return nil, err
}
tconn, err := net.DialTCP("tcp", nil, addr)
if err != nil {
return nil, err
}
if err = d.opts.configTCP(tconn); err != nil {
tconn.Close()
return nil, err
}
if v, ok := d.opts[mangos.OptionTLSConfig]; ok {
config = v.(*tls.Config)
}
conn := tls.Client(tconn, config)
if err = conn.Handshake(); err != nil { // <== STUCK HERE Hence the problem is not limited to Accept() only. In addition, the |
This allows connection-oriented protocols to recover from bad network conditions, by enabling to set a limit on the time for network I/O. As a special case, it allows client/servers to recover from a bad connection during the initial TLS handshake.
This allows connection-oriented protocols to recover from bad network conditions, by enabling to set a limit on the time for network I/O. As a special case, it allows client/servers to recover from a bad connection during the initial TLS handshake.
We are encountering a lock-up on a tlsctp REP listener if 1 incoming TLS/TCP connection is bad (ends prematurely). This could be a problem with other transports, too.
Problem description
We are running mangos v1, which we pull from
git:nanomsg.org/go-mangos
. Occasionally it happens thattlstcp.go:Accept()
fails. The failure has to do withmangos.NewConnPipe()
.The symptom in each case follows the same pattern as the gdb trace below:
The call is stuck in
Accept()
:NewConnPipe was stuck in handshake:
The
p.handshake
was stuck in binary read:The binary read was stuck in
io.ReadFull
:According to
io
documentation,ReadFull
will block until it has read exactlyd.buf
bytes.This would block indefinitely if the TLS connection does not have any more data in the pipeline (which is confirmed by the calls to net.(*netFD).Read, runtime.netpollblock in the trace).
The
net.Conn
interface has aSetReadDeadline
call which prevents hanging on read:(There is a combined
SetDeadline
which sets both read/write timeouts.)Currently the mangos v1 code does not call
SetDeadline
orSetReadDeadline
. Thetls.Conn
which is used by the tlstcp module implementsnet.Conn
, but also no r/w deadline is set.The result is that the
PipeListener.Accept()
of the tlstcp module now becomes blocking, since itIt will hang in this place (core.go:687 in above gdb trace):
We are able to confirm this by looking at the host's connections. The tcp/tls listener is on 10.55.220.81:20083:
The incoming listener queue is full (we use the default 128 for somaxconn):
We were able (via lsof -p) to identify the one hanging connection as
10.55.220.81:20083->172.25.1.9:57146 (ESTABLISHED)
For this socket, the receive/send queues are empty on the receiving host:
On the sending host (172.25.1.9), the connection with source port 57146 was long gone.
Conclusion and proposed fix
The PipeListener.Accept is unlike the net.Conn.Accept, since it both reads/writes something on the connection before it returns.
In case of a broken connection, this will hang since socket r/w deadlines are not set. There is currently no way to set them.
One way would be to see if
OptionRecvDeadline
and/orOptionSendDeadline
are set by the caller, and to reuse that for the transport socket.If not present, this could use some "sane" fallback (e.g. 1 second), to avoid blocking indefinitely.
The text was updated successfully, but these errors were encountered: