NewConnPipe can wait indefinitely on peer #96

grrtrr · 2019-01-23T19:55:18Z

We are encountering a lock-up on a tlsctp REP listener if 1 incoming TLS/TCP connection is bad (ends prematurely). This could be a problem with other transports, too.

Problem description

We are running mangos v1, which we pull from git:nanomsg.org/go-mangos. Occasionally it happens that tlstcp.go:Accept() fails. The failure has to do with mangos.NewConnPipe().

The symptom in each case follows the same pattern as the gdb trace below:

(gdb) goroutine 61 bt
#0  runtime.gopark (lock=0x7f2bc5610708, reason=2 '\002', traceEv=27 '\033', traceskip=5, unlockf=<optimized out>) at /opt/go/src/runtime/proc.go:303
#1  0x0000000000446c2b in runtime.netpollblock (mode=114, pd=0x7f2bc56106e0, waitio=false, ~r3=<optimized out>) at /opt/go/src/runtime/netpoll.go:366
#2  0x00000000004462c8 in internal/poll.runtime_pollWait (mode=114, pd=0x7f2bc56106e0, ~r2=<optimized out>) at /opt/go/src/runtime/netpoll.go:173
#3  0x00000000004e687c in internal/poll.(*pollDesc).wait (isFile=false, mode=114, pd=0xc0003e0598, ~r2=...) at /opt/go/src/internal/poll/fd_poll_runtime.go:85
#4  0x00000000004e698f in internal/poll.(*pollDesc).waitRead (isFile=<optimized out>, pd=<optimized out>, ~r1=...) at /opt/go/src/internal/poll/fd_poll_runtime.go:90
#5  0x00000000004e76fb in internal/poll.(*FD).Read (fd=0xc0003e0580, p= []uint8 = {...}, ~r1=<optimized out>, ~r2=...) at /opt/go/src/internal/poll/fd_unix.go:169
#6  0x000000000057fd41 in net.(*netFD).Read (err=..., fd=<optimized out>, n=<optimized out>, p=...) at /opt/go/src/net/fd_unix.go:202
#7  0x00000000005936aa in net.(*conn).Read (b= []uint8 = {...}, c=0xc00000e258, ~r1=<optimized out>, ~r2=...) at /opt/go/src/net/net.go:177
#8  0x000000000067a97b in crypto/tls.(*block).readFromUntil (b=0xc0002a8900, n=5, r=..., ~r2=...) at /opt/go/src/crypto/tls/conn.go:492
#9  0x000000000067ae7f in crypto/tls.(*Conn).readRecord (c=0xc0001db880, want=23 '\027', ~r1=...) at /opt/go/src/crypto/tls/conn.go:593
#10 0x000000000067e433 in crypto/tls.(*Conn).Read (b=<error reading variable: access outside bounds of object referenced via synthetic pointer>, c=0xc0001db880, err=..., n=0)
    at /opt/go/src/crypto/tls/conn.go:1145
#11 0x000000000048140a in io.ReadAtLeast (buf= []uint8 = {...}, err=..., min=8, n=<optimized out>, r=...) at /opt/go/src/io/io.go:305
#12 0x00000000004815aa in io.ReadFull (buf=..., err=..., n=<optimized out>, r=...) at /opt/go/src/io/io.go:329
#13 0x00000000004bece0 in encoding/binary.Read (data=..., order=..., r=..., ~r3=...) at /opt/go/src/encoding/binary/binary.go:245
#14 0x0000000000ee5607 in nanomsg.org/go-mangos.(*conn).handshake (p=0xc0003d7540, props= []interface {} = {...}, ~r1=...)
    at /opt/TeamCity/buildAgent/work/34f7a3c3b5ec8423/server/syntropy3/src/go/src/nanomsg.org/go-mangos/conn.go:188
#15 0x0000000000ee510e in nanomsg.org/go-mangos.NewConnPipe (c=..., props=..., sock=..., ~r3=..., ~r4=...)
    at /opt/TeamCity/buildAgent/work/34f7a3c3b5ec8423/server/syntropy3/src/go/src/nanomsg.org/go-mangos/conn.go:141
#16 0x0000000000eef1e3 in nanomsg.org/go-mangos/transport/tlstcp.(*listener).Accept (l=0xc0003c90c0, ~r0=..., ~r1=...)
    at /opt/TeamCity/buildAgent/work/34f7a3c3b5ec8423/server/syntropy3/src/go/src/nanomsg.org/go-mangos/transport/tlstcp/tlstcp.go:176
#17 0x0000000000ee9a96 in nanomsg.org/go-mangos.(*listener).serve (l=0xc000498180)
    at /opt/TeamCity/buildAgent/work/34f7a3c3b5ec8423/server/syntropy3/src/go/src/nanomsg.org/go-mangos/core.go:687
#18 0x00000000004773e1 in runtime.goexit () at /opt/go/src/runtime/asm_amd64.s:1333
#19 0x000000c000498180 in ?? ()
#20 0x0000000000000000 in ?? ()

The call is stuck in Accept():

// tlstcp.go
func (l *listener) Accept() (mangos.Pipe, error) {
// ...
        return mangos.NewConnPipe(conn, l.sock,
                mangos.PropTLSConnState, conn.ConnectionState()) // <=== STUCK HERE
}

NewConnPipe was stuck in handshake:

// conn.go
func NewConnPipe(c net.Conn, sock Socket, props ...interface{}) (Pipe, error) {
        p := &conn{c: c, proto: sock.GetProtocol(), sock: sock}

        if err := p.handshake(props); err != nil {  // <=== STUCK HERE

The p.handshake was stuck in binary read:

//conn.go
func (p *conn) handshake(props []interface{}) error {
// ...

        h := connHeader{S: 'S', P: 'P', Proto: p.proto.Number()}
        if err = binary.Write(p.c, binary.BigEndian, &h); err != nil {
                return err
        }
        if err = binary.Read(p.c, binary.BigEndian, &h); err != nil { // <=== STUCK HERE
                p.c.Close()
                return err
        }

The binary read was stuck in io.ReadFull:

// encoding/binary/binary.go
func Read(r io.Reader, order ByteOrder, data interface{}) error {
// ...
        if _, err := io.ReadFull(r, d.buf); err != nil { // <== STUCK HERE
                return err
        }
        d.value(v)
        return nil
}

According to io documentation, ReadFull will block until it has read exactly d.buf bytes.
This would block indefinitely if the TLS connection does not have any more data in the pipeline (which is confirmed by the calls to net.(*netFD).Read, runtime.netpollblock in the trace).

The net.Conn interface has a SetReadDeadline call which prevents hanging on read:

type Conn interface {
// ...
    // SetReadDeadline sets the deadline for future Read calls
    // and any currently-blocked Read call.
    // A zero value for t means Read will not time out.
    SetReadDeadline(t time.Time) error

(There is a combined SetDeadline which sets both read/write timeouts.)

Currently the mangos v1 code does not call SetDeadline or SetReadDeadline . The tls.Conn which is used by the tlstcp module implements net.Conn, but also no r/w deadline is set.

The result is that the PipeListener.Accept() of the tlstcp module now becomes blocking, since it

does the net.Accept call (new incoming connection on the TLS/TCP socket),
tries to read the SP header, but hangs.

It will hang in this place (core.go:687 in above gdb trace):

// serve spins in a loop, calling the accepter's Accept routine.
func (l *listener) serve() {
        for {
                select {
                case <-l.sock.closeq:
                        return
                default:
                }

                // If the underlying PipeListener is closed, or not
                // listening, we expect to return back with an error.
                if pipe, err := l.l.Accept(); err == nil {  // <=== STUCK HERE
                     l.sock.addPipe(pipe, nil, l)
// ...

We are able to confirm this by looking at the host's connections. The tcp/tls listener is on 10.55.220.81:20083:

root@CA2ITSH-SRN-25:~# netstat -antp|grep 10.55.220.81:20083|awk '{print $6}'|sort |uniq -dc
    130 ESTABLISHED
      2 SYN_RECV

The incoming listener queue is full (we use the default 128 for somaxconn):

root@CA2ITSH-SRN-25:~# ss -l
...
tcp    LISTEN     129    128                                                    :::20083                                                                :::*

We were able (via lsof -p) to identify the one hanging connection as 10.55.220.81:20083->172.25.1.9:57146 (ESTABLISHED)
For this socket, the receive/send queues are empty on the receiving host:

root@CA2ITSH-SRN-25:~# ss -lona| grep 172.25.1.9:57146
tcp    ESTAB      0      0       ::ffff:10.55.220.81:20083                 ::ffff:172.25.1.9:57146

On the sending host (172.25.1.9), the connection with source port 57146 was long gone.

Conclusion and proposed fix

The PipeListener.Accept is unlike the net.Conn.Accept, since it both reads/writes something on the connection before it returns.
In case of a broken connection, this will hang since socket r/w deadlines are not set. There is currently no way to set them.

One way would be to see if OptionRecvDeadline and/or OptionSendDeadline are set by the caller, and to reuse that for the transport socket.
If not present, this could use some "sane" fallback (e.g. 1 second), to avoid blocking indefinitely.

The text was updated successfully, but these errors were encountered:

gdamore · 2019-01-23T20:13:55Z

This code needs to be refactored. Frankly, the SP layer negotiation needs to be done in a different goroutine.

This resolves #339 by setting an r/w timeout for pipe reads/writes (if present). This allows listening sockets to recover from bad incoming connections, and also goroutines suffering from broken connections, to recover.

gdamore · 2019-01-30T20:11:47Z

The pipe handshake needs to happen before accept is called. This means refactoring things elsewhere. A similar situation for the dialer side exists, I think (unconfirmed.)

grrtrr · 2019-01-31T02:29:06Z

Please don't get me wrong, I was in no way criticizing the way Accept is implemented.

I meant to go into a different direction, since this is a very low-level networking problem
that can and will happen in the field (we see this fail a few times a week in ongoing tests).

Problem affects both server (Accept) and client (Dial).
Within mangos, problem affects only tcp and tlstcp transports.
Without setting an r/w deadline, the tcp-based communication can hang at any time during the communication. If it is in a separate goroutine, it will not block the main progress, but the goroutine and the socket slot can not be reclaimed until the service is restarted. This is bad for us, since we use long-running processes.
On exceeding the r/w timeout, a net.Error is returned, i.e. both err.Timeout() and err.Temporary() answer true . This can be passed back to the caller or used for retry logic.

Hence my thinking was along the lines of how mangos is already using timeouts to avoid getting blocked by hard protocol state. Soft state with timeouts is used in many IETF protocols, and I believe that is also the reason that golang's net.Conn exposes these timeouts.

By not also exposing them in mangos, the two tcp-based transports can get blocked.

What do you think about the use of timeouts? The PR which I closed tried to apply the timeout to all transports. Would there be a way to pass a timeout only to the 2 transports that need them?

gdamore · 2019-01-31T03:41:34Z

Oh, but I was criticizing the way Accept is implemented. It's my own mistake, and it needs to be better.

In theory the same problem would impact IPC. It doesn't impact websocket because that uses an entirely different negotiation mechanism. (In fact, as I think about it, I might be able to leverage work I've done there -- but maybe not -- it will take a little investigation to determine that.)

Mangos is meant to be designed to use blocking operations, and the fact that a low level transport pipe can block during read or write should not impact the overall socket.

We do have some timeouts in a few places, but mostly we try to avoid imparting those to the transports themselves. I still feel pretty strongly about that.

The solution here isn't timeouts. It's to make the fact that these operations can take a long time harmless.

(We may still have timeouts, but that would mostly be about reclaiming sockets and preventing denial of service. The timeouts should not be a built-in requirement for the state machines to advance.)

grrtrr · 2019-01-31T15:55:45Z

Your answer is related to a different problem than I was describing, so I will explain again.

The problem here is not that a r/w operation may take a long (blocking) time. It is rather that we can not assume that the network is reliable.

What happens regularly is that one side of a two-sided TCP connection goes away in the middle of a conversation. The side currently involved in a blocking operation has no way of recovering from that.

Each 'blocking' operation that is not completed, adds one resource leak to the system. Moving that blocking operation somewhere else does not change the fact that the operation is now blocked forever.
This is a particular concern for short-lived exchanges like REQ/REP.

grrtrr · 2019-03-10T01:06:29Z

You are right, it it also affects IPC/Unix, so applies to 3 transports: tcp, tlstcp, and ipc/unix.

I understand that you would like to refactor the Accept methods in these, here are some further thoughts. This is not a trivial operation, my understanding is that mangos-v1 is more or less in maintenance mode. We have been working with it for over 2 years. There is a risk that changing the plumbing in v1 will have side effects, hence it would also take longer testing.

In addition to Accept(), there is another problem (and maybe this should be in a separate issue): when a connection aborts, the outside-visible effects may be harmless. However, goroutines that are blocked on IO (Read(Full), Write) will remain stuck. In real life the resulting goroutine leak will probably not be critical (unless there are long bursts of failing connections); but unless stuck reads/writes can be unblocked, zombie goroutines will remain a side effect.

I understand your objection to timeouts and for some time thought about re-using mango's Option{Read,Send}Deadline for the Set{Read,Write}Deadline in net.Conn. But it does not make sense. For instance, on our server, we do not use OptionRecvDeadline, as we block on sock.Recv() until the next request comes in.

There seems to be no trivial way around the fact that a connection-oriented transport is a leaky abstraction which implies getting stuck on read or write.

Hence my question: would it be possible to use a timeout similar to #341, as it deals with both issues (any kind of stuck read/write), is simpler to implement and test, and less intrusive?

This resolves #339 by setting an r/w timeout for pipe reads/writes (if present). This allows listening sockets to recover from bad incoming connections, and also goroutines suffering from broken connections, to recover.

This allows connection-oriented protocols to recover from bad network conditions, by enabling to set a limit on the time for network I/O (#339).

This allows connection-oriented protocols to recover from bad network conditions, by enabling to set a limit on the time for network I/O. Without the option, bad connections remain hung forever.

grrtrr · 2019-04-11T16:29:41Z

Problem not fully solved even with timeouts

We ran into problems after using the above-suggested timeouts in the manner implemented by PR #344. This time the tlstcp connection got stuck in another place.

Both the client/server got stuck in Accept/Dial, respectively:

(gdb) goroutine 77 bt
#0  runtime.gopark (lock=0x7f4d948a1dc0, reason=2 '\002', traceEv=27 '\033', traceskip=5, unlockf=<optimized out>) at /opt/go/src/runtime/proc.go:303
#1  0x000000000044516b in runtime.netpollblock (mode=114, pd=0x7f4d948a1d98, waitio=false, ~r3=<optimized out>) at /opt/go/src/runtime/netpoll.go:366
#2  0x0000000000444808 in internal/poll.runtime_pollWait (mode=114, pd=0x7f4d948a1d98, ~r2=<optimized out>) at /opt/go/src/runtime/netpoll.go:173
#3  0x00000000004e4e2c in internal/poll.(*pollDesc).wait (isFile=false, mode=114, pd=0xc000129918, ~r2=...) at /opt/go/src/internal/poll/fd_poll_runtime.go:85
#4  0x00000000004e4f3f in internal/poll.(*pollDesc).waitRead (isFile=<optimized out>, pd=<optimized out>, ~r1=...) at /opt/go/src/internal/poll/fd_poll_runtime.go:90
#5  0x00000000004e5cab in internal/poll.(*FD).Read (fd=0xc000129900, p= []uint8 = {...}, ~r1=<optimized out>, ~r2=...) at /opt/go/src/internal/poll/fd_unix.go:169
#6  0x000000000057e5e1 in net.(*netFD).Read (err=..., fd=<optimized out>, n=<optimized out>, p=...) at /opt/go/src/net/fd_unix.go:202
#7  0x0000000000591f4a in net.(*conn).Read (b= []uint8 = {...}, c=0xc00000e468, ~r1=<optimized out>, ~r2=...) at /opt/go/src/net/net.go:177
#8  0x000000000067985b in crypto/tls.(*block).readFromUntil (b=0xc000030360, n=5, r=..., ~r2=...) at /opt/go/src/crypto/tls/conn.go:492
#9  0x0000000000679d5f in crypto/tls.(*Conn).readRecord (c=0xc0005f5880, want=22 '\026', ~r1=...) at /opt/go/src/crypto/tls/conn.go:593
#10 0x000000000067c38c in crypto/tls.(*Conn).readHandshake (c=0xc0005f5880, ~r0=..., ~r1=...) at /opt/go/src/crypto/tls/conn.go:955
#11 0x000000000068c6a1 in crypto/tls.(*serverHandshakeState).doFullHandshake (hs=0xc000449cb0, ~r0=...) at /opt/go/src/crypto/tls/handshake_server.go:451
#12 0x000000000068a8fc in crypto/tls.(*Conn).serverHandshake (c=0xc0005f5880, ~r0=...) at /opt/go/src/crypto/tls/handshake_server.go:84
#13 0x000000000067da89 in crypto/tls.(*Conn).Handshake (c=0xc0005f5880, ~r0=...) at /opt/go/src/crypto/tls/conn.go:1274
#14 0x0000000000f0e9ea in nanomsg.org/go-mangos/transport/tlstcp.(*listener).Accept (l=0xc0000ade80, ~r0=..., ~r1=...)
    at /opt/TeamCity/buildAgent/work/34f7a3c3b5ec8423/server/syntropy3/src/go/src/nanomsg.org/go-mangos/transport/tlstcp/tlstcp.go:172
#15 0x0000000000f093f6 in nanomsg.org/go-mangos.(*listener).serve (l=0xc000227f20)
    at /opt/TeamCity/buildAgent/work/34f7a3c3b5ec8423/server/syntropy3/src/go/src/nanomsg.org/go-mangos/core.go:697
#16 0x0000000000475921 in runtime.goexit () at /opt/go/src/runtime/asm_amd64.s:1333
#17 0x000000c000227f20 in ?? ()
#18 0x0000000000000008 in ?? ()
#19 0x0000000000000000 in ?? ()


(gdb) goroutine 2449 bt
#0  runtime.gopark (lock=0x7f4d94851330, reason=2 '\002', traceEv=27 '\033', traceskip=5, unlockf=<optimized out>) at /opt/go/src/runtime/proc.go:303
#1  0x000000000044516b in runtime.netpollblock (mode=114, pd=0x7f4d94851308, waitio=false, ~r3=<optimized out>) at /opt/go/src/runtime/netpoll.go:366
#2  0x0000000000444808 in internal/poll.runtime_pollWait (mode=114, pd=0x7f4d94851308, ~r2=<optimized out>) at /opt/go/src/runtime/netpoll.go:173
#3  0x00000000004e4e2c in internal/poll.(*pollDesc).wait (isFile=false, mode=114, pd=0xc000440a18, ~r2=...) at /opt/go/src/internal/poll/fd_poll_runtime.go:85
#4  0x00000000004e4f3f in internal/poll.(*pollDesc).waitRead (isFile=<optimized out>, pd=<optimized out>, ~r1=...) at /opt/go/src/internal/poll/fd_poll_runtime.go:90
#5  0x00000000004e5cab in internal/poll.(*FD).Read (fd=0xc000440a00, p= []uint8 = {...}, ~r1=<optimized out>, ~r2=...) at /opt/go/src/internal/poll/fd_unix.go:169
#6  0x000000000057e5e1 in net.(*netFD).Read (err=..., fd=<optimized out>, n=<optimized out>, p=...) at /opt/go/src/net/fd_unix.go:202
#7  0x0000000000591f4a in net.(*conn).Read (b= []uint8 = {...}, c=0xc0000aa2e8, ~r1=<optimized out>, ~r2=...) at /opt/go/src/net/net.go:177
#8  0x000000000067985b in crypto/tls.(*block).readFromUntil (b=0xc000031740, n=5, r=..., ~r2=...) at /opt/go/src/crypto/tls/conn.go:492
#9  0x0000000000679d5f in crypto/tls.(*Conn).readRecord (c=0xc000498700, want=22 '\026', ~r1=...) at /opt/go/src/crypto/tls/conn.go:593
#10 0x000000000067c38c in crypto/tls.(*Conn).readHandshake (c=0xc000498700, ~r0=..., ~r1=...) at /opt/go/src/crypto/tls/conn.go:955
#11 0x000000000067efec in crypto/tls.(*clientHandshakeState).handshake (hs=0xc0000c2ba8, ~r0=...) at /opt/go/src/crypto/tls/handshake_client.go:191
#12 0x000000000067ecd9 in crypto/tls.(*Conn).clientHandshake (c=0xc000498700, ~r0=...) at /opt/go/src/crypto/tls/handshake_client.go:168
#13 0x000000000067da03 in crypto/tls.(*Conn).Handshake (c=0xc000498700, ~r0=...) at /opt/go/src/crypto/tls/conn.go:1272
#14 0x0000000000f0e1fe in nanomsg.org/go-mangos/transport/tlstcp.(*dialer).Dial (d=0xc000031620, err=..., ~b0=...)
    at /opt/TeamCity/buildAgent/work/34f7a3c3b5ec8423/server/syntropy3/src/go/src/nanomsg.org/go-mangos/transport/tlstcp/tlstcp.go:104
#15 0x0000000000f08f26 in nanomsg.org/go-mangos.(*dialer).dialer (d=<optimized out>)
    at /opt/TeamCity/buildAgent/work/34f7a3c3b5ec8423/server/syntropy3/src/go/src/nanomsg.org/go-mangos/core.go:628
#16 0x0000000000475921 in runtime.goexit () at /opt/go/src/runtime/asm_amd64.s:1333
#17 0x000000c00052a3c0 in ?? ()
#18 0x0000000000000000 in ?? ()

(gdb) goroutine 2450 bt
#0  runtime.gopark (lock=0x7f4d948510c0, reason=2 '\002', traceEv=27 '\033', traceskip=5, unlockf=<optimized out>) at /opt/go/src/runtime/proc.go:303
#1  0x000000000044516b in runtime.netpollblock (mode=114, pd=0x7f4d94851098, waitio=false, ~r3=<optimized out>) at /opt/go/src/runtime/netpoll.go:366
#2  0x0000000000444808 in internal/poll.runtime_pollWait (mode=114, pd=0x7f4d94851098, ~r2=<optimized out>) at /opt/go/src/runtime/netpoll.go:173
#3  0x00000000004e4e2c in internal/poll.(*pollDesc).wait (isFile=false, mode=114, pd=0xc000440a98, ~r2=...) at /opt/go/src/internal/poll/fd_poll_runtime.go:85
#4  0x00000000004e4f3f in internal/poll.(*pollDesc).waitRead (isFile=<optimized out>, pd=<optimized out>, ~r1=...) at /opt/go/src/internal/poll/fd_poll_runtime.go:90
#5  0x00000000004e5cab in internal/poll.(*FD).Read (fd=0xc000440a80, p= []uint8 = {...}, ~r1=<optimized out>, ~r2=...) at /opt/go/src/internal/poll/fd_unix.go:169
#6  0x000000000057e5e1 in net.(*netFD).Read (err=..., fd=<optimized out>, n=<optimized out>, p=...) at /opt/go/src/net/fd_unix.go:202
#7  0x0000000000591f4a in net.(*conn).Read (b= []uint8 = {...}, c=0xc0000aa308, ~r1=<optimized out>, ~r2=...) at /opt/go/src/net/net.go:177
#8  0x000000000067985b in crypto/tls.(*block).readFromUntil (b=0xc000031bf0, n=5, r=..., ~r2=...) at /opt/go/src/crypto/tls/conn.go:492
#9  0x0000000000679d5f in crypto/tls.(*Conn).readRecord (c=0xc000498e00, want=22 '\026', ~r1=...) at /opt/go/src/crypto/tls/conn.go:593
#10 0x000000000067c38c in crypto/tls.(*Conn).readHandshake (c=0xc000498e00, ~r0=..., ~r1=...) at /opt/go/src/crypto/tls/conn.go:955
#11 0x000000000067efec in crypto/tls.(*clientHandshakeState).handshake (hs=0xc000019ba8, ~r0=...) at /opt/go/src/crypto/tls/handshake_client.go:191
#12 0x000000000067ecd9 in crypto/tls.(*Conn).clientHandshake (c=0xc000498e00, ~r0=...) at /opt/go/src/crypto/tls/handshake_client.go:168
#13 0x000000000067da03 in crypto/tls.(*Conn).Handshake (c=0xc000498e00, ~r0=...) at /opt/go/src/crypto/tls/conn.go:1272
#14 0x0000000000f0e1fe in nanomsg.org/go-mangos/transport/tlstcp.(*dialer).Dial (d=0xc000031ad0, err=..., ~b0=...)
    at /opt/TeamCity/buildAgent/work/34f7a3c3b5ec8423/server/syntropy3/src/go/src/nanomsg.org/go-mangos/transport/tlstcp/tlstcp.go:104
#15 0x0000000000f08f26 in nanomsg.org/go-mangos.(*dialer).dialer (d=<optimized out>)
    at /opt/TeamCity/buildAgent/work/34f7a3c3b5ec8423/server/syntropy3/src/go/src/nanomsg.org/go-mangos/core.go:628
#16 0x0000000000475921 in runtime.goexit () at /opt/go/src/runtime/asm_amd64.s:1333

(gdb) goroutine 2451 bt
#0  runtime.gopark (lock=0x7f4d94850a40, reason=2 '\002', traceEv=27 '\033', traceskip=5, unlockf=<optimized out>) at /opt/go/src/runtime/proc.go:303
#1  0x000000000044516b in runtime.netpollblock (mode=114, pd=0x7f4d94850a18, waitio=false, ~r3=<optimized out>) at /opt/go/src/runtime/netpoll.go:366
#2  0x0000000000444808 in internal/poll.runtime_pollWait (mode=114, pd=0x7f4d94850a18, ~r2=<optimized out>) at /opt/go/src/runtime/netpoll.go:173
#3  0x00000000004e4e2c in internal/poll.(*pollDesc).wait (isFile=false, mode=114, pd=0xc000440b18, ~r2=...) at /opt/go/src/internal/poll/fd_poll_runtime.go:85
#4  0x00000000004e4f3f in internal/poll.(*pollDesc).waitRead (isFile=<optimized out>, pd=<optimized out>, ~r1=...) at /opt/go/src/internal/poll/fd_poll_runtime.go:90
#5  0x00000000004e5cab in internal/poll.(*FD).Read (fd=0xc000440b00, p= []uint8 = {...}, ~r1=<optimized out>, ~r2=...) at /opt/go/src/internal/poll/fd_unix.go:169
#6  0x000000000057e5e1 in net.(*netFD).Read (err=..., fd=<optimized out>, n=<optimized out>, p=...) at /opt/go/src/net/fd_unix.go:202
#7  0x0000000000591f4a in net.(*conn).Read (b= []uint8 = {...}, c=0xc0000aa320, ~r1=<optimized out>, ~r2=...) at /opt/go/src/net/net.go:177
#8  0x000000000067985b in crypto/tls.(*block).readFromUntil (b=0xc0004e22d0, n=5, r=..., ~r2=...) at /opt/go/src/crypto/tls/conn.go:492
#9  0x0000000000679d5f in crypto/tls.(*Conn).readRecord (c=0xc000499500, want=22 '\026', ~r1=...) at /opt/go/src/crypto/tls/conn.go:593
#10 0x000000000067c38c in crypto/tls.(*Conn).readHandshake (c=0xc000499500, ~r0=..., ~r1=...) at /opt/go/src/crypto/tls/conn.go:955
#11 0x000000000067efec in crypto/tls.(*clientHandshakeState).handshake (hs=0xc00001aba8, ~r0=...) at /opt/go/src/crypto/tls/handshake_client.go:191
#12 0x000000000067ecd9 in crypto/tls.(*Conn).clientHandshake (c=0xc000499500, ~r0=...) at /opt/go/src/crypto/tls/handshake_client.go:168
#13 0x000000000067da03 in crypto/tls.(*Conn).Handshake (c=0xc000499500, ~r0=...) at /opt/go/src/crypto/tls/conn.go:1272
#14 0x0000000000f0e1fe in nanomsg.org/go-mangos/transport/tlstcp.(*dialer).Dial (d=0xc0004e21b0, err=..., ~b0=...)
    at /opt/TeamCity/buildAgent/work/34f7a3c3b5ec8423/server/syntropy3/src/go/src/nanomsg.org/go-mangos/transport/tlstcp/tlstcp.go:104
#15 0x0000000000f08f26 in nanomsg.org/go-mangos.(*dialer).dialer (d=<optimized out>)
    at /opt/TeamCity/buildAgent/work/34f7a3c3b5ec8423/server/syntropy3/src/go/src/nanomsg.org/go-mangos/core.go:628
#16 0x0000000000475921 in runtime.goexit () at /opt/go/src/runtime/asm_amd64.s:1333

(gdb) goroutine 8856 bt
#0  runtime.gopark (lock=0x7f4d9489e248, reason=2 '\002', traceEv=27 '\033', traceskip=5, unlockf=<optimized out>) at /opt/go/src/runtime/proc.go:303
#1  0x000000000044516b in runtime.netpollblock (mode=114, pd=0x7f4d9489e220, waitio=false, ~r3=<optimized out>) at /opt/go/src/runtime/netpoll.go:366
#2  0x0000000000444808 in internal/poll.runtime_pollWait (mode=114, pd=0x7f4d9489e220, ~r2=<optimized out>) at /opt/go/src/runtime/netpoll.go:173
#3  0x00000000004e4e2c in internal/poll.(*pollDesc).wait (isFile=false, mode=114, pd=0xc000532918, ~r2=...) at /opt/go/src/internal/poll/fd_poll_runtime.go:85
#4  0x00000000004e4f3f in internal/poll.(*pollDesc).waitRead (isFile=<optimized out>, pd=<optimized out>, ~r1=...) at /opt/go/src/internal/poll/fd_poll_runtime.go:90
#5  0x00000000004e5cab in internal/poll.(*FD).Read (fd=0xc000532900, p= []uint8 = {...}, ~r1=<optimized out>, ~r2=...) at /opt/go/src/internal/poll/fd_unix.go:169
#6  0x000000000057e5e1 in net.(*netFD).Read (err=..., fd=<optimized out>, n=<optimized out>, p=...) at /opt/go/src/net/fd_unix.go:202
#7  0x0000000000591f4a in net.(*conn).Read (b= []uint8 = {...}, c=0xc0000aa908, ~r1=<optimized out>, ~r2=...) at /opt/go/src/net/net.go:177
#8  0x000000000067985b in crypto/tls.(*block).readFromUntil (b=0xc00040b590, n=5, r=..., ~r2=...) at /opt/go/src/crypto/tls/conn.go:492
#9  0x0000000000679d5f in crypto/tls.(*Conn).readRecord (c=0xc0005f4a80, want=22 '\026', ~r1=...) at /opt/go/src/crypto/tls/conn.go:593
#10 0x000000000067c38c in crypto/tls.(*Conn).readHandshake (c=0xc0005f4a80, ~r0=..., ~r1=...) at /opt/go/src/crypto/tls/conn.go:955
#11 0x000000000067efec in crypto/tls.(*clientHandshakeState).handshake (hs=0xc000231ba8, ~r0=...) at /opt/go/src/crypto/tls/handshake_client.go:191
#12 0x000000000067ecd9 in crypto/tls.(*Conn).clientHandshake (c=0xc0005f4a80, ~r0=...) at /opt/go/src/crypto/tls/handshake_client.go:168
#13 0x000000000067da03 in crypto/tls.(*Conn).Handshake (c=0xc0005f4a80, ~r0=...) at /opt/go/src/crypto/tls/conn.go:1272
#14 0x0000000000f0e1fe in nanomsg.org/go-mangos/transport/tlstcp.(*dialer).Dial (d=0xc0006686f0, err=..., ~b0=...)
    at /opt/TeamCity/buildAgent/work/34f7a3c3b5ec8423/server/syntropy3/src/go/src/nanomsg.org/go-mangos/transport/tlstcp/tlstcp.go:104
#15 0x0000000000f08f26 in nanomsg.org/go-mangos.(*dialer).dialer (d=<optimized out>)
    at /opt/TeamCity/buildAgent/work/34f7a3c3b5ec8423/server/syntropy3/src/go/src/nanomsg.org/go-mangos/core.go:628
#16 0x0000000000475921 in runtime.goexit () at /opt/go/src/runtime/asm_amd64.s:1333
#17 0x000000c000541ac0 in ?? ()
#18 0x0000000000000000 in ?? ()

There were lots of stuck clients with messages similar to the above (same location).

The server was stuck in tlstcp Accept:

// /opt/TeamCity/buildAgent/work/34f7a3c3b5ec8423/server/syntropy3/src/go/src/nanomsg.org/go-mangos/transport/tlstcp/tlstcp.go:172
func (l *listener) Accept() (mangos.Pipe, error) {

        tconn, err := l.listener.AcceptTCP()
        if err != nil {
                return nil, err
        }

        if err = l.opts.configTCP(tconn); err != nil {
                tconn.Close()
                return nil, err
        }

        conn := tls.Server(tconn, l.config)
        if err = conn.Handshake(); err != nil { // <=== STUCK HERE
// ...

The client was stuck in the corresponding tlstcp client handshake:

func (d *dialer) Dial() (_ mangos.Pipe, err error) {
        var (   
                addr   *net.TCPAddr
                config *tls.Config
        )

        if addr, err = mangos.ResolveTCPAddr(d.addr); err != nil {
                return nil, err
        }

        tconn, err := net.DialTCP("tcp", nil, addr)
        if err != nil {
                return nil, err
        }
        if err = d.opts.configTCP(tconn); err != nil {
                tconn.Close()
                return nil, err
        }
        if v, ok := d.opts[mangos.OptionTLSConfig]; ok {
                config = v.(*tls.Config)
        }
        conn := tls.Client(tconn, config)
        if err = conn.Handshake(); err != nil { // <== STUCK HERE

Hence the problem is not limited to Accept() only.

In addition, the tlstcp transport needs the timeout to be set before doing the tcp/tlstcp handshake.

This allows connection-oriented protocols to recover from bad network conditions, by enabling to set a limit on the time for network I/O. As a special case, it allows client/servers to recover from a bad connection during the initial TLS handshake.

grrtrr referenced this issue in grrtrr/mangos Mar 26, 2019

socket: support network IO deadline (#339)

c02cbc0

This allows connection-oriented protocols to recover from bad network conditions, by enabling to set a limit on the time for network I/O (#339).

gdamore transferred this issue from nanomsg/mangos-v1 Jun 28, 2019

gdamore changed the title ~~bug: NewConnPipe can hang indefinitely on bad incoming connection~~ NewConnPipe can wait indefinitely on peer Jul 28, 2019

gdamore added a commit that referenced this issue Jul 28, 2019

fixes #96 NewConnPipe can wait indefinitely on peer

eae7742

gdamore added a commit that referenced this issue Jul 29, 2019

fixes #96 NewConnPipe can wait indefinitely on peer

d0b9ff0

gdamore added a commit that referenced this issue Jul 29, 2019

fixes #96 NewConnPipe can wait indefinitely on peer

950a583

grrtrr mentioned this issue Jul 29, 2019

fixes #96 NewConnPipe can wait indefinitely on peer #102

Merged

gdamore closed this as completed in 1c3fb54 Aug 2, 2019

gdamore mentioned this issue Aug 5, 2019

Add handshake timeout #108

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NewConnPipe can wait indefinitely on peer #96

NewConnPipe can wait indefinitely on peer #96

grrtrr commented Jan 23, 2019

gdamore commented Jan 23, 2019

gdamore commented Jan 30, 2019

grrtrr commented Jan 31, 2019 •

edited

Loading

gdamore commented Jan 31, 2019

grrtrr commented Jan 31, 2019 •

edited

Loading

grrtrr commented Mar 10, 2019

grrtrr commented Apr 11, 2019 •

edited

Loading

NewConnPipe can wait indefinitely on peer #96

NewConnPipe can wait indefinitely on peer #96

Comments

grrtrr commented Jan 23, 2019

Problem description

Conclusion and proposed fix

gdamore commented Jan 23, 2019

gdamore commented Jan 30, 2019

grrtrr commented Jan 31, 2019 • edited Loading

gdamore commented Jan 31, 2019

grrtrr commented Jan 31, 2019 • edited Loading

grrtrr commented Mar 10, 2019

grrtrr commented Apr 11, 2019 • edited Loading

Problem not fully solved even with timeouts

grrtrr commented Jan 31, 2019 •

edited

Loading

grrtrr commented Jan 31, 2019 •

edited

Loading

grrtrr commented Apr 11, 2019 •

edited

Loading