Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AsyncSocket is failing to send data under backpressure (Mac,Linux) #15003

Closed
snej opened this issue Jul 16, 2020 · 11 comments · Fixed by #15012
Closed

AsyncSocket is failing to send data under backpressure (Mac,Linux) #15003

snej opened this issue Jul 16, 2020 · 11 comments · Fixed by #15012

Comments

@snej
Copy link

snej commented Jul 16, 2020

AsyncSocket.send loses data in the presence of backpressure, i.e. when the receiver reads data more slowly than the sender sends it. At some point soon after the OS's send buffer fills up, some number of pending send calls are ignored: their data is never written to the actual socket, and their Futures are not completed.

This is quite reproducible on macOS 10.14 and 10.15; I don't have convenient access to other platforms to test on.
Occurs in Nim 1.2.2 as well as latest devel.

Example

# AsyncSocketBug.nim
# Jens Alfke (@snej) -- 16 July 2020
# Demonstrates data loss by Nim's AsyncSocket.
# Just run it, and it will raise an assertion failure within a minute.

import asyncdispatch, asyncnet, strformat, strutils, sugar

const FrameSize = 9999   # Exact size not important, but larger sizes fail quicker

proc runServer() {.async.} =
    # Server side:
    var server = newAsyncSocket()
    server.bindAddr(Port(9001))
    server.listen()
    let client = await server.accept()
    echo "Server got client connection"
    var lastN = 0
    while true:
        let frame = await client.recv(FrameSize)
        assert frame.len == FrameSize
        let n = frame[0..<6].parseInt()
        echo "RCVD #", n, ":  ", frame[0..80], "..."
        if n != lastN + 1:
            echo &"******** ERROR: Server received #{n}, but last was #{lastN}!"
        assert n == lastN + 1
        lastN = n
        await sleepAsync 100


proc main() {.async.} =
    asyncCheck runServer()

    # Client side:
    let socket = newAsyncSocket()
    await socket.connect("localhost", Port(9001))
    echo "Client socket connected"

    var sentCount = 0
    var completedCount = 0

    while true:
        sentCount += 1
        let n = sentCount

        var message = &"{n:06} This is message #{n} of ∞. Please stay tuned for more. "
        #echo ">>> ", message
        while message.len < FrameSize:
            message = message & message
        let frame = message[0..<FrameSize]

        capture n:
            socket.send(frame).addCallback proc(f: Future[void]) =
                # Callback when the send completes:
                assert not f.failed
                echo "SENT #", n
                if n != completedCount + 1:
                    echo &"******** ERROR: Client completed #{n}, but last completed was #{completedCount}!"
                # If this assert is enabled, it will trigger earlier than the server-side assert above:
                #assert count == completedCount + 1
                completedCount = n
        await sleepAsync 1

waitFor main()

Current Output

Client socket connected
Server got client connection
SENT #1
RCVD #1:  000001 This is message #1 of ∞. Please stay tuned for more. 000001 This is mess...
SENT #2
SENT #3
SENT #4
SENT #5
SENT #6
SENT #7
SENT #8

.....

SENT #111
SENT #2074
******** ERROR: Client completed #2074, but last completed was #111!
SENT #2075
SENT #2076

.....

SENT #6312
SENT #6313
RCVD #108:  000108 This is message #108 of ∞. Please stay tuned for more. 000108 This is me...
RCVD #109:  000109 This is message #109 of ∞. Please stay tuned for more. 000109 This is me...
RCVD #110:  000110 This is message #110 of ∞. Please stay tuned for more. 000110 This is me...
RCVD #111:  000111 This is message #111 of ∞. Please stay tuned for more. 000111 This is me...
RCVD #2074:  002074 This is message #2074 of ∞. Please stay tuned for more. 002074 This is m...
******** ERROR: Server received #2074, but last was #111!
fatal.nim(49)            sysFatal
asyncfutures.nim(431)    asyncCheckCallback
Error: unhandled exception: AsyncSocketBug.nim(25, 16) `n == lastN + 1`  [AssertionDefect]

Note that messages 112…2073 were dropped on the floor: neither completed nor sent over the socket.

Expected Output

Should run forever without any assertion failures and without writing any "ERROR" lines.

Additional Information

This completely breaks the WebSocket-based protocol implementation I'm working on, in which one side typically sends large numbers of small messages. The bug did not show up as long as I was running my tests on one machine (over the loopback interface), because the receiver process was able to keep up with the sender (about 15 megabytes/sec); but it showed up very quickly as soon as I ran any tests on two machines, where WiFi bandwidth+latency was a bottleneck.

$ nim -v
Nim Compiler Version 1.3.5 [MacOSX: amd64]
Compiled at 2020-07-16
Copyright (c) 2006-2020 by Andreas Rumpf

active boot switches: -d:release

(Was pulled and built this morning.)

@snej
Copy link
Author

snej commented Jul 16, 2020

Originally reported on https://forum.nim-lang.org/t/6548

@snej snej changed the title AsyncSocket is failing to send data under backpressure (macOS) AsyncSocket is failing to send data under backpressure (Mac,Linux) Jul 17, 2020
@snej
Copy link
Author

snej commented Jul 17, 2020

Two forum users have now reproduced the same bug on Linux (see above thread.)

@qqtop
Copy link

qqtop commented Jul 17, 2020

I have the same issue as reported above , but if you change the code like so:

import strformat, strutils, sugar
import asyncdispatch, asyncnet

let FrameSize = 12000   # Exact size not important, but larger sizes fail quicker

template `<>` (a, b: untyped): untyped =
            not (a == b)

proc runServer() {.async.} =
    # Server side:
    var server = newAsyncSocket()
    server.bindAddr(Port(9001))
    server.listen()
    let client = await server.accept()
    echo "Server got client connection"
    var lastN = 0
    while true:
        let frame = await client.recv(FrameSize)
        assert frame.len == FrameSize
        let n = frame[0..<6].parseInt()
        echo "RCVD #", n, ":  ", frame[0..80], "..."
        inc(lastN)
        if n <> lastN:
            echo &"******** ERROR: Server received #{n}, but last was #{lastN}!"
        assert n == lastN
        lastN = n
        await sleepAsync 1


proc main() {.async.} =
    asyncCheck runServer()

    # Client side:
    let socket = newAsyncSocket()
    await socket.connect("localhost", Port(9001))
    echo "Client socket connected"

    var sentCount = 0
    var completedCount = 0

    while true:
        inc sentCount
        let n = sentCount

        var message = &"{n:06} This is message #{n} of ∞. Please stay tuned for more. "
        #echo ">>> ", message
        while message.len < FrameSize:
            message = message & message
        let frame = message[0..<FrameSize]

        capture n:
            socket.send(frame).addCallback proc(f: Future[void]) =
                # Callback when the send completes:
                assert not f.failed
                echo "SENT #", n
                inc(completedCount)
                if n <> completedCount:
                    echo &"******** ERROR: Client completed #{n}, but last completed was #{completedCount}!"
                # If this assert is enabled, it will trigger earlier than the server-side assert above:
                #assert count == completedCount + 1
                completedCount = n
        await sleepAsync 2

waitFor main()

It seems to work as expected compiled with --gc:boehm or --gc:MarkandSweep .
I let it run to over 1,200,000 messages before I killed it and the memory usage stayed stable.

Compiled with --gc:arc it ran ok but memory usage moved towards 100%

The main change is not to use != and use inc to increment and not + 1 as well as
changing the await in main to read : await sleepAsync 2

All this on debian based parrotOS .

@ghost
Copy link

ghost commented Jul 17, 2020

@qqtop for async you need orc, not arc :)

@qqtop
Copy link

qqtop commented Jul 17, 2020

@Yardanico obviously you are perfectly right and my test right now
shows that with --gc:orc the memory usage also is moving up towards 100% , whereas with boehm and MarkandSweep the
memory usage remained stable.

@ghost
Copy link

ghost commented Jul 17, 2020

@Yardanico obviously you are perfectly right and my test right now
shows that with --gc:orc the memory usage also is moving up towards 100% , whereas with boehm and MarkandSweep the
memory usage remained stable.

Well yes, orc (assuming you're on devel) still kinda leaks memory with async

@snej
Copy link
Author

snej commented Jul 17, 2020

@qqtop:

The main change is not to use != and use inc to increment and not + 1

That's just syntactic; it doesn't affect the meaning of the code

as well as changing the await in main to read : await sleepAsync 2

But you also changed the await in runServer to await sleepAsync 1! That makes the reader side run faster than the sender, which means it's no longer testing the same situation. The bug is triggered when the reader is slower, or at least lags the sender enough to cause backpressure. So I'm not surprised that the test now doesn't fail.

If I take your modified test and simply change sleepAsync 1 back to sleepAsync 100, it fails again.

@qqtop
Copy link

qqtop commented Jul 18, 2020

I changed the code based on previous advise when I ran into exactly the same issue .
Basically a little latency (as little as 1 ) in favour of the server may give the system enough
time to execute correctly, aside of any issues with the gc used.
You also could try to add more more clients or run server and client on different machines and see what happens.
Maybe it is a subtle bug in async or an issue in the linux kernel , but this will require more testing .

@dom96
Copy link
Contributor

dom96 commented Jul 18, 2020

Just tested it on Windows: doesn't reproduce. Reproduces for me on WSL though.

Edit: Fun bug. Thank you for the repro. Still looking into the cause.

Edit2: Fixed #15012

dom96 added a commit that referenced this issue Jul 18, 2020
Fixes #15003.

This is a serious bug which occurs when data cannot be read/sent
immediately and there are a bunch of other read/write events
pending. What happens is that the new events are dropped which
results in the case of the reported bug resulted in some data not
being sent (!).
dom96 added a commit that referenced this issue Jul 18, 2020
Fixes #15003.

This is a serious bug which occurs when data cannot be read/sent
immediately and there are a bunch of other read/write events
pending. What happens is that the new events are dropped which
results in the case of the reported bug resulted in some data not
being sent (!).
Araq pushed a commit that referenced this issue Jul 19, 2020
Fixes #15003.

This is a serious bug which occurs when data cannot be read/sent
immediately and there are a bunch of other read/write events
pending. What happens is that the new events are dropped which
results in the case of the reported bug resulted in some data not
being sent (!).
@snej
Copy link
Author

snej commented Jul 20, 2020

Ah, EAGAIN strikes again... >_<
Thanks a million for the fix, @dom96 !

@snej
Copy link
Author

snej commented Jul 20, 2020

✔️ Verified the fix on macOS 10.15.

narimiran pushed a commit that referenced this issue Jul 21, 2020
Fixes #15003.

This is a serious bug which occurs when data cannot be read/sent
immediately and there are a bunch of other read/write events
pending. What happens is that the new events are dropped which
results in the case of the reported bug resulted in some data not
being sent (!).

(cherry picked from commit 1e3a0ef)
narimiran pushed a commit that referenced this issue Jul 21, 2020
Fixes #15003.

This is a serious bug which occurs when data cannot be read/sent
immediately and there are a bunch of other read/write events
pending. What happens is that the new events are dropped which
results in the case of the reported bug resulted in some data not
being sent (!).

(cherry picked from commit 1e3a0ef)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants