Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

transport: *http2Server.sendQuotaPool doesn't reclaim quota for canceled requests #632

Closed
rhysh opened this issue Apr 9, 2016 · 6 comments · Fixed by #633 or #637
Closed

transport: *http2Server.sendQuotaPool doesn't reclaim quota for canceled requests #632

rhysh opened this issue Apr 9, 2016 · 6 comments · Fixed by #633 or #637

Comments

@rhysh
Copy link
Contributor

rhysh commented Apr 9, 2016

I have a gRPC api that streams data to the client. When a single client (*grpc.ClientConn) has issued and canceled several requests to this api, the server stops sending data to the client. The server and client are both written in Go, running with go version go1.6 darwin/amd64 and grpc-go at 7834b97. This appears to be due to incorrect flow control, possibly on the server.

The server included below serves the routeguide.ListFeatures api, responding to each request with a stream of 100 responses of around 1kB each. The client below uses a single *grpc.ClientConn to make a sequence of 1000 requests. It reads the first two streamed responses and then cancels the request.

I'd expect the client to reliably read the first two responses from any number of sequential RPC calls.

Instead, the responses from the server stall. The client receives data for the first 160 or so requests (the exact number is sensitive to network latency), but then the server stops sending data to the client. In my test program, this is indicated by log lines like i=164 j=0 instead of e.g. i=144 j=2. The server is stalled while waiting for the transport's sendQuotaPool:

        // Wait until the transport has some quota to send the data.
        tq, err := wait(s.ctx, t.shutdownChan, t.sendQuotaPool.acquire())

I've logged the values of sq and tq below those lines, and indeed the value of tq starts at 64kB, increases to nearly 1MB, and then slowly decreases to 48B after which the server is unable to send additional data to the client's TCP connection.

It seems that the bytes of quota reserved for sending data in response to RPC calls are either not correctly refunded on the server when the request is canceled, or is not correctly reimbursed by the client when the request is canceled.

Creating a new *grpc.ClientConn with a new TCP connection works around this issue, but with the buggy flow control we can only reliably use each TCP connection for O(1) RPC requests.

// server.go
package main

import (
    "flag"
    "log"
    "net"
    "strings"

    "google.golang.org/grpc"
    "google.golang.org/grpc/examples/route_guide/routeguide"
)

func main() {
    addr := flag.String("addr", "127.0.0.1:8080", "Listen address")
    flag.Parse()

    s := grpc.NewServer()
    routeguide.RegisterRouteGuideServer(s, new(guide))

    lis, err := net.Listen("tcp", *addr)
    if err != nil {
        log.Fatalf("Listen err=%q", err)
    }

    err = s.Serve(lis)
    if err != nil {
        log.Fatalf("Serve err=%q", err)
    }
}

type guide struct {
    routeguide.RouteGuideServer
}

func (f *guide) ListFeatures(req *routeguide.Rectangle, srv routeguide.RouteGuide_ListFeaturesServer) error {
    for i := 0; i < 100; i++ {
        err := srv.Send(&routeguide.Feature{
            Name: strings.Repeat("a", 1<<10),
        })
        if err != nil {
            return err
        }
    }
    return nil
}
// client.go
package main

import (
    "flag"
    "io"
    "log"
    "time"

    "golang.org/x/net/context"
    "google.golang.org/grpc"
    "google.golang.org/grpc/codes"
    "google.golang.org/grpc/examples/route_guide/routeguide"
)

func main() {
    addr := flag.String("addr", "127.0.0.1:8080", "Server address")
    flag.Parse()

    ctx := context.Background()

    cc, err := grpc.Dial(*addr, grpc.WithInsecure())
    if err != nil {
        log.Fatalf("Dial err=%q", err)
    }
    cl := routeguide.NewRouteGuideClient(cc)

    for i := 0; i < 1000; i++ {
        ctx, cancel := context.WithTimeout(ctx, 1*time.Second)
        fire, err := cl.ListFeatures(ctx, &routeguide.Rectangle{})
        if err != nil {
            log.Fatalf("ListFeatures err=%q", err)
        }

        j := 0
    loop:
        for ; j < 2; j++ {
            _, err := fire.Recv()
            if err != nil {
                if err == io.EOF {
                    break loop
                }
                switch grpc.Code(err) {
                case codes.DeadlineExceeded:
                    break loop
                default:
                    log.Fatalf("Recv err=%q", err)
                }
            }
        }
        log.Printf("i=%d j=%d", i, j)

        if true {
            cancel()
        }

        <-ctx.Done()
    }
}
@iamqizhao
Copy link
Contributor

Thanks for reporting. I had a fix and will polish it and send it out tomorrow.

@rhysh
Copy link
Contributor Author

rhysh commented Apr 12, 2016

Thanks for the quick turnaround on this. The situation is improved somewhat, but this issue is still present. The transport's window shrinks to zero more slowly than before, but it still shrinks and requests still stall. Is there more than one leak?

With #633, the reproducer processes can exchange around 2400–2600 RPCs on my laptop before stalling. Without it, they can exchange only 140–160.

When I adjust my reproducer to send fewer response messages (one per request) so the requests are not canceled, the client and server can exchange 1,000,000 RPCs with no hiccups.

@rhysh
Copy link
Contributor Author

rhysh commented Apr 12, 2016

I added logging to google.golang.org/grpc/transport.(*http2Client).handleData, which indicates that the amount of data ready for the application to read (t.fc.pendingData) grows as messages arrive, and does not shrink when streams are closed. It grows to around 3/4 of a MB when communication stalls. This is contrasted with t.fc.pendingUpdate which grows to around 1/4 of a MB and then shrinks, likely as a window update is sent.

It looks like the receiver's pendingData, the receiver's pendingUpdate, and the sender's transport window tq sum roughly to a constant number—1MB once communication is established.

The receiver seems to be collecting a pile of byte-tokens for data it will never read—in its pendingData pool. Is that what's depriving the sender's tq window of the byte-tokens it needs to continue communicating?

    // pendingData is the overall data which have been received but not been
    // consumed by applications.
    pendingData uint32
    // The amount of data the application has consumed but grpc has not sent
    // window update for them. Used to reduce window update frequency.
    pendingUpdate uint32

@iamqizhao
Copy link
Contributor

not sure how you logged. I did observer t.fc.pendingData drops (the max is about 16k). But yes, I do observe the problem stays.

@iamqizhao iamqizhao reopened this Apr 12, 2016
@iamqizhao
Copy link
Contributor

I think I found the problem. the fix will be out tomorrow.

@rhysh
Copy link
Contributor Author

rhysh commented Apr 14, 2016

It looks like 963ee99 fixes the problem for me. Thanks!

rhysh added a commit to rhysh/grpc-go that referenced this issue Nov 30, 2016
Remove the add and cancel methods of quotaPool. Their use is not required, and
leads to logical races when used concurrently from multiple goroutines. Rename
the reset method to add.

The typical way that a goroutine claims quota is to call the add method and
then to select on the channel returned by the acquire method. If two
goroutines are both trying to claim quota from a single quotaPool, the second
call to the add method can happen before the first attempt to read from the
channel. When that happens the second goroutine to attempt the channel read
will end up waiting for a very long time, in spite of its efforts to prepare
the channel for reading.

The quotaPool will always behave correctly when any positive quota is on the
channel rather than stored in the struct field. In the opposite case, when
positive quota is only in the struct field and not on the channel, users of
the quotaPool can fail to access available quota. Err on the side of storing
any positive quota in the channel.

This includes a reproducer for grpc#632, which fails on many runs with this
package at v1.0.4. The frequency of the test failures depends on how stressed
the server is, since it's now effectively checking for weird interleavings of
goroutines. It passes reliably with these changes to the transport package.

The behavior described in grpc#734 (an RPC with a streaming response hangs
unexpectedly) matches what I've seen in my programs, and what I see in the
test case added here. If it's a logical flow control bug, this change may well
fix it.

Updates grpc#632
Updates grpc#734
rhysh added a commit to rhysh/grpc-go that referenced this issue Nov 30, 2016
Remove the add and cancel methods of quotaPool. Their use is not required, and
leads to logical races when used concurrently from multiple goroutines. Rename
the reset method to add.

The typical way that a goroutine claims quota is to call the add method and
then to select on the channel returned by the acquire method. If two
goroutines are both trying to claim quota from a single quotaPool, the second
call to the add method can happen before the first attempt to read from the
channel. When that happens the second goroutine to attempt the channel read
will end up waiting for a very long time, in spite of its efforts to prepare
the channel for reading.

The quotaPool will always behave correctly when any positive quota is on the
channel rather than stored in the struct field. In the opposite case, when
positive quota is only in the struct field and not on the channel, users of
the quotaPool can fail to access available quota. Err on the side of storing
any positive quota in the channel.

This includes a reproducer for grpc#632, which fails on many runs with this
package at v1.0.4. The frequency of the test failures depends on how stressed
the server is, since it's now effectively checking for weird interleavings of
goroutines. It passes reliably with these changes to the transport package.

The behavior described in grpc#734 (an RPC with a streaming response hangs
unexpectedly) matches what I've seen in my programs, and what I see in the
test case added here. If it's a logical flow control bug, this change may well
fix it.

Updates grpc#632
Updates grpc#734
rhysh added a commit to rhysh/grpc-go that referenced this issue Nov 30, 2016
Remove the add and cancel methods of quotaPool. Their use is not required, and
leads to logical races when used concurrently from multiple goroutines. Rename
the reset method to add.

The typical way that a goroutine claims quota is to call the add method and
then to select on the channel returned by the acquire method. If two
goroutines are both trying to claim quota from a single quotaPool, the second
call to the add method can happen before the first attempt to read from the
channel. When that happens the second goroutine to attempt the channel read
will end up waiting for a very long time, in spite of its efforts to prepare
the channel for reading.

The quotaPool will always behave correctly when any positive quota is on the
channel rather than stored in the struct field. In the opposite case, when
positive quota is only in the struct field and not on the channel, users of
the quotaPool can fail to access available quota. Err on the side of storing
any positive quota in the channel.

This includes a reproducer for grpc#632, which fails on many runs with this
package at v1.0.4. The frequency of the test failures depends on how stressed
the server is, since it's now effectively checking for weird interleavings of
goroutines. It passes reliably with these changes to the transport package.

The behavior described in grpc#734 (an RPC with a streaming response hangs
unexpectedly) matches what I've seen in my programs, and what I see in the
test case added here. If it's a logical flow control bug, this change may well
fix it.

Updates grpc#632
Updates grpc#734
rhysh added a commit to rhysh/grpc-go that referenced this issue Nov 30, 2016
Remove the add and cancel methods of quotaPool. Their use is not required, and
leads to logical races when used concurrently from multiple goroutines. Rename
the reset method to add.

The typical way that a goroutine claims quota is to call the add method and
then to select on the channel returned by the acquire method. If two
goroutines are both trying to claim quota from a single quotaPool, the second
call to the add method can happen before the first attempt to read from the
channel. When that happens the second goroutine to attempt the channel read
will end up waiting for a very long time, in spite of its efforts to prepare
the channel for reading.

The quotaPool will always behave correctly when any positive quota is on the
channel rather than stored in the struct field. In the opposite case, when
positive quota is only in the struct field and not on the channel, users of
the quotaPool can fail to access available quota. Err on the side of storing
any positive quota in the channel.

This includes a reproducer for grpc#632, which fails on many runs with this
package at v1.0.4. The frequency of the test failures depends on how stressed
the server is, since it's now effectively checking for weird interleavings of
goroutines. It passes reliably with these changes to the transport package.

The behavior described in grpc#734 (an RPC with a streaming response hangs
unexpectedly) matches what I've seen in my programs, and what I see in the
test case added here. If it's a logical flow control bug, this change may well
fix it.

Updates grpc#632
Updates grpc#734
rhysh added a commit to rhysh/grpc-go that referenced this issue Dec 1, 2016
Remove the add and cancel methods of quotaPool. Their use is not required, and
leads to logical races when used concurrently from multiple goroutines. Rename
the reset method to add.

The typical way that a goroutine claims quota is to call the add method and
then to select on the channel returned by the acquire method. If two
goroutines are both trying to claim quota from a single quotaPool, the second
call to the add method can happen before the first attempt to read from the
channel. When that happens the second goroutine to attempt the channel read
will end up waiting for a very long time, in spite of its efforts to prepare
the channel for reading.

The quotaPool will always behave correctly when any positive quota is on the
channel rather than stored in the struct field. In the opposite case, when
positive quota is only in the struct field and not on the channel, users of
the quotaPool can fail to access available quota. Err on the side of storing
any positive quota in the channel.

This includes a reproducer for grpc#632, which fails on many runs with this
package at v1.0.4. The frequency of the test failures depends on how stressed
the server is, since it's now effectively checking for weird interleavings of
goroutines. It passes reliably with these changes to the transport package.

The behavior described in grpc#734 (an RPC with a streaming response hangs
unexpectedly) matches what I've seen in my programs, and what I see in the
test case added here. If it's a logical flow control bug, this change may well
fix it.

Updates grpc#632
Updates grpc#734
rhysh added a commit to rhysh/grpc-go that referenced this issue Dec 1, 2016
Remove the add and cancel methods of quotaPool. Their use is not required, and
leads to logical races when used concurrently from multiple goroutines. Rename
the reset method to add.

The typical way that a goroutine claims quota is to call the add method and
then to select on the channel returned by the acquire method. If two
goroutines are both trying to claim quota from a single quotaPool, the second
call to the add method can happen before the first attempt to read from the
channel. When that happens the second goroutine to attempt the channel read
will end up waiting for a very long time, in spite of its efforts to prepare
the channel for reading.

The quotaPool will always behave correctly when any positive quota is on the
channel rather than stored in the struct field. In the opposite case, when
positive quota is only in the struct field and not on the channel, users of
the quotaPool can fail to access available quota. Err on the side of storing
any positive quota in the channel.

This includes a reproducer for grpc#632, which fails on many runs with this
package at v1.0.4. The frequency of the test failures depends on how stressed
the server is, since it's now effectively checking for weird interleavings of
goroutines. It passes reliably with these changes to the transport package.

The behavior described in grpc#734 (an RPC with a streaming response hangs
unexpectedly) matches what I've seen in my programs, and what I see in the
test case added here. If it's a logical flow control bug, this change may well
fix it.

Updates grpc#632
Updates grpc#734
iamqizhao pushed a commit that referenced this issue Dec 1, 2016
Remove the add and cancel methods of quotaPool. Their use is not required, and
leads to logical races when used concurrently from multiple goroutines. Rename
the reset method to add.

The typical way that a goroutine claims quota is to call the add method and
then to select on the channel returned by the acquire method. If two
goroutines are both trying to claim quota from a single quotaPool, the second
call to the add method can happen before the first attempt to read from the
channel. When that happens the second goroutine to attempt the channel read
will end up waiting for a very long time, in spite of its efforts to prepare
the channel for reading.

The quotaPool will always behave correctly when any positive quota is on the
channel rather than stored in the struct field. In the opposite case, when
positive quota is only in the struct field and not on the channel, users of
the quotaPool can fail to access available quota. Err on the side of storing
any positive quota in the channel.

This includes a reproducer for #632, which fails on many runs with this
package at v1.0.4. The frequency of the test failures depends on how stressed
the server is, since it's now effectively checking for weird interleavings of
goroutines. It passes reliably with these changes to the transport package.

The behavior described in #734 (an RPC with a streaming response hangs
unexpectedly) matches what I've seen in my programs, and what I see in the
test case added here. If it's a logical flow control bug, this change may well
fix it.

Updates #632
Updates #734
@lock lock bot locked as resolved and limited conversation to collaborators Sep 26, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
2 participants