transport: *http2Server.sendQuotaPool doesn't reclaim quota for canceled requests #632

rhysh · 2016-04-09T22:02:47Z

I have a gRPC api that streams data to the client. When a single client (*grpc.ClientConn) has issued and canceled several requests to this api, the server stops sending data to the client. The server and client are both written in Go, running with go version go1.6 darwin/amd64 and grpc-go at 7834b97. This appears to be due to incorrect flow control, possibly on the server.

The server included below serves the routeguide.ListFeatures api, responding to each request with a stream of 100 responses of around 1kB each. The client below uses a single *grpc.ClientConn to make a sequence of 1000 requests. It reads the first two streamed responses and then cancels the request.

I'd expect the client to reliably read the first two responses from any number of sequential RPC calls.

Instead, the responses from the server stall. The client receives data for the first 160 or so requests (the exact number is sensitive to network latency), but then the server stops sending data to the client. In my test program, this is indicated by log lines like i=164 j=0 instead of e.g. i=144 j=2. The server is stalled while waiting for the transport's sendQuotaPool:

        // Wait until the transport has some quota to send the data.
        tq, err := wait(s.ctx, t.shutdownChan, t.sendQuotaPool.acquire())

I've logged the values of sq and tq below those lines, and indeed the value of tq starts at 64kB, increases to nearly 1MB, and then slowly decreases to 48B after which the server is unable to send additional data to the client's TCP connection.

It seems that the bytes of quota reserved for sending data in response to RPC calls are either not correctly refunded on the server when the request is canceled, or is not correctly reimbursed by the client when the request is canceled.

Creating a new *grpc.ClientConn with a new TCP connection works around this issue, but with the buggy flow control we can only reliably use each TCP connection for O(1) RPC requests.

// server.go
package main

import (
    "flag"
    "log"
    "net"
    "strings"

    "google.golang.org/grpc"
    "google.golang.org/grpc/examples/route_guide/routeguide"
)

func main() {
    addr := flag.String("addr", "127.0.0.1:8080", "Listen address")
    flag.Parse()

    s := grpc.NewServer()
    routeguide.RegisterRouteGuideServer(s, new(guide))

    lis, err := net.Listen("tcp", *addr)
    if err != nil {
        log.Fatalf("Listen err=%q", err)
    }

    err = s.Serve(lis)
    if err != nil {
        log.Fatalf("Serve err=%q", err)
    }
}

type guide struct {
    routeguide.RouteGuideServer
}

func (f *guide) ListFeatures(req *routeguide.Rectangle, srv routeguide.RouteGuide_ListFeaturesServer) error {
    for i := 0; i < 100; i++ {
        err := srv.Send(&routeguide.Feature{
            Name: strings.Repeat("a", 1<<10),
        })
        if err != nil {
            return err
        }
    }
    return nil
}

// client.go
package main

import (
    "flag"
    "io"
    "log"
    "time"

    "golang.org/x/net/context"
    "google.golang.org/grpc"
    "google.golang.org/grpc/codes"
    "google.golang.org/grpc/examples/route_guide/routeguide"
)

func main() {
    addr := flag.String("addr", "127.0.0.1:8080", "Server address")
    flag.Parse()

    ctx := context.Background()

    cc, err := grpc.Dial(*addr, grpc.WithInsecure())
    if err != nil {
        log.Fatalf("Dial err=%q", err)
    }
    cl := routeguide.NewRouteGuideClient(cc)

    for i := 0; i < 1000; i++ {
        ctx, cancel := context.WithTimeout(ctx, 1*time.Second)
        fire, err := cl.ListFeatures(ctx, &routeguide.Rectangle{})
        if err != nil {
            log.Fatalf("ListFeatures err=%q", err)
        }

        j := 0
    loop:
        for ; j < 2; j++ {
            _, err := fire.Recv()
            if err != nil {
                if err == io.EOF {
                    break loop
                }
                switch grpc.Code(err) {
                case codes.DeadlineExceeded:
                    break loop
                default:
                    log.Fatalf("Recv err=%q", err)
                }
            }
        }
        log.Printf("i=%d j=%d", i, j)

        if true {
            cancel()
        }

        <-ctx.Done()
    }
}

The text was updated successfully, but these errors were encountered:

iamqizhao · 2016-04-12T01:35:51Z

Thanks for reporting. I had a fix and will polish it and send it out tomorrow.

rhysh · 2016-04-12T22:26:00Z

Thanks for the quick turnaround on this. The situation is improved somewhat, but this issue is still present. The transport's window shrinks to zero more slowly than before, but it still shrinks and requests still stall. Is there more than one leak?

With #633, the reproducer processes can exchange around 2400–2600 RPCs on my laptop before stalling. Without it, they can exchange only 140–160.

When I adjust my reproducer to send fewer response messages (one per request) so the requests are not canceled, the client and server can exchange 1,000,000 RPCs with no hiccups.

rhysh · 2016-04-12T23:24:32Z

I added logging to google.golang.org/grpc/transport.(*http2Client).handleData, which indicates that the amount of data ready for the application to read (t.fc.pendingData) grows as messages arrive, and does not shrink when streams are closed. It grows to around 3/4 of a MB when communication stalls. This is contrasted with t.fc.pendingUpdate which grows to around 1/4 of a MB and then shrinks, likely as a window update is sent.

It looks like the receiver's pendingData, the receiver's pendingUpdate, and the sender's transport window tq sum roughly to a constant number—1MB once communication is established.

The receiver seems to be collecting a pile of byte-tokens for data it will never read—in its pendingData pool. Is that what's depriving the sender's tq window of the byte-tokens it needs to continue communicating?

    // pendingData is the overall data which have been received but not been
    // consumed by applications.
    pendingData uint32
    // The amount of data the application has consumed but grpc has not sent
    // window update for them. Used to reduce window update frequency.
    pendingUpdate uint32

iamqizhao · 2016-04-12T23:49:04Z

not sure how you logged. I did observer t.fc.pendingData drops (the max is about 16k). But yes, I do observe the problem stays.

iamqizhao · 2016-04-13T03:32:52Z

I think I found the problem. the fix will be out tomorrow.

rhysh · 2016-04-14T22:05:56Z

It looks like 963ee99 fixes the problem for me. Thanks!

Remove the add and cancel methods of quotaPool. Their use is not required, and leads to logical races when used concurrently from multiple goroutines. Rename the reset method to add. The typical way that a goroutine claims quota is to call the add method and then to select on the channel returned by the acquire method. If two goroutines are both trying to claim quota from a single quotaPool, the second call to the add method can happen before the first attempt to read from the channel. When that happens the second goroutine to attempt the channel read will end up waiting for a very long time, in spite of its efforts to prepare the channel for reading. The quotaPool will always behave correctly when any positive quota is on the channel rather than stored in the struct field. In the opposite case, when positive quota is only in the struct field and not on the channel, users of the quotaPool can fail to access available quota. Err on the side of storing any positive quota in the channel. This includes a reproducer for grpc#632, which fails on many runs with this package at v1.0.4. The frequency of the test failures depends on how stressed the server is, since it's now effectively checking for weird interleavings of goroutines. It passes reliably with these changes to the transport package. The behavior described in grpc#734 (an RPC with a streaming response hangs unexpectedly) matches what I've seen in my programs, and what I see in the test case added here. If it's a logical flow control bug, this change may well fix it. Updates grpc#632 Updates grpc#734

Remove the add and cancel methods of quotaPool. Their use is not required, and leads to logical races when used concurrently from multiple goroutines. Rename the reset method to add. The typical way that a goroutine claims quota is to call the add method and then to select on the channel returned by the acquire method. If two goroutines are both trying to claim quota from a single quotaPool, the second call to the add method can happen before the first attempt to read from the channel. When that happens the second goroutine to attempt the channel read will end up waiting for a very long time, in spite of its efforts to prepare the channel for reading. The quotaPool will always behave correctly when any positive quota is on the channel rather than stored in the struct field. In the opposite case, when positive quota is only in the struct field and not on the channel, users of the quotaPool can fail to access available quota. Err on the side of storing any positive quota in the channel. This includes a reproducer for #632, which fails on many runs with this package at v1.0.4. The frequency of the test failures depends on how stressed the server is, since it's now effectively checking for weird interleavings of goroutines. It passes reliably with these changes to the transport package. The behavior described in #734 (an RPC with a streaming response hangs unexpectedly) matches what I've seen in my programs, and what I see in the test case added here. If it's a logical flow control bug, this change may well fix it. Updates #632 Updates #734

iamqizhao mentioned this issue Apr 12, 2016

Fix window update counting for the canceled streams #633

Merged

menghanl closed this as completed in #633 Apr 12, 2016

iamqizhao reopened this Apr 12, 2016

iamqizhao mentioned this issue Apr 14, 2016

Refactored the inbound flow control and fixed a couple of issues. #637

Merged

menghanl closed this as completed in #637 Apr 15, 2016

rhysh mentioned this issue Nov 30, 2016

transport: fix logical race in flow control #1005

Merged

lock bot locked as resolved and limited conversation to collaborators Sep 26, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

transport: *http2Server.sendQuotaPool doesn't reclaim quota for canceled requests #632

transport: *http2Server.sendQuotaPool doesn't reclaim quota for canceled requests #632

rhysh commented Apr 9, 2016

iamqizhao commented Apr 12, 2016

rhysh commented Apr 12, 2016

rhysh commented Apr 12, 2016

iamqizhao commented Apr 12, 2016

iamqizhao commented Apr 13, 2016

rhysh commented Apr 14, 2016

transport: *http2Server.sendQuotaPool doesn't reclaim quota for canceled requests #632

transport: *http2Server.sendQuotaPool doesn't reclaim quota for canceled requests #632

Comments

rhysh commented Apr 9, 2016

iamqizhao commented Apr 12, 2016

rhysh commented Apr 12, 2016

rhysh commented Apr 12, 2016

iamqizhao commented Apr 12, 2016

iamqizhao commented Apr 13, 2016

rhysh commented Apr 14, 2016