Pre-allocate response body using Content-Length #1565

tysonmote · 2022-01-20T00:59:22Z

This saves us allocations when the response body is greater than 512 bytes. For large responses, like multi-megabyte Lambda Invoke responses, this can be a very significant number of allocations and slice copies because ioutil.ReadAll / io.ReadAll start with 512 bytes and follow append's growth rules:

I saw this when profiling some code that makes heavy use of the Lambda Invoke method:

This saves us allocations when the response body is greater than 512 bytes. For large responses, like multi-megabyte Lambda Invoke responses, this can be a very significant number of allocations and slice copies because ioutil.ReadAll / io.ReadAll start with 512 bytes and follow append's growth rules: * https://cs.opensource.google/go/go/+/refs/tags/go1.17.6:src/io/io.go;drc=dc289d3dcb59f80b9e23c7e8f237628359d21d92;l=627 * https://cs.opensource.google/go/go/%20/master:src/runtime/slice.go;l=166?q=growslice

tysonmote · 2022-01-20T01:00:40Z

I had trouble getting tests to run locally (I'm not a Java person and I couldn't figure out the right incantation of configuration/magic), so I'm hoping the build will surface any issues. :)

jasdel · 2022-01-20T17:37:02Z

Thanks for putting this PR together @tysonmote, and adding the concept of pre-allocating buffer for byte slice and string payloads. We'll get this PR reviewed and submit feedback. I've also kicked off the GitHub CI tasks for testing.

Generating the API clients (and protocol test) is currently the best way to validate the codegen module's behavior.

jasdel · 2022-01-20T19:41:47Z

Also, out of curiosity was this flamegraph on allocated objects, or time?

tysonmote · 2022-01-20T20:01:21Z

The flamegraph is CPU time. This is from a worker that calls Invoke thousands of times per second and then forwards results to another plain-vanilla HTTP API, so it's a bit of a stress test of the AWS SDK request path. In the cases where the Invoke response is large, this causes an outsized amount of work being done to grow and copy the io.ReadAll slice's underlying array. This also causes a lot of GC work, as well, to clean up all those intermediate arrays. To iteratively grow a slice from 512 bytes to 1 MB is roughly 34 array allocation and copy operations.

jasdel · 2022-01-20T21:08:16Z

Updated PR to limit contentLength parameter to non-streaming blob or string deserialization functions, and added benchmark for byte slice, and string payloads. Comparing impact of change. Benchmarks used 12MB payloads, which are far larger than most of these operations will use, but exercised the impact proportionally.

Amazon S3 (`service/s3`) GetBucketPolicy benchmark

name                old time/op    new time/op    delta
GetBucketPolicy-12    12.7ms ± 2%     8.7ms ± 2%  -31.60%  (p=0.000 n=9+9)

name                old alloc/op   new alloc/op   delta
GetBucketPolicy-12    83.4MB ± 0%    50.4MB ± 0%  -39.60%  (p=0.000 n=10+10)

name                old allocs/op  new allocs/op  delta
GetBucketPolicy-12       256 ± 0%       216 ± 0%  -15.49%  (p=0.000 n=10+10)

Amazon EventBridge Schema Registry (`service/schemas`) GetCodeBindingSource benchmarks:

name                     old time/op    new time/op    delta
GetCodeBindingSource-12    10.8ms ± 3%     7.4ms ± 4%  -31.79%  (p=0.000 n=10+10)

name                     old alloc/op   new alloc/op   delta
GetCodeBindingSource-12    70.8MB ± 0%    37.8MB ± 0%  -46.64%  (p=0.000 n=10+10)

name                     old allocs/op  new allocs/op  delta
GetCodeBindingSource-12       241 ± 0%       200 ± 0%  -17.01%  (p=0.000 n=10+10)

…d limit downcast to bytes usage

S3 GetBucketPolicy benchmark: name old time/op new time/op delta GetBucketPolicy-12 12.7ms ± 2% 8.7ms ± 2% -31.60% (p=0.000 n=9+9) name old alloc/op new alloc/op delta GetBucketPolicy-12 83.4MB ± 0% 50.4MB ± 0% -39.60% (p=0.000 n=10+10) name old allocs/op new allocs/op delta GetBucketPolicy-12 256 ± 0% 216 ± 0% -15.49% (p=0.000 n=10+10) Schemas GetCodeBindingSource benchmarks: name old time/op new time/op delta GetCodeBindingSource-12 10.8ms ± 3% 7.4ms ± 4% -31.79% (p=0.000 n=10+10) name old alloc/op new alloc/op delta GetCodeBindingSource-12 70.8MB ± 0% 37.8MB ± 0% -46.64% (p=0.000 n=10+10) name old allocs/op new allocs/op delta GetCodeBindingSource-12 241 ± 0% 200 ± 0% -17.01% (p=0.000 n=10+10)

Updates SDK API client deserialization to pre-allocate byte slice and string response payloads. This saves us allocations when the response body is greater than 512 bytes. For large responses, like multi-megabyte Lambda Invoke responses, this can be a very significant number of allocations and slice copies because ioutil.ReadAll / io.ReadAll start with 512 bytes and follow append's growth rules: * https://cs.opensource.google/go/go/+/refs/tags/go1.17.6:src/io/io.go;drc=dc289d3dcb59f80b9e23c7e8f237628359d21d92;l=627 * https://cs.opensource.google/go/go/%20/master:src/runtime/slice.go;l=166?q=growslice S3 GetBucketPolicy benchmark: name old time/op new time/op delta GetBucketPolicy-12 12.7ms ± 2% 8.7ms ± 2% -31.60% (p=0.000 n=9+9) name old alloc/op new alloc/op delta GetBucketPolicy-12 83.4MB ± 0% 50.4MB ± 0% -39.60% (p=0.000 n=10+10) name old allocs/op new allocs/op delta GetBucketPolicy-12 256 ± 0% 216 ± 0% -15.49% (p=0.000 n=10+10) Schemas GetCodeBindingSource benchmarks: name old time/op new time/op delta GetCodeBindingSource-12 10.8ms ± 3% 7.4ms ± 4% -31.79% (p=0.000 n=10+10) name old alloc/op new alloc/op delta GetCodeBindingSource-12 70.8MB ± 0% 37.8MB ± 0% -46.64% (p=0.000 n=10+10) name old allocs/op new allocs/op delta GetCodeBindingSource-12 241 ± 0% 200 ± 0% -17.01% (p=0.000 n=10+10) * Add changelog Co-authored-by: Jason Del Ponte <961963+jasdel@users.noreply.github.com>

tysonmote added 2 commits January 19, 2022 16:51

make generate format

1928414

Nicer formatting

3b50ea4

jasdel added 4 commits January 20, 2022 13:52

Update code generation to only add contentLength param when needed an…

b2eedf4

…d limit downcast to bytes usage

regenerate api clients

67873f2

Add changelog

7b1d966

jasdel force-pushed the respect-content-length branch from 200c3de to 7b1d966 Compare January 20, 2022 22:00

jasdel requested a review from skmcgrail January 20, 2022 22:00

jasdel approved these changes Jan 21, 2022

View reviewed changes

skmcgrail approved these changes Jan 22, 2022

View reviewed changes

jasdel merged commit a1bf7dd into aws:main Jan 24, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pre-allocate response body using Content-Length #1565

Pre-allocate response body using Content-Length #1565

tysonmote commented Jan 20, 2022 •

edited

Loading

tysonmote commented Jan 20, 2022

jasdel commented Jan 20, 2022

jasdel commented Jan 20, 2022

tysonmote commented Jan 20, 2022 •

edited

Loading

jasdel commented Jan 20, 2022

Pre-allocate response body using Content-Length #1565

Pre-allocate response body using Content-Length #1565

Conversation

tysonmote commented Jan 20, 2022 • edited Loading

tysonmote commented Jan 20, 2022

jasdel commented Jan 20, 2022

jasdel commented Jan 20, 2022

tysonmote commented Jan 20, 2022 • edited Loading

jasdel commented Jan 20, 2022

Amazon S3 (service/s3) GetBucketPolicy benchmark

Amazon EventBridge Schema Registry (service/schemas) GetCodeBindingSource benchmarks:

tysonmote commented Jan 20, 2022 •

edited

Loading

tysonmote commented Jan 20, 2022 •

edited

Loading

Amazon S3 (`service/s3`) GetBucketPolicy benchmark

Amazon EventBridge Schema Registry (`service/schemas`) GetCodeBindingSource benchmarks: