jar: reduce redundant allocations and copies (2x-3x speedup) #57

aktau · 2022-02-23T23:13:22Z

The io.ReadAll() call is responsible for a bit more than half the allocations observed in a very highly scaled up instance of this [1]. On the local testdata, these commits had quite the positive effect. I ran go test inside the jar/ subdirectory too, and it passed.

Benchmarks:

$ for commit in $(git log --pretty=oneline | head -4 | awk '{print $1}' | tac) ; do
    git checkout $commit
    go build
    hyperfine -i './log4jscanner jar/testdata'
    rm log4jscanner
  done

HEAD is now at https://github.com/aktau/log4jscanner/commit/9d3aceeff48dc332d314d938ef6f3ed7837dfc1e Document commit that removed InitialContext
  Time (mean ± σ):      1.754 s ±  0.260 s    [User: 1.711 s, System: 1.264 s]
  Range (min … max):    1.411 s …  2.182 s    10 runs

HEAD is now at https://github.com/aktau/log4jscanner/commit/2932093ea7873a0df400ec12980ed6206b120ad7 jar: close the zip.File reader before recursing
  Time (mean ± σ):      1.761 s ±  0.312 s    [User: 1.778 s, System: 1.124 s]
  Range (min … max):    1.271 s …  2.239 s    10 runs

HEAD is now at https://github.com/aktau/log4jscanner/commit/f6c1ebcfb928f724f174e9c8ccd818534840a10a jar: prefer io.ReadFull over io.ReadAll
  Time (mean ± σ):     444.8 ms ±  95.3 ms    [User: 353.5 ms, System: 122.3 ms]
  Range (min … max):   337.2 ms … 683.2 ms    10 runs

HEAD is now at 5a28952 jar: reuse buffers for nested .jar's
  Time (mean ± σ):     398.0 ms ±  44.7 ms    [User: 330.6 ms, System: 89.9 ms]
  Range (min … max):   339.7 ms … 471.1 ms    10 runs

[1]:

(pprof) list checkJAR
Total: 213.06GB
         ...
         .     9.89MB    334:           f, err := zf.Open()
         .          .    335:           if err != nil {
         .          .    336:                   return fmt.Errorf("open file %s: %v", p, err)
         .          .    337:           }
         .          .    338:           defer f.Close()
         .   169.96GB    339:           data, err := io.ReadAll(f)
         .          .    340:           if err != nil {
         .          .    341:                   return fmt.Errorf("read file %s: %v", p, err)
         .          .    342:           }
         .          .    343:           br := bytes.NewReader(data)
         .     3.40GB    344:           r2, err := zip.NewReader(br, br.Size())
         ...

(pprof) list ReadAll
Total: 213.06GB
ROUTINE ======================== io.ReadAll in third_party/go/gc/src/io/io.go
  113.40GB   169.96GB (flat, cum) 79.77% of Total
         .          .    638:func ReadAll(r Reader) ([]byte, error) {
         .          .    639:   b := make([]byte, 0, 512)
         .          .    640:   for {
         .          .    641:           if len(b) == cap(b) {
         .          .    642:                   // Add more capacity (let append pick how much).
  113.40GB   113.40GB    643:                   b = append(b, 0)[:len(b)]
         .          .    644:           }
         .    56.56GB    645:           n, err := r.Read(b[len(b):cap(b)])
         .          .    646:           b = b[:len(b)+n]
         .          .    647:           if err != nil {
         .          .    648:                   if err == EOF {
         .          .    649:                           err = nil
         .          .    650:                   }

google-cla · 2022-02-23T23:13:25Z

Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

For more information, open the CLA check for this pull request.

aktau · 2022-02-23T23:14:46Z

I'm not sure why this wants me to sign a CLA. This is from aktau@google.com.

UPDATE: I added my corp email to my GH account now. I wonder how I can make this check run again.

ericchiang · 2022-02-28T18:18:43Z

Maybe I'm holding it wrong, but I'm not seeing the improvement on our benchmarks

$ git checkout upstream/main
$ go test -run=NONE -bench=. ./... >> old.txt
$ git checkout aktau/main
$ go test -run=NONE -bench=. ./... > new.txt
$ benchstat old.txt new.txt 
name             old time/op    new time/op    delta
Parse-8            16.7µs ± 0%    16.8µs ± 0%   ~     (p=1.000 n=1+1)
ParseParallel-8    4.97µs ± 0%    4.97µs ± 0%   ~     (p=1.000 n=1+1)

name             old alloc/op   new alloc/op   delta
Parse-8            4.84kB ± 0%    4.84kB ± 0%   ~     (p=1.000 n=1+1)
ParseParallel-8    4.92kB ± 0%    4.93kB ± 0%   ~     (p=1.000 n=1+1)

name             old allocs/op  new allocs/op  delta
Parse-8              31.0 ± 0%      31.0 ± 0%   ~     (all equal)
ParseParallel-8      31.0 ± 0%      31.0 ± 0%   ~     (all equal)

aktau · 2022-02-28T21:33:00Z

Maybe I'm holding it wrong, but I'm not seeing the improvement on our benchmarks

Indeed. Even with 10 runs per test we get this:

$ git checkout main ; for commit in $(git log --pretty=oneline | head -4 | awk '{print $1}' | tac) ; do git checkout $commit && ( cd ./jar && go test -run=NONE -bench=. -count=10 | tee ~/tmp/$commit.bench ) ; done
$ benchstat 9d3aceeff48dc332d314d938ef6f3ed7837dfc1e.bench 61ade2d03f307bde6b98ed513f68f01afb9421c9.bench
name             old time/op    new time/op    delta
Parse-4            38.5µs ± 9%    39.5µs ± 9%     ~     (p=0.408 n=7+9)
ParseParallel-4    36.4µs ±23%    31.6µs ±12%  -13.42%  (p=0.035 n=10+9)

name             old alloc/op   new alloc/op   delta
Parse-4            4.92kB ± 0%    4.92kB ± 0%     ~     (p=0.671 n=8+10)
ParseParallel-4    4.94kB ± 0%    4.93kB ± 0%     ~     (p=0.381 n=10+10)

name             old allocs/op  new allocs/op  delta
Parse-4              33.0 ± 0%      33.0 ± 0%     ~     (all equal)
ParseParallel-4      33.0 ± 0%      33.0 ± 0%     ~     (all equal)

However, due to the fact that we see consistent runtime improvements on /log4jscanner jar/testdata, that might be an indication that the benchmarks are not representative of either the data inside testdata, or the data we see in production.

Indeed, given the codepath I'm changing, I'd expect only nested jars to benefit (and it seems like this is fairly common in production).

To verify, I ran all the jars in testdata individually:

$ git checkout main
TMPDIR="$HOME/tmp/tmpdir"
mkdir "$TMPDIR" || true
for file in jar/testdata/* ; do
  RTMPDIR="$TMPDIR/$(basename $file)"
  mkdir "$RTMPDIR" || true
  ln -fv "$PWD/$file" "$RTMPDIR"
done
for commit in $(git log --pretty=oneline | head -4 | awk '{print $1}' | tac) ; do
  git checkout $commit
  go build
  dirs="$(echo $TMPDIR/* | tr ' ' ',')"
  hyperfine --ignore-failure --warmup 1 --parameter-list dir "$dirs" './log4jscanner {dir}'
  rm log4jscanner
done
 ...
Summary
  './log4jscanner /home/aktau/tmp/tmpdir/400mb.jar' ran
    1.12 ± 1.12 times faster than './log4jscanner /home/aktau/tmp/tmpdir/log4j-core-2.14.0.jar.patched'
    1.13 ± 1.21 times faster than './log4jscanner /home/aktau/tmp/tmpdir/log4j-core-2.1.jar.patched'
    1.15 ± 1.16 times faster than './log4jscanner /home/aktau/tmp/tmpdir/notarealjar.jar'
    1.15 ± 1.17 times faster than './log4jscanner /home/aktau/tmp/tmpdir/emptydir.zip'
    1.15 ± 1.17 times faster than './log4jscanner /home/aktau/tmp/tmpdir/bad_jar_in_jar_in_jar.jar.patched'
    1.15 ± 1.16 times faster than './log4jscanner /home/aktau/tmp/tmpdir/helloworld.signed.jar'
    1.17 ± 1.18 times faster than './log4jscanner /home/aktau/tmp/tmpdir/corruptjar.go'
    1.18 ± 1.15 times faster than './log4jscanner /home/aktau/tmp/tmpdir/arara.jar.patched'
    1.18 ± 1.21 times faster than './log4jscanner /home/aktau/tmp/tmpdir/log4j-core-2.12.1.jar.patched'
    1.18 ± 1.26 times faster than './log4jscanner /home/aktau/tmp/tmpdir/vuln-class-executable'
    1.19 ± 1.31 times faster than './log4jscanner /home/aktau/tmp/tmpdir/arara.signed.jar.patched'
    1.20 ± 1.21 times faster than './log4jscanner /home/aktau/tmp/tmpdir/corrupt.jar'
    1.21 ± 1.24 times faster than './log4jscanner /home/aktau/tmp/tmpdir/bad_jar_in_jar.jar.patched'
    1.23 ± 1.44 times faster than './log4jscanner /home/aktau/tmp/tmpdir/bad_jar_with_invalid_jar.jar.patched'
    1.24 ± 1.39 times faster than './log4jscanner /home/aktau/tmp/tmpdir/log4j-core-2.15.0.jar.patched'
    1.24 ± 1.22 times faster than './log4jscanner /home/aktau/tmp/tmpdir/safe1.signed.jar'
    1.24 ± 1.26 times faster than './log4jscanner /home/aktau/tmp/tmpdir/vuln-class.jar.patched'
    1.25 ± 1.25 times faster than './log4jscanner /home/aktau/tmp/tmpdir/safe1.jar'
    1.25 ± 1.30 times faster than './log4jscanner /home/aktau/tmp/tmpdir/generate.sh'
    1.28 ± 1.24 times faster than './log4jscanner /home/aktau/tmp/tmpdir/helloworld-executable'
    1.34 ± 1.31 times faster than './log4jscanner /home/aktau/tmp/tmpdir/good_jar_in_jar.jar'
    1.38 ± 1.31 times faster than './log4jscanner /home/aktau/tmp/tmpdir/corrupt_jar_in_jar.jar'
    1.41 ± 1.35 times faster than './log4jscanner /home/aktau/tmp/tmpdir/vuln-class.jar'
    1.43 ± 1.38 times faster than './log4jscanner /home/aktau/tmp/tmpdir/good_jar_with_invalid_jar.jar'
    1.48 ± 1.38 times faster than './log4jscanner /home/aktau/tmp/tmpdir/bad_jar_with_invalid_jar.jar'
    1.52 ± 1.52 times faster than './log4jscanner /home/aktau/tmp/tmpdir/helloworld.jar'
    1.52 ± 1.42 times faster than './log4jscanner /home/aktau/tmp/tmpdir/zipbombs'
    1.55 ± 1.50 times faster than './log4jscanner /home/aktau/tmp/tmpdir/good_jar_in_jar_in_jar.jar'
    1.59 ± 1.49 times faster than './log4jscanner /home/aktau/tmp/tmpdir/log4j-core-2.0-beta9.jar'
    1.62 ± 1.55 times faster than './log4jscanner /home/aktau/tmp/tmpdir/bad_jar_in_jar.jar'
    1.63 ± 1.49 times faster than './log4jscanner /home/aktau/tmp/tmpdir/selenium-api-3.141.59.jar'
    1.76 ± 1.61 times faster than './log4jscanner /home/aktau/tmp/tmpdir/log4j-core-2.1.jar'
    1.78 ± 1.55 times faster than './log4jscanner /home/aktau/tmp/tmpdir/similarbutnotvuln.jar'
    1.86 ± 1.61 times faster than './log4jscanner /home/aktau/tmp/tmpdir/log4j-core-2.12.2.jar'
    1.86 ± 1.63 times faster than './log4jscanner /home/aktau/tmp/tmpdir/log4j-core-2.12.1.jar'
    1.94 ± 2.04 times faster than './log4jscanner /home/aktau/tmp/tmpdir/bad_jar_in_jar_in_jar.jar'
    1.97 ± 1.73 times faster than './log4jscanner /home/aktau/tmp/tmpdir/log4j-core-2.14.0.jar'
    2.01 ± 1.72 times faster than './log4jscanner /home/aktau/tmp/tmpdir/log4j-core-2.16.0.jar'
    2.76 ± 2.28 times faster than './log4jscanner /home/aktau/tmp/tmpdir/log4j-core-2.15.0.jar'
    3.33 ± 2.80 times faster than './log4jscanner /home/aktau/tmp/tmpdir/shadow-6.1.0.jar'
    7.64 ± 7.06 times faster than './log4jscanner /home/aktau/tmp/tmpdir/arara.jar'
    9.34 ± 7.40 times faster than './log4jscanner /home/aktau/tmp/tmpdir/arara.signed.jar'
   13.78 ± 10.56 times faster than './log4jscanner /home/aktau/tmp/tmpdir/emptydirs.zip'
  326.03 ± 262.82 times faster than './log4jscanner /home/aktau/tmp/tmpdir/400mb_jar_in_jar.jar'

Which is somewhat curious. I'd have expected 400mb.jar to not be the fastest thing to run. But individually re-running it confirms it.

Most of these invocations take so little time that it's hard to show statistically significant improvements. So we'll just focus on the four slowest ones (next comment).

Otherwise, the code path at the end of checkFile (for nested jars) is not excercised. In production, we see a decent number of these.

Package archive/zip has a pool [1] for flate buffers. Calling `Close` on the result of `zip.(*File).Open` returns the buffer to the pool. We should do this as soon as we are able (when we no longer need the file), especially before recursing. I did not modify the `defer f.Close()` in the code paths above because: 1. The `return` follows closely after (no recursion) in their case. And: 2. They actually use the file for longer. [1]: https://cs.opensource.google/go/go/+/master:src/archive/zip/register.go;l=69;drc=1b09d430678d4a6f73b2443463d11f75851aba8a.

This io.ReadAll() call is responsible for a bit more than half the allocations observed in a very highly scaled up instance of this [1]. io.ReadAll does repeated allocations [1], starting with a 512 byte buffer. We know the size of the file, so allocate it all at once. A future commit should look at a further optimization: using a buffer pool. That wasn't done in this CL because it requires dealing with the recursive call. [1]: ``` (pprof) list checkJAR Total: 213.06GB ... . 9.89MB 334: f, err := zf.Open() . . 335: if err != nil { . . 336: return fmt.Errorf("open file %s: %v", p, err) . . 337: } . . 338: defer f.Close() . 169.96GB 339: data, err := io.ReadAll(f) . . 340: if err != nil { . . 341: return fmt.Errorf("read file %s: %v", p, err) . . 342: } . . 343: br := bytes.NewReader(data) . 3.40GB 344: r2, err := zip.NewReader(br, br.Size()) ... (pprof) list ReadAll Total: 213.06GB ROUTINE ======================== io.ReadAll in third_party/go/gc/src/io/io.go 113.40GB 169.96GB (flat, cum) 79.77% of Total . . 638:func ReadAll(r Reader) ([]byte, error) { . . 639: b := make([]byte, 0, 512) . . 640: for { . . 641: if len(b) == cap(b) { . . 642: // Add more capacity (let append pick how much). 113.40GB 113.40GB 643: b = append(b, 0)[:len(b)] . . 644: } . 56.56GB 645: n, err := r.Read(b[len(b):cap(b)]) . . 646: b = b[:len(b)+n] . . 647: if err != nil { . . 648: if err == EOF { . . 649: err = nil . . 650: } ``` [1]: https://cs.opensource.google/go/go/+/master:src/io/io.go;l=638;drc=2580d0e08d5e9f979b943758d3c49877fb2324cb

This commit is a further optimization of allocation behaviour (see parent). As before, this read is responsible for a bit more than half the bytes allocated observed in a very highly scaled up instance of this [1]. To sanity check this, I added some printf statements to the tests. It appears to work (we clearly see buffer reuse). ``` $ go test Need realloc desired=419430572 len=4096 cap=4096, extending with 419426476 Reallocated desired=419430572 len=419430572 cap=419438592 Buffer reused desired=12691 len=419430572 cap=419438592 Need realloc desired=10951 len=4096 cap=4096, extending with 6855 Reallocated desired=10951 len=10951 cap=12288 Need realloc desired=12691 len=4096 cap=4096, extending with 8595 Reallocated desired=12691 len=12691 cap=13568 Need realloc desired=17631 len=12691 cap=13568, extending with 4940 Reallocated desired=17631 len=17631 cap=21760 Buffer reused desired=16072 len=17631 cap=21760 Buffer reused desired=15537 len=16072 cap=21760 Need realloc desired=17631 len=10951 cap=12288, extending with 6680 Reallocated desired=17631 len=17631 cap=20480 Buffer reused desired=14121 len=17631 cap=20480 Buffer reused desired=16072 len=15537 cap=21760 Buffer reused desired=17631 len=16072 cap=21760 Buffer reused desired=48 len=17631 cap=21760 Buffer reused desired=16072 len=48 cap=21760 Buffer reused desired=48 len=16072 cap=21760 Buffer reused desired=12691 len=48 cap=21760 Buffer reused desired=48 len=12691 cap=21760 Need realloc desired=42374 len=48 cap=21760, extending with 42326 Reallocated desired=42374 len=42374 cap=49152 Buffer reused desired=15537 len=42374 cap=49152 Buffer reused desired=17631 len=14121 cap=20480 Buffer reused desired=440 len=17631 cap=20480 Buffer reused desired=440 len=15537 cap=49152 Buffer reused desired=440 len=4096 cap=4096 Buffer reused desired=440 len=4096 cap=4096 Buffer reused desired=440 len=4096 cap=4096 Buffer reused desired=440 len=4096 cap=4096 Buffer reused desired=440 len=4096 cap=4096 Buffer reused desired=440 len=4096 cap=4096 Buffer reused desired=440 len=4096 cap=4096 Buffer reused desired=440 len=4096 cap=4096 Buffer reused desired=440 len=4096 cap=4096 Buffer reused desired=440 len=4096 cap=4096 Buffer reused desired=440 len=4096 cap=4096 Buffer reused desired=440 len=4096 cap=4096 Buffer reused desired=440 len=4096 cap=4096 Buffer reused desired=440 len=4096 cap=4096 Buffer reused desired=440 len=4096 cap=4096 Need realloc desired=17631 len=4096 cap=4096, extending with 13535 Reallocated desired=17631 len=17631 cap=18432 Buffer reused desired=15537 len=17631 cap=18432 Need realloc desired=17631 len=4096 cap=4096, extending with 13535 Reallocated desired=17631 len=17631 cap=18432 Buffer reused desired=17631 len=17631 cap=18432 Buffer reused desired=48 len=17631 cap=18432 Buffer reused desired=12691 len=48 cap=18432 Buffer reused desired=10951 len=12691 cap=18432 Buffer reused desired=12691 len=15537 cap=18432 Buffer reused desired=12691 len=12691 cap=18432 Buffer reused desired=48 len=12691 cap=18432 Need realloc desired=17631 len=4096 cap=4096, extending with 13535 Reallocated desired=17631 len=17631 cap=18432 Buffer reused desired=15537 len=17631 cap=18432 Need realloc desired=17631 len=4096 cap=4096, extending with 13535 Reallocated desired=17631 len=17631 cap=18432 Buffer reused desired=17631 len=15537 cap=18432 Buffer reused desired=48 len=17631 cap=18432 Buffer reused desired=12691 len=48 cap=18432 Buffer reused desired=10951 len=12691 cap=18432 Need realloc desired=12691 len=4096 cap=4096, extending with 8595 Reallocated desired=12691 len=12691 cap=13568 Buffer reused desired=12691 len=12691 cap=13568 Buffer reused desired=48 len=12691 cap=13568 Need realloc desired=16252 len=4096 cap=4096, extending with 12156 Reallocated desired=16252 len=16252 cap=16384 Buffer reused desired=14113 len=16252 cap=16384 Need realloc desired=16252 len=4096 cap=4096, extending with 12156 Reallocated desired=16252 len=16252 cap=16384 Buffer reused desired=16252 len=16252 cap=16384 Buffer reused desired=48 len=16252 cap=16384 Buffer reused desired=12691 len=48 cap=16384 Buffer reused desired=10951 len=12691 cap=16384 Buffer reused desired=12691 len=14113 cap=16384 Buffer reused desired=12691 len=12691 cap=16384 Buffer reused desired=48 len=12691 cap=16384 Need realloc desired=18688 len=48 cap=16384, extending with 18640 Reallocated desired=18688 len=18688 cap=21760 Buffer reused desired=17631 len=18688 cap=21760 PASS ok github.com/google/log4jscanner/jar 21.074s ``` Benchmarks using hyperfine(1) ``` $ for commit in $(git log --pretty=oneline | head -4 | awk '{print $1}' | tac) ; do git checkout $commit go build hyperfine -i './log4jscanner jar/testdata' rm log4jscanner done ``` HEAD is now at 9d3acee Document commit that removed InitialContext Time (mean ± σ): 1.754 s ± 0.260 s [User: 1.711 s, System: 1.264 s] Range (min … max): 1.411 s … 2.182 s 10 runs HEAD is now at 2932093 jar: close the zip.File reader before recursing Time (mean ± σ): 1.761 s ± 0.312 s [User: 1.778 s, System: 1.124 s] Range (min … max): 1.271 s … 2.239 s 10 runs HEAD is now at f6c1ebc jar: prefer io.ReadFull over io.ReadAll Time (mean ± σ): 444.8 ms ± 95.3 ms [User: 353.5 ms, System: 122.3 ms] Range (min … max): 337.2 ms … 683.2 ms 10 runs HEAD is now at 5a28952 jar: reuse buffers for nested .jar's Time (mean ± σ): 398.0 ms ± 44.7 ms [User: 330.6 ms, System: 89.9 ms] Range (min … max): 339.7 ms … 471.1 ms 10 runs It looks like re-using buffers has only a small positive effect on speed, switching from `io.ReadAll` to `io.ReadFull` has the most impact. Still, I believe re-using allocations may really help in our production context production context. [1]: ``` (pprof) list checkJAR Total: 213.06GB ... . 9.89MB 334: f, err := zf.Open() . . 335: if err != nil { . . 336: return fmt.Errorf("open file %s: %v", p, err) . . 337: } . . 338: defer f.Close() . 169.96GB 339: data, err := io.ReadAll(f) . . 340: if err != nil { . . 341: return fmt.Errorf("read file %s: %v", p, err) . . 342: } . . 343: br := bytes.NewReader(data) . 3.40GB 344: r2, err := zip.NewReader(br, br.Size()) ... (pprof) list ReadAll Total: 213.06GB ROUTINE ======================== io.ReadAll in third_party/go/gc/src/io/io.go 113.40GB 169.96GB (flat, cum) 79.77% of Total . . 638:func ReadAll(r Reader) ([]byte, error) { . . 639: b := make([]byte, 0, 512) . . 640: for { . . 641: if len(b) == cap(b) { . . 642: // Add more capacity (let append pick how much). 113.40GB 113.40GB 643: b = append(b, 0)[:len(b)] . . 644: } . 56.56GB 645: n, err := r.Read(b[len(b):cap(b)]) . . 646: b = b[:len(b)+n] . . 647: if err != nil { . . 648: if err == EOF { . . 649: err = nil . . 650: } ``` jar: use make instead of append

aktau · 2022-02-28T22:01:56Z

(Actually, let's just focus on the slowest one. I've verified that the others don't change in speed over the 4 last commits.)

The results for 400mb_jar_in_jar, which is the only case where runtime is noticeably changed:

Benchmark 1: ./log4jscanner /home/aktau/tmp/tmpdir/400mb_jar_in_jar.jar
  Time (mean ± σ):      1.484 s ±  0.326 s    [User: 1.448 s, System: 0.999 s]
  Range (min … max):    1.119 s …  2.087 s    10 runs
 
Previous HEAD position was 9d3acee Document commit that removed InitialContext
HEAD is now at 2932093 jar: close the zip.File reader before recursing
Benchmark 1: ./log4jscanner /home/aktau/tmp/tmpdir/400mb_jar_in_jar.jar
  Time (mean ± σ):      1.542 s ±  0.174 s    [User: 1.387 s, System: 1.076 s]
  Range (min … max):    1.245 s …  1.739 s    10 runs
 
Previous HEAD position was 2932093 jar: close the zip.File reader before recursing
HEAD is now at f6c1ebc jar: prefer io.ReadFull over io.ReadAll
Benchmark 1: ./log4jscanner /home/aktau/tmp/tmpdir/400mb_jar_in_jar.jar
  Time (mean ± σ):     441.6 ms ±  85.7 ms    [User: 355.1 ms, System: 112.7 ms]
  Range (min … max):   318.3 ms … 578.1 ms    10 runs
 
Previous HEAD position was f6c1ebc jar: prefer io.ReadFull over io.ReadAll
HEAD is now at 61ade2d jar: reuse buffers for nested .jar's
Benchmark 1: ./log4jscanner /home/aktau/tmp/tmpdir/400mb_jar_in_jar.jar
  Time (mean ± σ):     340.7 ms ±  57.9 ms    [User: 269.7 ms, System: 78.8 ms]
  Range (min … max):   298.8 ms … 481.4 ms    10 runs

Si I added a benchmark to test 400mb_jar_in_jar specifically, and the results now do show an improvement (first versus last commit):

benchstat 48d70bfb7e99a469832f21854710f7cddb3e17fd.bench 37376ef3d7fcb898fd5e192d12a2afe4718ffc33.bench
name                                  old time/op    new time/op    delta
Parse/400mb_jar_in_jar.jar-4             1.26s ±27%     0.27s ±10%  -78.51%  (p=0.000 n=10+10)
Parse/safe1.jar-4                       38.0µs ± 9%    43.4µs ±40%  +14.47%  (p=0.006 n=9+9)
ParseParallel/400mb_jar_in_jar.jar-4     1.40s ±26%     0.20s ±32%  -85.39%  (p=0.000 n=10+9)
ParseParallel/safe1.jar-4               33.0µs ± 8%    61.4µs ±11%  +86.03%  (p=0.000 n=9+9)

name                                  old alloc/op   new alloc/op   delta
Parse/400mb_jar_in_jar.jar-4            2.35GB ± 0%   0.05GB ±108%  -97.86%  (p=0.000 n=10+10)
Parse/safe1.jar-4                       4.92kB ± 0%    4.92kB ± 0%     ~     (p=0.810 n=10+10)
ParseParallel/400mb_jar_in_jar.jar-4    2.35GB ± 0%   0.02GB ±336%  -99.18%  (p=0.000 n=10+8)
ParseParallel/safe1.jar-4               4.93kB ± 0%    4.93kB ± 0%   -0.08%  (p=0.032 n=9+9)

name                                  old allocs/op  new allocs/op  delta
Parse/400mb_jar_in_jar.jar-4               102 ±11%        28 ± 8%  -72.43%  (p=0.000 n=10+10)
Parse/safe1.jar-4                         33.0 ± 0%      33.0 ± 0%     ~     (all equal)
ParseParallel/400mb_jar_in_jar.jar-4       180 ± 3%        39 ± 9%  -78.53%  (p=0.000 n=10+8)
ParseParallel/safe1.jar-4                 33.0 ± 0%      33.0 ± 0%     ~     (all equal)

And versus just the io.ReadAll -> io.ReadFull CL:

benchstat 48d70bfb7e99a469832f21854710f7cddb3e17fd.bench 4b23cd3db5400b191c5310c07415848bc0a2bb6d.bench
name                                  old time/op    new time/op    delta
Parse/400mb_jar_in_jar.jar-4             1.26s ±27%     0.30s ±17%  -76.18%  (p=0.000 n=10+9)
Parse/safe1.jar-4                       38.0µs ± 9%    40.3µs ± 8%   +6.15%  (p=0.028 n=9+10)
ParseParallel/400mb_jar_in_jar.jar-4     1.40s ±26%     0.35s ±45%  -74.81%  (p=0.000 n=10+9)
ParseParallel/safe1.jar-4               33.0µs ± 8%    32.2µs ±15%     ~     (p=0.243 n=9+10)

name                                  old alloc/op   new alloc/op   delta
Parse/400mb_jar_in_jar.jar-4            2.35GB ± 0%    0.42GB ± 0%  -82.14%  (p=0.000 n=10+10)
Parse/safe1.jar-4                       4.92kB ± 0%    4.92kB ± 0%     ~     (p=0.463 n=10+10)
ParseParallel/400mb_jar_in_jar.jar-4    2.35GB ± 0%    0.42GB ± 0%  -82.14%  (p=0.000 n=10+10)
ParseParallel/safe1.jar-4               4.93kB ± 0%    4.93kB ± 0%     ~     (p=0.159 n=9+10)

name                                  old allocs/op  new allocs/op  delta
Parse/400mb_jar_in_jar.jar-4               102 ±11%        30 ± 0%  -70.67%  (p=0.000 n=10+6)
Parse/safe1.jar-4                         33.0 ± 0%      33.0 ± 0%     ~     (all equal)
ParseParallel/400mb_jar_in_jar.jar-4       180 ± 3%        52 ±17%  -71.10%  (p=0.000 n=10+10)
ParseParallel/safe1.jar-4                 33.0 ± 0%      33.0 ± 0%     ~     (all equal)

Indicating that the regression in speed of the safe1.jar benchmark may be due to the garbage collector dealing with allocations from the other benchmarks [1].

[1]: Notice how some runs after the sync.Pool introduction allocate ~nothing:

BenchmarkParse/400mb_jar_in_jar.jar-4                  4         299414355 ns/op        104883246 B/op        30 allocs/op
BenchmarkParse/400mb_jar_in_jar.jar-4                  5         249801048 ns/op           11801 B/op         26 allocs/op
BenchmarkParse/400mb_jar_in_jar.jar-4                  4         269324904 ns/op           12186 B/op         27 allocs/op
BenchmarkParse/400mb_jar_in_jar.jar-4                  4         262044230 ns/op           12186 B/op         27 allocs/op
BenchmarkParse/400mb_jar_in_jar.jar-4                  4         259753494 ns/op           12186 B/op         27 allocs/op
BenchmarkParse/400mb_jar_in_jar.jar-4                  4         277134852 ns/op        104883246 B/op        30 allocs/op
BenchmarkParse/400mb_jar_in_jar.jar-4                  4         279076490 ns/op        104883246 B/op        30 allocs/op
BenchmarkParse/400mb_jar_in_jar.jar-4                  5         263421624 ns/op        83908649 B/op         28 allocs/op
BenchmarkParse/400mb_jar_in_jar.jar-4                  4         264917986 ns/op           12186 B/op         27 allocs/op
BenchmarkParse/400mb_jar_in_jar.jar-4                  4         286430956 ns/op        104883246 B/op        30 allocs/op

The pool is not cleared between benchmark runs.

jar/jar.go

This uses a sync.Pool wrapper (called pool.Dynamic) that prevents the pool from holding on to very large buffers indefinitely, while still amortizing the cost of allocation. The policy appears to give good results both with the pre-existing tests and the specific tests added for the pool. This is useful because the library is also used from long-running server contexts, where it would be unfortunate to pin very large buffers for too long. See golang/go#23199. Example algorithm run (from the test): ``` num allocs value target capacity 1 1 100000 100000.000000 100000 2 1 1 52048.000000 100000 3 1 1 28072.000000 100000 4 1 1 16084.000000 100000 5 1 1 10090.000000 100000 6 1 1 7093.000000 100000 7 2 10 5594.500000 4096 8 2 1 4845.250000 4096 9 2 1 4470.625000 4096 10 2 1 4283.312500 4096 11 2 1 4189.656250 4096 12 2 1 4142.828125 4096 13 2 1 4119.414062 4096 14 2 1 4107.707031 4096 15 2 12 4101.853516 4096 16 2 1 4098.926758 4096 17 2 1 4097.463379 4096 18 2 1 4096.731689 4096 19 2 1 4096.365845 4096 20 2 1 4096.182922 4096 21 2 1 4096.091461 4096 22 2 1 4096.045731 4096 23 2 1000 4096.022865 4096 24 2 100 4096.011433 4096 25 3 10000 10000.000000 10000 26 4 100000 100000.000000 100000 27 4 1 52048.000000 100000 28 4 100000 100000.000000 100000 29 4 1 52048.000000 100000 30 4 50000 51024.000000 100000 31 4 1 27560.000000 100000 32 4 1 15828.000000 100000 33 4 25000 25000.000000 100000 34 4 1 14548.000000 100000 35 4 1 9322.000000 100000 36 5 1 6709.000000 4096 37 6 100000 100000.000000 100000 38 6 1 52048.000000 100000 39 6 1 28072.000000 100000 40 6 1 16084.000000 100000 41 6 1 10090.000000 100000 42 6 1 7093.000000 100000 43 7 1 5594.500000 4096 44 7 1 4845.250000 4096 45 7 1 4470.625000 4096 46 7 1 4283.312500 4096 47 7 100 4189.656250 4096 48 7 100 4142.828125 4096 49 7 100 4119.414062 4096 50 7 1 4107.707031 4096 51 7 1 4101.853516 4096 52 7 1 4098.926758 4096 53 7 1 4097.463379 4096 54 7 1 4096.731689 4096 55 7 100 4096.365845 4096 56 7 200 4096.182922 4096 57 7 300 4096.091461 4096 58 7 100 4096.045731 4096 59 7 50 4096.022865 4096 60 7 50 4096.011433 4096 61 7 50 4096.005716 4096 62 7 50 4096.002858 4096 63 7 50 4096.001429 4096 64 7 1 4096.000715 4096 65 7 1 4096.000357 4096 66 7 1 4096.000179 4096 67 7 1 4096.000089 4096 68 8 100000000 100000000.000000 100000000 69 8 1000000 50500000.000000 100000000 70 8 100000 25300000.000000 100000000 71 8 10000 12655000.000000 100000000 72 8 1000 6329548.000000 100000000 73 9 100 3166822.000000 4096 74 9 10 1585459.000000 4096 75 9 1 794777.500000 4096 76 9 1 399436.750000 4096 77 9 500 201766.375000 4096 78 9 2020 102931.187500 4096 79 9 400 53513.593750 4096 80 9 3984 28804.796875 4096 81 9 5 16450.398438 4096 82 9 200 10273.199219 4096 83 9 500 7184.599609 4096 84 10 40000 40000.000000 40000 85 10 35000 37500.000000 40000 86 11 45000 45000.000000 45000 87 11 42000 43500.000000 45000 88 11 38000 40750.000000 45000 89 11 38000 39375.000000 45000 90 11 39000 39187.500000 45000 91 11 41000 41000.000000 45000 92 11 42000 42000.000000 45000 93 11 42000 42000.000000 45000 94 11 2000 23048.000000 45000 95 11 4000 13572.000000 45000 96 11 3949 8834.000000 45000 97 11 2011 6465.000000 45000 98 11 4096 5280.500000 45000 99 11 33 4688.250000 45000 100 11 0 4392.125000 45000 101 12 4938 4938.000000 4938 102 12 1 4517.000000 4938 103 12 1 4306.500000 4938 104 12 1200 4201.250000 4938 105 12 2400 4148.625000 4938 106 12 1200 4122.312500 4938 107 12 200 4109.156250 4938 108 12 400 4102.578125 4938 109 12 600 4099.289062 4938 110 12 700 4097.644531 4938 111 12 100 4096.822266 4938 112 12 400 4096.411133 4938 113 12 500 4096.205566 4938 114 12 700 4096.102783 4938 115 12 600 4096.051392 4938 116 12 900 4096.025696 4938 117 12 1000 4096.012848 4938 118 12 1100 4096.006424 4938 119 12 1200 4096.003212 4938 120 12 1000 4096.001606 4938 ``` Benchmarks also show that the pool does retain the buffer, as performance is not worsened over the previous commit: ``` $ git checkout main TMPDIR="$HOME/tmp/tmpdir" mkdir "$TMPDIR" || true for file in jar/testdata/* ; do RTMPDIR="$TMPDIR/$(basename $file)" mkdir "$RTMPDIR" || true ln -fv "$PWD/$file" "$RTMPDIR" done for commit in $(git log --pretty=oneline | head -5 | awk '{print $1}' | tac) ; do git checkout $commit go build hyperfine --ignore-failure --warmup 1 "./log4jscanner $TMPDIR/400mb_jar_in_jar.jar" rm log4jscanner done HEAD is now at 48d70bf jar: add benchmarks with 400mb_jar_in_jar.jar Time (mean ± σ): 2.026 s ± 0.324 s [User: 2.363 s, System: 1.269 s] Range (min … max): 1.651 s … 2.749 s 10 runs HEAD is now at bf524fa jar: close the zip.File reader before recursing Time (mean ± σ): 1.908 s ± 0.297 s [User: 2.084 s, System: 1.218 s] Range (min … max): 1.502 s … 2.567 s 10 runs HEAD is now at 4b23cd3 jar: prefer io.ReadFull over io.ReadAll Time (mean ± σ): 445.9 ms ± 51.2 ms [User: 401.7 ms, System: 79.9 ms] Range (min … max): 386.3 ms … 566.1 ms 10 runs HEAD is now at 37376ef jar: reuse buffers for nested .jar's Time (mean ± σ): 464.5 ms ± 41.8 ms [User: 420.5 ms, System: 93.7 ms] Range (min … max): 409.2 ms … 545.5 ms 10 runs HEAD is now at c17a81b jar: do not keep large buffers unnecessarily Time (mean ± σ): 436.1 ms ± 26.2 ms [User: 409.5 ms, System: 77.6 ms] Range (min … max): 390.2 ms … 472.7 ms 10 runs ```

ericchiang · 2022-03-18T17:39:51Z

Okay, taking a step back...

A custom sync.Pool wrapper isn't something that I'm thrilled to maintain long term. Particularly since our internal response has wound down, and we have fewer cycles to work on this project :) We already hit performance oddities with this parser in the past, and I don't want to have to hunt down weird behavior or bugs.

What's the motivation here for this performance work? Are you hitting a bottle neck with some use case? Or is this strictly, just trying to improve the performance for its own sake? If it's the latter, I'm inclined to leave this as is.

see above comment

aktau · 2022-03-18T18:51:01Z

A custom sync.Pool wrapper isn't something that I'm thrilled to maintain long term.

Removing comments, it's a tiny amount of LoC. At first I had it as a simple if-check on the "put" operation. This results in the same thing but I thought it looked more ergonomic with Get/Put. Most of the code is in the test.

Particularly since our internal response has wound down, and we have fewer cycles to work on this project :)

True, but this could well improve throughput for other users, too. For us internally, it was a bottleneck. See below.

We already hit performance oddities with this parser in the past, ...

Yes, performance seemed noticeably worse after the switchover to this version, though I wasn't able to pin down immediately why even when comparing profiles.

...and I don't want to have to hunt down weird behavior or bugs.

All new code carries the risk of bugs, that's true. If there's another test you'd like that provides stronger guarantees (also for future contributions), I could see whether I could do that.

Otherwise, I'll just leave this code here for others to potentially pull in.

What's the motivation here for this performance work? Are you hitting a bottle neck with some use case? Or is this strictly, just trying to improve the performance for its own sake? If it's the latter, I'm inclined to leave this as is.

In the last run we did internally, this was a noticeable bottleneck. This codepath was responsible for >50% of all allocations. Allocations by themselves are very costly in Go, but the more processors (GOMAXPROCS) given to a Go program, the more of a bottleneck this becomes as there are some global datastructures involved. For us, this was very large.

You're right that the response has wound down. But I assume others can benefit from this all the same.

ericchiang · 2022-03-18T18:57:44Z

Okay thanks for the reply! This lgtm then

ericchiang

please squash your commits and I'll merge

ericchiang · 2022-03-18T18:59:32Z

Actually, I see that the commits make sense on their own. Will merge as is

aktau force-pushed the main branch 2 times, most recently from b85d917 to 61ade2d Compare February 25, 2022 13:28

aktau added 4 commits February 28, 2022 22:46

jar: add benchmarks with 400mb_jar_in_jar.jar

48d70bf

Otherwise, the code path at the end of checkFile (for nested jars) is not excercised. In production, we see a decent number of these.

aktau force-pushed the main branch from 61ade2d to 37376ef Compare February 28, 2022 22:02

ericchiang previously approved these changes Mar 1, 2022

View reviewed changes

jar/jar.go Outdated Show resolved Hide resolved

jar/jar.go Outdated Show resolved Hide resolved

ericchiang approved these changes Mar 18, 2022

View reviewed changes

ericchiang merged commit fb3bde4 into google:main Mar 18, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

jar: reduce redundant allocations and copies (2x-3x speedup) #57

jar: reduce redundant allocations and copies (2x-3x speedup) #57

aktau commented Feb 23, 2022 •

edited

Loading

google-cla bot commented Feb 23, 2022

aktau commented Feb 23, 2022 •

edited

Loading

ericchiang commented Feb 28, 2022

aktau commented Feb 28, 2022

aktau commented Feb 28, 2022

ericchiang commented Mar 18, 2022

aktau commented Mar 18, 2022

ericchiang commented Mar 18, 2022

ericchiang left a comment

ericchiang commented Mar 18, 2022

jar: reduce redundant allocations and copies (2x-3x speedup) #57

jar: reduce redundant allocations and copies (2x-3x speedup) #57

Conversation

aktau commented Feb 23, 2022 • edited Loading

google-cla bot commented Feb 23, 2022

aktau commented Feb 23, 2022 • edited Loading

ericchiang commented Feb 28, 2022

aktau commented Feb 28, 2022

aktau commented Feb 28, 2022

ericchiang commented Mar 18, 2022

aktau commented Mar 18, 2022

ericchiang commented Mar 18, 2022

ericchiang left a comment

Choose a reason for hiding this comment

ericchiang commented Mar 18, 2022

aktau commented Feb 23, 2022 •

edited

Loading

aktau commented Feb 23, 2022 •

edited

Loading