Skip to content
This repository has been archived by the owner on Jun 30, 2023. It is now read-only.

jar: reduce redundant allocations and copies (2x-3x speedup) #57

Merged
merged 5 commits into from
Mar 18, 2022

Conversation

aktau
Copy link
Contributor

@aktau aktau commented Feb 23, 2022

The io.ReadAll() call is responsible for a bit more than half the allocations observed in a very highly scaled up instance of this [1]. On the local testdata, these commits had quite the positive effect. I ran go test inside the jar/ subdirectory too, and it passed.

Benchmarks:

$ for commit in $(git log --pretty=oneline | head -4 | awk '{print $1}' | tac) ; do
    git checkout $commit
    go build
    hyperfine -i './log4jscanner jar/testdata'
    rm log4jscanner
  done

HEAD is now at https://github.com/aktau/log4jscanner/commit/9d3aceeff48dc332d314d938ef6f3ed7837dfc1e Document commit that removed InitialContext
  Time (mean ± σ):      1.754 s ±  0.260 s    [User: 1.711 s, System: 1.264 s]
  Range (min … max):    1.411 s …  2.182 s    10 runs

HEAD is now at https://github.com/aktau/log4jscanner/commit/2932093ea7873a0df400ec12980ed6206b120ad7 jar: close the zip.File reader before recursing
  Time (mean ± σ):      1.761 s ±  0.312 s    [User: 1.778 s, System: 1.124 s]
  Range (min … max):    1.271 s …  2.239 s    10 runs

HEAD is now at https://github.com/aktau/log4jscanner/commit/f6c1ebcfb928f724f174e9c8ccd818534840a10a jar: prefer io.ReadFull over io.ReadAll
  Time (mean ± σ):     444.8 ms ±  95.3 ms    [User: 353.5 ms, System: 122.3 ms]
  Range (min … max):   337.2 ms … 683.2 ms    10 runs

HEAD is now at 5a28952 jar: reuse buffers for nested .jar's
  Time (mean ± σ):     398.0 ms ±  44.7 ms    [User: 330.6 ms, System: 89.9 ms]
  Range (min … max):   339.7 ms … 471.1 ms    10 runs

[1]:

(pprof) list checkJAR
Total: 213.06GB
         ...
         .     9.89MB    334:           f, err := zf.Open()
         .          .    335:           if err != nil {
         .          .    336:                   return fmt.Errorf("open file %s: %v", p, err)
         .          .    337:           }
         .          .    338:           defer f.Close()
         .   169.96GB    339:           data, err := io.ReadAll(f)
         .          .    340:           if err != nil {
         .          .    341:                   return fmt.Errorf("read file %s: %v", p, err)
         .          .    342:           }
         .          .    343:           br := bytes.NewReader(data)
         .     3.40GB    344:           r2, err := zip.NewReader(br, br.Size())
         ...

(pprof) list ReadAll
Total: 213.06GB
ROUTINE ======================== io.ReadAll in third_party/go/gc/src/io/io.go
  113.40GB   169.96GB (flat, cum) 79.77% of Total
         .          .    638:func ReadAll(r Reader) ([]byte, error) {
         .          .    639:   b := make([]byte, 0, 512)
         .          .    640:   for {
         .          .    641:           if len(b) == cap(b) {
         .          .    642:                   // Add more capacity (let append pick how much).
  113.40GB   113.40GB    643:                   b = append(b, 0)[:len(b)]
         .          .    644:           }
         .    56.56GB    645:           n, err := r.Read(b[len(b):cap(b)])
         .          .    646:           b = b[:len(b)+n]
         .          .    647:           if err != nil {
         .          .    648:                   if err == EOF {
         .          .    649:                           err = nil
         .          .    650:                   }

@google-cla
Copy link

google-cla bot commented Feb 23, 2022

Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

For more information, open the CLA check for this pull request.

@aktau
Copy link
Contributor Author

aktau commented Feb 23, 2022

I'm not sure why this wants me to sign a CLA. This is from aktau@google.com.

UPDATE: I added my corp email to my GH account now. I wonder how I can make this check run again.

@aktau aktau force-pushed the main branch 2 times, most recently from b85d917 to 61ade2d Compare February 25, 2022 13:28
@ericchiang
Copy link
Member

Maybe I'm holding it wrong, but I'm not seeing the improvement on our benchmarks

$ git checkout upstream/main
$ go test -run=NONE -bench=. ./... >> old.txt
$ git checkout aktau/main
$ go test -run=NONE -bench=. ./... > new.txt
$ benchstat old.txt new.txt 
name             old time/op    new time/op    delta
Parse-8            16.7µs ± 0%    16.8µs ± 0%   ~     (p=1.000 n=1+1)
ParseParallel-8    4.97µs ± 0%    4.97µs ± 0%   ~     (p=1.000 n=1+1)

name             old alloc/op   new alloc/op   delta
Parse-8            4.84kB ± 0%    4.84kB ± 0%   ~     (p=1.000 n=1+1)
ParseParallel-8    4.92kB ± 0%    4.93kB ± 0%   ~     (p=1.000 n=1+1)

name             old allocs/op  new allocs/op  delta
Parse-8              31.0 ± 0%      31.0 ± 0%   ~     (all equal)
ParseParallel-8      31.0 ± 0%      31.0 ± 0%   ~     (all equal)

@aktau
Copy link
Contributor Author

aktau commented Feb 28, 2022

Maybe I'm holding it wrong, but I'm not seeing the improvement on our benchmarks

Indeed. Even with 10 runs per test we get this:

$ git checkout main ; for commit in $(git log --pretty=oneline | head -4 | awk '{print $1}' | tac) ; do git checkout $commit && ( cd ./jar && go test -run=NONE -bench=. -count=10 | tee ~/tmp/$commit.bench ) ; done
$ benchstat 9d3aceeff48dc332d314d938ef6f3ed7837dfc1e.bench 61ade2d03f307bde6b98ed513f68f01afb9421c9.bench
name             old time/op    new time/op    delta
Parse-4            38.5µs ± 9%    39.5µs ± 9%     ~     (p=0.408 n=7+9)
ParseParallel-4    36.4µs ±23%    31.6µs ±12%  -13.42%  (p=0.035 n=10+9)

name             old alloc/op   new alloc/op   delta
Parse-4            4.92kB ± 0%    4.92kB ± 0%     ~     (p=0.671 n=8+10)
ParseParallel-4    4.94kB ± 0%    4.93kB ± 0%     ~     (p=0.381 n=10+10)

name             old allocs/op  new allocs/op  delta
Parse-4              33.0 ± 0%      33.0 ± 0%     ~     (all equal)
ParseParallel-4      33.0 ± 0%      33.0 ± 0%     ~     (all equal)

However, due to the fact that we see consistent runtime improvements on /log4jscanner jar/testdata, that might be an indication that the benchmarks are not representative of either the data inside testdata, or the data we see in production.

Indeed, given the codepath I'm changing, I'd expect only nested jars to benefit (and it seems like this is fairly common in production).

To verify, I ran all the jars in testdata individually:

$ git checkout main
TMPDIR="$HOME/tmp/tmpdir"
mkdir "$TMPDIR" || true
for file in jar/testdata/* ; do
  RTMPDIR="$TMPDIR/$(basename $file)"
  mkdir "$RTMPDIR" || true
  ln -fv "$PWD/$file" "$RTMPDIR"
done
for commit in $(git log --pretty=oneline | head -4 | awk '{print $1}' | tac) ; do
  git checkout $commit
  go build
  dirs="$(echo $TMPDIR/* | tr ' ' ',')"
  hyperfine --ignore-failure --warmup 1 --parameter-list dir "$dirs" './log4jscanner {dir}'
  rm log4jscanner
done
 ...
Summary
  './log4jscanner /home/aktau/tmp/tmpdir/400mb.jar' ran
    1.12 ± 1.12 times faster than './log4jscanner /home/aktau/tmp/tmpdir/log4j-core-2.14.0.jar.patched'
    1.13 ± 1.21 times faster than './log4jscanner /home/aktau/tmp/tmpdir/log4j-core-2.1.jar.patched'
    1.15 ± 1.16 times faster than './log4jscanner /home/aktau/tmp/tmpdir/notarealjar.jar'
    1.15 ± 1.17 times faster than './log4jscanner /home/aktau/tmp/tmpdir/emptydir.zip'
    1.15 ± 1.17 times faster than './log4jscanner /home/aktau/tmp/tmpdir/bad_jar_in_jar_in_jar.jar.patched'
    1.15 ± 1.16 times faster than './log4jscanner /home/aktau/tmp/tmpdir/helloworld.signed.jar'
    1.17 ± 1.18 times faster than './log4jscanner /home/aktau/tmp/tmpdir/corruptjar.go'
    1.18 ± 1.15 times faster than './log4jscanner /home/aktau/tmp/tmpdir/arara.jar.patched'
    1.18 ± 1.21 times faster than './log4jscanner /home/aktau/tmp/tmpdir/log4j-core-2.12.1.jar.patched'
    1.18 ± 1.26 times faster than './log4jscanner /home/aktau/tmp/tmpdir/vuln-class-executable'
    1.19 ± 1.31 times faster than './log4jscanner /home/aktau/tmp/tmpdir/arara.signed.jar.patched'
    1.20 ± 1.21 times faster than './log4jscanner /home/aktau/tmp/tmpdir/corrupt.jar'
    1.21 ± 1.24 times faster than './log4jscanner /home/aktau/tmp/tmpdir/bad_jar_in_jar.jar.patched'
    1.23 ± 1.44 times faster than './log4jscanner /home/aktau/tmp/tmpdir/bad_jar_with_invalid_jar.jar.patched'
    1.24 ± 1.39 times faster than './log4jscanner /home/aktau/tmp/tmpdir/log4j-core-2.15.0.jar.patched'
    1.24 ± 1.22 times faster than './log4jscanner /home/aktau/tmp/tmpdir/safe1.signed.jar'
    1.24 ± 1.26 times faster than './log4jscanner /home/aktau/tmp/tmpdir/vuln-class.jar.patched'
    1.25 ± 1.25 times faster than './log4jscanner /home/aktau/tmp/tmpdir/safe1.jar'
    1.25 ± 1.30 times faster than './log4jscanner /home/aktau/tmp/tmpdir/generate.sh'
    1.28 ± 1.24 times faster than './log4jscanner /home/aktau/tmp/tmpdir/helloworld-executable'
    1.34 ± 1.31 times faster than './log4jscanner /home/aktau/tmp/tmpdir/good_jar_in_jar.jar'
    1.38 ± 1.31 times faster than './log4jscanner /home/aktau/tmp/tmpdir/corrupt_jar_in_jar.jar'
    1.41 ± 1.35 times faster than './log4jscanner /home/aktau/tmp/tmpdir/vuln-class.jar'
    1.43 ± 1.38 times faster than './log4jscanner /home/aktau/tmp/tmpdir/good_jar_with_invalid_jar.jar'
    1.48 ± 1.38 times faster than './log4jscanner /home/aktau/tmp/tmpdir/bad_jar_with_invalid_jar.jar'
    1.52 ± 1.52 times faster than './log4jscanner /home/aktau/tmp/tmpdir/helloworld.jar'
    1.52 ± 1.42 times faster than './log4jscanner /home/aktau/tmp/tmpdir/zipbombs'
    1.55 ± 1.50 times faster than './log4jscanner /home/aktau/tmp/tmpdir/good_jar_in_jar_in_jar.jar'
    1.59 ± 1.49 times faster than './log4jscanner /home/aktau/tmp/tmpdir/log4j-core-2.0-beta9.jar'
    1.62 ± 1.55 times faster than './log4jscanner /home/aktau/tmp/tmpdir/bad_jar_in_jar.jar'
    1.63 ± 1.49 times faster than './log4jscanner /home/aktau/tmp/tmpdir/selenium-api-3.141.59.jar'
    1.76 ± 1.61 times faster than './log4jscanner /home/aktau/tmp/tmpdir/log4j-core-2.1.jar'
    1.78 ± 1.55 times faster than './log4jscanner /home/aktau/tmp/tmpdir/similarbutnotvuln.jar'
    1.86 ± 1.61 times faster than './log4jscanner /home/aktau/tmp/tmpdir/log4j-core-2.12.2.jar'
    1.86 ± 1.63 times faster than './log4jscanner /home/aktau/tmp/tmpdir/log4j-core-2.12.1.jar'
    1.94 ± 2.04 times faster than './log4jscanner /home/aktau/tmp/tmpdir/bad_jar_in_jar_in_jar.jar'
    1.97 ± 1.73 times faster than './log4jscanner /home/aktau/tmp/tmpdir/log4j-core-2.14.0.jar'
    2.01 ± 1.72 times faster than './log4jscanner /home/aktau/tmp/tmpdir/log4j-core-2.16.0.jar'
    2.76 ± 2.28 times faster than './log4jscanner /home/aktau/tmp/tmpdir/log4j-core-2.15.0.jar'
    3.33 ± 2.80 times faster than './log4jscanner /home/aktau/tmp/tmpdir/shadow-6.1.0.jar'
    7.64 ± 7.06 times faster than './log4jscanner /home/aktau/tmp/tmpdir/arara.jar'
    9.34 ± 7.40 times faster than './log4jscanner /home/aktau/tmp/tmpdir/arara.signed.jar'
   13.78 ± 10.56 times faster than './log4jscanner /home/aktau/tmp/tmpdir/emptydirs.zip'
  326.03 ± 262.82 times faster than './log4jscanner /home/aktau/tmp/tmpdir/400mb_jar_in_jar.jar'

Which is somewhat curious. I'd have expected 400mb.jar to not be the fastest thing to run. But individually re-running it confirms it.

Most of these invocations take so little time that it's hard to show statistically significant improvements. So we'll just focus on the four slowest ones (next comment).

Otherwise, the code path at the end of checkFile (for nested jars) is
not excercised. In production, we see a decent number of these.
Package archive/zip has a pool [1] for flate buffers. Calling `Close` on
the result of `zip.(*File).Open` returns the buffer to the pool. We
should do this as soon as we are able (when we no longer need the file),
especially before recursing.

I did not modify the `defer f.Close()` in the code paths above because:

 1. The `return` follows closely after (no recursion) in their case.
    And:
 2. They actually use the file for longer.

[1]: https://cs.opensource.google/go/go/+/master:src/archive/zip/register.go;l=69;drc=1b09d430678d4a6f73b2443463d11f75851aba8a.
This io.ReadAll() call is responsible for a bit more than half the
allocations observed in a very highly scaled up instance of this [1].

io.ReadAll does repeated allocations [1], starting with a 512 byte
buffer. We know the size of the file, so allocate it all at once.

A future commit should look at a further optimization: using a buffer
pool. That wasn't done in this CL because it requires dealing with the
recursive call.

[1]:

```
(pprof) list checkJAR
Total: 213.06GB
         ...
         .     9.89MB    334:           f, err := zf.Open()
         .          .    335:           if err != nil {
         .          .    336:                   return fmt.Errorf("open file %s: %v", p, err)
         .          .    337:           }
         .          .    338:           defer f.Close()
         .   169.96GB    339:           data, err := io.ReadAll(f)
         .          .    340:           if err != nil {
         .          .    341:                   return fmt.Errorf("read file %s: %v", p, err)
         .          .    342:           }
         .          .    343:           br := bytes.NewReader(data)
         .     3.40GB    344:           r2, err := zip.NewReader(br, br.Size())
         ...

(pprof) list ReadAll
Total: 213.06GB
ROUTINE ======================== io.ReadAll in third_party/go/gc/src/io/io.go
  113.40GB   169.96GB (flat, cum) 79.77% of Total
         .          .    638:func ReadAll(r Reader) ([]byte, error) {
         .          .    639:   b := make([]byte, 0, 512)
         .          .    640:   for {
         .          .    641:           if len(b) == cap(b) {
         .          .    642:                   // Add more capacity (let append pick how much).
  113.40GB   113.40GB    643:                   b = append(b, 0)[:len(b)]
         .          .    644:           }
         .    56.56GB    645:           n, err := r.Read(b[len(b):cap(b)])
         .          .    646:           b = b[:len(b)+n]
         .          .    647:           if err != nil {
         .          .    648:                   if err == EOF {
         .          .    649:                           err = nil
         .          .    650:                   }
```

[1]: https://cs.opensource.google/go/go/+/master:src/io/io.go;l=638;drc=2580d0e08d5e9f979b943758d3c49877fb2324cb
This commit is a further optimization of allocation behaviour (see
parent).

As before, this read is responsible for a bit more than half the bytes
allocated observed in a very highly scaled up instance of this [1].

To sanity check this, I added some printf statements to the tests. It
appears to work (we clearly see buffer reuse).

```
$ go test
Need realloc    desired=419430572 len=4096 cap=4096, extending with 419426476
Reallocated     desired=419430572 len=419430572 cap=419438592
Buffer reused   desired=12691 len=419430572 cap=419438592
Need realloc    desired=10951 len=4096 cap=4096, extending with 6855
Reallocated     desired=10951 len=10951 cap=12288
Need realloc    desired=12691 len=4096 cap=4096, extending with 8595
Reallocated     desired=12691 len=12691 cap=13568
Need realloc    desired=17631 len=12691 cap=13568, extending with 4940
Reallocated     desired=17631 len=17631 cap=21760
Buffer reused   desired=16072 len=17631 cap=21760
Buffer reused   desired=15537 len=16072 cap=21760
Need realloc    desired=17631 len=10951 cap=12288, extending with 6680
Reallocated     desired=17631 len=17631 cap=20480
Buffer reused   desired=14121 len=17631 cap=20480
Buffer reused   desired=16072 len=15537 cap=21760
Buffer reused   desired=17631 len=16072 cap=21760
Buffer reused   desired=48 len=17631 cap=21760
Buffer reused   desired=16072 len=48 cap=21760
Buffer reused   desired=48 len=16072 cap=21760
Buffer reused   desired=12691 len=48 cap=21760
Buffer reused   desired=48 len=12691 cap=21760
Need realloc    desired=42374 len=48 cap=21760, extending with 42326
Reallocated     desired=42374 len=42374 cap=49152
Buffer reused   desired=15537 len=42374 cap=49152
Buffer reused   desired=17631 len=14121 cap=20480
Buffer reused   desired=440 len=17631 cap=20480
Buffer reused   desired=440 len=15537 cap=49152
Buffer reused   desired=440 len=4096 cap=4096
Buffer reused   desired=440 len=4096 cap=4096
Buffer reused   desired=440 len=4096 cap=4096
Buffer reused   desired=440 len=4096 cap=4096
Buffer reused   desired=440 len=4096 cap=4096
Buffer reused   desired=440 len=4096 cap=4096
Buffer reused   desired=440 len=4096 cap=4096
Buffer reused   desired=440 len=4096 cap=4096
Buffer reused   desired=440 len=4096 cap=4096
Buffer reused   desired=440 len=4096 cap=4096
Buffer reused   desired=440 len=4096 cap=4096
Buffer reused   desired=440 len=4096 cap=4096
Buffer reused   desired=440 len=4096 cap=4096
Buffer reused   desired=440 len=4096 cap=4096
Buffer reused   desired=440 len=4096 cap=4096
Need realloc    desired=17631 len=4096 cap=4096, extending with 13535
Reallocated     desired=17631 len=17631 cap=18432
Buffer reused   desired=15537 len=17631 cap=18432
Need realloc    desired=17631 len=4096 cap=4096, extending with 13535
Reallocated     desired=17631 len=17631 cap=18432
Buffer reused   desired=17631 len=17631 cap=18432
Buffer reused   desired=48 len=17631 cap=18432
Buffer reused   desired=12691 len=48 cap=18432
Buffer reused   desired=10951 len=12691 cap=18432
Buffer reused   desired=12691 len=15537 cap=18432
Buffer reused   desired=12691 len=12691 cap=18432
Buffer reused   desired=48 len=12691 cap=18432
Need realloc    desired=17631 len=4096 cap=4096, extending with 13535
Reallocated     desired=17631 len=17631 cap=18432
Buffer reused   desired=15537 len=17631 cap=18432
Need realloc    desired=17631 len=4096 cap=4096, extending with 13535
Reallocated     desired=17631 len=17631 cap=18432
Buffer reused   desired=17631 len=15537 cap=18432
Buffer reused   desired=48 len=17631 cap=18432
Buffer reused   desired=12691 len=48 cap=18432
Buffer reused   desired=10951 len=12691 cap=18432
Need realloc    desired=12691 len=4096 cap=4096, extending with 8595
Reallocated     desired=12691 len=12691 cap=13568
Buffer reused   desired=12691 len=12691 cap=13568
Buffer reused   desired=48 len=12691 cap=13568
Need realloc    desired=16252 len=4096 cap=4096, extending with 12156
Reallocated     desired=16252 len=16252 cap=16384
Buffer reused   desired=14113 len=16252 cap=16384
Need realloc    desired=16252 len=4096 cap=4096, extending with 12156
Reallocated     desired=16252 len=16252 cap=16384
Buffer reused   desired=16252 len=16252 cap=16384
Buffer reused   desired=48 len=16252 cap=16384
Buffer reused   desired=12691 len=48 cap=16384
Buffer reused   desired=10951 len=12691 cap=16384
Buffer reused   desired=12691 len=14113 cap=16384
Buffer reused   desired=12691 len=12691 cap=16384
Buffer reused   desired=48 len=12691 cap=16384
Need realloc    desired=18688 len=48 cap=16384, extending with 18640
Reallocated     desired=18688 len=18688 cap=21760
Buffer reused   desired=17631 len=18688 cap=21760
PASS
ok      github.com/google/log4jscanner/jar      21.074s
```

Benchmarks using hyperfine(1)

```
$ for commit in $(git log --pretty=oneline | head -4 | awk '{print $1}' | tac) ; do
    git checkout $commit
    go build
    hyperfine -i './log4jscanner jar/testdata'
    rm log4jscanner
  done
```

HEAD is now at 9d3acee Document commit that removed InitialContext
  Time (mean ± σ):      1.754 s ±  0.260 s    [User: 1.711 s, System: 1.264 s]
  Range (min … max):    1.411 s …  2.182 s    10 runs

HEAD is now at 2932093 jar: close the zip.File reader before recursing
  Time (mean ± σ):      1.761 s ±  0.312 s    [User: 1.778 s, System: 1.124 s]
  Range (min … max):    1.271 s …  2.239 s    10 runs

HEAD is now at f6c1ebc jar: prefer io.ReadFull over io.ReadAll
  Time (mean ± σ):     444.8 ms ±  95.3 ms    [User: 353.5 ms, System: 122.3 ms]
  Range (min … max):   337.2 ms … 683.2 ms    10 runs

HEAD is now at 5a28952 jar: reuse buffers for nested .jar's
  Time (mean ± σ):     398.0 ms ±  44.7 ms    [User: 330.6 ms, System: 89.9 ms]
  Range (min … max):   339.7 ms … 471.1 ms    10 runs

It looks like re-using buffers has only a small positive effect on
speed, switching from `io.ReadAll` to `io.ReadFull` has the most impact.
Still, I believe re-using allocations may really help in our production
context production context.

[1]:

```
(pprof) list checkJAR
Total: 213.06GB
         ...
         .     9.89MB    334:           f, err := zf.Open()
         .          .    335:           if err != nil {
         .          .    336:                   return fmt.Errorf("open file %s: %v", p, err)
         .          .    337:           }
         .          .    338:           defer f.Close()
         .   169.96GB    339:           data, err := io.ReadAll(f)
         .          .    340:           if err != nil {
         .          .    341:                   return fmt.Errorf("read file %s: %v", p, err)
         .          .    342:           }
         .          .    343:           br := bytes.NewReader(data)
         .     3.40GB    344:           r2, err := zip.NewReader(br, br.Size())
         ...

(pprof) list ReadAll
Total: 213.06GB
ROUTINE ======================== io.ReadAll in third_party/go/gc/src/io/io.go
  113.40GB   169.96GB (flat, cum) 79.77% of Total
         .          .    638:func ReadAll(r Reader) ([]byte, error) {
         .          .    639:   b := make([]byte, 0, 512)
         .          .    640:   for {
         .          .    641:           if len(b) == cap(b) {
         .          .    642:                   // Add more capacity (let append pick how much).
  113.40GB   113.40GB    643:                   b = append(b, 0)[:len(b)]
         .          .    644:           }
         .    56.56GB    645:           n, err := r.Read(b[len(b):cap(b)])
         .          .    646:           b = b[:len(b)+n]
         .          .    647:           if err != nil {
         .          .    648:                   if err == EOF {
         .          .    649:                           err = nil
         .          .    650:                   }
```

jar: use make instead of append
@aktau
Copy link
Contributor Author

aktau commented Feb 28, 2022

(Actually, let's just focus on the slowest one. I've verified that the others don't change in speed over the 4 last commits.)

The results for 400mb_jar_in_jar, which is the only case where runtime is noticeably changed:

Benchmark 1: ./log4jscanner /home/aktau/tmp/tmpdir/400mb_jar_in_jar.jar
  Time (mean ± σ):      1.484 s ±  0.326 s    [User: 1.448 s, System: 0.999 s]
  Range (min … max):    1.119 s …  2.087 s    10 runs
 
Previous HEAD position was 9d3acee Document commit that removed InitialContext
HEAD is now at 2932093 jar: close the zip.File reader before recursing
Benchmark 1: ./log4jscanner /home/aktau/tmp/tmpdir/400mb_jar_in_jar.jar
  Time (mean ± σ):      1.542 s ±  0.174 s    [User: 1.387 s, System: 1.076 s]
  Range (min … max):    1.245 s …  1.739 s    10 runs
 
Previous HEAD position was 2932093 jar: close the zip.File reader before recursing
HEAD is now at f6c1ebc jar: prefer io.ReadFull over io.ReadAll
Benchmark 1: ./log4jscanner /home/aktau/tmp/tmpdir/400mb_jar_in_jar.jar
  Time (mean ± σ):     441.6 ms ±  85.7 ms    [User: 355.1 ms, System: 112.7 ms]
  Range (min … max):   318.3 ms … 578.1 ms    10 runs
 
Previous HEAD position was f6c1ebc jar: prefer io.ReadFull over io.ReadAll
HEAD is now at 61ade2d jar: reuse buffers for nested .jar's
Benchmark 1: ./log4jscanner /home/aktau/tmp/tmpdir/400mb_jar_in_jar.jar
  Time (mean ± σ):     340.7 ms ±  57.9 ms    [User: 269.7 ms, System: 78.8 ms]
  Range (min … max):   298.8 ms … 481.4 ms    10 runs

Si I added a benchmark to test 400mb_jar_in_jar specifically, and the results now do show an improvement (first versus last commit):

benchstat 48d70bfb7e99a469832f21854710f7cddb3e17fd.bench 37376ef3d7fcb898fd5e192d12a2afe4718ffc33.bench
name                                  old time/op    new time/op    delta
Parse/400mb_jar_in_jar.jar-4             1.26s ±27%     0.27s ±10%  -78.51%  (p=0.000 n=10+10)
Parse/safe1.jar-4                       38.0µs ± 9%    43.4µs ±40%  +14.47%  (p=0.006 n=9+9)
ParseParallel/400mb_jar_in_jar.jar-4     1.40s ±26%     0.20s ±32%  -85.39%  (p=0.000 n=10+9)
ParseParallel/safe1.jar-4               33.0µs ± 8%    61.4µs ±11%  +86.03%  (p=0.000 n=9+9)

name                                  old alloc/op   new alloc/op   delta
Parse/400mb_jar_in_jar.jar-4            2.35GB ± 0%   0.05GB ±108%  -97.86%  (p=0.000 n=10+10)
Parse/safe1.jar-4                       4.92kB ± 0%    4.92kB ± 0%     ~     (p=0.810 n=10+10)
ParseParallel/400mb_jar_in_jar.jar-4    2.35GB ± 0%   0.02GB ±336%  -99.18%  (p=0.000 n=10+8)
ParseParallel/safe1.jar-4               4.93kB ± 0%    4.93kB ± 0%   -0.08%  (p=0.032 n=9+9)

name                                  old allocs/op  new allocs/op  delta
Parse/400mb_jar_in_jar.jar-4               102 ±11%        28 ± 8%  -72.43%  (p=0.000 n=10+10)
Parse/safe1.jar-4                         33.0 ± 0%      33.0 ± 0%     ~     (all equal)
ParseParallel/400mb_jar_in_jar.jar-4       180 ± 3%        39 ± 9%  -78.53%  (p=0.000 n=10+8)
ParseParallel/safe1.jar-4                 33.0 ± 0%      33.0 ± 0%     ~     (all equal)

And versus just the io.ReadAll -> io.ReadFull CL:

benchstat 48d70bfb7e99a469832f21854710f7cddb3e17fd.bench 4b23cd3db5400b191c5310c07415848bc0a2bb6d.bench
name                                  old time/op    new time/op    delta
Parse/400mb_jar_in_jar.jar-4             1.26s ±27%     0.30s ±17%  -76.18%  (p=0.000 n=10+9)
Parse/safe1.jar-4                       38.0µs ± 9%    40.3µs ± 8%   +6.15%  (p=0.028 n=9+10)
ParseParallel/400mb_jar_in_jar.jar-4     1.40s ±26%     0.35s ±45%  -74.81%  (p=0.000 n=10+9)
ParseParallel/safe1.jar-4               33.0µs ± 8%    32.2µs ±15%     ~     (p=0.243 n=9+10)

name                                  old alloc/op   new alloc/op   delta
Parse/400mb_jar_in_jar.jar-4            2.35GB ± 0%    0.42GB ± 0%  -82.14%  (p=0.000 n=10+10)
Parse/safe1.jar-4                       4.92kB ± 0%    4.92kB ± 0%     ~     (p=0.463 n=10+10)
ParseParallel/400mb_jar_in_jar.jar-4    2.35GB ± 0%    0.42GB ± 0%  -82.14%  (p=0.000 n=10+10)
ParseParallel/safe1.jar-4               4.93kB ± 0%    4.93kB ± 0%     ~     (p=0.159 n=9+10)

name                                  old allocs/op  new allocs/op  delta
Parse/400mb_jar_in_jar.jar-4               102 ±11%        30 ± 0%  -70.67%  (p=0.000 n=10+6)
Parse/safe1.jar-4                         33.0 ± 0%      33.0 ± 0%     ~     (all equal)
ParseParallel/400mb_jar_in_jar.jar-4       180 ± 3%        52 ±17%  -71.10%  (p=0.000 n=10+10)
ParseParallel/safe1.jar-4                 33.0 ± 0%      33.0 ± 0%     ~     (all equal)

Indicating that the regression in speed of the safe1.jar benchmark may be due to the garbage collector dealing with allocations from the other benchmarks [1].

[1]: Notice how some runs after the sync.Pool introduction allocate ~nothing:

BenchmarkParse/400mb_jar_in_jar.jar-4                  4         299414355 ns/op        104883246 B/op        30 allocs/op
BenchmarkParse/400mb_jar_in_jar.jar-4                  5         249801048 ns/op           11801 B/op         26 allocs/op
BenchmarkParse/400mb_jar_in_jar.jar-4                  4         269324904 ns/op           12186 B/op         27 allocs/op
BenchmarkParse/400mb_jar_in_jar.jar-4                  4         262044230 ns/op           12186 B/op         27 allocs/op
BenchmarkParse/400mb_jar_in_jar.jar-4                  4         259753494 ns/op           12186 B/op         27 allocs/op
BenchmarkParse/400mb_jar_in_jar.jar-4                  4         277134852 ns/op        104883246 B/op        30 allocs/op
BenchmarkParse/400mb_jar_in_jar.jar-4                  4         279076490 ns/op        104883246 B/op        30 allocs/op
BenchmarkParse/400mb_jar_in_jar.jar-4                  5         263421624 ns/op        83908649 B/op         28 allocs/op
BenchmarkParse/400mb_jar_in_jar.jar-4                  4         264917986 ns/op           12186 B/op         27 allocs/op
BenchmarkParse/400mb_jar_in_jar.jar-4                  4         286430956 ns/op        104883246 B/op        30 allocs/op

The pool is not cleared between benchmark runs.

ericchiang
ericchiang previously approved these changes Mar 1, 2022
jar/jar.go Outdated Show resolved Hide resolved
jar/jar.go Outdated Show resolved Hide resolved
This uses a sync.Pool wrapper (called pool.Dynamic) that prevents the
pool from holding on to very large buffers indefinitely, while still
amortizing the cost of allocation. The policy appears to give good
results both with the pre-existing tests and the specific tests added
for the pool.

This is useful because the library is also used from long-running server
contexts, where it would be unfortunate to pin very large buffers for
too long. See golang/go#23199.

Example algorithm run (from the test):

```
num  allocs   value      target            capacity
1    1        100000     100000.000000     100000
2    1        1          52048.000000      100000
3    1        1          28072.000000      100000
4    1        1          16084.000000      100000
5    1        1          10090.000000      100000
6    1        1          7093.000000       100000
7    2        10         5594.500000       4096
8    2        1          4845.250000       4096
9    2        1          4470.625000       4096
10   2        1          4283.312500       4096
11   2        1          4189.656250       4096
12   2        1          4142.828125       4096
13   2        1          4119.414062       4096
14   2        1          4107.707031       4096
15   2        12         4101.853516       4096
16   2        1          4098.926758       4096
17   2        1          4097.463379       4096
18   2        1          4096.731689       4096
19   2        1          4096.365845       4096
20   2        1          4096.182922       4096
21   2        1          4096.091461       4096
22   2        1          4096.045731       4096
23   2        1000       4096.022865       4096
24   2        100        4096.011433       4096
25   3        10000      10000.000000      10000
26   4        100000     100000.000000     100000
27   4        1          52048.000000      100000
28   4        100000     100000.000000     100000
29   4        1          52048.000000      100000
30   4        50000      51024.000000      100000
31   4        1          27560.000000      100000
32   4        1          15828.000000      100000
33   4        25000      25000.000000      100000
34   4        1          14548.000000      100000
35   4        1          9322.000000       100000
36   5        1          6709.000000       4096
37   6        100000     100000.000000     100000
38   6        1          52048.000000      100000
39   6        1          28072.000000      100000
40   6        1          16084.000000      100000
41   6        1          10090.000000      100000
42   6        1          7093.000000       100000
43   7        1          5594.500000       4096
44   7        1          4845.250000       4096
45   7        1          4470.625000       4096
46   7        1          4283.312500       4096
47   7        100        4189.656250       4096
48   7        100        4142.828125       4096
49   7        100        4119.414062       4096
50   7        1          4107.707031       4096
51   7        1          4101.853516       4096
52   7        1          4098.926758       4096
53   7        1          4097.463379       4096
54   7        1          4096.731689       4096
55   7        100        4096.365845       4096
56   7        200        4096.182922       4096
57   7        300        4096.091461       4096
58   7        100        4096.045731       4096
59   7        50         4096.022865       4096
60   7        50         4096.011433       4096
61   7        50         4096.005716       4096
62   7        50         4096.002858       4096
63   7        50         4096.001429       4096
64   7        1          4096.000715       4096
65   7        1          4096.000357       4096
66   7        1          4096.000179       4096
67   7        1          4096.000089       4096
68   8        100000000  100000000.000000  100000000
69   8        1000000    50500000.000000   100000000
70   8        100000     25300000.000000   100000000
71   8        10000      12655000.000000   100000000
72   8        1000       6329548.000000    100000000
73   9        100        3166822.000000    4096
74   9        10         1585459.000000    4096
75   9        1          794777.500000     4096
76   9        1          399436.750000     4096
77   9        500        201766.375000     4096
78   9        2020       102931.187500     4096
79   9        400        53513.593750      4096
80   9        3984       28804.796875      4096
81   9        5          16450.398438      4096
82   9        200        10273.199219      4096
83   9        500        7184.599609       4096
84   10       40000      40000.000000      40000
85   10       35000      37500.000000      40000
86   11       45000      45000.000000      45000
87   11       42000      43500.000000      45000
88   11       38000      40750.000000      45000
89   11       38000      39375.000000      45000
90   11       39000      39187.500000      45000
91   11       41000      41000.000000      45000
92   11       42000      42000.000000      45000
93   11       42000      42000.000000      45000
94   11       2000       23048.000000      45000
95   11       4000       13572.000000      45000
96   11       3949       8834.000000       45000
97   11       2011       6465.000000       45000
98   11       4096       5280.500000       45000
99   11       33         4688.250000       45000
100  11       0          4392.125000       45000
101  12       4938       4938.000000       4938
102  12       1          4517.000000       4938
103  12       1          4306.500000       4938
104  12       1200       4201.250000       4938
105  12       2400       4148.625000       4938
106  12       1200       4122.312500       4938
107  12       200        4109.156250       4938
108  12       400        4102.578125       4938
109  12       600        4099.289062       4938
110  12       700        4097.644531       4938
111  12       100        4096.822266       4938
112  12       400        4096.411133       4938
113  12       500        4096.205566       4938
114  12       700        4096.102783       4938
115  12       600        4096.051392       4938
116  12       900        4096.025696       4938
117  12       1000       4096.012848       4938
118  12       1100       4096.006424       4938
119  12       1200       4096.003212       4938
120  12       1000       4096.001606       4938
```

Benchmarks also show that the pool does retain the buffer, as
performance is not worsened over the previous commit:

```
$ git checkout main
TMPDIR="$HOME/tmp/tmpdir"
mkdir "$TMPDIR" || true
for file in jar/testdata/* ; do
  RTMPDIR="$TMPDIR/$(basename $file)"
  mkdir "$RTMPDIR" || true
  ln -fv "$PWD/$file" "$RTMPDIR"
done
for commit in $(git log --pretty=oneline | head -5 | awk '{print $1}' | tac) ; do
  git checkout $commit
  go build
  hyperfine --ignore-failure --warmup 1 "./log4jscanner $TMPDIR/400mb_jar_in_jar.jar"
  rm log4jscanner
done
HEAD is now at 48d70bf jar: add benchmarks with 400mb_jar_in_jar.jar
  Time (mean ± σ):      2.026 s ±  0.324 s    [User: 2.363 s, System: 1.269 s]
  Range (min … max):    1.651 s …  2.749 s    10 runs

HEAD is now at bf524fa jar: close the zip.File reader before recursing
  Time (mean ± σ):      1.908 s ±  0.297 s    [User: 2.084 s, System: 1.218 s]
  Range (min … max):    1.502 s …  2.567 s    10 runs

HEAD is now at 4b23cd3 jar: prefer io.ReadFull over io.ReadAll
  Time (mean ± σ):     445.9 ms ±  51.2 ms    [User: 401.7 ms, System: 79.9 ms]
  Range (min … max):   386.3 ms … 566.1 ms    10 runs

HEAD is now at 37376ef jar: reuse buffers for nested .jar's
  Time (mean ± σ):     464.5 ms ±  41.8 ms    [User: 420.5 ms, System: 93.7 ms]
  Range (min … max):   409.2 ms … 545.5 ms    10 runs

HEAD is now at c17a81b jar: do not keep large buffers unnecessarily
  Time (mean ± σ):     436.1 ms ±  26.2 ms    [User: 409.5 ms, System: 77.6 ms]
  Range (min … max):   390.2 ms … 472.7 ms    10 runs
```
@ericchiang
Copy link
Member

Okay, taking a step back...

A custom sync.Pool wrapper isn't something that I'm thrilled to maintain long term. Particularly since our internal response has wound down, and we have fewer cycles to work on this project :) We already hit performance oddities with this parser in the past, and I don't want to have to hunt down weird behavior or bugs.

What's the motivation here for this performance work? Are you hitting a bottle neck with some use case? Or is this strictly, just trying to improve the performance for its own sake? If it's the latter, I'm inclined to leave this as is.

@ericchiang ericchiang dismissed their stale review March 18, 2022 17:40

see above comment

@aktau
Copy link
Contributor Author

aktau commented Mar 18, 2022

A custom sync.Pool wrapper isn't something that I'm thrilled to maintain long term.

Removing comments, it's a tiny amount of LoC. At first I had it as a simple if-check on the "put" operation. This results in the same thing but I thought it looked more ergonomic with Get/Put. Most of the code is in the test.

Particularly since our internal response has wound down, and we have fewer cycles to work on this project :)

True, but this could well improve throughput for other users, too. For us internally, it was a bottleneck. See below.

We already hit performance oddities with this parser in the past, ...

Yes, performance seemed noticeably worse after the switchover to this version, though I wasn't able to pin down immediately why even when comparing profiles.

...and I don't want to have to hunt down weird behavior or bugs.

All new code carries the risk of bugs, that's true. If there's another test you'd like that provides stronger guarantees (also for future contributions), I could see whether I could do that.

Otherwise, I'll just leave this code here for others to potentially pull in.

What's the motivation here for this performance work? Are you hitting a bottle neck with some use case? Or is this strictly, just trying to improve the performance for its own sake? If it's the latter, I'm inclined to leave this as is.

In the last run we did internally, this was a noticeable bottleneck. This codepath was responsible for >50% of all allocations. Allocations by themselves are very costly in Go, but the more processors (GOMAXPROCS) given to a Go program, the more of a bottleneck this becomes as there are some global datastructures involved. For us, this was very large.

You're right that the response has wound down. But I assume others can benefit from this all the same.

@ericchiang
Copy link
Member

Okay thanks for the reply! This lgtm then

Copy link
Member

@ericchiang ericchiang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please squash your commits and I'll merge

@ericchiang
Copy link
Member

Actually, I see that the commits make sense on their own. Will merge as is

@ericchiang ericchiang merged commit fb3bde4 into google:main Mar 18, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants