Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Perf -89%] Benchstone.BenchI.IniArray.Test #2267

Open
performanceautofiler bot opened this issue Oct 13, 2020 · 3 comments
Open

[Perf -89%] Benchstone.BenchI.IniArray.Test #2267

performanceautofiler bot opened this issue Oct 13, 2020 · 3 comments

Comments

@performanceautofiler
Copy link

performanceautofiler bot commented Oct 13, 2020

Run Information

Architecture x64
OS Windows 10.0.18362
Baseline bdd25a2eb910d87847c0fcc47bfe31b4dc1a576e
Compare b6f791d1984f998384267b8b6683bd02f64747f2

Regressions in Benchstone.BenchI.IniArray

Benchmark Baseline Test Test/Base Baseline IR Compare IR IR Ratio Baseline ETL Compare ETL
Test 106.03 ms 56.83 ms 0.54

graph
Historical Data in Reporting System

Repro

git clone https://github.com/dotnet/performance.git
py .\performance\scripts\benchmarks_ci.py -f netcoreapp5.0 --filter 'Benchstone.BenchI.IniArray*'

Histogram

Benchstone.BenchI.IniArray.Test

[ 54921211.809 ;  62850262.120) | @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
[ 62850262.120 ;  70779312.431) | 
[ 70779312.431 ;  78708362.742) | 
[ 78708362.742 ;  86637413.053) | 
[ 86637413.053 ;  94566463.364) | 
[ 94566463.364 ; 102495513.676) | 
[102495513.676 ; 105634995.133) | 
[105634995.133 ; 113564045.444) | @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@

Docs

Profiling workflow for dotnet/runtime repository
Benchmarking workflow for dotnet/runtime repository

@kunalspathak
Copy link
Collaborator

This benchmark is affected by alignment heavily.

Here is the generated code before dotnet/runtime#42909 went in. Here, the method is aligned to 16B boundary. However, the loop body, luckily, came within the range of 32B boundary and could have been fetched in single request. The highlighted part is the range of code that is approx. in the 32B fetch cycle.

image

However, after dotnet/runtime#42909, the method body started getting aligned to 32B boundary. This drove the loop body further and got organized such that to fetch entire loop body processor would have to do 2 requests. See the highlighted part below which is a 32B boundary chunk. As seen, only half loop body fits in that request and processor has to do another request to get another chunk.

image

I tried collecting vTune profiler data as well and it matches with my finding above.

Before 32B change:

image

After 32B change:

image

cc: @AndyAyersMS, @adamsitnik

@kunalspathak
Copy link
Collaborator

I verified that the 32B loop alignment changes that I am working on improves the performance of this benchmark.

Before:

|   Method |     Mean |    Error |   StdDev |   Median |      Min |      Max | Gen 0 | Gen 1 | Gen 2 | Allocated |
|--------- |---------:|---------:|---------:|---------:|---------:|---------:|------:|------:|------:|----------:|
| IniArray | 82.12 ms | 1.607 ms | 1.850 ms | 82.00 ms | 79.77 ms | 85.32 ms |     - |     - |     - |      68 B |
Assembly code
G_M33241_IG01:              ;; offset=0000H
 00007ff8`85876de0        4883EC28             sub      rsp, 40
                                                ;; bbWeight=1    PerfScore 0.25
G_M33241_IG02:              ;; offset=0004H
 00007ff8`85876de4        48B908285E85F87F0000 mov      rcx, 0x7FF8855E2808
 00007ff8`85876dee        BA10000000           mov      edx, 16
 00007ff8`85876df3        E85891355F           call     CORINFO_HELP_NEWARR_1_VC
 00007ff8`85876df8        33D2                 xor      edx, edx
                                                ;; bbWeight=1    PerfScore 1.75
G_M33241_IG03:              ;; offset=001AH
 00007ff8`85876dfa        33C9                 xor      ecx, ecx
                                                ;; bbWeight=4    PerfScore 1.00
G_M33241_IG04:              ;; offset=001CH
 00007ff8`85876dfc        4C63C1               movsxd   r8, ecx
 00007ff8`85876dff        6642C74440102000     mov      word  ptr [rax+2*r8+16], 32
; =========================== 32B boundary ===========================
 00007ff8`85876e07        FFC1                 inc      ecx
 00007ff8`85876e09        83F910               cmp      ecx, 16
 00007ff8`85876e0c        7CEE                 jl       SHORT G_M33241_IG04
                                                ;; bbWeight=16    PerfScore 44.00
G_M33241_IG05:              ;; offset=002EH
 00007ff8`85876e0e        FFC2                 inc      edx
 00007ff8`85876e10        81FA80969800         cmp      edx, 0x989680
 00007ff8`85876e16        7CE2                 jl       SHORT G_M33241_IG03
                                                ;; bbWeight=4    PerfScore 6.00
G_M33241_IG06:              ;; offset=0038H
 00007ff8`85876e18        4883C428             add      rsp, 40
 00007ff8`85876e1c        C3                   ret
                                                ;; bbWeight=1    PerfScore 1.25

; Total bytes of code 61, prolog size 4, PerfScore 60.35, instruction count 16 (MethodHash=3ba87e26) for method Benchstone.BenchI.IniArray:Test():System.Char[]:this
; ============================================================

After:

|   Method |     Mean |    Error |   StdDev |   Median |      Min |      Max | Gen 0 | Gen 1 | Gen 2 | Allocated |
|--------- |---------:|---------:|---------:|---------:|---------:|---------:|------:|------:|------:|----------:|
| IniArray | 43.01 ms | 0.364 ms | 0.341 ms | 43.07 ms | 42.47 ms | 43.73 ms |     - |     - |     - |      64 B |
Assembly code
G_M33241_IG01:              ;; offset=0000H
 00007ff8`a13b7100        4883EC28             sub      rsp, 40
                                                ;; bbWeight=1    PerfScore 0.25
G_M33241_IG02:              ;; offset=0004H
 00007ff8`a13b7104        48B9082812A1F87F0000 mov      rcx, 0x7FF8A1122808
 00007ff8`a13b710e        BA10000000           mov      edx, 16
 00007ff8`a13b7113        E8388E365F           call     CORINFO_HELP_NEWARR_1_VC
 00007ff8`a13b7118        33D2                 xor      edx, edx
                                                ;; bbWeight=1    PerfScore 1.75
G_M33241_IG03:              ;; offset=001AH
 00007ff8`a13b711a        33C9                 xor      ecx, ecx
 00007ff8`a13b711c        0F1F4000             align
; =========================== 32B boundary ===========================
 00007ff8`a13b7120                             align
 00007ff8`a13b7120                             align
                                                ;; bbWeight=4    PerfScore 4.00
G_M33241_IG04:              ;; offset=0020H
 00007ff8`a13b7120        4C63C1               movsxd   r8, ecx
 00007ff8`a13b7123        6642C74440102000     mov      word  ptr [rax+2*r8+16], 32
 00007ff8`a13b712b        FFC1                 inc      ecx
 00007ff8`a13b712d        83F910               cmp      ecx, 16
 00007ff8`a13b7130        7CEE                 jl       SHORT G_M33241_IG04
                                                ;; bbWeight=16    PerfScore 44.00
G_M33241_IG05:              ;; offset=0032H
 00007ff8`a13b7132        FFC2                 inc      edx
 00007ff8`a13b7134        81FA80969800         cmp      edx, 0x989680
 00007ff8`a13b713a        7CDE                 jl       SHORT G_M33241_IG03
                                                ;; bbWeight=4    PerfScore 6.00
G_M33241_IG06:              ;; offset=003CH
 00007ff8`a13b713c        4883C428             add      rsp, 40
; =========================== 32B boundary ===========================
 00007ff8`a13b7140        C3                   ret
                                                ;; bbWeight=1    PerfScore 1.25

; Total bytes of code 65, prolog size 4, PerfScore 66.45, instruction count 19 (MethodHash=3ba87e26) for method Benchstone.BenchI.IniArray:Test():System.Char[]:this
; ============================================================

@adamsitnik
Copy link
Collaborator

@kunalspathak thank you for a great analysis! the before/after diff looks very promising (perf x2)!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants