-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Perf -89%] Benchstone.BenchI.IniArray.Test #2267
Comments
This benchmark is affected by alignment heavily. Here is the generated code before dotnet/runtime#42909 went in. Here, the method is aligned to 16B boundary. However, the loop body, luckily, came within the range of 32B boundary and could have been fetched in single request. The highlighted part is the range of code that is approx. in the 32B fetch cycle. However, after dotnet/runtime#42909, the method body started getting aligned to 32B boundary. This drove the loop body further and got organized such that to fetch entire loop body processor would have to do 2 requests. See the highlighted part below which is a 32B boundary chunk. As seen, only half loop body fits in that request and processor has to do another request to get another chunk. I tried collecting vTune profiler data as well and it matches with my finding above. Before 32B change: After 32B change: cc: @AndyAyersMS, @adamsitnik |
I verified that the 32B loop alignment changes that I am working on improves the performance of this benchmark. Before:
Assembly codeG_M33241_IG01: ;; offset=0000H
00007ff8`85876de0 4883EC28 sub rsp, 40
;; bbWeight=1 PerfScore 0.25
G_M33241_IG02: ;; offset=0004H
00007ff8`85876de4 48B908285E85F87F0000 mov rcx, 0x7FF8855E2808
00007ff8`85876dee BA10000000 mov edx, 16
00007ff8`85876df3 E85891355F call CORINFO_HELP_NEWARR_1_VC
00007ff8`85876df8 33D2 xor edx, edx
;; bbWeight=1 PerfScore 1.75
G_M33241_IG03: ;; offset=001AH
00007ff8`85876dfa 33C9 xor ecx, ecx
;; bbWeight=4 PerfScore 1.00
G_M33241_IG04: ;; offset=001CH
00007ff8`85876dfc 4C63C1 movsxd r8, ecx
00007ff8`85876dff 6642C74440102000 mov word ptr [rax+2*r8+16], 32
; =========================== 32B boundary ===========================
00007ff8`85876e07 FFC1 inc ecx
00007ff8`85876e09 83F910 cmp ecx, 16
00007ff8`85876e0c 7CEE jl SHORT G_M33241_IG04
;; bbWeight=16 PerfScore 44.00
G_M33241_IG05: ;; offset=002EH
00007ff8`85876e0e FFC2 inc edx
00007ff8`85876e10 81FA80969800 cmp edx, 0x989680
00007ff8`85876e16 7CE2 jl SHORT G_M33241_IG03
;; bbWeight=4 PerfScore 6.00
G_M33241_IG06: ;; offset=0038H
00007ff8`85876e18 4883C428 add rsp, 40
00007ff8`85876e1c C3 ret
;; bbWeight=1 PerfScore 1.25
; Total bytes of code 61, prolog size 4, PerfScore 60.35, instruction count 16 (MethodHash=3ba87e26) for method Benchstone.BenchI.IniArray:Test():System.Char[]:this
; ============================================================ After:
Assembly codeG_M33241_IG01: ;; offset=0000H
00007ff8`a13b7100 4883EC28 sub rsp, 40
;; bbWeight=1 PerfScore 0.25
G_M33241_IG02: ;; offset=0004H
00007ff8`a13b7104 48B9082812A1F87F0000 mov rcx, 0x7FF8A1122808
00007ff8`a13b710e BA10000000 mov edx, 16
00007ff8`a13b7113 E8388E365F call CORINFO_HELP_NEWARR_1_VC
00007ff8`a13b7118 33D2 xor edx, edx
;; bbWeight=1 PerfScore 1.75
G_M33241_IG03: ;; offset=001AH
00007ff8`a13b711a 33C9 xor ecx, ecx
00007ff8`a13b711c 0F1F4000 align
; =========================== 32B boundary ===========================
00007ff8`a13b7120 align
00007ff8`a13b7120 align
;; bbWeight=4 PerfScore 4.00
G_M33241_IG04: ;; offset=0020H
00007ff8`a13b7120 4C63C1 movsxd r8, ecx
00007ff8`a13b7123 6642C74440102000 mov word ptr [rax+2*r8+16], 32
00007ff8`a13b712b FFC1 inc ecx
00007ff8`a13b712d 83F910 cmp ecx, 16
00007ff8`a13b7130 7CEE jl SHORT G_M33241_IG04
;; bbWeight=16 PerfScore 44.00
G_M33241_IG05: ;; offset=0032H
00007ff8`a13b7132 FFC2 inc edx
00007ff8`a13b7134 81FA80969800 cmp edx, 0x989680
00007ff8`a13b713a 7CDE jl SHORT G_M33241_IG03
;; bbWeight=4 PerfScore 6.00
G_M33241_IG06: ;; offset=003CH
00007ff8`a13b713c 4883C428 add rsp, 40
; =========================== 32B boundary ===========================
00007ff8`a13b7140 C3 ret
;; bbWeight=1 PerfScore 1.25
; Total bytes of code 65, prolog size 4, PerfScore 66.45, instruction count 19 (MethodHash=3ba87e26) for method Benchstone.BenchI.IniArray:Test():System.Char[]:this
; ============================================================ |
@kunalspathak thank you for a great analysis! the before/after diff looks very promising (perf x2)! |
Run Information
Regressions in Benchstone.BenchI.IniArray
Historical Data in Reporting System
Repro
Histogram
Benchstone.BenchI.IniArray.Test
Docs
Profiling workflow for dotnet/runtime repository
Benchmarking workflow for dotnet/runtime repository
The text was updated successfully, but these errors were encountered: