Performance regression 3.10b1: inlining issue in the big _PyEval_EvalFrameDefault() function with Visual Studio (MSC) #89279

neonene · 2021-09-06T15:27:18Z

BPO	45116
Nosy	@malemburg, @gvanrossum, @rhettinger, @pfmoore, @vstinner, @tjguk, @markshannon, @zware, @zooba, @animalize, @pablogsal, @brandtbucher, @neonene, @erlend-aasland, @Fidget-Spinner
PRs	bpo-45116: Add the Py_ALWAYS_INLINE macro #28390 bpo-45116: Py_DEBUG ignores Py_ALWAYS_INLINE #28419 bpo-45116: Use Py_ALWAYS_INLINE in object.h #28427 [3.10] bpo-45116: Shrink interpreter by targetted revert of #25244 #28475 bpo-45116 Fix another performance regression on Windows #28630 bpo-45116 Add a warm-up for PGO training on Windows #28631 bpo-45116: Make --inlinestat option available in build.bat for diagnostic logs #31436 [3.10] bpo-45116: Fix inlining regressions on Windows Release build #31459 gh-89279: In ceval.c, redefine some macros for speed #32387
Files	310rc1_confirm_overhead.patch ceval_310rc1_patched.c b98e-no-inline-in-all.diff b98e-no-inline-in-eval.diff b98e-no-inline-in-the-others.diff pyproject_inlinestat.patch x64_28d2.log x64_b98e.log 310rc2_benchmarks.txt 310a7_vs_310rc2_bench.txt PR28475_inline.log PR28475_vs_310rc2_vs_310a7.txt PR28475_skip1test_bench.txt 310rc2patched_vs_310rc2notrace.txt switch-case_unarranged_bench.txt ceval_PR29565_split_func.c

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields:

assignee = None
closed_at = None
created_at = <Date 2021-09-06.15:27:18.216>
labels = ['interpreter-core', '3.10', 'performance', 'expert-C-API', '3.11', 'OS-windows']
title = 'Performance regression 3.10b1: inlining issue in the big _PyEval_EvalFrameDefault() function with Visual Studio (MSC)'
updated_at = <Date 2022-04-08.11:04:12.476>
user = 'https://github.com/neonene'

bugs.python.org fields:

activity = <Date 2022-04-08.11:04:12.476>
actor = 'steve.dower'
assignee = 'none'
closed = False
closed_date = None
closer = None
components = ['Interpreter Core', 'Windows', 'C API']
creation = <Date 2021-09-06.15:27:18.216>
creator = 'neonene'
dependencies = []
files = ['50263', '50264', '50271', '50272', '50273', '50274', '50275', '50276', '50280', '50286', '50291', '50293', '50296', '50315', '50363', '50452']
hgrepos = []
issue_num = 45116
keywords = ['patch']
message_count = 82.0
messages = ['401143', '401152', '401154', '401182', '401183', '401319', '401329', '401346', '401364', '401623', '401624', '401628', '401743', '401964', '401970', '401972', '402025', '402040', '402043', '402044', '402063', '402064', '402065', '402067', '402068', '402071', '402090', '402091', '402092', '402098', '402099', '402117', '402135', '402143', '402189', '402190', '402217', '402229', '402230', '402287', '402289', '402296', '402307', '402308', '402320', '402480', '402856', '402857', '402858', '402864', '402867', '402871', '402878', '402886', '402891', '402893', '402928', '402930', '402954', '403403', '403409', '403430', '403432', '403464', '403559', '403587', '403609', '404089', '406354', '406386', '406407', '406416', '406471', '406474', '406479', '406487', '406613', '407188', '415378', '416911', '416950', '416977']
nosy_count = 15.0
nosy_names = ['lemburg', 'gvanrossum', 'rhettinger', 'paul.moore', 'vstinner', 'tim.golden', 'Mark.Shannon', 'zach.ware', 'steve.dower', 'malin', 'pablogsal', 'brandtbucher', 'neonene', 'erlendaasland', 'kj']
pr_nums = ['28390', '28419', '28427', '28475', '28630', '28631', '31436', '31459', '32387']
priority = None
resolution = None
stage = 'patch review'
status = 'open'
superseder = None
type = 'performance'
url = 'https://bugs.python.org/issue45116'
versions = ['Python 3.10', 'Python 3.11']

neonene · 2021-09-06T15:27:17Z

pyperformance on Windows shows some gap between 3.10a7 and 3.10b1.
The following are the ratios compared with 3.10a7 (the higher the slower).

-------------------------------------------------
Windows x64 | PGO release official-binary
----------------+--------------------------------
20210405 |
3.10a7 | 1.00 1.24 1.00 (PGO?)
20210408-07:58 |
b98eba5 | 0.98
20210408-10:22 |

PR25244 | 1.04
20210503 |
3.10b1 | 1.07 1.21 1.07
-------------------------------------------------
Windows x86 | PGO release official-binary
----------------+--------------------------------
20210405 |
3.10a7 | 1.00 1.25 1.27 (release?)
20210408-07:58 |
b98eba5 | 1.00
20210408-10:22 |
PR25244 | 1.11
20210503 |
3.10b1 | 1.14 1.28 1.29

Since PR25244 (28d28e0),
_PyEval_EvalFrameDefault() in ceval.c has seemed to be unoptimized with PGO (msvc14.29.16.10).
At least the functions below have become un-inlined there at all.

(1) _Py_DECREF() (from Py_DECREF,Py_CLEAR,Py_SETREF)
(2) _Py_XDECREF() (from Py_XDECREF,SETLOCAL)
(3) _Py_IS_TYPE() (from PyXXX_CheckExact)
(4) _Py_atomic_load_32bit_impl() (from CHECK_EVAL_BREAKER)

I tried in vain other linker options like thread-safe-profiling, agressive-code-generation, /OPT:NOREF.
3.10a7 can inline them in the eval-loop even if profiling only test_array.py.

I measured overheads of (1)~(4) on my own build whose eval-loop uses macros instead of them.

-----------------------------------------------------------------
Windows x64 | PGO patched overhead in eval-loop
----------------+------------------------------------------------
3.10a7 | 1.00
20210802 |
3.10rc1 | 1.09 1.05 4% (slow 43, fast 5, same 10)
20210831-20:42 |
863154c | 0.95 0.90 5% (slow 48, fast 3, same 7)
(3.11a0+) |
-----------------------------------------------------------------
Windows x86 | PGO patched overhead in eval-loop
----------------+------------------------------------------------
3.10a7 | 1.00
20210802 |
3.10rc1 | 1.15 1.13 2% (slow 29, fast 14, same 15)
20210831-20:42 |
863154c | 1.05 1.02 3% (slow 44, fast 7, same 7)
(3.11a0+) |

vstinner · 2021-09-06T16:35:18Z

Rather than defining again functions as macro, you should consider using __forceinline function attribute: see bpo-45094.

rhettinger · 2021-09-06T16:43:20Z

Perhaps these critical code sections should have been left as macros. It is difficult to assuring system wide inlining across modules.

vstinner · 2021-09-06T21:49:38Z

Raymond:

Perhaps these critical code sections should have been left as macros. It is difficult to assuring system wide inlining across modules.

These functions were not converted recently to static inline function. For example, Py_INCREF() was already a static inline function in Python 3.9. I don't think that any decision should be taken before performances have been analyzed in depth.

I'm not convinced that there is really a performance regression. I'm not sure how benchmarks are run on Windows.

neonene:

I measured overheads of (1)~(4) on my own build whose eval-loop uses macros instead of them.

I don't understand how to read the table.

Usually, I expect a comparison between a reference build and a patch build, but here you seem to use 3.10a7 as the reference to compare results. I'm not sure that geometric means can be compared this way.

vstinner · 2021-09-06T21:52:54Z

Since PR25244 (28d28e0),
_PyEval_EvalFrameDefault() in ceval.c has seemed to be unoptimized with PGO (msvc14.29.16.10).
At least the functions below have become un-inlined there at all.

I'm not sure if PGO builds are reproducible, since there is a training step which requires to run a non-deterministic workload (the Python test suite which is somehow randomized by I/O timings and other stuffs).

The compiler is free to inline or not depending on the pressure on registers and stack memory. I'm not sure that inlining always make the code faster, since inlining have many side effects.

I don't know well MSC compiler.

neonene · 2021-09-07T18:13:05Z

@vstinner: __forceinline suggestion

Since PR25244 (mentioned above), it seems link.exe has got to get stuck on python310.dll.
Before the PR, it took 10x~ longer to link than without __forceinline function.
I can confirm with _Py_DECREF() and _Py_XDECREF() and one training-job (the more fucntions forced/jobs used, the slower to link).
Have you tried __forceinline on PGO ?

I don't understand how to read the table.

Overhead field is the output of pyperf command, not subtraction (the answers are the same just luckily).

ex) 3.10rc1x86 PGO:
PGO : pyperf compare_to 3.10a7 left
patched : pyperf compare_to 3.10a7 right
overhead : pyperf compare_to right left
are
1.15x slower (slower 52, faster 4, not significant 2)
1.13x slower (slower 50, faster 4, not significant 4)
1.02x slower (slower 29, faster 14, not significant 15)

I'm not sure if PGO builds are reproducible,

MSVC does not produce the same code. Inlining (all or nothing) might be a quite special case in the hottest section.
I suspect the profiler doesn't work well only for _PyEval_EvalFrameDefault(), including branch/align optimization.
So my posted macro or inlining is just for a mesureing, not the solution.

vstinner · 2021-09-07T19:15:36Z

I don't understand if adding __forceinline on the mentioned static inline functions has an impact on the performance of a PGO build on Windows.

zooba · 2021-09-07T21:57:07Z

I'm not sure if PGO builds are reproducible, since there is a training step which requires to run a non-deterministic workload (the Python test suite which is somehow randomized by I/O timings and other stuffs).

I'm 99% sure it's a tracing PGO rather than sampling, so provided everything runs in the same order and follows the same codepaths, wall-clock timings shouldn't have a significant effect.

Without looking at the generated code, it may be more effective to try and force the functions called from the macro to never be inlined. The optimiser may be missing that the calls are uncommon, and trying to inline them first, then deciding that the whole merged function is too big to inline in places where it would be useful.

There's also no particular reason we need to use these tests as the profile, it's just that nobody has taken the time to come up with anything better. I'd rather see us invest time in generating a good profile rather than trying to hand-optimise inlining. (Though the test suite is good in many ways, because it makes sure all the extension modules are covered. We definitely want as close to full code coverage as we can get, at least for happy paths, but may need more eval loop in the profile if that's what needs help.)

animalize · 2021-09-08T09:48:25Z

This article briefly introduces the inlining decisions in MSVC.
https://devblogs.microsoft.com/cppblog/inlining-decisions-in-visual-studio/

neonene · 2021-09-11T02:15:36Z

Thanks for all suggestions. I focused on my bisected commit and the previous.

I run pyperformance with 4 functions never inlined in the sections below.

   _Py_DECREF()
   _Py_XDECREF()
   _Py_IS_TYPE()
   _Py_atomic_load_32bit_impl()

are

(1) never inlined in _PyEval_EvalFrameDefault().
(2) never inlined in the other funcitons.
(3) never inlined in all functions.

slow downs [4-funcs never inlined section]
--------------------------------------------------------------
Windows x64 PGO (44job) (*) (1) (2) (3)
rebuild none eval others all
--------------------------------------------------------------
b98eba5 (4 funcs inlined in eval) 1.00 1.05 1.09 1.14
PR25244 ( not inlined in eval) 1.06 1.07 1.18 1.17

pyperf compare_to upper lower:
(*) 1.06x slower (slower 45, faster 4, not not significant 9)
(1) 1.02x slower (slower 33, faster 13, not not significant 12)
(2) 1.08x slower (slower 48, faster 6, not not significant 4)
(3) 1.03x slower (slower 39, faster 6, not not significant 13)

--------------------------------------------------------------
Windows x86 PGO (44job) (*) (1) (2) (3)
rebuild none eval others all
--------------------------------------------------------------
b98eba5 (4 funcs inlined in eval) 1.00 1.03 1.06 1.15
PR25244 ( not inlined in eval) 1.13 1.13 1.22 1.24

pyperf compare_to upper lower:
(*) 1.13x slower (slower 54, faster 2, not not significant 2)
(1) 1.10x slower (slower 47, faster 3, not not significant 8)
(2) 1.14x slower (slower 54, faster 1, not not significant 3)
(3) 1.08x slower (slower 43, faster 3, not not significant 12)

In both x64 and x86, it looks column (2) and (*) has similar gaps.
So, I would like to simply focus on the eval-loop.

I built PGO with "/d2inlinestats" and "/d2inlinelogfull:_PyEval_EvalFrameDefault" according to the blog.

I posted logs. As for PR25244, the logsize is 3x smaller than the previous and pgo rejects the 4 funcs above. I will look into it later.

Collecting:

Before the PR, it took 10x~ longer to link than without __forceinline function.

Current build is 10x~ shorter than before to link.
Before the PR, __forceinline had no impact to me.

animalize · 2021-09-11T02:45:23Z

MSVC 2019 has a /Ob3 option:
https://docs.microsoft.com/en-us/cpp/build/reference/ob-inline-function-expansion

From the experience of another project, I conjecture /Ob3 increase the "global budget" mentioned in the blog.
I used /Ob3 for the 3.10 branch, and there seems to be no significant performance change. If you have time, neonene welcome to verify this.

neonene · 2021-09-11T03:30:27Z

According to:
https://docs.microsoft.com/en-us/cpp/build/profile-guided-optimizations?view=msvc-160

PGO seems to override /Ob3.
Around this issue, I posted benchmark on bpo-44381.
On python building, /Ob3 works for only non-pgo-supported dlls like,

_ctypes_test
_freeze_importlib
_msi
_testbuffer
_testcapi
_testconsole
_testembed
_testimportmultiple
_testinternalcapi
_testmultiphase
_uuid
liblzma
pylauncher
pyshellext
pywlauncher
sqlite3
venvlauncher
venvwlauncher
winsound

I use this option in _msvccompiler.py for my pyd.
I will try and report when PGO with /Ob3 makes difference in the log.

neonene · 2021-09-13T23:37:47Z

With msvc 16.10.3 and 16.11.2 (latest),
PR25244 told me the amount of code in _PyEval_EvalFrameDefault() is over the limit of PGO.
In the old version of _PyEval_EvalFrameDefault (b98eba5), the same issue can be caused adding any-code anywhere with more than 20 expressions/statements. For example, at the top/middle/end of the function, repeating "if (0) {}" 10times, or "if (0) {19 statements}". As for python3.9.7, more than 800 expressions/statements.

Here is just a workaround for 3.10rc2 on windows.
==================================================

--- Python/ceval.c
+++ Python/ceval.c
@@ -1306,9 +1306 @@
-#define DISPATCH() \
-    { \
-        if (trace_info.cframe.use_tracing OR_DTRACE_LINE OR_LLTRACE) { \
-            goto tracing_dispatch; \
-        } \
-        f->f_lasti = INSTR_OFFSET(); \
-        NEXTOPARG(); \
-        DISPATCH_GOTO(); \
-    }
+#define DISPATCH() goto tracing_dispatch
@@ -1782,4 +1774,9 @@
     tracing_dispatch:
     {
+        if (!(trace_info.cframe.use_tracing OR_DTRACE_LINE OR_LLTRACE)) {
+            f->f_lasti = INSTR_OFFSET();
+            NEXTOPARG();
+            DISPATCH_GOTO();
+        }
         int instr_prev = f->f_lasti;
         f->f_lasti = INSTR_OFFSET();

==================================================

This patch becomes ineffective just adding one expression to DISPATCH macro as below

   #define DISPATCH() {if (1) goto tracing_dispatch;}

And this approach is not sufficient for 3.11 with bigger eval-func.
I don't know a cl/link option to lift such restriction of function size.

3.10rc2 x86 pgo : 1.00
patched : 1.09x faster (slower 5, faster 48, not significant 5)

3.10rc2 x64 pgo : 1.00 (roughly the same speed as official bin)
patched : 1.07x faster (slower 5, faster 47, not significant 6)
patched(/Ob3) : 1.07x faster (slower 7, faster 45, not significant 6)

x64 results are posted.

Fixing inlining rejection also made __forceinline buildable with normal processing time and memory usage.

neonene · 2021-09-16T15:44:51Z

I reported this issue to developercommunity of microsoft.

https://developercommunity.visualstudio.com/t/1531987

rhettinger · 2021-09-16T16:26:53Z

These should be changed back to macros where inlining is guaranteed.

zooba · 2021-09-16T16:42:55Z

I agree with Raymond. Let's stop throwing more code at this until we've figured out what's going on and revert the change for now.

zooba · 2022-04-08T11:04:12Z

__assume(0) should be replaced with other function, inside the eval switch-case or in the inlined paths of callees. This is critical with PGO.

Out of interest, have you done other experiments confirming this? The reference linked is talking about compiler throughput (i.e. how long it takes to compile), and while it hints that using __assume(0) may interfere with other optimisations, that isn't supported with any detail or analysis in the post.

neonene · 2022-04-10T23:59:41Z

have you done other experiments confirming this?

My benchmark results are left in faster-cpython/ideas#321 (comment).

__assume(0) is problematic only where the substitute function is inlined.

Correction of my previous post:

MSVC's stuck can be avoided in many ways, ...
unless tp_dealloc does not create a inlined callsite

-unless
+if

gvanrossum · 2022-04-11T19:26:47Z

For Py_UNREACHABLE() then maybe we should just remove these two lines from, pymacro.h?

#elif defined(_MSC_VER)
#  define Py_UNREACHABLE() __assume(0)

Then the code will fall back to

#else
#  define Py_UNREACHABLE() \
    Py_FatalError("Unreachable C code path reached")
#endif

gvanrossum · 2022-04-12T01:31:05Z

__assume(0) is problematic only where the substitute function is inlined.

Can you elaborate? What is the "substitute function"? The macro definition is

#  define Py_UNREACHABLE() __assume(0)

so there is no inlined function. Are you referring to the code containing the call to Py_UNREACHABLE()? That wouldn't affect the ceval.c main loop in _PyEval_EvalFrameDefault because that function is definitely to large to be inlined. :-)

What am I missing?

neonene · 2022-04-13T00:13:20Z

Sorry for the lack of explanation.

I encountered a measurable slowdown several months ago when Py_RETURN_RICHCOMPARE macro is inlined in the eval-loop.
However, that may be x86 only.

If I understand correctly, x86 official binaries are non-PGO builds. Then, a Py_FatalError() only for TARGET(CACHE) branch would be enough for now.

When I change the current version as below:

Substitute void Py_UNREACHABLE(void) {} or Py_FatalError() for __assume(0) in pymacro.h
Make PyObject_RichCompare() called through a function pointer, adding
this above _PyEval_EvalFrameDefault().

static const richcmpfunc PyObject_RichCompare_PTR = PyObject_RichCompare;
#define PyObject_RichCompare PyObject_RichCompare_PTR

Then, PGO decides to inline PyObject_RichCompare(), based on its profile.
This seems to affect "Function Layout optimization" even if it is not inlined. (under verification)

    PyObject_RichCompare (pgu decision)
      _PyThreadState_GET (pgu decision)
        _PyRuntimeState_GetThreadState (pgu decision)
      _PyErr_Occurred (pgu decision)
      -_PyErr_BadInternalCall (pgo hard reject)
      _Py_EnterRecursiveCall (pgu decision)
        _Py_MakeRecCheck (pgu decision)
        -_Py_CheckRecursiveCall (pgo hard reject)
      do_richcompare (pgu decision)
        PyType_IsSubtype (pgu decision)
          -type_is_subtype_base_chain (pgo hard reject)
        Py_DECREF (pgu decision)
          _Py_Dealloc (pgu decision)
        long_richcompare (pgu decision)
          ...
>>>>>     Py_UNREACHABLE (pgu decision)  // or _Py_FatalErrorFunc (pgo hard reject)

As for other place (_Py_FatalErrorFunc() never gets inlined anywhere):

    dict_get (pgu decision)
      -_PyArg_CheckPositional (pgo hard reject)
      dict_get_impl (pgu decision)
        unicode_get_hash (pgu decision)
        PyObject_Hash (pgu decision)
          _Py_HashPointer (pgu decision)
            _Py_HashPointerRaw (pgu decision)
          -PyType_Ready (pgo hard reject)
          -PyObject_HashNotImplemented (pgo hard reject)
        _Py_dict_lookup (pgu decision)
          -unicodekeys_lookup_unicode (pgo hard reject)
          -unicodekeys_lookup_generic (pgo hard reject)
          dictkeys_generic_lookup (pgu decision)
            dictkeys_get_index (pgu decision)
            Py_INCREF (pgu decision)
            -PyObject_RichCompareBool (pgo hard reject)
            Py_DECREF (pgu decision)
              _Py_Dealloc (pgu decision)
>>>>>       -Py_UNREACHABLE (pgo hard reject)  // (no harm)

    Py_DECREF (force inline)
      -_Py_Dealloc (initial scan: soft depth exceeded)
    -_Py_Specialize_BinaryOp (pgo hard reject)
>>>  Py_UNREACHABLE (pgu decision)            // @TARGET(CACHE) inlined
    _PyFrame_SetStackPointer (pgu decision)
    -trace_function_entry (pgo hard reject)
    _PyFrame_GetStackPointer (pgu decision)
    PyDTrace_FUNCTION_ENTRY_ENABLED (pgu decision)
    -dtrace_function_entry (pgo hard reject)
    PyDTrace_LINE_ENABLED (pgu decision)
    -maybe_dtrace_line (pgo hard reject)
    _PyFrame_SetStackPointer (pgu decision)
    -maybe_call_line_trace (pgo hard reject)
    _PyFrame_GetStackPointer (pgu decision)
    -_PyInterpreterFrame_GetLine (pgo hard reject)
    -fprintf (initial scan: parameter mismatch, varargs, not eligible)
    -_PyErr_SetString (pgo hard reject)
>>> -Py_UNREACHABLE (pgo hard reject)         // out of switch (no harm)

Benchmark after removal of TARGET(CACHE) branch:

Py_UNREACHABLE at long_richcompare()	x64 PGO	x86 PGO
__assume(0)	1.00	1.00
Py_FatalError	1.02x *slower*	1.03x~ faster
void foo(void) {}	1.02x *slower*	1.04x~ faster

__assume(0) works well in the hot section on x64.

EDIT: The gap on x86 can be increased depending on the amount of optimization.

>pyperf compare_to  assume64  fatalerr64

Slower (25):
- dulwich_log: 195 ms +- 3 ms -> 231 ms +- 50 ms: 1.19x slower
- xml_etree_process: 98.3 ms +- 1.7 ms -> 109 ms +- 1 ms: 1.11x slower
- pickle_pure_python: 540 us +- 7 us -> 592 us +- 41 us: 1.10x slower
- logging_silent: 163 ns +- 1 ns -> 177 ns +- 3 ns: 1.09x slower
- xml_etree_generate: 138 ms +- 4 ms -> 148 ms +- 3 ms: 1.07x slower
- unpickle_pure_python: 348 us +- 9 us -> 371 us +- 5 us: 1.07x slower
- meteor_contest: 152 ms +- 1 ms -> 162 ms +- 2 ms: 1.06x slower
- hexiom: 10.8 ms +- 0.1 ms -> 11.4 ms +- 0.2 ms: 1.06x slower
- deltablue: 6.78 ms +- 0.23 ms -> 7.14 ms +- 0.09 ms: 1.05x slower
- scimark_fft: 491 ms +- 13 ms -> 517 ms +- 12 ms: 1.05x slower
- fannkuch: 627 ms +- 14 ms -> 660 ms +- 20 ms: 1.05x slower
- crypto_pyaes: 127 ms +- 3 ms -> 134 ms +- 4 ms: 1.05x slower
- sqlalchemy_imperative: 44.8 ms +- 2.3 ms -> 47.1 ms +- 5.3 ms: 1.05x slower
- nqueens: 164 ms +- 1 ms -> 172 ms +- 3 ms: 1.05x slower
- float: 138 ms +- 2 ms -> 144 ms +- 3 ms: 1.05x slower
- logging_format: 20.4 us +- 0.4 us -> 21.3 us +- 0.9 us: 1.04x slower
- nbody: 179 ms +- 6 ms -> 185 ms +- 7 ms: 1.04x slower
- django_template: 85.4 ms +- 1.1 ms -> 88.6 ms +- 2.3 ms: 1.04x slower
- go: 240 ms +- 4 ms -> 249 ms +- 7 ms: 1.03x slower
- regex_compile: 231 ms +- 2 ms -> 238 ms +- 3 ms: 1.03x slower
- json_dumps: 21.1 ms +- 0.2 ms -> 21.6 ms +- 0.2 ms: 1.02x slower
- logging_simple: 18.9 us +- 0.3 us -> 19.3 us +- 0.4 us: 1.02x slower
- sqlalchemy_declarative: 277 ms +- 8 ms -> 283 ms +- 8 ms: 1.02x slower
- scimark_monte_carlo: 104 ms +- 2 ms -> 106 ms +- 2 ms: 1.02x slower
- pathlib: 183 ms +- 12 ms -> 187 ms +- 7 ms: 1.02x slower

Faster (6):
- pickle_dict: 44.7 us +- 0.5 us -> 41.9 us +- 0.9 us: 1.07x faster
- scimark_sor: 215 ms +- 4 ms -> 203 ms +- 3 ms: 1.06x faster
- telco: 11.3 ms +- 0.1 ms -> 10.8 ms +- 0.1 ms: 1.04x faster
- unpickle: 23.0 us +- 0.3 us -> 22.2 us +- 0.2 us: 1.04x faster
- pickle_list: 6.75 us +- 0.17 us -> 6.60 us +- 0.19 us: 1.02x faster
- python_startup: 13.8 ms +- 1.2 ms -> 13.5 ms +- 0.1 ms: 1.02x faster

Benchmark hidden because not significant (27): 2to3, chameleon, chaos, json_loads, mako, pickle, pid
igits, pyflate, python_startup_no_site, raytrace, regex_dna, regex_effbot, regex_v8, richards, scima
rk_lu, scimark_sparse_mat_mult, spectral_norm, sqlite_synth, sympy_expand, sympy_integrate, sympy_su
m, sympy_str, tornado_http, unpack_sequence, unpickle_list, xml_etree_parse, xml_etree_iterparse

Geometric mean: 1.02x slower

>pyperf compare_to  assume64  emptyfunc64

Slower (23):
- nbody: 179 ms +- 6 ms -> 191 ms +- 4 ms: 1.07x slower
- xml_etree_process: 98.3 ms +- 1.7 ms -> 105 ms +- 1 ms: 1.07x slower
- meteor_contest: 152 ms +- 1 ms -> 163 ms +- 7 ms: 1.07x slower
- scimark_fft: 491 ms +- 13 ms -> 523 ms +- 11 ms: 1.07x slower
- xml_etree_generate: 138 ms +- 4 ms -> 146 ms +- 3 ms: 1.06x slower
- fannkuch: 627 ms +- 14 ms -> 662 ms +- 15 ms: 1.06x slower
- spectral_norm: 179 ms +- 2 ms -> 189 ms +- 3 ms: 1.05x slower
- unpickle_pure_python: 348 us +- 9 us -> 366 us +- 5 us: 1.05x slower
- telco: 11.3 ms +- 0.1 ms -> 11.8 ms +- 0.2 ms: 1.05x slower
- scimark_lu: 153 ms +- 3 ms -> 160 ms +- 2 ms: 1.04x slower
- django_template: 85.4 ms +- 1.1 ms -> 89.0 ms +- 1.2 ms: 1.04x slower
- python_startup: 13.8 ms +- 1.2 ms -> 14.3 ms +- 1.8 ms: 1.04x slower
- nqueens: 164 ms +- 1 ms -> 170 ms +- 4 ms: 1.04x slower
- json_dumps: 21.1 ms +- 0.2 ms -> 21.9 ms +- 0.2 ms: 1.04x slower
- deltablue: 6.78 ms +- 0.23 ms -> 7.01 ms +- 0.10 ms: 1.03x slower
- crypto_pyaes: 127 ms +- 3 ms -> 131 ms +- 3 ms: 1.03x slower
- unpack_sequence: 69.1 ns +- 3.0 ns -> 71.3 ns +- 6.4 ns: 1.03x slower
- hexiom: 10.8 ms +- 0.1 ms -> 11.1 ms +- 0.2 ms: 1.03x slower
- sympy_sum: 299 ms +- 4 ms -> 308 ms +- 6 ms: 1.03x slower
- chameleon: 14.7 ms +- 0.2 ms -> 15.1 ms +- 0.8 ms: 1.03x slower
- sqlalchemy_declarative: 277 ms +- 8 ms -> 284 ms +- 12 ms: 1.02x slower
- raytrace: 543 ms +- 6 ms -> 556 ms +- 8 ms: 1.02x slower
- scimark_sparse_mat_mult: 6.86 ms +- 0.35 ms -> 7.00 ms +- 0.30 ms: 1.02x slower

Faster (4):
- scimark_sor: 215 ms +- 4 ms -> 202 ms +- 2 ms: 1.06x faster
- sqlite_synth: 5.44 us +- 0.13 us -> 5.21 us +- 0.02 us: 1.05x faster
- pickle_list: 6.75 us +- 0.17 us -> 6.53 us +- 0.05 us: 1.03x faster
- pickle_dict: 44.7 us +- 0.5 us -> 43.7 us +- 1.5 us: 1.02x faster

Benchmark hidden because not significant (31): 2to3, chaos, dulwich_log, float, go, json_loads, logg
ing_format, logging_silent, logging_simple, mako, pathlib, pickle, pickle_pure_python, pidigits, pyf
late, python_startup_no_site, regex_compile, regex_dna, regex_effbot, regex_v8, richards, scimark_mo
nte_carlo, sqlalchemy_imperative, sympy_expand, sympy_integrate, sympy_str, tornado_http, unpickle,
unpickle_list, xml_etree_parse, xml_etree_iterparse

Geometric mean: 1.02x slower

>pyperf compare_to  assume86  fatalerr86

Slower (6):
- sqlalchemy_imperative: 40.2 ms +- 2.2 ms -> 42.6 ms +- 2.9 ms: 1.06x slower
- unpickle_list: 8.10 us +- 0.10 us -> 8.47 us +- 0.18 us: 1.05x slower
- pickle: 21.1 us +- 0.6 us -> 21.8 us +- 0.2 us: 1.04x slower
- scimark_monte_carlo: 148 ms +- 5 ms -> 153 ms +- 5 ms: 1.03x slower
- sqlalchemy_declarative: 261 ms +- 5 ms -> 269 ms +- 14 ms: 1.03x slower
- sqlite_synth: 5.57 us +- 0.03 us -> 5.71 us +- 0.03 us: 1.03x slower

Faster (28):
- unpack_sequence: 126 ns +- 11 ns -> 95.3 ns +- 5.7 ns: 1.32x faster
- logging_silent: 225 ns +- 5 ns -> 191 ns +- 10 ns: 1.18x faster
- spectral_norm: 279 ms +- 6 ms -> 244 ms +- 4 ms: 1.14x faster
- hexiom: 14.3 ms +- 0.1 ms -> 12.6 ms +- 0.1 ms: 1.14x faster
- scimark_lu: 230 ms +- 6 ms -> 205 ms +- 5 ms: 1.12x faster
- richards: 101 ms +- 1 ms -> 90.4 ms +- 4.0 ms: 1.12x faster
- scimark_sparse_mat_mult: 10.7 ms +- 0.3 ms -> 9.64 ms +- 0.27 ms: 1.11x faster
- deltablue: 8.37 ms +- 0.10 ms -> 7.65 ms +- 0.31 ms: 1.09x faster
- xml_etree_process: 129 ms +- 1 ms -> 120 ms +- 2 ms: 1.07x faster
- scimark_sor: 275 ms +- 6 ms -> 257 ms +- 11 ms: 1.07x faster
- float: 175 ms +- 2 ms -> 164 ms +- 2 ms: 1.07x faster
- go: 308 ms +- 7 ms -> 288 ms +- 9 ms: 1.07x faster
- xml_etree_generate: 176 ms +- 2 ms -> 165 ms +- 4 ms: 1.06x faster
- pickle_pure_python: 652 us +- 11 us -> 621 us +- 16 us: 1.05x faster
- raytrace: 695 ms +- 5 ms -> 663 ms +- 6 ms: 1.05x faster
- pathlib: 189 ms +- 6 ms -> 180 ms +- 5 ms: 1.04x faster
- pyflate: 963 ms +- 21 ms -> 923 ms +- 27 ms: 1.04x faster
- nbody: 247 ms +- 8 ms -> 237 ms +- 7 ms: 1.04x faster
- xml_etree_parse: 272 ms +- 4 ms -> 264 ms +- 7 ms: 1.03x faster
- regex_compile: 272 ms +- 3 ms -> 263 ms +- 2 ms: 1.03x faster
- mako: 22.0 ms +- 0.3 ms -> 21.3 ms +- 0.1 ms: 1.03x faster
- chameleon: 19.9 ms +- 0.2 ms -> 19.4 ms +- 0.4 ms: 1.03x faster
- chaos: 174 ms +- 2 ms -> 170 ms +- 2 ms: 1.03x faster
- xml_etree_iterparse: 175 ms +- 3 ms -> 170 ms +- 4 ms: 1.03x faster
- nqueens: 203 ms +- 1 ms -> 198 ms +- 2 ms: 1.03x faster
- unpickle_pure_python: 456 us +- 6 us -> 445 us +- 16 us: 1.02x faster
- unpickle: 25.9 us +- 0.4 us -> 25.3 us +- 0.4 us: 1.02x faster
- json_loads: 47.1 us +- 0.4 us -> 46.2 us +- 0.5 us: 1.02x faster

Benchmark hidden because not significant (24): 2to3, crypto_pyaes, django_template, dulwich_log, fan
nkuch, json_dumps, logging_format, logging_simple, meteor_contest, pickle_dict, pickle_list, pidigit
s, python_startup, python_startup_no_site, regex_dna, regex_effbot, regex_v8, scimark_fft, sympy_exp
and, sympy_integrate, sympy_sum, sympy_str, telco, tornado_http

Geometric mean: 1.03x faster

>pyperf compare_to  assume86  emptyfunc86

Slower (3):
- pickle: 21.1 us +- 0.6 us -> 22.2 us +- 0.5 us: 1.05x slower
- python_startup: 13.2 ms +- 0.2 ms -> 13.8 ms +- 2.5 ms: 1.04x slower
- regex_dna: 243 ms +- 1 ms -> 250 ms +- 6 ms: 1.03x slower

Faster (38):
- unpack_sequence: 126 ns +- 11 ns -> 94.1 ns +- 5.3 ns: 1.34x faster
- hexiom: 14.3 ms +- 0.1 ms -> 12.5 ms +- 0.2 ms: 1.15x faster
- deltablue: 8.37 ms +- 0.10 ms -> 7.43 ms +- 0.08 ms: 1.13x faster
- spectral_norm: 279 ms +- 6 ms -> 249 ms +- 2 ms: 1.12x faster
- logging_silent: 225 ns +- 5 ns -> 202 ns +- 7 ns: 1.11x faster
- xml_etree_process: 129 ms +- 1 ms -> 118 ms +- 1 ms: 1.09x faster
- unpickle_pure_python: 456 us +- 6 us -> 418 us +- 12 us: 1.09x faster
- nqueens: 203 ms +- 1 ms -> 187 ms +- 2 ms: 1.09x faster
- scimark_sparse_mat_mult: 10.7 ms +- 0.3 ms -> 9.85 ms +- 0.33 ms: 1.09x faster
- scimark_lu: 230 ms +- 6 ms -> 213 ms +- 4 ms: 1.08x faster
- xml_etree_generate: 176 ms +- 2 ms -> 163 ms +- 3 ms: 1.08x faster
- richards: 101 ms +- 1 ms -> 93.7 ms +- 3.9 ms: 1.08x faster
- float: 175 ms +- 2 ms -> 163 ms +- 3 ms: 1.08x faster
- scimark_sor: 275 ms +- 6 ms -> 257 ms +- 6 ms: 1.07x faster
- unpickle_list: 8.10 us +- 0.10 us -> 7.60 us +- 0.23 us: 1.07x faster
- scimark_monte_carlo: 148 ms +- 5 ms -> 139 ms +- 5 ms: 1.06x faster
- go: 308 ms +- 7 ms -> 290 ms +- 8 ms: 1.06x faster
- raytrace: 695 ms +- 5 ms -> 661 ms +- 7 ms: 1.05x faster
- crypto_pyaes: 175 ms +- 1 ms -> 166 ms +- 2 ms: 1.05x faster
- nbody: 247 ms +- 8 ms -> 235 ms +- 8 ms: 1.05x faster
- pyflate: 963 ms +- 21 ms -> 917 ms +- 15 ms: 1.05x faster
- xml_etree_iterparse: 175 ms +- 3 ms -> 167 ms +- 6 ms: 1.04x faster
- meteor_contest: 178 ms +- 2 ms -> 171 ms +- 5 ms: 1.04x faster
- sympy_sum: 298 ms +- 11 ms -> 287 ms +- 2 ms: 1.04x faster
- pickle_pure_python: 652 us +- 11 us -> 628 us +- 16 us: 1.04x faster
- scimark_fft: 724 ms +- 10 ms -> 697 ms +- 15 ms: 1.04x faster
- regex_compile: 272 ms +- 3 ms -> 263 ms +- 7 ms: 1.03x faster
- unpickle: 25.9 us +- 0.4 us -> 25.1 us +- 0.3 us: 1.03x faster
- chameleon: 19.9 ms +- 0.2 ms -> 19.3 ms +- 0.4 ms: 1.03x faster
- sympy_expand: 912 ms +- 21 ms -> 885 ms +- 7 ms: 1.03x faster
- mako: 22.0 ms +- 0.3 ms -> 21.4 ms +- 0.2 ms: 1.03x faster
- chaos: 174 ms +- 2 ms -> 169 ms +- 2 ms: 1.03x faster
- django_template: 93.7 ms +- 1.6 ms -> 91.1 ms +- 1.4 ms: 1.03x faster
- fannkuch: 907 ms +- 19 ms -> 885 ms +- 12 ms: 1.02x faster
- sympy_str: 542 ms +- 4 ms -> 529 ms +- 10 ms: 1.02x faster
- xml_etree_parse: 272 ms +- 4 ms -> 266 ms +- 9 ms: 1.02x faster
- json_dumps: 23.8 ms +- 0.2 ms -> 23.3 ms +- 0.3 ms: 1.02x faster
- regex_v8: 38.7 ms +- 0.4 ms -> 38.0 ms +- 0.4 ms: 1.02x faster

Benchmark hidden because not significant (17): 2to3, dulwich_log, json_loads, logging_format, loggin
g_simple, pathlib, pickle_dict, pickle_list, pidigits, python_startup_no_site, regex_effbot, sqlalch
emy_declarative, sqlalchemy_imperative, sqlite_synth, sympy_integrate, telco, tornado_http

Geometric mean: 1.04x faster

neonene · 2022-04-13T01:53:29Z

Are you referring to the code containing the call to Py_UNREACHABLE()? That wouldn't affect the ceval.c main loop in _PyEval_EvalFrameDefault because that function is definitely to large to be inlined. :-)

Here is MSVC's inlining decision on current Python3.10:
https://bugs.python.org/file50291/PR28475_inline.log

Weird, but faster than when only tiny functions are inlined.
In the log, Py_DECREF(static) is expanded until Py_Dealloc(extern) stops its recursion. That looks to me too expansive.

gvanrossum · 2022-04-13T02:44:05Z

It looks like we may be looking at different builds? I'm only looking at x64 builds for 3.11.

If I understand correctly, x86 official binaries are non-PGO builds.

@zooba Is that so?

zooba · 2022-04-13T12:11:55Z

If I understand correctly, x86 official binaries are non-PGO builds.

Yeah, this is correct. We're more likely to deprecate and drop the 32-bit binaries before we make any major effort to optimise them - they run under an emulation layer in the OS (practically all supported OS installs are 64-bit native), so aren't really going to be recommended for people who care about performance anyway.

Macros Py_DECREF, Py_XDECREF, Py_IS_TYPE, _Py_atomic_load_32bit_impl and _Py_DECREF_SPECIALIZED are redefined as macros that completely replace the inline functions of the same name. These three came out in the top four of functions that (in MSVC) somehow weren't inlined. Co-authored-by: Victor Stinner <vstinner@python.org> Co-authored-by: Dennis Sweeney <36520290+sweeneyde@users.noreply.github.com>

neonene · 2022-04-23T06:45:05Z

I think this issue can be closed. (I can't after migration)

Most of my experiences are invalid after Guido's #91718 corrected the quirks of MSVC.
Another reasonable fix would be a good test which makes specialized sections hotter.

Thanks.

Fidget-Spinner · 2022-04-23T09:09:22Z

Closing as requested by OP. Thanks for your investigations @neonene ! Thanks to Guido too for the fix.

gvanrossum · 2022-04-23T14:38:00Z

Thank you @neonene for your gentle pushes and encouragement and help to get this fixed!

vstinner · 2022-04-25T15:39:59Z

@neonene:

Most of my experiences are invalid after Guido's #91718 corrected the quirks of MSVC.

Do you mean that this merged change 2f233fc is now useless?

gvanrossum · 2022-04-25T16:28:42Z

No they are complementary.

neonene · 2022-04-25T20:59:39Z

Do you mean that this merged change 2f233fc is now useless?

No. What I said is about the optimization, not the (force) inlining. And what I suggested before have been already fixed by f8dc618 (and 2f233fc):

tp_* or cfunc pointer in the eval-loop can inline multiple callees without conflict.
Moving LOAD_FAST out of switch according to the scores below has no advantage now.

TOP3 entries with current 44 tests
case 124  132522464  // LOAD_FAST
case 100   48956231  // LOAD_CONST
case  45   48318813  // LOAD_FAST__LOAD_FAST

vstinner · 2022-04-25T22:03:25Z

What I understand is that PGO build of Python 3.11 on Windows will be faster thanks to these changes, and the Windows python.org binaries only use PGO for 64-bit, not for 32-bit.

neonene · 2022-04-25T22:33:04Z

You can read a bit more posts and links because you have changed this thread's title several times.

vstinner · 2022-04-26T08:02:02Z

Can someone please try to write a summary of this long and complex issue? It seems like different but related topics have been discussed and it's hard to get an overview. I'm confused between sometimes someone said that a change fixed the fix and then wrote that no, it's not really fixing the issue.

gvanrossum · 2022-04-28T04:18:05Z

Let me give it a quick try.

Originally, @neonene observed a Windows-specific performance regression in 3.10 between the a7 and b1 release. This was eventually shown to be caused by the function _PyEval_EvalFrameDefault getting so long that the MSVC LTO gave up on inlining many things there. IIUC in 3.10 this was eventually fixed by making the function a bit smaller ([3.10] bpo-45116: Shrink interpreter by targetted revert of #25244 #28475).
Of course, the same issue was then observed in the main branch (3.11). We then went back and forth trying various approaches to fix it. This wasn't easy because (a) the code kept changing (because the "Faster CPython" team was very active -- mostly growing the function), and (b) we had no good hardware or strategy to run reliable benchmarks. The latter problem spawned Measure Windows performance (and improve if lacking) faster-cpython/ideas#321.
Eventually I settled on a fix which consisted of turning a few inline functions back into macros, but only in ceval.c, and not in debug mode (and one only for MSVC). This was PR Performance regression 3.10b1: inlining issue in the big _PyEval_EvalFrameDefault() function with Visual Studio (MSC) #89279, commit 2f233fc.
Somewhat relatedly, I also figured out how to get MSVC to generate slightly faster switch code: if you switch on a one-byte value and all 256 cases exist, it skips a memory load. This was PR Improve performance for switch in ceval.c when using MSVC #91719, commit f8dc618.
Finally, I figured out how to get stable benchmark numbers (see Measure Windows performance (and improve if lacking) faster-cpython/ideas#321 (comment) and following comments) and showed that the macrofied inline functions gave us 10% performance back and the improved switch code gave 3%.

That's it.

vstinner · 2022-04-28T07:00:13Z

Thanks for the summary. I would add that marking performance critical function with __forceinline (Py_ALWAYS_INLINE) was tested, but it didn't work.

neonene mannequin added performance Performance or resource usage 3.10 only security fixes 3.11 only security fixes interpreter-core (Objects, Python, Grammar, and Parser dirs) OS-windows topic-C-API labels Sep 6, 2021

vstinner changed the title ~~Performance regression 3.10b1 and later on Windows~~ Performance regression 3.10b1 and later on Windows: Py_DECREF() not inlined in PGO build Sep 13, 2021

ezio-melotti transferred this issue from another repository Apr 10, 2022

arhadthedev mentioned this issue Apr 11, 2022

Some after-migration messages are semi-attributed to mannequins #91437

Closed

markshannon mentioned this issue Apr 13, 2022

gh-91404: Improve re performance #91495

Merged

gvanrossum mentioned this issue Apr 13, 2022

gh-89279: In ceval.c, redefine some macros for speed #32387

Merged

gvanrossum added a commit to gvanrossum/cpython that referenced this issue Apr 19, 2022

Add comment referring to pythonGH-89279 for is_method()

c43ec56

gvanrossum mentioned this issue Apr 20, 2022

Improve performance for switch in ceval.c when using MSVC #91719

Closed

Fidget-Spinner closed this as completed Apr 23, 2022

gvanrossum mentioned this issue Apr 23, 2022

Measure Windows performance (and improve if lacking) faster-cpython/ideas#321

Closed

2bndy5 mentioned this issue Mar 11, 2023

add nox config; update CI jbms/sphinx-immaterial#230

Merged

eduardo-elizondo mentioned this issue Apr 10, 2023

gh-84436: Implement Immortal Objects #19474

Merged

Performance regression 3.10b1: inlining issue in the big _PyEval_EvalFrameDefault() function with Visual Studio (MSC) #89279

Performance regression 3.10b1: inlining issue in the big _PyEval_EvalFrameDefault() function with Visual Studio (MSC) #89279

Comments

neonene mannequin commented Sep 6, 2021

neonene mannequin commented Sep 6, 2021

vstinner commented Sep 6, 2021

rhettinger commented Sep 6, 2021

vstinner commented Sep 6, 2021

vstinner commented Sep 6, 2021

neonene mannequin commented Sep 7, 2021

vstinner commented Sep 7, 2021

zooba commented Sep 7, 2021

animalize mannequin commented Sep 8, 2021

neonene mannequin commented Sep 11, 2021

animalize mannequin commented Sep 11, 2021

neonene mannequin commented Sep 11, 2021

neonene mannequin commented Sep 13, 2021

neonene mannequin commented Sep 16, 2021

rhettinger commented Sep 16, 2021

zooba commented Sep 16, 2021

zooba commented Apr 8, 2022

neonene commented Apr 10, 2022 • edited Loading

gvanrossum commented Apr 11, 2022 • edited Loading

gvanrossum commented Apr 12, 2022

neonene commented Apr 13, 2022 • edited Loading

neonene commented Apr 13, 2022 • edited Loading

gvanrossum commented Apr 13, 2022 • edited Loading

zooba commented Apr 13, 2022

neonene commented Apr 23, 2022 • edited Loading

Fidget-Spinner commented Apr 23, 2022

gvanrossum commented Apr 23, 2022

vstinner commented Apr 25, 2022

gvanrossum commented Apr 25, 2022

neonene commented Apr 25, 2022 • edited Loading

vstinner commented Apr 25, 2022

neonene commented Apr 25, 2022

vstinner commented Apr 26, 2022

gvanrossum commented Apr 28, 2022

vstinner commented Apr 28, 2022

neonene commented Apr 10, 2022 •

edited

Loading

gvanrossum commented Apr 11, 2022 •

edited

Loading

neonene commented Apr 13, 2022 •

edited

Loading

neonene commented Apr 13, 2022 •

edited

Loading

gvanrossum commented Apr 13, 2022 •

edited

Loading

neonene commented Apr 23, 2022 •

edited

Loading

neonene commented Apr 25, 2022 •

edited

Loading