-
-
Notifications
You must be signed in to change notification settings - Fork 31k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Performance regression 3.10b1: inlining issue in the big _PyEval_EvalFrameDefault() function with Visual Studio (MSC) #89279
Comments
pyperformance on Windows shows some gap between 3.10a7 and 3.10b1. -------------------------------------------------
Since PR25244 (28d28e0), (1) _Py_DECREF() (from Py_DECREF,Py_CLEAR,Py_SETREF) I tried in vain other linker options like thread-safe-profiling, agressive-code-generation, /OPT:NOREF. I measured overheads of (1)~(4) on my own build whose eval-loop uses macros instead of them. ----------------------------------------------------------------- |
Rather than defining again functions as macro, you should consider using __forceinline function attribute: see bpo-45094. |
Perhaps these critical code sections should have been left as macros. It is difficult to assuring system wide inlining across modules. |
Raymond:
These functions were not converted recently to static inline function. For example, Py_INCREF() was already a static inline function in Python 3.9. I don't think that any decision should be taken before performances have been analyzed in depth. I'm not convinced that there is really a performance regression. I'm not sure how benchmarks are run on Windows. neonene:
I don't understand how to read the table. Usually, I expect a comparison between a reference build and a patch build, but here you seem to use 3.10a7 as the reference to compare results. I'm not sure that geometric means can be compared this way. |
I'm not sure if PGO builds are reproducible, since there is a training step which requires to run a non-deterministic workload (the Python test suite which is somehow randomized by I/O timings and other stuffs). The compiler is free to inline or not depending on the pressure on registers and stack memory. I'm not sure that inlining always make the code faster, since inlining have many side effects. I don't know well MSC compiler. |
@vstinner: __forceinline suggestion Since PR25244 (mentioned above), it seems link.exe has got to get stuck on python310.dll.
Overhead field is the output of pyperf command, not subtraction (the answers are the same just luckily). ex) 3.10rc1x86 PGO:
MSVC does not produce the same code. Inlining (all or nothing) might be a quite special case in the hottest section. |
I don't understand if adding __forceinline on the mentioned static inline functions has an impact on the performance of a PGO build on Windows. |
I'm 99% sure it's a tracing PGO rather than sampling, so provided everything runs in the same order and follows the same codepaths, wall-clock timings shouldn't have a significant effect. Without looking at the generated code, it may be more effective to try and force the functions called from the macro to never be inlined. The optimiser may be missing that the calls are uncommon, and trying to inline them first, then deciding that the whole merged function is too big to inline in places where it would be useful. There's also no particular reason we need to use these tests as the profile, it's just that nobody has taken the time to come up with anything better. I'd rather see us invest time in generating a good profile rather than trying to hand-optimise inlining. (Though the test suite is good in many ways, because it makes sure all the extension modules are covered. We definitely want as close to full code coverage as we can get, at least for happy paths, but may need more eval loop in the profile if that's what needs help.) |
This article briefly introduces the inlining decisions in MSVC. |
Thanks for all suggestions. I focused on my bisected commit and the previous. I run pyperformance with 4 functions never inlined in the sections below. _Py_DECREF()
_Py_XDECREF()
_Py_IS_TYPE()
_Py_atomic_load_32bit_impl() are (1) never inlined in _PyEval_EvalFrameDefault(). slow downs [4-funcs never inlined section] pyperf compare_to upper lower: -------------------------------------------------------------- pyperf compare_to upper lower: In both x64 and x86, it looks column (2) and (*) has similar gaps. I built PGO with "/d2inlinestats" and "/d2inlinelogfull:_PyEval_EvalFrameDefault" according to the blog. I posted logs. As for PR25244, the logsize is 3x smaller than the previous and pgo rejects the 4 funcs above. I will look into it later. Collecting:
Current build is 10x~ shorter than before to link. |
MSVC 2019 has a /Ob3 option: From the experience of another project, I conjecture /Ob3 increase the "global budget" mentioned in the blog. |
According to: PGO seems to override /Ob3. _ctypes_test I use this option in _msvccompiler.py for my pyd. |
With msvc 16.10.3 and 16.11.2 (latest), Here is just a workaround for 3.10rc2 on windows. --- Python/ceval.c
+++ Python/ceval.c
@@ -1306,9 +1306 @@
-#define DISPATCH() \
- { \
- if (trace_info.cframe.use_tracing OR_DTRACE_LINE OR_LLTRACE) { \
- goto tracing_dispatch; \
- } \
- f->f_lasti = INSTR_OFFSET(); \
- NEXTOPARG(); \
- DISPATCH_GOTO(); \
- }
+#define DISPATCH() goto tracing_dispatch
@@ -1782,4 +1774,9 @@
tracing_dispatch:
{
+ if (!(trace_info.cframe.use_tracing OR_DTRACE_LINE OR_LLTRACE)) {
+ f->f_lasti = INSTR_OFFSET();
+ NEXTOPARG();
+ DISPATCH_GOTO();
+ }
int instr_prev = f->f_lasti;
f->f_lasti = INSTR_OFFSET(); ================================================== This patch becomes ineffective just adding one expression to DISPATCH macro as below #define DISPATCH() {if (1) goto tracing_dispatch;} And this approach is not sufficient for 3.11 with bigger eval-func. 3.10rc2 x86 pgo : 1.00 3.10rc2 x64 pgo : 1.00 (roughly the same speed as official bin) x64 results are posted. Fixing inlining rejection also made __forceinline buildable with normal processing time and memory usage. |
I reported this issue to developercommunity of microsoft. |
These should be changed back to macros where inlining is guaranteed. |
I agree with Raymond. Let's stop throwing more code at this until we've figured out what's going on and revert the change for now. |
Out of interest, have you done other experiments confirming this? The reference linked is talking about compiler throughput (i.e. how long it takes to compile), and while it hints that using __assume(0) may interfere with other optimisations, that isn't supported with any detail or analysis in the post. |
My benchmark results are left in faster-cpython/ideas#321 (comment). __assume(0) is problematic only where the substitute function is inlined. Correction of my previous post:
-unless |
For
Then the code will fall back to
|
Can you elaborate? What is the "substitute function"? The macro definition is
so there is no inlined function. Are you referring to the code containing the call to What am I missing? |
Sorry for the lack of explanation. I encountered a measurable slowdown several months ago when If I understand correctly, x86 official binaries are non-PGO builds. Then, a When I change the current version as below:
Then, PGO decides to inline
As for other place (
Benchmark after removal of
EDIT: The gap on x86 can be increased depending on the amount of optimization.
|
Here is MSVC's inlining decision on current Python3.10: Weird, but faster than when only tiny functions are inlined. |
It looks like we may be looking at different builds? I'm only looking at x64 builds for 3.11.
@zooba Is that so? |
Yeah, this is correct. We're more likely to deprecate and drop the 32-bit binaries before we make any major effort to optimise them - they run under an emulation layer in the OS (practically all supported OS installs are 64-bit native), so aren't really going to be recommended for people who care about performance anyway. |
Macros Py_DECREF, Py_XDECREF, Py_IS_TYPE, _Py_atomic_load_32bit_impl and _Py_DECREF_SPECIALIZED are redefined as macros that completely replace the inline functions of the same name. These three came out in the top four of functions that (in MSVC) somehow weren't inlined. Co-authored-by: Victor Stinner <vstinner@python.org> Co-authored-by: Dennis Sweeney <36520290+sweeneyde@users.noreply.github.com>
I think this issue can be closed. (I can't after migration) Most of my experiences are invalid after Guido's #91718 corrected the quirks of MSVC. Thanks. |
Closing as requested by OP. Thanks for your investigations @neonene ! Thanks to Guido too for the fix. |
Thank you @neonene for your gentle pushes and encouragement and help to get this fixed! |
No they are complementary. |
No. What I said is about the optimization, not the (force) inlining. And what I suggested before have been already fixed by f8dc618 (and 2f233fc):
|
What I understand is that PGO build of Python 3.11 on Windows will be faster thanks to these changes, and the Windows python.org binaries only use PGO for 64-bit, not for 32-bit. |
You can read a bit more posts and links because you have changed this thread's title several times. |
Can someone please try to write a summary of this long and complex issue? It seems like different but related topics have been discussed and it's hard to get an overview. I'm confused between sometimes someone said that a change fixed the fix and then wrote that no, it's not really fixing the issue. |
Let me give it a quick try.
That's it. |
Thanks for the summary. I would add that marking performance critical function with |
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: