Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BOLT] [3.12] Python 3.12.7 --enable bolt option not working #124948

Open
ptr1337 opened this issue Oct 3, 2024 · 17 comments · Fixed by #128572
Open

[BOLT] [3.12] Python 3.12.7 --enable bolt option not working #124948

ptr1337 opened this issue Oct 3, 2024 · 17 comments · Fixed by #128572
Assignees
Labels
build The build process and cross-build type-bug An unexpected behavior, bug, or error

Comments

@ptr1337
Copy link

ptr1337 commented Oct 3, 2024

Bug report

Bug description:

Hi together,

After updating llvm-bolt to 19.1.0 it is not possible to use the --enable-bolt function anymore.
Following can be found in the log:

BOLT-INFO: 0 out of 5950 functions in the binary (0.0%) have non-empty execution profile
BOLT-INSTRUMENTER: Number of indirect call site descriptors: 7119
BOLT-INSTRUMENTER: Number of indirect call target descriptors: 5874
BOLT-INSTRUMENTER: Number of function descriptors: 5874
BOLT-INSTRUMENTER: Number of branch counters: 78400
BOLT-INSTRUMENTER: Number of ST leaf node counters: 40936
BOLT-INSTRUMENTER: Number of direct call counters: 0
BOLT-INSTRUMENTER: Total number of counters: 119336
BOLT-INSTRUMENTER: Total size of counters: 954688 bytes (static alloc memory)
BOLT-INSTRUMENTER: Total size of string table emitted: 133862 bytes in file
BOLT-INSTRUMENTER: Total size of descriptors: 7642344 bytes in file
BOLT-INSTRUMENTER: Profile will be saved to file /tmp/pkg/src/Python-3.12.7/libpython3.12.so.1.0.bolt
BOLT-INFO: 65850 instructions were shortened
BOLT-INFO: removed 134 empty blocks
BOLT-INFO: UCE removed 1155 blocks and 71042 bytes of code
BOLT-INFO: padding code to 0xe00000 to accommodate hot text
BOLT-INFO: output linked against instrumentation runtime library, lib entry point is 0xf90930
BOLT-INFO: clear procedure is 0xf8c380
BOLT-INFO: patched build-id (flipped last bit)
BOLT-INFO: setting __bolt_runtime_start to 0xf908e0
BOLT-INFO: setting __bolt_runtime_fini to 0xf90930
BOLT-INFO: setting __hot_start to 0x800000
BOLT-INFO: setting __hot_end to 0xdf7438
BOLT-ERROR: unable to get new address corresponding to input address 0x1a6e11 in function sre_ucs1_match/1. Consider adding this function to --skip-funcs=...
make[1]: *** [Makefile:856: profile-bolt-stamp] Error 1
make[1]: Leaving directory '/tmp/pkg/src/Python-3.12.7'
make: *** [Makefile:885: bolt-opt] Error 2
==> ERROR: A failure occurred in build().

so, it appears to fail at the instrumentation and suggest to add --skip-funcs=sre_ucs1_match/*.

CPython versions tested on:

3.12

Operating systems tested on:

Linux

@ptr1337 ptr1337 added the type-bug An unexpected behavior, bug, or error label Oct 3, 2024
@picnixz picnixz added the build The build process and cross-build label Oct 3, 2024
@Eclips4
Copy link
Member

Eclips4 commented Oct 3, 2024

cc @corona10

@corona10 corona10 self-assigned this Oct 4, 2024
@ZIZUN
Copy link

ZIZUN commented Nov 2, 2024

I will take a look

@ZIZUN
Copy link

ZIZUN commented Nov 3, 2024

 leesm@leesm-ubuntu  ~/Workspace/cpython   ./configure --enable-bolt
 leesm@leesm-ubuntu  ~/Workspace/cpython   make
->
BOLT-INFO: basic block reordering modified layout of 2699 functions (68.73% of profiled, 45.13% of total)
BOLT-INFO: UCE removed 36 blocks and 0 bytes of code
BOLT-INFO: splitting separates 1220980 hot bytes from 780031 cold bytes (61.02% of split functions is hot).
BOLT-INFO: 43 Functions were reordered by LoopInversionPass
BOLT-INFO: program-wide dynostats after all optimizations before SCTC and FOP:

         18740817903 : executed forward branches
          7826248652 : taken forward branches
          7349242935 : executed backward branches
          2852718872 : taken backward branches
          4004518103 : executed unconditional branches
          3353239032 : all function calls
          1262678534 : indirect calls
           137784052 : PLT calls
        259717127425 : executed instructions
         77083459173 : executed load instructions
         43910218984 : executed store instructions
          4541628185 : taken jump table branches
                   0 : taken unknown indirect branches
         30094578941 : total branches
         14683485627 : taken branches
         15411093314 : non-taken conditional branches
         10678967524 : taken conditional branches
         26090060838 : all conditional branches

         21694777552 : executed forward branches (+15.8%)
          1679657684 : taken forward branches (-78.5%)
          4863157010 : executed backward branches (-33.8%)
          2286864783 : taken backward branches (-19.8%)
          1866969944 : executed unconditional branches (-53.4%)
          2800009758 : all function calls (-16.5%)
           673537882 : indirect calls (-46.7%)
           137784052 : PLT calls (=)
        256484195370 : executed instructions (-1.2%)
         76738590906 : executed load instructions (-0.4%)
         43910090648 : executed store instructions (-0.0%)
          4541628185 : taken jump table branches (=)
                   0 : taken unknown indirect branches (=)
         28424904506 : total branches (-5.5%)
          5833492411 : taken branches (-60.3%)
         22591412095 : non-taken conditional branches (+46.6%)
          3966522467 : taken conditional branches (-62.9%)
         26557934562 : all conditional branches (+1.8%)

BOLT-INFO: SCTC: patched 60 tail calls (57 forward) tail calls (3 backward) from a total of 60 while removing 3 double jumps and removing 50 basic blocks totalling 250 bytes of code. CTCs total execution count is 36614663 and the number of times CTCs are taken is 25417844
BOLT-INFO: FOP optimized 1 redundant load(s) and 0 unused store(s)
BOLT-INFO: Frequency of redundant loads is 15661399 and frequency of unused stores is 0
BOLT-INFO: Frequency of loads changed to use a register is 15661399 and frequency of loads changed to use an immediate is 0
BOLT-INFO: FOP deleted 1 load(s) (dyn count: 15661399) and 0 store(s)
BOLT-INFO: FRAME ANALYSIS: 2248 function(s) were not optimized.
BOLT-INFO: FRAME ANALYSIS: 1999 function(s) (88.8% dyn cov) could not have its frame indices restored.
BOLT-INFO: Shrink wrapping moved 14 spills inserting load/stores and 0 spills inserting push/pops
BOLT-INFO: Shrink wrapping reduced 117436164 store executions (0.0% total instructions executed, 0.3% store instructions)
BOLT-INFO: Shrink wrapping failed at reducing 0 store executions (0.0% total instructions executed, 0.0% store instructions)
BOLT-INFO: Allocation combiner: 22 empty spaces coalesced (dyn count: 84363257).
BOLT-INFO: patched build-id (flipped last bit)
BOLT-INFO: setting _end to 0xcf093c
BOLT-INFO: setting _end to 0xcf093c
BOLT-INFO: setting __hot_start to 0xa00000
BOLT-INFO: setting __hot_end to 0xb7d512
touch profile-bolt-stamp
make[1]: Leaving directory '/home/leesm/Workspace/cpython'

 leesm@leesm-ubuntu  ~/Workspace/cpython  ./python
Python 3.12.7 (main, Nov  2 2024, 17:57:45) [GCC 9.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> exit()
 leesm@leesm-ubuntu  ~/Workspace/cpython  llvm-bolt --version
LLVM (http://llvm.org/):
  LLVM version 19.1.0
  Optimized build with assertions.
BOLT revision <unknown>

  Registered Targets:
    aarch64    - AArch64 (little endian)
    aarch64_32 - AArch64 (little endian ILP32)
    aarch64_be - AArch64 (big endian)
    arm64      - ARM64 (little endian)
    arm64_32   - ARM64 (little endian ILP32)
    x86        - 32-bit X86: Pentium-Pro and above
    x86-64     - 64-bit X86: EM64T and AMD64

I tested this issue on my setup with CPython 3.12 and LLVM-BOLT version 19.1.0 on a Ubuntu(20.04) environment, and I did not encounter any problems. The --enable-bolt function worked as expected. It appears that this issue might be other environment problems. @ptr1337

@ptr1337
Copy link
Author

ptr1337 commented Dec 23, 2024

After Python 3.13 got pushed to archlinux stable, im still not able to bolt it with the same issues.
Since the llvm-bolt package is not in the archlinux repository, you can test it with following docker container:

git clone https://gitlab.archlinux.org/archlinux/packaging/packages/python.git
cd python
# enable --enable-bolt option
docker run --name dockerbuilder -e EXPORT_PKG=1 -e SYNC_DATABASE=1 -v $PWD:/pkg cachyos/docker-makepkg && docker rm dockerbuilder

@ms178
Copy link

ms178 commented Dec 23, 2024

I can confirm this issue using Clang-20git (c660b281b60085cbe40d73d692badd43d7708d20) on CachyOS with Python 3.13.1:

BOLT-INFO: Target architecture: x86_64
BOLT-INFO: BOLT version: c660b281b60085cbe40d73d692badd43d7708d20
BOLT-INFO: first alloc address is 0x200000
BOLT-INFO: creating new program header table at address 0xa00000, offset 0x800000
BOLT-WARNING: debug info will be stripped from the binary. Use -update-debug-sections to keep it.
BOLT-INFO: enabling relocation mode
BOLT-INFO: forcing -jump-tables=move for instrumentation
BOLT-INFO: enabling lite mode
BOLT-INFO: 0 out of 6 functions in the binary (0.0%) have non-empty execution profile
BOLT-INSTRUMENTER: Number of indirect call site descriptors: 2
BOLT-INSTRUMENTER: Number of indirect call target descriptors: 4
BOLT-INSTRUMENTER: Number of function descriptors: 4
BOLT-INSTRUMENTER: Number of branch counters: 1
BOLT-INSTRUMENTER: Number of ST leaf node counters: 4
BOLT-INSTRUMENTER: Number of direct call counters: 0
BOLT-INSTRUMENTER: Total number of counters: 5
BOLT-INSTRUMENTER: Total size of counters: 40 bytes (static alloc memory)
BOLT-INSTRUMENTER: Total size of string table emitted: 47 bytes in file
BOLT-INSTRUMENTER: Total size of descriptors: 356 bytes in file
BOLT-INSTRUMENTER: Profile will be saved to file /tmp/makepkg/python/src/Python-3.13.1/python.bolt
BOLT-INFO: padding code to 0xe00000 to accommodate hot text
BOLT-INFO: output linked against instrumentation runtime library, lib entry point is 0xe06890
BOLT-INFO: clear procedure is 0xe02360
BOLT-INFO: setting __bolt_runtime_start to 0xe06850
BOLT-INFO: setting __bolt_runtime_fini to 0xe06890
BOLT-INFO: setting __hot_start to 0xc00000
BOLT-INFO: setting __hot_end to 0xc00134
BOLT-INFO: shared object or position-independent executable detected
BOLT-INFO: Target architecture: x86_64
BOLT-INFO: BOLT version: c660b281b60085cbe40d73d692badd43d7708d20
BOLT-INFO: first alloc address is 0x0
BOLT-INFO: creating new program header table at address 0xc00000, offset 0xc00000
BOLT-WARNING: debug info will be stripped from the binary. Use -update-debug-sections to keep it.
BOLT-INFO: enabling relocation mode
BOLT-INFO: forcing -jump-tables=move for instrumentation
BOLT-INFO: enabling lite mode
BOLT-WARNING: Failed to analyze 1081 relocations
BOLT-WARNING: 26 collisions detected while hashing binary objects. Use -v=1 to see the list.
BOLT-INFO: 0 out of 7729 functions in the binary (0.0%) have non-empty execution profile
BOLT-INSTRUMENTER: Number of indirect call site descriptors: 1611
BOLT-INSTRUMENTER: Number of indirect call target descriptors: 7663
BOLT-INSTRUMENTER: Number of function descriptors: 7663
BOLT-INSTRUMENTER: Number of branch counters: 87823
BOLT-INSTRUMENTER: Number of ST leaf node counters: 43203
BOLT-INSTRUMENTER: Number of direct call counters: 0
BOLT-INSTRUMENTER: Total number of counters: 131026
BOLT-INSTRUMENTER: Total size of counters: 1048208 bytes (static alloc memory)
BOLT-INSTRUMENTER: Total size of string table emitted: 166433 bytes in file
BOLT-INSTRUMENTER: Total size of descriptors: 8686344 bytes in file
BOLT-INSTRUMENTER: Profile will be saved to file /tmp/makepkg/python/src/Python-3.13.1/libpython3.13.so.1.0.bolt
BOLT-INFO: 67316 instructions were shortened
BOLT-INFO: removed 59 empty blocks
BOLT-INFO: UCE removed 491 blocks and 30006 bytes of code
BOLT-INFO: padding code to 0x1600000 to accommodate hot text
BOLT-INFO: output linked against instrumentation runtime library, lib entry point is 0x17d38a0
BOLT-INFO: clear procedure is 0x17cf370
BOLT-INFO: setting __bolt_runtime_start to 0x17d3860
BOLT-INFO: setting __bolt_runtime_fini to 0x17d38a0
BOLT-INFO: setting __hot_start to 0xe00000
BOLT-INFO: setting __hot_end to 0x148ed34
BOLT-ERROR: unable to get new address corresponding to input address 0x476f7a in function _PyEval_EvalFrameDefault. Consider adding this function to --skip-funcs=...

And if I do as advised by BOLT, I see another function that errors out instead:

BOLT-ERROR: unable to get new address corresponding to input address 0x4fa143 in function sre_ucs1_match/1(*2). Consider adding this function to --skip-funcs=...

@ptr1337
Copy link
Author

ptr1337 commented Dec 27, 2024

Also, another idea:
Instead of relaying on instrumentation (which works on most enviroments) we could also add an option.
@lseman provided an option a while ago at following patch: https://termbin.com/zksh

@ptr1337
Copy link
Author

ptr1337 commented Dec 27, 2024

Adding to the bolt commands:
--skip-funcs=sre_ucs1_match/1,_PyEval_EvalFrameDefault.localalias/1
appears to fix it.

@indygreg
Copy link
Contributor

The unable to get new address corresponding to input address error was added in LLVM 19.1 by llvm/llvm-project#89681. It can trigger when using computed gotos in PIC compiled code. Not sure about the original report, but _PyEval_EvalFrameDefault uses computed gotos, making it BOLT incompatible with LLVM 19.1. (And behavior on older LLVM versions may be incorrect, leading to buggy behavior.)

I think a proper fix here is to add any functions with computed gotos to the BOLT exclusion list. Presumably a future LLVM release will gain the ability to perform these dynamic relocations, so we'll [eventually] want some form of LLVM version sniffing to control the behavior.

@ms178
Copy link

ms178 commented Dec 31, 2024

@indygreg This might be the relevant LLVM-MR for gaining that functionality: llvm/llvm-project#120267

@indygreg
Copy link
Contributor

indygreg commented Jan 1, 2025

Nice find, @ms178! I agree that PR looks promising. Hopefully it makes LLVM 20.

While I'm here, python-build-standalone is working around the issue with a patch at astral-sh/python-build-standalone#463. I needed to add -skip-funcs=_PyEval_EvalFrameDefault,sre_ucs1_match/1,sre_ucs2_match/1,sre_ucs4_match/1 to both BOLT_INSTRUMENT_FLAGS and BOLT_APPLY_FLAGS to get things to work.

@ms178
Copy link

ms178 commented Jan 1, 2025

@indygreg While at it, I'd suggest to modernize the BOLT flags a bit: -reorder-functions=cdsort and -split-strategy=cdsplit are now the state-of-the art according to a recent LLVM presentation from the BOLT devs (https://llvm.org/devmtg/2024-03/slides/practical-use-of-bolt.pdf - slide 33 onwards). That would need a bit of testing though.

@indygreg
Copy link
Contributor

indygreg commented Jan 1, 2025

Good idea!

FWIW -reorder-functions=hfsort+ is transparently rewritten to cdsort:

      if (option == bolt::ReorderFunctions::RT_HFSORT_PLUS) {
        errs() << "BOLT-WARNING: '-reorder-functions=hfsort+' is deprecated,"
               << " please use '-reorder-functions=cdsort' instead\n";
        ReorderFunctions = bolt::ReorderFunctions::RT_CDSORT;
      }

But -split-strategy defaults to profile2.

@liusy58
Copy link

liusy58 commented Jan 3, 2025

strategy=cdsplit is only fully supported in X86 but not on AArch64, but I am now working on it recently~

@zanieb
Copy link
Contributor

zanieb commented Jan 4, 2025

strategy=cdsplit is only fully supported in X86 but not on AArch64, but I am now working on it recently~

It's a bit off-topic here, but could you expand on this? Would it fail at build or runtime on aarch64?

@liusy58
Copy link

liusy58 commented Jan 6, 2025

strategy=cdsplit is only fully supported in X86 but not on AArch64, but I am now working on it recently~

It's a bit off-topic here, but could you expand on this? Would it fail at build or runtime on aarch64?

Is there something like Discord for CPython? Maybe it is more convenient we can chat there ?

@zanieb
Copy link
Contributor

zanieb commented Jan 6, 2025

My Discord username is zanieb — you're welcome to reach out there. I'm on the Astral Discord and the Python Discord (though I'm not sure what channel would be used in the latter). We can also talk in #128514 which is dedicated to this topic — that may be best for visibility.

@liusy58
Copy link

liusy58 commented Jan 6, 2025

Ok, it seems that there is no notification in github. So I often miss important message...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
build The build process and cross-build type-bug An unexpected behavior, bug, or error
Projects
None yet
9 participants