Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimizing BOLT flags #128514

Open
Tracked by #101525
zanieb opened this issue Jan 5, 2025 · 15 comments
Open
Tracked by #101525

Optimizing BOLT flags #128514

zanieb opened this issue Jan 5, 2025 · 15 comments
Assignees
Labels
build The build process and cross-build performance Performance or resource usage type-feature A feature request or enhancement

Comments

@zanieb
Copy link
Contributor

zanieb commented Jan 5, 2025

Feature or enhancement

This is a tracking issue for discussion on determining the optimal flags for BOLT to improve performance.

Tuning the flags is mentioned in #101525, but doesn't feel like a blocker for stabilization.

Linked PRs

@zanieb zanieb added type-feature A feature request or enhancement performance Performance or resource usage build The build process and cross-build labels Jan 5, 2025
@zanieb
Copy link
Contributor Author

zanieb commented Jan 5, 2025

There was a talk in March 2024 at the LLVM Performance Workshop; I can't find a copy of the talk online but the slides are available at https://llvm.org/devmtg/2024-03/slides/practical-use-of-bolt.pdf

It includes the following suggestions:

  • Function splitting: -split-functions, -split-strategy=cdsplit
  • Function reordering: -reorder-functions=cdsort
  • Block reordering: -reorder-blocks=ext-tsp
  • Use THP pages for hot text: -hugify
  • PLT optimization: -plt
  • More aggressive ICF: -icf
  • Indirect Call Promotion: -indirect-call-promotion

We're currently using:

cpython/configure.ac

Lines 2199 to 2212 in b60044b

-reorder-blocks=ext-tsp
-reorder-functions=cdsort
-split-functions
-icf=1
-inline-all
-split-eh
-reorder-functions-use-hot-size
-peepholes=none
-jump-tables=aggressive
-inline-ap
-indirect-call-promotion=all
-dyno-stats
-use-gnu-stack
-frame-opt=hot

Suggesting we should explore -split-strategy=cdsplit, -hugify, and -plt

@zanieb
Copy link
Contributor Author

zanieb commented Jan 5, 2025

There's some commentary in #124948 (comment)

python-build-standalone recently added -hugify and -split-strategy=cdsplit (astral-sh/python-build-standalone#462), though the performance benefits were not validated.

My intent is to do some benchmarking for each flag.

@liusy58
Copy link

liusy58 commented Jan 7, 2025

Ask me everything if you need. And I am now working on BOLT, I also want to contribute to Python!

@liusy58
Copy link

liusy58 commented Jan 7, 2025

By the way, I wonder how you collect profiles? By instrumentation or perf ?

@liusy58
Copy link

liusy58 commented Jan 7, 2025

I strongly recommend that --split-all-cold should be added.

@zanieb
Copy link
Contributor Author

zanieb commented Jan 7, 2025

I was going to collect benchmarks with https://github.com/python/pyperformance (i.e., not instrumentation) on my Linux machine.

I think I can also post branches and ask the faster-cpython team to run benchmarks https://github.com/faster-cpython/benchmarking-public

I have a few commits ready

@liusy58
Copy link

liusy58 commented Jan 8, 2025

Profiles are key to BOLT. You are on x86, right? I remember cdsplit is not supported on AArch64.

@zanieb
Copy link
Contributor Author

zanieb commented Jan 8, 2025

I have machines with both architectures.

Are you suggesting an alternative approach to measuring the effect?

@liusy58
Copy link

liusy58 commented Jan 8, 2025

Yeah, maybe aarch64 can get more performance.

@zanieb
Copy link
Contributor Author

zanieb commented Jan 10, 2025

As an update, I set up an x86-64 bare metal machine with LLVM 19 and am running benchmarks for the flags I described above. I'm not including LTO in the baseline, should I?

@zanieb
Copy link
Contributor Author

zanieb commented Jan 11, 2025

@corona10
Copy link
Member

By the way, I wonder how you collect profiles? By instrumentation or perf ?

FYI, We are getting BOLTed binary through instrumentation, not perf, when we actually build.

@corona10
Copy link
Member

As an update, I set up an x86-64 bare metal machine with LLVM 19 and am running benchmarks for the flags I described above. I'm not including LTO in the baseline, should I?

I belive that we don't have to, if only the difference between baseline is flag :)

@corona10
Copy link
Member

(I am adding myself as assignee to catch up)

@zanieb
Copy link
Contributor Author

zanieb commented Jan 16, 2025

A second round of benchmarks with more samples comes out a little different https://gist.github.com/zanieb/8614bcb40b0db24dd678f2983146fb43

The effect depends on the workload.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
build The build process and cross-build performance Performance or resource usage type-feature A feature request or enhancement
Projects
None yet
Development

No branches or pull requests

3 participants