Very slow build times on Windows with optimization flags in Nanobind #791

pthom · 2024-11-19T09:47:39Z

pthom
Nov 19, 2024

Hi,

I noticed extremely slow build times on Windows (more than 3 hours!) when using any optimization switch (/Os, /O1, or /O2). However, builds are much faster when using /Od (no optimization), completing in under 2 minutes.

This issue seems to be specific to Windows, as similar slowdowns were not observed on Linux or macOS.

For context, my bindings were generated using litgen, a tool I authored, which I mentioned in this previous PR. I am in the process of porting imgui_bundle from Pybind11 to Nanobind, I experienced this issue during this.

I’ve conducted a detailed analysis and set up a minimal reproduction repository, which I will describe below.

Reproduction Repository

To investigate this issue, I created a minimal reproduction repository: nano_study_link. Its purpose is to analyze build times and library sizes for medium-sized projects (~8,300 lines of binding code) generated by litgen for both Pybind11 and Nanobind.

You can find detailed results in its GitHub Actions workflows

Results Overview

Here are the build times and library sizes for both Pybind11 and Nanobind across platforms:

When using Pybind11

Platform	Lib Size (MB)	Options	Time
linux	4.9	default	3m 21s
macos	4.0	default	3m 17s
windows	3.2	default	3m 02s

When using Nanobind

Platform	Lib Size (MB)	Options	Time
linux	2.6	default (will use -Os)	1m 37s
macos	2.2	default (will use -Os)	1m 5s
linux	2.8	-O3	1m 43s
macos	2.3	-O3	1m 1s
windows	5.3	no optim	1m 59s
windows	2.2	default optim (/Os cf nanobind cmake)	2h 28m (!)
windows	2.2	/O1 (close to /Os)	2h 18m (!)
windows	2.2	/Os (optim size)	2h 22m (!)

Quick Analysis (on this particular project)

On Linux and macOS

Nanobind consistently outperforms Pybind11 in build times and produces smaller binaries
Optimization switches (-O3 vs -Os) have a minor impact on build times and binary sizes

On Windows

On Windows, build times with Nanobind are prohibitively slow under optimization switches (/Os, /O1, /O2), taking over 2 hours, compared to under 2 minutes with /Od (no optimization).
Library sizes are significantly smaller with optimization switches, but the difference may depend on the codebase (see analysis below with a more comprehensive project).

Reproduction repository summary

Here is a summarized version of the repository structure:

nano_study_link/
├── CMakeLists.txt          # CMakelists with switches for optims 
├── imgui/                  # imgui source code
│         ├── imconfig.h  
│         ├── imgui.cpp
│         ├── imgui.h
│         └── ...
├── nanobind/               # nanobind submodule
├── pybind11/               # pybind11 submodule
├── nanobind_imgui.cpp      # imgui bindings with nanobind (generated by litgen) 8300 lines
├── pybind11_imgui.cpp      # imgui bindings with pybind11 (generated by litgen) 8300 lines
├── pyproject.toml
├──.github/
        └── workflows/ # CI workflows (to analyze build times and library size
           ├── pip_nixes_nano.yml              # Linux and macOS (Nanobind default)
           ├── pip_nixes_nano_o3.yml           # Linux and macOS with -O3
           ├── pip_nixes_pybind.yml            # Linux and macOS (Pybind11 default)
           ├── pip_win_nano_optim.yml          # Windows with Nanobind default (/Os)
           ├── pip_win_nano_optim_disabled.yml # Windows with /Od (no optimization)
           ├── pip_win_nano_optim_o1.yml       # Windows with /O1 optimization
           ├── pip_win_nano_optim_o2.yml       # Windows with /O2 optimization
           ├── pip_win_nano_optim_size.yml     # Windows with /Os optimization
           └── pip_win_pybind.yml              # Windows with Pybind11 default

A More Comprehensive Test on a Large Library

The tables below are based on imgui_bundle, a project that builds wheels for Dear ImGui along with 19 additional libraries.

This project is currently transitioning from Pybind11 to Nanobind, and the figures below show the build times and library sizes for both tools. For Windows, I forcefully disabled optimizations (/Od) to allow a fair comparison of build times and sizes.

Results Overview

Library	OS	Build Time (min)	Combined wheels Size (MB) (python 3.11->3.13)
Pybind11	macOS (latest)	50.7	63.3
Pybind11	Ubuntu (latest)	90.0	179.0
Pybind11	Windows (latest)	29.4	118.0
Nanobind	macOS (latest)	13.1	58.3
Nanobind	Ubuntu (latest)	61.0	165.0
Nanobind	Windows (latest)	22.2	124.0

Total Size Comparison

Library	Binary Wheels (MB)	Source Distribution (MB)
Nanobind	347.3	39.6
Pybind11	360.3	39.5

Sources

Wheels with nanobind
Wheels with pybind11

Summary of Findings

Build Times:
Nanobind significantly outperforms Pybind11 in build times across all platforms. On macOS, the build time is reduced from ~50 minutes to ~13 minutes, while on Windows, it drops from ~29 minutes to ~22 minutes.
Library Sizes:
Library sizes are comparable between Nanobind and Pybind11, even on Windows, where the dramatic size difference observed with ImGui is mitigated when combined with other libraries in the bundle.

Details about Wheels Action

Python Versions: 3.11, 3.12, and 3.13
macOS: Intel builds (no ARM support on GitHub Actions)
Ubuntu: x64 builds only (no ARM or 32-bit support)
Windows: x64 builds only (no ARM or 32-bit support), compiled with /Od (no optimization)

Summary

This part addresses nanobind only:

Observations:

On Windows, build times with optimization switches (/Os, /O1, /O2) may be prohibitively slow on large projects, taking several hours. In contrast, using /Od (no optimization) results in much faster builds (~2 minutes), albeit with larger binary sizes.
On Linux and macOS, testing across 20 different libraries on Linux and macOS showed fast build times with both -Os and -O3, and only minor differences in performance or binary size. However, slowdowns may still occur on different codebases.

What could be done:

Addressing multiple platforms and scenarios is hard (I know :-).

Perhaps, a note in the documentation could inform developers about the potential for extremely slow build times on Windows when using /Os or other optimization switches. For example:

"When using MSVC on Windows, builds with /Os optimization may experience significantly slower compile times compared to /Od (no optimization). Developers facing such issues can temporarily disable size optimization in their builds by adjusting the CMake configuration. This can be achieved via:
target_compile_options(your_module PRIVATE $<$<CONFIG:Release>:/Od>)

Alternatively, Nanobind’s CMake configuration could include an optional toggle to allow developers to switch between /Os (default) and /Od for faster builds when needed.

A draft could be:

function(nanobind_opt_size name)
    if (MSVC)
        # Allow users to disable /Os with an option
        if (NOT NB_DISABLE_OPTIM)
            target_compile_options(${name} PRIVATE $<${NB_OPT_SIZE}:$<$<COMPILE_LANGUAGE:CXX>:/Os>>)
        else()
            target_compile_options(${name} PRIVATE $<${NB_OPT_SIZE}:$<$<COMPILE_LANGUAGE:CXX>:/Od>>)
        endif()
    else()
        target_compile_options(${name} PRIVATE $<${NB_OPT_SIZE}:$<$<COMPILE_LANGUAGE:CXX>:-Os>>)
    endif()
endfunction()

This implementation is a starting point and would need further testing, especially to ensure compatibility with the existing NOMINSIZE option, which currently controls calls to nanobind_opt_size.

I hope this analysis is helpful! Please let me know if additional details or testing are needed!

wjakob · 2024-11-19T14:03:28Z

wjakob
Nov 19, 2024
Maintainer

That's horrible. pybind11 and nanobind work very similarly at a high level, just that nanobind migrates tons of template code that must be compiled many times to a separate library. If anything, compilation should become much simpler, which is what you are seeing on GCC/macOS.

I am not really sure how to approach this problem because your bindings are so large. I tried two things: I changed func_create in include/nanobind/nb_func.h and deleted the entire body of this function so that it only returns nullptr. The compilation still takes forever. This is actually the main thing that gets compiled over and over again and generates most of the actual binding executable code. So the problem must be somewhere else.

Then I tried deleting the bottom half of the code in nanobind_imgui.cpp and the compilation became fast. I suspect at this point one of two things:

The crazy size of this function triggers some kind of O(n^2) growth in MSVC. Splitting the bindings into multiple functions or multiple compilation units might resolve the issue. This kind of stuff can be tricky to track down, and the reason why it explodes in nanobind and not in pybind11 might not be very obvious without source-level access to the compiler.
There is actually something bad happening among the combination of nanobind, this compiler, and this set of bindings. If that is so, I would ask that you can maybe spend sometime bisecting it by removing different parts of the binding code to make a smaller reproducer?

0 replies

pthom · 2024-11-19T16:28:33Z

pthom
Nov 19, 2024
Author

Hello,

Many thanks for your answer!

I had initially tried to split the file, and did not see any improvements. But this was because I was not patient enough, and I had given up hope after 30 minutes. Sorry for this.

I gave it another try and observed that:

Splitting the file in two does reduce the compile time from 2h30 to 50 minutes: see workflow "pip_win_nano_parts" in these actions
Splitting the file in 4 reduces the compile time to 4'54" : see workflow "pip_win_nano_parts"

So the lessons are:

there is definitely a O(n^2) effect somewhere in MSVC's optimizer which depends on the size of the functions
luckily, this effect does not occur during link time optimizations
there is likely nothing bad happening in nanobind, since when the source is sufficiently splitted, it works ok

And on my side, I need to study whether it is feasible to split generated bindings files (I have several generated binding file which exceed 4000 lines).

PS: Would you agree with me that when building the bindings with optimization disabled (but building the rest of the libraries with full optimizations), the performance should likely remains good (since bindings function merely provide a link).

1 reply

wjakob Nov 22, 2024
Maintainer

To follow up on your question here: I think that optimizations of some sort are definitely needed, otherwise perf. of the bindings will suffer.

pthom · 2024-11-19T17:53:03Z

pthom
Nov 19, 2024
Author

Additional detail: it is sufficient to split the long function in several functions while keeping them in the same file (see this action which built in 5 minutes).

Many thanks for your help, and for your continued work on nanobind and pybind!

On my side, I'll study how to get the generator to split its bindings in several functions. It should be enough.

Thanks again

1 reply

wjakob Nov 20, 2024
Maintainer

Hi @pthom,

could I ask you to experiment with one more thing? The compile time is exploding in MSVC because of __forceinline. Specifically, if in include/nanobind/nb_defs.h, if you change #define NB_INLINE __forceinline to #define NB_INLINE inline, the compilation becomes fast.

Could you compile with and without this flag and report the size of the resulting shared libraries?

Thanks,
Wenzel

pthom · 2024-11-20T10:18:57Z

pthom
Nov 20, 2024
Author

Hi @wjakob,

Good news,

See this action, the reported size is 2.4MB, very close to the 2.2MB which we get when waiting for 3 hours; and the build time is 3min4s

(Done with this commit)

Thanks!

Pascal

0 replies

pthom · 2024-11-20T18:45:56Z

pthom
Nov 20, 2024
Author

More results on a larger library (imgui bundle), when disabling __forceinline:

Library	OS	Build Time (min)	Combined wheels Size (MB) (python 3.11->3.13)
Pybind11	macOS (latest)	50.7	63.3
Pybind11	Ubuntu (latest)	90.0	179.0
Pybind11	Windows (latest)	29.4	118.0
Nanobind	macOS (latest)	13.1	58.3
Nanobind	Ubuntu (latest)	61.0	165.0
Nanobind	Windows (latest)	19.1	119.0 (was 124 with /Od)

0 replies

pthom · 2024-11-22T06:28:04Z

pthom
Nov 22, 2024
Author

FYI, my compilation time for bindings using pyodide (under linux) went down from 15 minutes (using pybind11) to 2 minutes (using nanobind). The link step, which was very slow when using pybind11 (12 minutes, using 6 GB of memory), is now much more reasonable.

0 replies

wjakob · 2024-11-22T14:31:09Z

wjakob
Nov 22, 2024
Maintainer

This all sounds great, I am glad that the build performance is now better. It is surprising for me to see that pybind11 wheels are smaller than the nanobind ones, however. Is it possible that the number of wheels in that count is different? (I guess the average wheel size might be a more useful "bloat" measurement)

0 replies

pthom · 2024-11-22T16:06:40Z

pthom
Nov 22, 2024
Author

Is it possible that the number of wheels in that count is different?

I was not different, but it went from containing Python 3.10, 3.11, 3.12 to containing 3.11, 3.12, 3.13. However, a wheel is a zipped file that includes a compiled library together with other artifacts so that the measure is not very efficient.

I took some time to extract the wheel for imgui bundle, comparing from pybind to nanobind. Those wheels were produced with one week of interval, the changes in the code were small (and all related to switching from pybind to nanobind).

Here are the raw results (size in bytes)

18274912 dist-macos-latest-nano/_imgui_bundle.cpython-312-darwin.so
24359184 dist-macos-latest-pybind/_imgui_bundle.cpython-312-darwin.so
37598129 dist-ubuntu-latest-nano/_imgui_bundle.cpython-312-x86_64-linux-musl.so
47025297 dist-ubuntu-latest-pybind/_imgui_bundle.cpython-312-x86_64-linux-musl.so
13180416 dist-windows-latest-pybind/_imgui_bundle.cp312-win_amd64.pyd
20549632 dist-windows-latest_nano/_imgui_bundle.cp312-win_amd64.pyd

And the results as tables

macOS

Library	Size in MB
Nano	18.27
Pybind	24.36

Ubuntu

Library	Size in MB
Nano	37.60
Pybind	47.03

Windows

Library	Size in MB
Nano	20.55
Pybind	13.18

=> Honestly, I do not know what explains the difference observed under windows. It might come from an option I inadvertently changed during the migration, or from nanobind, I don't know for sure.

I did check the changes on my side (I checked the diffs before and after this commit), and saw no important changes in the compilation options (apart from switching to nanobind).
The generated bindings code behavior changed a bit however, since it now handles mutable default function parameters in a C++ fashion (it makes sure to reevaluate them at each call even from python, so that it is less "surprising" for the C++ API). However, this did not have an impact on other platforms.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Very slow build times on Windows with optimization flags in Nanobind #791

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 8 comments 2 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

Very slow build times on Windows with optimization flags in Nanobind #791

pthom Nov 19, 2024

Reproduction Repository

Results Overview

Quick Analysis (on this particular project)

On Linux and macOS

On Windows

Reproduction repository summary

A More Comprehensive Test on a Large Library

Results Overview

Total Size Comparison

Sources

Summary of Findings

Details about Wheels Action

Summary

Observations:

What could be done:

Replies: 8 comments · 2 replies

wjakob Nov 19, 2024 Maintainer

pthom Nov 19, 2024 Author

wjakob Nov 22, 2024 Maintainer

pthom Nov 19, 2024 Author

wjakob Nov 20, 2024 Maintainer

pthom Nov 20, 2024 Author

pthom Nov 20, 2024 Author

pthom Nov 22, 2024 Author

wjakob Nov 22, 2024 Maintainer

pthom Nov 22, 2024 Author

pthom
Nov 19, 2024

Replies: 8 comments 2 replies

wjakob
Nov 19, 2024
Maintainer

pthom
Nov 19, 2024
Author

wjakob Nov 22, 2024
Maintainer

pthom
Nov 19, 2024
Author

wjakob Nov 20, 2024
Maintainer

pthom
Nov 20, 2024
Author

pthom
Nov 20, 2024
Author

pthom
Nov 22, 2024
Author

wjakob
Nov 22, 2024
Maintainer

pthom
Nov 22, 2024
Author