Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Python][Packaging] Strip unnecessary symbols from libarrow.so to reduce wheel package size #40749

Closed
raulcd opened this issue Mar 22, 2024 · 7 comments

Comments

@raulcd
Copy link
Member

raulcd commented Mar 22, 2024

Describe the enhancement requested

There has been some effort in order to reduce pyarrow size and there are some issues opened in order to split pyarrow wheels in order to have pyarrow-core, pyarrow-all, etcetera.

There seems to be the possibility to strip unnecessary symbols via strip --discard-all. The libarrow.so file seems to be reduced from 61MB to 45 MB.

Another example:

-rwxrwxr-x 1 user group  49M Feb 22 18:55 libarrow.so.1600.0.0

$ strip --strip-unneeded libarrow.so.1600.0.0

-rwxrwxr-x 1 user group  33M Mar 12 15:27 libarrow.so.1600.0.0

This issue is to investigate the possibility of using strip and the possible disadvantages.

edited to fix typo

Component(s)

Packaging, Python

@kou
Copy link
Member

kou commented Mar 25, 2024

Can we check whether backtrace on crash is still available with strip --discard-all/--strip-unneeded? (We may need to use gdb for it.)

@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented Apr 9, 2024

Potentially related issue with some useful content (but specifically for the wheels, so maybe mostly relevant for our cython code?): pypa/cibuildwheel#331

That issue mentions numpy used --strip-all though multibuild, although nowadays the wheel building infrastructure is different (using cibuildwheel), and I don't directly see any mention of that in the numpy sources.
Pandas is using -Wl,--strip-all: pandas: https://github.com/pandas-dev/pandas/blob/8da8b54412020a9478d7b6b0fccde9f9bbf3d8ba/pyproject.toml#L151

@pitrou
Copy link
Member

pitrou commented Apr 9, 2024

I've tried on the reproducer in #38770 and only --strip-debug preserves the full backtrace including non-public functions:

[...]
#7  0x00007ffff310b277 in std::terminate() () from /lib/x86_64-linux-gnu/libstdc++.so.6
#8  0x00007ffff310b4d8 in __cxa_throw () from /lib/x86_64-linux-gnu/libstdc++.so.6
#9  0x00007ffff373737b in std::__throw_bad_variant_access(char const*) () from /home/antoine/t/venv-3.10/lib/python3.10/site-packages/pyarrow/libarrow.so.1500
#10 0x00007ffff3737399 in std::__throw_bad_variant_access(bool) () from /home/antoine/t/venv-3.10/lib/python3.10/site-packages/pyarrow/libarrow.so.1500
#11 0x00007ffff37960e7 in arrow::compute::internal::(anonymous namespace)::FilterMetaFunction::ExecuteImpl(std::vector<arrow::Datum, std::allocator<arrow::Datum> > const&, arrow::compute::FunctionOptions const*, arrow::compute::ExecContext*) const [clone .cold] ()
   from /home/antoine/t/venv-3.10/lib/python3.10/site-packages/pyarrow/libarrow.so.1500
#12 0x00007ffff42cc3e0 in arrow::compute::MetaFunction::Execute(std::vector<arrow::Datum, std::allocator<arrow::Datum> > const&, arrow::compute::FunctionOptions const*, arrow::compute::ExecContext*) const () from /home/antoine/t/venv-3.10/lib/python3.10/site-packages/pyarrow/libarrow.so.1500
[...]

More powerful strip options make the backtrace less useful because the symbols of non-public functions are removed:

[...]
#7  0x00007ffff310b277 in std::terminate() () from /lib/x86_64-linux-gnu/libstdc++.so.6
#8  0x00007ffff310b4d8 in __cxa_throw () from /lib/x86_64-linux-gnu/libstdc++.so.6
#9  0x00007ffff373737b in ?? () from /home/antoine/t/venv-3.10/lib/python3.10/site-packages/pyarrow/libarrow.so.1500
#10 0x00007ffff3737399 in ?? () from /home/antoine/t/venv-3.10/lib/python3.10/site-packages/pyarrow/libarrow.so.1500
#11 0x00007ffff37960e7 in ?? () from /home/antoine/t/venv-3.10/lib/python3.10/site-packages/pyarrow/libarrow.so.1500
#12 0x00007ffff42cc3e0 in arrow::compute::MetaFunction::Execute(std::vector<arrow::Datum, std::allocator<arrow::Datum> > const&, arrow::compute::FunctionOptions const*, arrow::compute::ExecContext*) const () from /home/antoine/t/venv-3.10/lib/python3.10/site-packages/pyarrow/libarrow.so.1500
[...]

This is on the PyArrow 15.0.2 wheels. The savings are still significant: from 61MB (original) to 54MB (stripped) for libarrow.so.

@paleolimbot
Copy link
Member

For pyarrow, there are also nightly wheel builds as well, correct? Might it also be an option to keep the debug symbols in those (so that there is a route to getting a user-reported stack trace) but strip them from the wheel most users get with a default pip?

@pitrou
Copy link
Member

pitrou commented Apr 9, 2024

A user-reported stack trace would be possible with --strip-debug.

@raulcd
Copy link
Member Author

raulcd commented Jun 7, 2024

For macOS on pandas it seems that -g0 was used:

And the fix

raulcd added a commit to raulcd/arrow that referenced this issue Jun 10, 2024
raulcd added a commit that referenced this issue Jun 11, 2024
… wheels (#42028)

### Rationale for this change

Removing unnecessary symbols for wheels will allow us to reduce the size of the wheels considerably.

### What changes are included in this PR?

Running `strip --strip-debug` on Linux wheels for all *.so files.

### Are these changes tested?

Yes, via Archery.

### Are there any user-facing changes?

No
* GitHub Issue: #40749

Lead-authored-by: Raúl Cumplido <raulcumplido@gmail.com>
Co-authored-by: Antoine Pitrou <pitrou@free.fr>
Signed-off-by: Raúl Cumplido <raulcumplido@gmail.com>
@raulcd
Copy link
Member Author

raulcd commented Jun 11, 2024

Issue resolved by pull request 42028
#42028

@raulcd raulcd closed this as completed Jun 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants