Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Binary build script: Switch from timestamp to hash based .pyc files #1322

Merged
merged 1 commit into from
May 9, 2022

Conversation

edmorley
Copy link
Member

@edmorley edmorley commented May 9, 2022

When we build Python binaries, the make install step automatically generates .pyc files for the Python stdlib, however:

  • It generates these using the default timestamp invalidation mode, which does not work well with the CNB file timestamp normalisation behaviour.
  • It generates .pycs for all three optimisation levels (standard, -O and -OO), when the vast majority of apps only use the standard mode.

As such, this changes our builds to:

  • Use one of the hash-based pyc invalidation modes to prevent the .pycs from always being treated as outdated and so being regenerated at application boot.
  • Ship only the standard optimisation level pycs (and not the .opt-{1,2}.pyc files), reducing build output by 18MB.

We use the unchecked-hash mode rather than checked-hash since it improves app startup times by ~5%, and is only an issue if manual edits are made to the stdlib, which is not something we support.

See:
https://docs.python.org/3/reference/import.html#cached-bytecode-invalidation
https://docs.python.org/3/library/compileall.html
https://peps.python.org/pep-0488/
https://peps.python.org/pep-0552/
https://github.com/python/cpython/blob/v3.10.4/Makefile.pre.in#L1603-L1629

GUS-W-10988998.
GUS-W-10989125.

@edmorley edmorley self-assigned this May 9, 2022
@edmorley edmorley requested a review from a team as a code owner May 9, 2022 08:59
@edmorley edmorley force-pushed the builds-adjust-pycs branch from fcd6d2c to 28df132 Compare May 9, 2022 09:36
When we build Python binaries, the `make install` step automatically
generates `.pyc` files for the Python stdlib, however:
- It generates these using the default `timestamp` invalidation mode,
  which does not work well with the CNB file timestamp normalisation
  behaviour.
- It generates `.pyc`s for all three optimisation levels (standard, `-O`
  and `-OO`), when the vast majority of apps only use the standard mode.

As such, this changes our builds to:
- Use one of the hash-based pyc invalidation modes to prevent the
  `.pyc`s from always being treated as outdated and so being
  regenerated at application boot.
- Ship only the standard optimisation level pycs (and not the
  `.opt-{1,2}.pyc` files), reducing build output by 18MB.

We use the `unchecked-hash` mode rather than `checked-hash` since it
improves app startup times by ~5%, and is only an issue if manual edits
are made to the stdlib, which is not something we support.

See:
https://docs.python.org/3/reference/import.html#cached-bytecode-invalidation
https://docs.python.org/3/library/compileall.html
https://peps.python.org/pep-0488/
https://peps.python.org/pep-0552/
https://github.com/python/cpython/blob/v3.10.4/Makefile.pre.in#L1603-L1629

GUS-W-10988998.
GUS-W-10989125.
@edmorley edmorley force-pushed the builds-adjust-pycs branch from 28df132 to 09f8faf Compare May 9, 2022 09:42
@edmorley
Copy link
Member Author

edmorley commented May 9, 2022

The combined build output size reductions from this PR plus #1319, #1320 and #1321, are:

  • Python 3.10:
    • Uncompressed: 207 MB -> 38 MB
    • Compressed: 51 MB -> 13 MB
  • Python 3.9:
    • Uncompressed: 196 MB -> 52 MB
    • Compressed: 48 MB -> 17 MB

These size reductions reduce:

  • the download/extraction time (and bytes over the wire) of the Python archive during the buildpack build (this particularly helps when using CNBs locally, where users often won't have as fast connections as an EC2 instance)
  • the archiving/compression/upload time of the build directory by the build system at the end of the build (whether that be for the slug for classic builds, or the layer exporting/pushing for CNBs)
  • the app boot time at runtime (since smaller slug / layer to download on the runtime instance)

...plus they also reduce the chance of an app running into slug size limits when using heavier dependencies.

In the future I plan to explore switching the archives to using zstd instead of gzip, for further size/performance wins.

@edmorley edmorley merged commit 38afb77 into main May 9, 2022
@edmorley edmorley deleted the builds-adjust-pycs branch May 9, 2022 11:17
edmorley added a commit that referenced this pull request Apr 18, 2024
As part of the CNB multi-architecture support work, we need to change
the Python runtime archive S3 URLs to include the architecture name.
In addition, for the CNB transition from "stacks" to "targets", it would
be helpful to switch from stack ID references (such as `heroku-22`) in
the URL scheme, to the distro name+version (eg `ubuntu` and `22.04`)
available to CNBs via the CNB targets feature. See:
https://github.com/buildpacks/spec/blob/buildpack/0.10/buildpack.md#targets-1

Rather than duplicate the Python archives on S3 under different
filenames/locations, it makes sense to migrate this buildpack to the new
archive names too, so the same S3 archives can be used by both this
buildpack and the CNB.

Moving to new archive names/URLs also means we can safely regenerate all
existing Python versions to pick up the changes in #1566 (and changes
made in the past, such as #1319, #1320, #1321 and #1322), since we won't
have to worry about overwriting the old archives (which is something
we've typically avoided, since it isn't compatible with the model of
being able to roll back to an older buildpack version to return to prior
behaviour).

Since we're changing the S3 URLs anyway, now is also a good time to make
another change that would otherwise cause churn in the S3 URLs again
(which affects people that pin buildpack version): Switching archive
compression format from gzip to Zstandard (something that we've been
wanting to do for a while).

Zstandard (aka zstd) is a much superior compression format over gzip
(smaller archives and much faster decompression), and is seeing
widespread adoption across multiple ecosystems (eg APT packages,
Docker images, web browsers etc).

See:
https://github.com/facebook/zstd
https://github.com/facebook/zstd/blob/dev/programs/README.md#usage-of-command-line-interface

Our base images already have `zstd` installed (and for Rust for the CNB,
there is the [zstd](https://crates.io/crates/zstd) crate available), so it's an easy switch.

Various compression levels were tested using zstd's benchmarking feature
and in the end the highest level of compression picked, since:
1. Unlike some other compression algorithms, zstd's decompression speed
   is generally not affected by the compression level.
2. We only have to perform the compression once (when compiling Python).
3. Even at the highest compression ratio, it only takes 20 seconds to
   compress the Python archives compared to the 10 minutes it takes to
   compile Python itself (when using PGO+LTO).

For the Ubuntu 22.04 Python 3.12.3 archive, switching from gzip to zstd
(level 22, with long window mode enabled) results in a 26% reduction in
compressed archive size.

GUS-W-15158299.
GUS-W-15505556.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants