-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Store schemas compressed on disk. #2365
Comments
Hi @adamchainz, Thanks for the feature request! I'll review this with the team, although I can't make any guarantees as to when/if this will be implemented. |
This is an interesting idea. This has been noted previously in a similar scenario with the AWS CLI as well: The AWS SDKs consume the API models from upstream. Changing the way that they are stored and accessed would be a significant feature. One drawback would be the lack of direct human readability of the API models that are currently available in the Python SDK. It would be difficult to see where API changes were introduced between versions of the SDK. For example, removing the documentation strings from the models would cut 20MB off of the size, which might be useful in a CI/CD environment. Do you have specific scenarios of your own that a slimmed down version? |
One can use the
This affects me in a couple ways:
|
I'm in favor of this feature as well. They could stay uncompressed in the source code here, but be bundled into a zip for the released wheel. They'd stay programmatically available in The benefits to install time, artifact size, and Lambda in-console editing would be well-worth the effort imo. |
Hey all, just wanted to chime in real quick to mention that I took some time today to play around with the ideas here. I think @benkehoe's suggestion makes a lot of sense, and I took a crack at implementing support for building wheels that include compressed models instead of the plaintext versions. However, rather than modifying the loader to include an additional possible location that checks within a zip, I decided to update the In addition to support for loading gzip compressed models, I've added a script to the Using my branch you should be able to generate then modify a wheel that includes the compressed models instead. $ python setup.py bdist_wheel
$ ./scripts/compress-wheel-data dist/botocore-*-none-any.whl It'd be great if some of you could test the compressed wheels out as I do have some concerns around compatibility / performance if we were ever to begin publishing wheels like this instead of the uncompressed version. As for my testing (on an M1 macbook pro) I saw the following: Install times were marginally in the favor of the wheel with compressed models but it wasn't significant and might have just been margin of error. Comparing the unzipped wheel I saw about a 5x reduction in disk space going from 66M to 13M:
I also tried creating a new
Unfortunately, loading the compressed models is about 10% slower. I'm sure there's different compression algorithms that might produce better results here but I'm concerned about compatibility if we were to use a less ubiquitous algorithm than gzip. |
Do you know what gzip level you used? Python's Even level 1 would probably provide significant gains given the repetitition in JSON. |
Thanks for taking a look at this! Can we get a comparison on wheel size and performance between compressing the files individually versus all together? I get the benefit of allowing non-default locations to have them individually, but if there's a big difference for the primary package it could make sense to special-case that as a single zip. |
@benkehoe The wheel size wasn't significantly impacted by a a single zip vs individual models. Size of the As for the decompressed package I got: So a slight improvement in favor of a single zip. Getting data on how that affects botocore client load times isn't something I've tested since I haven't implemented it. I do have concerns around the monolithic nature of a single zip and the performance characteristics of random access in the zip. @adamchainz My understanding was that the compression level mostly impacts the time to compress, so the wheel is generated using level 9 compression on all of the model files. A quick search seems to confirm this, higher compression level => slower compression times, smaller files, marginally faster decompress times. |
Ah, you are right. My bad. |
I ran a sanity check comparing all 3 by doing a minimal open directly a model in the
The nested zip is the slowest and impacts smaller models pretty significantly. This is only considering loading the |
Awesome, this all makes sense. The small difference in size (that surprised me a bit) combined with individual zips better on both performance and code simplicity makes it no contest. Thanks for humoring me and validating it though! |
Feels like this is trying to fix similar symptoms as #1543 but in a different way. Though I don't think the two ideas are mutually exclusive, just linking |
@gricey432 You're absolutely correct that the two approaches aren't mutually exclusive. When I was doing the initial proof of concept script on my branch I was tempted to add a services filter that could allow the built wheel to only include a subset of services but didn't quite have time. |
Could save roughly 50 megs in lambda installs by doing this. That means installing botocore/boto3 + telemetry tools + something like pandas usually breaks the bank when deploying to Lambda (even after removing pyc and stripping shared objects) |
in the context of boto/botocore#2365 .json.gz loading has been merged boto/botocore#2628 so we can take advantage of it by gzing all the json in the data folder all the relevant tests pass so it seems that we are good to go
Hey everyone, wanted to provide a quick status update. Starting in 1.32.0, we began compressing select service models (Amazon EC2, Amazon Sagemaker, and Amazon Quicksight) in our .whl files distributed on PyPI. With this change, we were able to reduce the size of botocore by 9.4 MB (11%) to a total of 76.1 MB on disk. This was the final step in a series of changes we've made over the last year to validate and enable today's release. With 1.32.1, we've rolled this change out to all service models in our .whl files. This allows us to shrink botocore from 85.5 MB in our last 1.31.x release to 19.9 MB for a total savings of 77%. We hope this will be an impactful first step towards making Botocore less difficult to use in space constrained environments. Going forward, we have additional areas we're looking to improve and will provide updates as we have them. We'd welcome any feedback you might have in the mean time. |
Nice work! This is an update I've been waiting for a long time. ❯ for VERSION in 1.31.83 1.31.84 1.31.85 1.32.0 1.32.1; do echo -n "$VERSION --> " && docker run --rm python:3.11-slim bash -c "pip install --disable-pip-version-check --quiet --root-user-action=ignore botocore==$VERSION && du -h -s /usr/local/lib/python3.11/site-packages/botocore"; done
1.31.83 --> 86M /usr/local/lib/python3.11/site-packages/botocore
1.31.84 --> 86M /usr/local/lib/python3.11/site-packages/botocore
1.31.85 --> 86M /usr/local/lib/python3.11/site-packages/botocore
1.32.0 --> 77M /usr/local/lib/python3.11/site-packages/botocore
1.32.1 --> 24M /usr/local/lib/python3.11/site-packages/botocore |
This is great news! Will this change end up in the CLI as well? |
Just a note for people who are excited about the possibilities of smaller Lambda deploy packages: this probably won't help you get under 50 MB, because what you upload to Lambda is typically compressed already. That is, |
I'd also like to drop a plug here for boto/boto3#2702, you tell us botocore version 1.32.1 has this change and then it's work for us to figure out what boto3 version it is (it's 1.29.1), when they should just be the same. |
This drops installed size of botocore from 60 to 24 MiB and should fix deployment. See boto/botocore#2365 (comment).
To further reduce the size, could you remove some parts of the JSON files? For example, I noticed that the If we try that with
More than 70% reduction in size compressed (and 50+% uncompressed)! We can even minify the JSON to further reduce the size, e.g. with
|
Is your feature request related to a problem? Please describe.
The
data
directory of abotocore
install is over 50MB. The JSON inside compresses really well - we can see as the PyPI packages are just 7MB.Describe the solution you'd like
It would be good to keep the schemas compressed on disk and only decompress them when reading into memory. This would save disk space, and probably a little time too since the decompression step is likely to be faster that reading all the bytes from disk.
Python's zlib or zip modules in the standard library can be used.
For an example of a library shipping data in a zip file, see my heroicons package: https://github.com/adamchainz/heroicons/blob/main/src/heroicons/__init__.py
The text was updated successfully, but these errors were encountered: