Store schemas compressed on disk. #2365

adamchainz · 2021-04-27T15:59:46Z

Is your feature request related to a problem? Please describe.

The data directory of a botocore install is over 50MB. The JSON inside compresses really well - we can see as the PyPI packages are just 7MB.

Describe the solution you'd like

It would be good to keep the schemas compressed on disk and only decompress them when reading into memory. This would save disk space, and probably a little time too since the decompression step is likely to be faster that reading all the bytes from disk.

Python's zlib or zip modules in the standard library can be used.

For an example of a library shipping data in a zip file, see my heroicons package: https://github.com/adamchainz/heroicons/blob/main/src/heroicons/__init__.py

The text was updated successfully, but these errors were encountered:

stobrien89 · 2021-04-29T15:35:30Z

Hi @adamchainz,

Thanks for the feature request! I'll review this with the team, although I can't make any guarantees as to when/if this will be implemented.

kdaily · 2021-05-01T13:55:54Z

@adamchainz,

This is an interesting idea. This has been noted previously in a similar scenario with the AWS CLI as well:

aws/aws-cli#5725

The AWS SDKs consume the API models from upstream. Changing the way that they are stored and accessed would be a significant feature. One drawback would be the lack of direct human readability of the API models that are currently available in the Python SDK. It would be difficult to see where API changes were introduced between versions of the SDK. For example, removing the documentation strings from the models would cut 20MB off of the size, which might be useful in a CI/CD environment.

Do you have specific scenarios of your own that a slimmed down version?

adamchainz · 2021-05-07T21:08:42Z

It would be difficult to see where API changes were introduced between versions of the SDK.

One can use the textconv git attribute in the repo to have git decompress the files before comparing them.

Do you have specific scenarios of your own that a slimmed down version?

This affects me in a couple ways:

I bundle boto3 into my lambda functions so I can pin an exact version. The occasional API change can break code. Bundling botocore takes a function over the 50MB limit, which requires ann upload to S3 rather than directly to Lambda, and prevents the console code editor from working.
I have maybe 30 projects using boto3/botocore, each with their own virtual environment. This means I have 1.5GB of botocore, which isn't a great use of disk space.

benkehoe · 2022-01-10T23:19:58Z

I'm in favor of this feature as well. They could stay uncompressed in the source code here, but be bundled into a zip for the released wheel. They'd stay programmatically available in botocore exactly as they are today, it would be the Loader that would change to read them out of the zip file rather than directly off disk.

The benefits to install time, artifact size, and Lambda in-console editing would be well-worth the effort imo.

joguSD · 2022-01-14T00:48:59Z

Hey all, just wanted to chime in real quick to mention that I took some time today to play around with the ideas here.

I think @benkehoe's suggestion makes a lot of sense, and I took a crack at implementing support for building wheels that include compressed models instead of the plaintext versions. However, rather than modifying the loader to include an additional possible location that checks within a zip, I decided to update the JSONFileLoader to look for either a plaintext .json file or a gzip compressed .json.gz file. This means that a compressed model can be present in any location the Loader class might look (e.g. ~/.aws/models).

In addition to support for loading gzip compressed models, I've added a script to the scripts folder that will modify a botocore wheel in-place replacing all .json files in the data directory with a gzip compressed version. You can take a look at the branch on my fork here.

Using my branch you should be able to generate then modify a wheel that includes the compressed models instead.

$ python setup.py bdist_wheel
$ ./scripts/compress-wheel-data dist/botocore-*-none-any.whl

It'd be great if some of you could test the compressed wheels out as I do have some concerns around compatibility / performance if we were ever to begin publishing wheels like this instead of the uncompressed version.

As for my testing (on an M1 macbook pro) I saw the following:

Install times were marginally in the favor of the wheel with compressed models but it wasn't significant and might have just been margin of error.

Comparing the unzipped wheel I saw about a 5x reduction in disk space going from 66M to 13M:

$ du -h -d 0 gzip/botocore-1.23.32
 13M    gzip/botocore-1.23.32

$ du -h -d 0 normal/botocore-1.23.32
 66M    normal/botocore-1.23.32

I also tried creating a new Session object and creating a client (the largest model is ec2 and the smallest is sagemaker-edge to see how this would impact load times. These results are the average of 100 runs:

ec2 Avg: 0.05411987456999998, Min: 0.03956283299999974, Max: 0.083342042
sagemaker-edge Avg: 0.02530930502999995, Min: 0.0206438750000002, Max: 0.05621566599999994

ec2 Avg: 0.048753524610000036, Min: 0.034418124999999966, Max: 0.08430220900000002
sagemaker-edge Avg: 0.02403186916999993, Min: 0.01891829100000031, Max: 0.057971249999999586

Unfortunately, loading the compressed models is about 10% slower. I'm sure there's different compression algorithms that might produce better results here but I'm concerned about compatibility if we were to use a less ubiquitous algorithm than gzip.

adamchainz · 2022-01-14T17:31:00Z

Do you know what gzip level you used? Python's gzip module defaults to 9, which is the slowest, because it applies the most compression. The gzip CLI uses 6 by default.

Even level 1 would probably provide significant gains given the repetitition in JSON.

benkehoe · 2022-01-14T17:52:01Z

Thanks for taking a look at this! Can we get a comparison on wheel size and performance between compressing the files individually versus all together? I get the benefit of allowing non-default locations to have them individually, but if there's a big difference for the primary package it could make sense to special-case that as a single zip.

joguSD · 2022-01-14T18:14:40Z

@benkehoe The wheel size wasn't significantly impacted by a a single zip vs individual models.
For the particular botocore version I used I go the following:

Size of the .whl:
Uncompressed model data dir: 8.6M
Individual model file compressed data dir: 8.6M
Single zip for data dir: 8.3M

As for the decompressed package I got:
Uncompressed model data dir: 66M
Individual model file compressed data dir: 13M
Single zip for data dir: 11M

So a slight improvement in favor of a single zip. Getting data on how that affects botocore client load times isn't something I've tested since I haven't implemented it. I do have concerns around the monolithic nature of a single zip and the performance characteristics of random access in the zip.

@adamchainz My understanding was that the compression level mostly impacts the time to compress, so the wheel is generated using level 9 compression on all of the model files. A quick search seems to confirm this, higher compression level => slower compression times, smaller files, marginally faster decompress times.

adamchainz · 2022-01-14T18:50:30Z

My understanding was that the compression level mostly impacts the time to compress, so the wheel is generated using level 9 compression on all of the model files. A quick search seems to confirm this, higher compression level => slower compression times, smaller files, marginally faster decompress times.

Ah, you are right. My bad.

joguSD · 2022-01-14T20:15:37Z

@benkehoe

I ran a sanity check comparing all 3 by doing a minimal open directly a model in the data dir or data.zip:

Loading ec2/2016-11-15/service-2.json
normal_open Avg: 0.009063401630000006, Min: 0.008451374999999997, Max: 0.010835124999999945, Sum: 0.9063401630000006
gzip_open Avg: 0.013103255060000008, Min: 0.012516417000000057, Max: 0.015194916000000003, Sum: 1.3103255060000008
nested_zip_open Avg: 0.016699820469999987, Min: 0.015805040999999687, Max: 0.020624874999999765, Sum: 1.6699820469999986


Loading sagemaker-edge/2020-09-23/service-2.json
normal_open Avg: 4.132742999999974e-05, Min: 3.9582999999999285e-05, Max: 8.28330000000009e-05, Sum: 0.0041327429999999735
gzip_open Avg: 6.306003999999984e-05, Min: 6.0208000000002565e-05, Max: 0.00011729200000000148, Sum: 0.006306003999999983
nested_zip_open Avg: 0.0048496287200000005, Min: 0.00480520799999995, Max: 0.005483624999999992, Sum: 0.48496287200000004

The nested zip is the slowest and impacts smaller models pretty significantly. This is only considering loading the .json contents because we already knew the path. I think when you start to consider the nature of the Loader class the overhead of going into the zip file will be even more significant. The Loader traverses sub-directories and lists files to discover available API versions / models, which doesn't really make sense in the context of a zip file. The ZipFileLoader class would likely need to be a significant deviation from the existing one to mitigate the performance overhead and my hunch is that it would still be slower overall.

benkehoe · 2022-01-14T20:22:49Z

Awesome, this all makes sense. The small difference in size (that surprised me a bit) combined with individual zips better on both performance and code simplicity makes it no contest. Thanks for humoring me and validating it though!

gricey432 · 2022-02-04T01:32:34Z

Feels like this is trying to fix similar symptoms as #1543 but in a different way. Though I don't think the two ideas are mutually exclusive, just linking

joguSD · 2022-02-04T19:21:12Z

@gricey432 You're absolutely correct that the two approaches aren't mutually exclusive. When I was doing the initial proof of concept script on my branch I was tempted to add a services filter that could allow the built wheel to only include a subset of services but didn't quite have time.

whardier · 2022-10-19T21:30:42Z

Could save roughly 50 megs in lambda installs by doing this. That means installing botocore/boto3 + telemetry tools + something like pandas usually breaks the bank when deploying to Lambda (even after removing pyc and stripping shared objects)

in the context of boto/botocore#2365 .json.gz loading has been merged boto/botocore#2628 so we can take advantage of it by gzing all the json in the data folder all the relevant tests pass so it seems that we are good to go

nateprewitt · 2023-11-15T23:26:25Z

Hey everyone, wanted to provide a quick status update.

Starting in 1.32.0, we began compressing select service models (Amazon EC2, Amazon Sagemaker, and Amazon Quicksight) in our .whl files distributed on PyPI. With this change, we were able to reduce the size of botocore by 9.4 MB (11%) to a total of 76.1 MB on disk. This was the final step in a series of changes we've made over the last year to validate and enable today's release.

With 1.32.1, we've rolled this change out to all service models in our .whl files. This allows us to shrink botocore from 85.5 MB in our last 1.31.x release to 19.9 MB for a total savings of 77%. We hope this will be an impactful first step towards making Botocore less difficult to use in space constrained environments.

Going forward, we have additional areas we're looking to improve and will provide updates as we have them. We'd welcome any feedback you might have in the mean time.

armenak-baburyan · 2023-11-16T08:36:47Z

Nice work! This is an update I've been waiting for a long time.

❯ for VERSION in 1.31.83 1.31.84 1.31.85 1.32.0 1.32.1; do echo -n "$VERSION  --> " && docker run --rm python:3.11-slim bash -c "pip install --disable-pip-version-check --quiet --root-user-action=ignore botocore==$VERSION && du -h -s /usr/local/lib/python3.11/site-packages/botocore"; done
1.31.83  --> 86M	/usr/local/lib/python3.11/site-packages/botocore
1.31.84  --> 86M	/usr/local/lib/python3.11/site-packages/botocore
1.31.85  --> 86M	/usr/local/lib/python3.11/site-packages/botocore
1.32.0  --> 77M	/usr/local/lib/python3.11/site-packages/botocore
1.32.1  --> 24M	/usr/local/lib/python3.11/site-packages/botocore

benkehoe · 2023-11-16T15:30:38Z

This is great news! Will this change end up in the CLI as well?

bbayles · 2023-11-16T16:25:21Z

Just a note for people who are excited about the possibilities of smaller Lambda deploy packages: this probably won't help you get under 50 MB, because what you upload to Lambda is typically compressed already.

That is, botocore on its own is now smaller because it's compressed, but your package that includes botocore isn't - you were already compressing botocore yourself. Compressing it twice doesn't help!

benkehoe · 2023-11-16T17:16:54Z

I'd also like to drop a plug here for boto/boto3#2702, you tell us botocore version 1.32.1 has this change and then it's work for us to figure out what boto3 version it is (it's 1.29.1), when they should just be the same.

This drops installed size of botocore from 60 to 24 MiB and should fix deployment. See boto/botocore#2365 (comment).

AdrianB-sovo · 2025-01-17T01:17:00Z

To further reduce the size, could you remove some parts of the JSON files?

For example, I noticed that the documentation properties make a large amount of the content of the data.
I would guess that the documentation property has no use in botocore/boto3.

If we try that with botocore/data/ec2/2016-11-15/service-2.json:

Currently, compressed: 401 KB
Currently, uncompressed: 3.2 MB
Without documentation properties (using jq 'del(.. | .documentation?)', uncompressed: 1.5 MB
Without documentation properties, compressed: 118 KB

More than 70% reduction in size compressed (and 50+% uncompressed)!

We can even minify the JSON to further reduce the size, e.g. with jq -c:

Without documentation properties + minified, uncompressed: 921 KB
Without documentation properties + minified, compressed: 105 KB
→ Not a big reduction once compressed, but it could be worthwhile anyway.

adamchainz added feature-request This issue requests a feature. needs-triage This issue or PR still needs to be triaged. labels Apr 27, 2021

stobrien89 removed the needs-triage This issue or PR still needs to be triaged. label Apr 29, 2021

kdaily added the needs-discussion label May 1, 2021

tim-finnigan mentioned this issue Nov 2, 2021

Provide more insight into botocore startup time #2553

Closed

joguSD mentioned this issue Mar 3, 2022

Support loading gzip compressed models #2628

Merged

tim-finnigan mentioned this issue Apr 13, 2022

smaller storage for the install (maybe by installing only some services)? boto/boto3#3222

Closed

2 tasks

adamchainz mentioned this issue Oct 28, 2022

Decrease installation size aws-cloudformation/cfn-lint#2455

Closed

2 tasks

nateprewitt mentioned this issue Oct 31, 2022

Add Endpoints 2.0 Rulesets #2801

Merged

RyanFitzSimmonsAK added p1 This is a high priority issue p2 This is a standard priority issue and removed p1 This is a high priority issue labels Nov 10, 2022

caiotoledo-lunasystems mentioned this issue Dec 7, 2022

Space Usage for botocore python library aws4embeddedlinux/meta-aws#1495

Closed

tim-finnigan mentioned this issue Dec 29, 2022

Split credentials-providing/config into separate package #2842

Closed

2 tasks

tgbugs mentioned this issue Mar 20, 2023

Consider minifying json shipped in package googleapis/google-api-python-client#1967

Closed

huonw mentioned this issue Nov 16, 2023

Support botocore 1.32.1 with its massive decrease in package size aio-libs/aiobotocore#1056

Closed

huonw mentioned this issue Dec 6, 2023

Bump to aiobotocore 2.8.0 terricain/aioboto3#321

Merged

fritzpaz mentioned this issue Dec 8, 2023

Bump to aiobotocore 2.8.0 + Fixing test by adding digest terricain/aioboto3#322

Closed

FlorentClarret mentioned this issue Dec 12, 2023

Update dependencies DataDog/integrations-core#16394

Merged

3 tasks

sir-sigurd added a commit to quiltdata/quilt that referenced this issue Feb 14, 2024

indexer lambda: bump botocore

4216855

This drops installed size of botocore from 60 to 24 MiB and should fix deployment. See boto/botocore#2365 (comment).

sir-sigurd mentioned this issue Feb 14, 2024

indexer lambda: bump botocore quiltdata/quilt#3879

Merged

tim-finnigan mentioned this issue Jul 23, 2024

Detected blocking call inside the event loop #3222

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Store schemas compressed on disk. #2365

Store schemas compressed on disk. #2365

adamchainz commented Apr 27, 2021

stobrien89 commented Apr 29, 2021

kdaily commented May 1, 2021

adamchainz commented May 7, 2021

benkehoe commented Jan 10, 2022

joguSD commented Jan 14, 2022

adamchainz commented Jan 14, 2022

benkehoe commented Jan 14, 2022

joguSD commented Jan 14, 2022 •

edited

Loading

adamchainz commented Jan 14, 2022

joguSD commented Jan 14, 2022 •

edited

Loading

benkehoe commented Jan 14, 2022

gricey432 commented Feb 4, 2022

joguSD commented Feb 4, 2022

whardier commented Oct 19, 2022

nateprewitt commented Nov 15, 2023

armenak-baburyan commented Nov 16, 2023

benkehoe commented Nov 16, 2023

bbayles commented Nov 16, 2023 •

edited

Loading

benkehoe commented Nov 16, 2023

AdrianB-sovo commented Jan 17, 2025

Store schemas compressed on disk. #2365

Store schemas compressed on disk. #2365

Comments

adamchainz commented Apr 27, 2021

stobrien89 commented Apr 29, 2021

kdaily commented May 1, 2021

adamchainz commented May 7, 2021

benkehoe commented Jan 10, 2022

joguSD commented Jan 14, 2022

adamchainz commented Jan 14, 2022

benkehoe commented Jan 14, 2022

joguSD commented Jan 14, 2022 • edited Loading

adamchainz commented Jan 14, 2022

joguSD commented Jan 14, 2022 • edited Loading

benkehoe commented Jan 14, 2022

gricey432 commented Feb 4, 2022

joguSD commented Feb 4, 2022

whardier commented Oct 19, 2022

nateprewitt commented Nov 15, 2023

armenak-baburyan commented Nov 16, 2023

benkehoe commented Nov 16, 2023

bbayles commented Nov 16, 2023 • edited Loading

benkehoe commented Nov 16, 2023

AdrianB-sovo commented Jan 17, 2025

joguSD commented Jan 14, 2022 •

edited

Loading

joguSD commented Jan 14, 2022 •

edited

Loading

bbayles commented Nov 16, 2023 •

edited

Loading