-
Notifications
You must be signed in to change notification settings - Fork 119
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support compression with external dictionary #272
Comments
One of the main scenarios for cache server compression is the ability to store blobs in compressed form and to serve those directly without recompression on each request. If the server supported a single dictionary I think that could still work, as long as the client has the dictionary or knows how to get it. But to do this I think we would need a way to expose this to clients without creating a backwards incompatibility (eg add new Compressor values like ZSTD_DICTIONARY and a way to expose the dictionary). Would that cover your use case? I have a feeling that supporting multiple dictionaries would add a significant amount of complexity (but I could be wrong). |
I'm not familiar with the details, but from the description it sounds like the dictionaries are used to "speed up" compression by providing hints about the likely format of the data. I'd assume that arbitrary data could still be compressed, but that doing so would be slower than compression dictionary-like data. My take on this is that the dictionary can and should be provided via the control plane, and thus doesn't belong in the RE API itself. That is, however you specify your worker shapes and pool sizes, you can also specify a compression dictionary, which will then be used implicitly when data is uploaded or served. The system admin is responsible for making sure that the dictionary is a reasonable match for the expected data. |
My bad for being vague on the original post. It's what I have in the back of my head for the last 2 months or so, so I did not include many details. So let me provide a bit more context. ZSTD / Brotli compression does not work very well with small data. The smaller the amount of data to compress, the more difficult it is to compress well. Most compression algorithms leverage on some "base" data and then write subsequent bytes as some form of "delta" / diff against the "base" to reduce the data size. So for smaller files, there isn't enough data to generate a practical "base", or you spend way too much compute for very little gain. To solve this, you could compute a dictionary from sample data sets, representing your data, and use that dictionary as a base. It will improve compression/decompression speed while improving the compression ratio. Brotli uses a hardcoded dictionary by default, and ZSTD supports the usage of multiple external dictionaries. See http://facebook.github.io/zstd/#small-data and brotli for more info. Given the fact that our API is mostly used with Build Tools. If you zoom out a bit, most of the input/output files for actions in a graph are source code and artifacts derived from source code. This means that most of the artifacts being transferred with our API are small and using the default ZSTD compression might be inefficient. We should investigate the usage of external dictionaries to help improve compression efficiency.
I think that would work as the first iteration. As a monorepo changes over time, it's very likely that the effective dictionary will also change over time. Our API, at minimum, should accommodate for migration use case between different dictionary. Taking a look at the usage of ZSTD's external dictionaries in real life in Meta's Sapling VCS, a PackBlob could store multiple versions of the same files(blob) in compressed form and use the first version of a file as the dictionary to compress the latter versions. That means dictionary usage could be as fine-grained as 1 dictionary per file path. Another example is how Git Servers are optimizing their storage today: git objects when pushed to the server side were often stored as loose (plain, non-compressed) objects. Regularly the server will run git-repack (or git-gc, which runs git-repack) to delta compress the loose objects into a single pack-file. During the repack process, the base of the delta could be re-calculated based on the input objects. If a server wants to optimize its git storage backend over time, git-repack needs to run very regularly, leading to base and delta changes. There are some recent optimizations around this space, but the general concepts are still true today with Github, Gitlab, GitTea, etc... implementations. I think if we start to support dictionary compression, RE API servers would want to implement similar optimization for each tenant using the server.
It depends. If we simply treat dictionaries as blobs, then it should be as simple as calling download on blobs.
It could be if we are looking into a single hard-coded dictionary use case. This means that ideally, (a) this should be part of RE API as a new compressor value and (b) we provide some guidance for how dictionaries could be discovered and exchanged between client and server. FWIW, I don't think we need a new RPC to accommodate for this, simply an agreement on ordering of things would work. As an example: Chrome is adding some specs to support Brotli/Zstd compression for HTTP responses https://chromestatus.com/feature/5124977788977152. And the specification is being worked on here https://github.com/WICG/compression-dictionary-transport |
If I understand correctly, zstandard data compressed with a dictionary can only be decompressed by a receiver that also has the dictionary. Dictionaries are either implicit, or specified by a 32bit dictionary ID. This leads to some obvious questions:
(I am not as familiar with brotli- does it provide a way to specify which dictionary was used for compression?) |
(I typed up something on my mobile browser... misclicked and it's gone... So this is going to be my attempt no. 2) These are great questions which is what I was not sure myself when I first created this issue.
With these assumptions in mind, I was thinking that we could advertise a When the client first runs GetCapabilties RPC from the server, it can obtain the digest and download the DictionaryTree content accordingly. Then for each dictionary ID found in the header of a compressed file, the client could download and cache the dictionary locally before performing decompression. Note that we should leave the server implementations with some freedom to decide how to generate and store the dictionaries. So I won't discuss that further in detail here. If interested, I could take a crack at submitting a PR. |
With Brotli support in mind, it's probably should be |
Both Brotli and ZSTD support speeding up compression with a set of external dictionaries, trained from sample data.
It would be nice if our API could be made aware that (a) there is a dictionary involved in the compression and (b) how to upload/download that dictionary from/to client to/from server.
The text was updated successfully, but these errors were encountered: