-
Notifications
You must be signed in to change notification settings - Fork 247
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enable zstd:chunked support in containers/image #775
Conversation
Lint is not happy. |
eb34310
to
69ca881
Compare
fe91364
to
66a8934
Compare
could I get some feedbacks on the containers/storage side (especially about the new interfaces)? The current implementation is feature completed, with some parts that can be added later, e.g. cache for host files. One thing to address is how to store the layer metadata. I think it will be better to use a separate file. [Addressed in the last version] |
b8cfe6b
to
264b1a2
Compare
Honestly I think cyphar just went off the rails a bit here - it's likely all the goals could be achieved with a canonical tar representation and hence be compatible with all existing tar parsers. |
That seems hard to do with compressed artifacts (we want to allow for bug fixes or improvements to the compression implementation, and for image creators to use different programming languages = different implementations — also the implementations might not produce reproducible data just because it wasn’t a goal), if we don’t keep the compressed variant around (like I guess Git does?). We don’t currently keep it around, and the disk space impact could be pretty significant in absolute terms, easily tens of gigabytes (I have no idea what it is relative vs. available disk space).
Yeah, this pair of PRs is big enough and invasive enough that it does warrant some thinking whether we could be getting more features for similar amount of complexity. Intuitively I don’t see reproducibility (from pulled and expanded image stores) as that much of an end goal; users shouldn’t be copying images by pull&push in any case, it’s going to be less efficient than a straight registry-to-registry copy no matter what we do. (Reproducibility is a useful property for a build system, which is a more constrained problem, e.g. it can choose to use a reproducible compression implementation or freeze an implementation version in a way the whole ecosystem can’t.) It seems interesting to me to think about the size/speed aspects of the incremental pull, to see if we can improve those. (If it turns out that we need a canonical uncompressed tar format for that, sure, let’s; it just doesn’t look like a natural place to start.) At least my life would be easier if we had only a small number of delta/incremental-update mechanisms, instead of giving the user a menu of 5 and having to worry about the 2^5 possible code paths in the worst case; so a format that allows best-of-all-worlds is a nice dream :) OTOH that’s all entirely theoretical until there’s an idea for how to do that, and interest+time to implement it; and this PR shouldn’t be blocked on an unachievable hypothetical. [In a sense, the uncompressed tar format is reproducible already, due to tar-split. (That‘s the other kind of reproducible, from a pulled image, not from a filesystem during a build process.) I don’t know whether that helps because I’m not sure what we need reproducibility/canonical form for. Just for more layer reuse across separately-built images?] |
Yes; we can really only care about the uncompressed version. Clients would need to verify the uncompressed checksum and not the compressed one. That said, after some reflection I think "canonical tar" could be done after this in theory - they're mostly orthogonal problems.
(looked this up, seems to be 9b04055 )
Basically the same rationale as https://reproducible-builds.org/ - the downside of tar-split is basically that the original tar stream becomes canonical and has to be carried around. |
I've marked all the APIs in containers/storage and containers/image as experimental, to have the freedom of introducing breaking changes until we are happy with it. I am planning to do more incremental improvements later, as well as possibly support stargz/estargz as well for compatibility with clients and registries that don't support zstd yet. The reason why I am pushing to get this merged is that we can finally get users to play with it and get some feedback, and not expect users to deal with re-vendoring containers/storage and containers/image into the container tools. We can still improve the new format and we add tar reproducibility later. I think the tar-split blob can be embedded in the zstd stream, in the same way the TOC metadata is added right now. |
Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
add a new custom variant of the zstd compression that permits to retrieve each file separately. The idea is based on CRFS and its stargz format for having seekable and indexable tarballs. One disadvantage of the stargz format is that a custom file is added to the tarball to store the layer metadata, and the metadata file is part of the image itself. Clients that are not aware of the stargz format will propagate the metadata file inside of the containers. The zstd compression supports embeddeding additional data as part of the stream that the zstd decompressor will ignore (skippable frame), so the issue above with CRFS can be solved directly within the zstd compression format. Beside this minor advantage, zstd is much faster and compresses better than gzip, so take this opportunity to push the zstd format further. The zstd compression is supported by the OCI image specs since August 2019: opencontainers/image-spec#788 and has been supported by containers/image since then. Clients that are not aware of the zstd:chunked format, won't notice any difference when handling a blob that uses the variant. Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
fedora-32 vfs fails to start, but I guess we can probably drop tests on fedora 32 |
Hi, FYI, I am speaking from a docker/podman user point of view. If I understand correctly this proposal, this is something more or less similar to the eStargz proposal, but using zstd instead, to be able to inject metadata unknown to old client impactless, where currently eStargz would result in having some unexpected files injected inside the running container. Also, if I understand correctly, this allows to share on the client side parts of files which may be found in different and unrelated images, if these files share the same hash, and that is really cool ! While this proposal seems to be a very good improvement from the "pull" side, is it also an improvement for "push" scenario, and the storage on remote registry ? I mean, let's say a file content exists already in some image of a remote registry, if I build a local image uzing this zstd format, which contains a file with the same content, when I will run "podman push", will it push the full ztd tar file, or only the parts that the registry doesn't know yet ? And on registry side, will it be able to serve the "big" zstd file as a collection of smaller concatenatd zstd tar files and thus only store split zstd tar files rather than the "big" ones ? Let me explain why we in Amadeus might be interested by such a proposal (as users). We have many applications which are C++ based. We do use (for now) mainly dynamic linking, and contrary to other companies our teams are very inter-dependant: we have several layer of C++ middleware teams developing diverse library, that are eventually reused by application developers. Our applications are quite big and tend to need many dependencies from various teams, compiled .so shared libraries are often 10MB files, up to sometimes 100+MB, even after being stripped. We have thousands of these libraries in total, big applications might require a lot of them. In production images, we install part of these libraries using "RUN dnf install some-amadeus-package", but the majority of these libraries are installed a simpler way, just untaring some tar.gz file in a new OCI layer. To develop internally, we also use Docker images, so these come with GNU C/C++ based toolchains, which also are very heavy (a single toolchain containing both gcc + LLVM tools can easily weight 1GB in total). While we do offer one "standard SDK" base image where most tools are present, some team make the choice to develop their own image, writing their own Dockerfile. Most of our tools are available as RPM packages, so they naturally write "RUN dnf install some-amadeus-packages && dnf clean all". The fact that we have many teams over dependant on each other and that each set of dependencies in every team is different results that in the end, we have many different SDK/production images in our registries, and layer's storage diff are very rarely shared between common images, despite we have MANY big files in common between all these various images. Anything based on "dnf" is doomed to fail to share layers as "dnf install" will result in having unique timestamsp, but not only (ie we can't just fix it by using "podman build --timestamp"), as "dnf" internally relies on a database, which will contain different package list for every team, thus look different from a filesystem point of view, thus leading to different tar.gz sha256). So on Amadeus side, proposals that would allow to share not only "big" layer's filesystem diff tar.gz, but really individual file content part (so without taking into account metadata) of these tar.gz (or tar.zstd) would definitely help us save a lot of disk space and network bandwidth to download/push just the "new" files. We are concerned by the registry side as well, not only the client side, as we do host our own internal JFrog Artifactory registry, and we regularly receive mails to clean all the docker images we don't use anymore as it fills up pretty quickly. I am curious about this one, and about the mentioned OCIv2 as well ;) Cheers, |
This is exactly one of the requirements of OCIv2! It might require the API changes on the registry side. But if the registry can understand seekable compression formats (e.g. zstd:chunked, eStargz, stargz) and the TOC format, I think we don't even need to change image spec, by peforming PATCH upload to these layers, in chunk-level granularity. Anyway, as mentioned #775 (comment), I think we can care about it in following-up PRs. |
@mtrmac @cgwalters are you fine with the current status of this PR? Or anything against moving this forward? |
In rpm-ostree we have In particular as soon as this is vendored in podman for example, then it becomes effectively frozen in that we need to support the format ~forever right? Whereas if it requires something like I only skimmed the design, and barely any of the code. |
|
As far as scanning for matching files in the Base OS @giuseppe has agreed to make this optional and just for /usr. We can turn it off by default, and then for certain workloads users could turn it on. |
it is already disabled by default in the last version of these patches and it must be enabled in the
it won't be enabled by default yet, it must be explicitly configured (haven't decided yet if through the |
Since this is experiemental, I am going to merge, so we can move forward. |
I randomly came across https://github.com/dragonflyoss/image-service - not sure if you all have seen it. May be worth a "Related projects" section somewhere? |
a bunch of new APIs to enable zstd:chunked support in containers/image.
add a new custom variant of the zstd compression that permits to retrieve each file separately.
The idea is based on CRFS and its stargz format for having seekable and indexable tarballs.
One disadvantage of the stargz format is that a custom file is added to the tarball to store the layer metadata, and the metadata file is part of the image itself. Clients that are not aware of the stargz format will propagate the metadata file inside of the containers.
The zstd compression supports embedding additional data as part of the stream that the zstd decompressor will ignore (skippable frame), so the issue above with CRFS can be solved directly within the zstd compression format.
Beside this minor advantage, zstd is much faster and compresses better than gzip, so take this opportunity to push the zstd format further.
The zstd compression is supported by the OCI image specs since August 2019: opencontainers/image-spec#788 and has been supported by containers/image since then.
Clients that are not aware of the zstd:chunked format, won't notice any difference when handling a blob that uses the variant.
Since access to host files is required for the deduplication, the untar doesn't happen in a separate process running in a chroot, but it happens in original mount namespace. To mitigate possible breakages, openat2(2) is used were available.
More details here: containers/image#1084
Signed-off-by: Giuseppe Scrivano gscrivan@redhat.com