Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal: Standard Base Image Annotations #821

Closed
imjasonh opened this issue Feb 11, 2021 · 8 comments · Fixed by #822
Closed

Proposal: Standard Base Image Annotations #821

imjasonh opened this issue Feb 11, 2021 · 8 comments · Fixed by #822

Comments

@imjasonh
Copy link
Member

I'd like to propose adding two new standard annotations to the spec, to describe information about an image's base image.

(This was previously briefly inquired about in #783)

Motivation

In general, it can be useful to users to be able to answer the question, "what is this image based on", or even for registry hosts to answer "which images are based on [this image]".

Use cases include being able to:

  1. quickly determine images based on vulnerable base images, without resorting to full image scans.
  2. quickly determine whether an image is based on an image for which a newer version is available.
  3. in certain cases, automatically "rebase" higher-up layers on top of updated base image layers.

In the case of (3), this is known not to be universally safe, but in certain constrained use cases, for example Buildpacks, this can be made to be more safe such that safely rebasing is possible. Being able to produce many updated images at once using only registry API operations can greatly decrease the time it takes to roll out a fix for a vulnerability.

Proposal

To accomplish this, I'm proposing two new layer annotations:

  • org.opencontainers.image.base.digest digest of the image this image was based on.
  • org.opencontainers.image.base.ref.name image reference of the image this image is based on, at the time it was built.

If both annotations are provided, the value of base.ref.name is assumed to refer to the same object (image manifest or manifest list) as base.digest refers to at the time the image was built.

Examples

1. Identifying Vulnerable Base Images

Say a vulnerability is found in an image, base-image:v6, with digest base-image@sha256:abcdef12. A patch is applied and rolled out, tagged as base-image:v6 with digest base-image@sha256:deadb33f.

Given an image annotated as above (assuming that annotation is correct), I can trivially tell whether it was based on the vulnerable image, if its image.base.digest annotation is sha256:abcdef12.

2. Identifying Updated Base Images

Say I have an image, my-app:latest, annotated with:

  • org.opencontainers.image.base.digest: sha256:deadb33f
  • org.opencontainers.image.base.ref.name: base-image:v6

I can at any point query the registry to determine the current digest of base-image:v6, and if it doesn't match the image.base.digest, I can know that my image is not based on the current base-image:v6 image. Knowing this, I can perform a rebuild to pick up this latest base image to ensure I have the latest base image.

For this to be efficient, base.ref.name and base.digest are assumed to refer to the same object, so clients only need to HEAD the base ref and compare the resulting digest, and not GET a manifest list and HEAD each child manifest.

3. Rebasing Updated Base Images

If I know, due to how my image was built, that my app's layers will be compatible with new layers in my base image, I don't even need to rebuild -- I can rebase.

That is, I can take the layers in my-app:latest, remove the base layers shared in base-image@sha256:deadb33f, and substitute them with the layers in base-image:v6, then push that image back to the registry for further validation and delivery.

This requires being able to identify a "base image seam", that is, which layers in an image belong to the base image, and which belong above the base image.

Rebasing is not guaranteed to produce valid images in all cases. But if the top-most app layers don't assume specific details of the lower base layers, this can be made to be safe.

More thorough examples and documentation in crane and Buildpacks.

Alternatives Considered

Annotating Multiple Base Images

An image might have multiple layers of base images -- e.g., a tagged whole app image, based on a tagged shared base image containing some OS packages, based on a bare OS image.

+-------------------+ <--- my-app:latest
| sha256:abc        |
+-------------------+
| sha256:def        |
+-------------------+ <--- base-packages:v6
| sha256:123        |
+-------------------+
| sha256:456        |
+-------------------+
| sha256:789        |
+-------------------+ <--- base-os:v6
| sha256:000        |
+-------------------+
| sha256:111        |
+-------------------+

One could therefore annotate the image to describe any number of base images. However, this adds complexity, both to the naming and semantics of the annotations (org.containers.image.base[N].digest ?) and in being able to constructively solve any of the motivating use cases listed above.

Instead, I propose only describing one base image seam per image, and if that image itself describes a base image seam of its own, and so on, so be it. In a situation where the base OS image fixes a vulnerability, the intermediate OS-packages base image can address it, producing a new tagged image, which will in turn signal to downstream app images that they are in need of an update.

Layer Index Annotation

Instead of annotating the image with the digest of the base image, one could annotate an integer index into the list of layers that represents the base layer seam. This is relatively straightforward, and still supports rebase scenarios, but it loses valuable provenance information. In this approach you wouldn't be able to answer "is this image based on a vulnerable base image", or "is this image based on an out-of-date base image".

topLayer Annotation

Instead of annotating the image manifest with the digest of the base image, Buildpacks annotates the image with the digest of the "top layer", that is, the top layer of the base image. Layers above this layer are preserved when rebasing.

The disadvantage of this approach is in handling images containing duplicated layers. Perfectly valid images might include the same layer contents (meaning the same layer digest) multiple times, perhaps non-consecutively, which can make identifying the base layer seam using the "top layer digest" approach ambiguous and harder to specify.

It's worth noting that a topLayer annotation, and an integer layer index annotation, both have the benefit of being able to describe the base image seam without needing to consult a registry, though neither can communicate base image provenance. Buildpacks expresses base image provenance information through another annotation mechanism.

Layer Descriptor Annotation

Instead of annotating the manifest with base image information, one could annotate the layer descriptors, to describe the base image seam. This would make it simpler to describe multiple base image seams, but as that's a non-goal (see above), this is not necessary, and having to validate that only one layer is annotated as a base image seam adds complexity.


cc @jonjohnsonjr @ekcasey @sclevine

@tianon
Copy link
Member

tianon commented Feb 17, 2021

In practice, this would be most useful if we took it all the way to the extreme of storing a full "tree" of references (given that a single "base" isn't always accurate), wouldn't it?

For example:

FROM golang:alpine AS build
# ... build the thing statically ...
FROM scratch
COPY --from=build ...

Now imagine a critical musl vulnerability that warrants a rebuild of alpine:xxx, which warrants a rebuild of golang:alpine, which then warrants a rebuild of my image.

In a more complex example, you might have something like this:

FROM foo
COPY --from=bar ...
COPY --from=baz ...

# then, separately
FROM that
COPY --from=buzz ...

@jonjohnsonjr
Copy link
Contributor

given that a single "base" isn't always accurate

I've discussed this a bit before with @imjasonh and we decided to simplify this a bit by just starting with the single base image use case. I think it's definitely possible to adapt the proposal to suit multiple base images, but the semantics get a bit tricky.

The final image has multiple bases, some of which are ephemeral intermediary images that would never get pushed anywhere (with multi-stage builds), so we couldn't reference them directly, but we could capture their base images, which would allow you to detect when a build needs re-triggering.

For some of these cases, you might be able to just annotate a layer -- "this layer derives (eventually) from this base image".

But for more complex cases, there's not an obvious way to represent this... one annotation per layer wouldn't work if a single layer somehow derived from multiple base images, e.g. imagine flattening an image. We could try to force this by having delimiters and multiple values within an annotation (honestly, not horrible), or we could have a fit-for-purpose separate tree data structure that gets attached to the image to indicate its pedigree, not unlike SBOM or signature use cases.

For single-base images, you get a nice tree structure by default. E.g. you could imagine that I have several images based on ubuntu, and ubuntu might be based on debian (it's not, but imagine it). A rebuild of debian would trigger a rebuild of ubuntu which would trigger a rebuild of my images.

Thinking about it more, a flattened list of base image dependencies sounds like it might actually work. The only change we'd need to make is adding a delimiter for both annotations so that they can support multiple values. I would expect len(base.digest) == len(base.ref) to always be true, which would make correlating the ref and digest trivial, but I'm interested in any cases where that might not be true?

A list of direct base dependencies would allow me to answer all the questions I'd care about, and kind of ducks the definition of what a "base" image really is, if there's only one. We can say that they're just images that this image depends on at build time.

I doubt that I'd ever want to attempt to automatically rebase a multi-based image, as that feels really unsafe, but you'd still get the benefit of knowing where some stuff came from and whether or not an image is up to date.

@imjasonh WDYT? Anything I'm missing?

@vbatts
Copy link
Member

vbatts commented May 14, 2021

some of the annotations are flexible to be used in more than just descriptors. This proposal is very specific to descriptors, and may need particular docs of when it does not make sense i.e. if the blob has been "squashed" such that the base image is no longer fetched for a rebuild. But even as i'm typing that, in the case of a "squashed" image, i would like to know the digest that the image was originally built upon.

i think i'm supportive of the use-case, just trying to get a heads up on all the ways this will get abused. :-D

@jonjohnsonjr
Copy link
Contributor

very specific to descriptors

I think what we settled on in the PR is a single top-level annotation in a manifest, pointing to the base image ref/digest.

The language in the PR seems generic enough to abuse however we'd like :P

@vbatts
Copy link
Member

vbatts commented May 14, 2021

then the references seem too spindly to me. 🤔

@imjasonh
Copy link
Member Author

The annotations spec already contains this text suggesting that the annotation keys are intended for manifests and image indexes:

This specification defines the following annotation keys, intended for but not limited to image index and image manifest authors:

We obviously can't stop people from annotating anything however they want, but IMO this text helps forestall bug reports where someone applied the annotation to their toaster or whatever and it caught fire.

@jonjohnsonjr
Copy link
Contributor

@vbatts can you expand on the spindly-ness?

@vbatts
Copy link
Member

vbatts commented Jul 30, 2021

Just the clients that pull through an image-index, and go straight to the architecture of their host,
And then handling a list of references.

After thinking about this, there is nothing that will blow up due to this. Fishing with dynamite 🧨

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants
@vbatts @tianon @imjasonh @jonjohnsonjr and others