Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

--mount=type=cache id should not default to the target #1706

Closed
myitcv opened this issue Sep 28, 2020 · 8 comments
Closed

--mount=type=cache id should not default to the target #1706

myitcv opened this issue Sep 28, 2020 · 8 comments

Comments

@myitcv
Copy link

myitcv commented Sep 28, 2020

I asked the following question on Docker slack in #buildkit:

If I have two RUN commands that, say, use --mount=type=cache,target=/root/.cache/go-build, do they use the same cache?

It's not entirely clear to me from the wording of the linked page what identifies the cache if id is omitted

@tonistiigi kindly replied with:

id defaults to the target path

My conclusion from this is that if I have two separate Dockerfile's, where both use --mount=type=cache,target=/blah, they will share the same cache, even thought their usage of that could be entirely different/conflicting. This seems somewhat surprising (dangerous?) as a default.

Would it instead be better to require an id for --mount=type=cache? That way, the caller can ensure caches are "re-used" when expected, not by accident.

Somewhat related to #1512 because in that case the volume itself would be the id.

@tonistiigi
Copy link
Member

This is expected. Specific contents of the cache mounts is never guaranteed and you build should always work with any state of files, just with different performance implications. Mostly this is used for package managers or specific language cache. Eg. for go, all projects should share the go build cache as it is safe, same for npm tarballs, apt etc. If you want project specifics then make sure to use ID parameter.

@myitcv
Copy link
Author

myitcv commented Sep 29, 2020

Mostly this is used for package managers or specific language cache

But I think the problem is that the target directory is somewhat arbitrary. For this approach to work (i.e. for people to benefit from shared caches), you would need everyone who builds Go projects to agree on the cache directory being, say, /root/.cache/go-build. Otherwise you will have redundant copies of a build or module cache. And in the worse case caches needlessly being blown away because that's really the only reliable way to work with any state of files.

This is somewhat related to my comments in #1512 (comment) and #1705. Because ultimately you need/want the caller to have to specify the location of caches, not the Dockerfile author. If I have a build that is Go related, then as the caller I should be able to (but not obliged to) pass the location of my local module cache (go env GOMODCACHE) and build cache (go env GOCACHE). If I don't want to/can't supply these values, then I think you need to default to an id-named cache, where the Dockerfile author chooses a suitably namespaced id (e.g. github.com/my/project).

Hence my conclusion that id defaulting to the target directory is not actually that useful, and indeed likely more harmful.

I therefore think id should be required, with no defaulting logic, but there should also be the option for the user to "inject" some volume-based mount that would substitute the id-based mount.

@tonistiigi
Copy link
Member

you would need everyone who builds Go projects to agree on the cache directory being,

No, /root/.cache/go-build is already built into go binary as the default value for the cache location. Same for the go modules, npm, apt, cargo etc. Everyone would need to agree if you would come up for an arbitrary ID to use instead.

Because ultimately you need/want the caller to have to specify the location of caches

Even when #1512 is added it will be a special case for advanced usage. Mostly so that you can transfer your cache or reuse it with a new buildkit instance. All the dockerfiles need to work without passing any specific location.

ID defaults to target path because that is the expected behavior for 95% of the practical valid uses of the cache mounts.

@myitcv
Copy link
Author

myitcv commented Oct 1, 2020

/root/.cache/go-build is already built into go binary as the default value for the cache location

For GOCACHE it is true this is the default for the root user, absent any environment overrides (which the build might establish). The default GOMODCACHE value for the root user, absent any environment overrides, is /root/go/pkg/mod.

But here in lies the problem because this is what the default golang images do:

https://github.com/docker-library/golang/blob/ee2d52a7ad3e077af02313cd4cd87fd39837412c/1.15/buster/Dockerfile#L119

Which means that the actual module cache in effect is /go/pkg/mod. Hence any build operation with the "standard":

RUN --mount=type=cache,target=/root/.cache/go-build --mount=type=cache,target=/root/go/pkg/mod

will not result in anything being written to/read from the cache.

And this is effectively the problem I'm trying to get at.

Hence why I think pushing the responsibility back to the caller is the right thing to do here (which I why I think there is a connection to #1512). The current approach does not allow me to say "oh and where you need a Go build cache use location X". It relies entirely on convention/identical setup between build containers and that (in my limited experience) doesn't feel particularly robust. Instead, if you make this an optional part of the "API" of such a build container, then the user can pass arguments that satisfy that API in much the same way that ARG/--build-arg work, but parameterising host values in these special --mount directives.

ID defaults to target path because that is the expected behavior for 95% of the practical valid uses of the cache mounts.

I agree with the sentiment of this point, the bit I'm not clear on is how accurate this 95% figure is, especially given the example above. The issue being that for the other 5% (which I think is likely much higher) there is a cost to bear in terms of cache misses, and critically no way of fixing the issue.

How about an approach that combines the best of both worlds? i.e. one where the caller can optionally provide an argument to parameterise the cache source; if not provided then it defaults to the target directory.

Incidentally my main use case for #1512 and fixing this issue is CI, where as the caller I need to be able to say where caches should be sourced from (because those caches are themselves cached as part of a CI run).

@tonistiigi
Copy link
Member

will not result in anything being written to/read from the cache.

That is why if you use the official golang image the "standard" is https://github.com/moby/buildkit/blob/master/Dockerfile#L87-L88 instead.

The issue being that for the other 5% (which I think is likely much higher) there is a cost to bear in terms of cache misses, and critically no way of fixing the issue.

Yes, if you want to use IDs it is harder because you need to set it and maintain that different components use the same strings. But you want to make it harder for everybody. With ID you can also set it to a path if you want to share cache with existing Dockerfile that you can't change in a different target path. If you want to share cache with a Dockerfile that you don't control you will always have this problem that the Dockerfiles need to agree that they want to share cache. With custom IDs it would just be a huge mess and almost never work as paths are much more stable than random identifier strings.

The use case to use IDs is to set it to something project-specific if you want to make sure your different projects don't share some cache artifacts. The possibility that you maintain 2 projects that use the same application cache but on a different path is pretty much non-existent.

@myitcv
Copy link
Author

myitcv commented Oct 2, 2020

That is why if you use the official golang image the "standard"

Indeed, but my point is that another Go-based build image will not necessarily use that same approach, and so you end up with multiple modules caches (because they did not override GOPATH, or used a different value).

With custom IDs it would just be a huge mess and almost never work as paths are much more stable than random identifier strings.

I agree with this, you make the point well that the choosing of IDs that can't be validated in any way is hard. The benefit of the use of target path is that it is validated, by virtue of it corresponding to a location on disk. Hence the current approach of defaulting the id to the target path does make sense.

If you want to share cache with a Dockerfile that you don't control you will always have this problem that the Dockerfiles need to agree that they want to share cache

This is exactly the problem, but also that it takes control away from me, the caller because I can't override this.

Which is why I wondered in #1706 (comment) whether we can take a slightly different approach here.

But, and apologies if this was obvious to you before, I think the penny has just dropped for me with respect to #1512. My comment in #1706 (comment) about parameterising the cache source is, I think, exactly what is proposed in #1512. In that issue, given:

RUN --mount=type=cache,target=/root/.cache,id=gocache go build ...

you propose something like:

docker build --cache-mount id=gocache,type=volume,volume=myvolume .

Given our discussion above, I would instead declare my RUN step as follows:

RUN --mount=type=cache,target=/root/.cache/go-build --mount=type=cache,target=/go/pkg/mod  go build ...

Using the current definition of id (i.e. if not specified, defaults to the target path) then am I right in saying I could do this?

docker build --cache-mount id=/root/.cache/go-build,type=volume,volume=myvolume1 --cache-mount id=/go/pkg/mod,type=volume,volume=myvolume2 .

If so then I think this absolutely satisfies my need to be able to "inject" a different cache strategy as the caller.

@tonistiigi
Copy link
Member

Using the current definition of id (i.e. if not specified, defaults to the target path) then am I right in saying I could do this?

Yes, that would still work. All cache mounts have an ID, just the user doesn't need to type it in most cases.

https://github.com/moby/buildkit/blob/master/frontend/dockerfile/dockerfile2llb/convert_runmount.go#L129-L132

@myitcv
Copy link
Author

myitcv commented Oct 2, 2020

Then I think I will close this issue because my question/concern here is effectively covered by #1512 (modulo one further question of whether one can provide an "offset" into a volume).

Thanks very much for engaging to talk this through.

@myitcv myitcv closed this as completed Oct 2, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants