-
Notifications
You must be signed in to change notification settings - Fork 13k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CI: We should be able to complete a build in 3 hours even without Docker image cache #49278
Comments
I think the worst offenders here are definitely the dist images where we're commonly building entire gcc toolchains for older compat and whatnot. One of the major pros, I think, is that we can test out changes to toolchains before they land. That is, if you update the toolchain in the CentOS image (for whatever reason) it'll get automatically rebuilt and the PR is rejected if it causes a failure. A downside with precompiled binaries is that we typically don't know if they work, so we'd have to run the precompiled build multiple times. That being said I think there's a lot we can do to improve this of course! I'd love to have incremental caching of docker images like we have with Overall it's basically impossible I think for us to get uncached builds to be under 3 hours, mainly because of dist builders (which require building gcc toolchains) and three (!) versions of LLVM. In that sense I see the main takeaway here is that we should improve the caching strategy with docker layers on Travis, which I've always wanted to do for sure! |
Another idea is that this may be a perfect application for Travis's "stages". If we figure out an easy way to share the docker layers between images (aka fast network transfers) then we could split up the docker build stage from the actual rustc build stage. I think that'd far extend our timeout and we'd always comfortably fit within the final time limit. |
Er scratch the idea of travis stages, they wouldn't work here. It looks like Travis stages only support parallelism inside one stage, and othewise stages are sequential. That's not quite what we want here... |
This commit moves away from caching on Travis to our own caching on S3 for caching docker layers between builds. Unfortunately the Travis caches have over time had a few critical pain points: * Caches are only updated for successful builds, meaning that if a build times out or fails in a different location the sucessfully-created docker images isn't always cached. While this makes sense as a general rule of caches it hurts our use cases. * Caches are per-branch and builder which means that we don't have a separate cache on each release channel. All our merges go through the `auto` branch which means that they're all sharing the same cache, even those for merging to master/beta. This means that PRs which switch between master/beta will keep rebuilting and having cache misses. * Caches have historically been invaliated somewhat regularly a little more aggressively than we'd want (I think). * We don't always need to update the contents of the cache if the Docker image didn't change at all, and saving off the docker layers can sometimes be quite expensive. For all these reasons this commit drops the usage of Travis's built-in caching support. Instead our own caching is used by storing blobs to S3. Normally this would be a very risky endeavour but we're basically priming a cache for a cache (docker) so if we get this wrong the failure mode is longer builds, not stale caches. We'll notice that pretty quickly and hopefully fix it! The logic here is inserted directly into the `src/ci/docker/run.sh` script to download an image based on a shasum of the `Dockerfile` and other assorted files. This blob, if found, is loaded into docker and we record what layers were inserted. After docker finishes the build (hopefully quickly with lots of cache hits) we then see the sha of the final image. If it's one of the layers we loaded then there's no need to update the cache. Otherwise we upload our layers to the global cache, possibly overwriting what we previously just downloaded. This is hopefully a step towards mitigating rust-lang#49278 although it doesn't completely fix it as it means we'll still probably have to retry builds that bust the cache.
ci: Don't use Travis caches for docker images This commit moves away from caching on Travis to our own caching on S3 for caching docker layers between builds. Unfortunately the Travis caches have over time had a few critical pain points: * Caches are only updated for successful builds, meaning that if a build times out or fails in a different location the sucessfully-created docker images isn't always cached. While this makes sense as a general rule of caches it hurts our use cases. * Caches are per-branch and builder which means that we don't have a separate cache on each release channel. All our merges go through the `auto` branch which means that they're all sharing the same cache, even those for merging to master/beta. This means that PRs which switch between master/beta will keep rebuilting and having cache misses. * Caches have historically been invaliated somewhat regularly a little more aggressively than we'd want (I think). * We don't always need to update the contents of the cache if the Docker image didn't change at all, and saving off the docker layers can sometimes be quite expensive. For all these reasons this commit drops the usage of Travis's built-in caching support. Instead our own caching is used by storing blobs to S3. Normally this would be a very risky endeavour but we're basically priming a cache for a cache (docker) so if we get this wrong the failure mode is longer builds, not stale caches. We'll notice that pretty quickly and hopefully fix it! The logic here is inserted directly into the `src/ci/docker/run.sh` script to download an image based on a shasum of the `Dockerfile` and other assorted files. This blob, if found, is loaded into docker and we record what layers were inserted. After docker finishes the build (hopefully quickly with lots of cache hits) we then see the sha of the final image. If it's one of the layers we loaded then there's no need to update the cache. Otherwise we upload our layers to the global cache, possibly overwriting what we previously just downloaded. This is hopefully a step towards mitigating #49278 although it doesn't completely fix it as it means we'll still probably have to retry builds that bust the cache.
This commit moves away from caching on Travis to our own caching on S3 for caching docker layers between builds. Unfortunately the Travis caches have over time had a few critical pain points: * Caches are only updated for successful builds, meaning that if a build times out or fails in a different location the sucessfully-created docker images isn't always cached. While this makes sense as a general rule of caches it hurts our use cases. * Caches are per-branch and builder which means that we don't have a separate cache on each release channel. All our merges go through the `auto` branch which means that they're all sharing the same cache, even those for merging to master/beta. This means that PRs which switch between master/beta will keep rebuilting and having cache misses. * Caches have historically been invaliated somewhat regularly a little more aggressively than we'd want (I think). * We don't always need to update the contents of the cache if the Docker image didn't change at all, and saving off the docker layers can sometimes be quite expensive. For all these reasons this commit drops the usage of Travis's built-in caching support. Instead our own caching is used by storing blobs to S3. Normally this would be a very risky endeavour but we're basically priming a cache for a cache (docker) so if we get this wrong the failure mode is longer builds, not stale caches. We'll notice that pretty quickly and hopefully fix it! The logic here is inserted directly into the `src/ci/docker/run.sh` script to download an image based on a shasum of the `Dockerfile` and other assorted files. This blob, if found, is loaded into docker and we record what layers were inserted. After docker finishes the build (hopefully quickly with lots of cache hits) we then see the sha of the final image. If it's one of the layers we loaded then there's no need to update the cache. Otherwise we upload our layers to the global cache, possibly overwriting what we previously just downloaded. This is hopefully a step towards mitigating rust-lang#49278 although it doesn't completely fix it as it means we'll still probably have to retry builds that bust the cache.
ci: Don't use Travis caches for docker images This commit moves away from caching on Travis to our own caching on S3 for caching docker layers between builds. Unfortunately the Travis caches have over time had a few critical pain points: * Caches are only updated for successful builds, meaning that if a build times out or fails in a different location the sucessfully-created docker images isn't always cached. While this makes sense as a general rule of caches it hurts our use cases. * Caches are per-branch and builder which means that we don't have a separate cache on each release channel. All our merges go through the `auto` branch which means that they're all sharing the same cache, even those for merging to master/beta. This means that PRs which switch between master/beta will keep rebuilting and having cache misses. * Caches have historically been invaliated somewhat regularly a little more aggressively than we'd want (I think). * We don't always need to update the contents of the cache if the Docker image didn't change at all, and saving off the docker layers can sometimes be quite expensive. For all these reasons this commit drops the usage of Travis's built-in caching support. Instead our own caching is used by storing blobs to S3. Normally this would be a very risky endeavour but we're basically priming a cache for a cache (docker) so if we get this wrong the failure mode is longer builds, not stale caches. We'll notice that pretty quickly and hopefully fix it! The logic here is inserted directly into the `src/ci/docker/run.sh` script to download an image based on a shasum of the `Dockerfile` and other assorted files. This blob, if found, is loaded into docker and we record what layers were inserted. After docker finishes the build (hopefully quickly with lots of cache hits) we then see the sha of the final image. If it's one of the layers we loaded then there's no need to update the cache. Otherwise we upload our layers to the global cache, possibly overwriting what we previously just downloaded. This is hopefully a step towards mitigating #49278 although it doesn't completely fix it as it means we'll still probably have to retry builds that bust the cache.
…ennytm ci: Don't use Travis caches for docker images This commit moves away from caching on Travis to our own caching on S3 for caching docker layers between builds. Unfortunately the Travis caches have over time had a few critical pain points: * Caches are only updated for successful builds, meaning that if a build times out or fails in a different location the sucessfully-created docker images isn't always cached. While this makes sense as a general rule of caches it hurts our use cases. * Caches are per-branch and builder which means that we don't have a separate cache on each release channel. All our merges go through the `auto` branch which means that they're all sharing the same cache, even those for merging to master/beta. This means that PRs which switch between master/beta will keep rebuilting and having cache misses. * Caches have historically been invaliated somewhat regularly a little more aggressively than we'd want (I think). * We don't always need to update the contents of the cache if the Docker image didn't change at all, and saving off the docker layers can sometimes be quite expensive. For all these reasons this commit drops the usage of Travis's built-in caching support. Instead our own caching is used by storing blobs to S3. Normally this would be a very risky endeavour but we're basically priming a cache for a cache (docker) so if we get this wrong the failure mode is longer builds, not stale caches. We'll notice that pretty quickly and hopefully fix it! The logic here is inserted directly into the `src/ci/docker/run.sh` script to download an image based on a shasum of the `Dockerfile` and other assorted files. This blob, if found, is loaded into docker and we record what layers were inserted. After docker finishes the build (hopefully quickly with lots of cache hits) we then see the sha of the final image. If it's one of the layers we loaded then there's no need to update the cache. Otherwise we upload our layers to the global cache, possibly overwriting what we previously just downloaded. This is hopefully a step towards mitigating rust-lang#49278 although it doesn't completely fix it as it means we'll still probably have to retry builds that bust the cache.
ci: Don't use Travis caches for docker images This commit moves away from caching on Travis to our own caching on S3 for caching docker layers between builds. Unfortunately the Travis caches have over time had a few critical pain points: * Caches are only updated for successful builds, meaning that if a build times out or fails in a different location the sucessfully-created docker images isn't always cached. While this makes sense as a general rule of caches it hurts our use cases. * Caches are per-branch and builder which means that we don't have a separate cache on each release channel. All our merges go through the `auto` branch which means that they're all sharing the same cache, even those for merging to master/beta. This means that PRs which switch between master/beta will keep rebuilting and having cache misses. * Caches have historically been invaliated somewhat regularly a little more aggressively than we'd want (I think). * We don't always need to update the contents of the cache if the Docker image didn't change at all, and saving off the docker layers can sometimes be quite expensive. For all these reasons this commit drops the usage of Travis's built-in caching support. Instead our own caching is used by storing blobs to S3. Normally this would be a very risky endeavour but we're basically priming a cache for a cache (docker) so if we get this wrong the failure mode is longer builds, not stale caches. We'll notice that pretty quickly and hopefully fix it! The logic here is inserted directly into the `src/ci/docker/run.sh` script to download an image based on a shasum of the `Dockerfile` and other assorted files. This blob, if found, is loaded into docker and we record what layers were inserted. After docker finishes the build (hopefully quickly with lots of cache hits) we then see the sha of the final image. If it's one of the layers we loaded then there's no need to update the cache. Otherwise we upload our layers to the global cache, possibly overwriting what we previously just downloaded. This is hopefully a step towards mitigating #49278 although it doesn't completely fix it as it means we'll still probably have to retry builds that bust the cache.
Maybe similarily to #49284 it would be better, to upload all intermediate images (all layers) to S3? |
@steffengy seems plausible to me! Right now it's only done on success but I think we could do it on failure as well |
@alexcrichton Was it an intentional design decision to tie caching to such a hash instead of to the $IMAGE itself?
|
@steffengy tying it purely to
I'm not sure which strategy is the better, neither is working that great I think :( |
@alexcrichton Yeah, the first one is definitely an issue (the second one is essentially the same). Was using a docker registry insteadof S3 for caching discussed before?
Then after each build one would push to e.g. Disclaimer: This is untested and only relies on assumptions and discussions I found regarding this topic. |
@steffengy oh I'd totally be down for using a docker registry as I think it'd for sure solve most problems, I just have no idea how to run one or how we might set that up! Is this something that'd be relatively lightweight to do? |
Setting up a registry is the easier part:
Adjustments to the approach above...Some local tests showed that Resulting Possibilities
@alexcrichton
Let me know what you think. |
@steffengy hm so I may not be fully understanding what this implies, but isn't the registry approach the same as just using a cache location determined by the image/branch name? We're sort of running a "pseudo registry" today with curl and I could be missing something though! |
Also take a look at buildkit: it offers more flexibility over |
@alexcrichton Yeah, it wouldn't really provide much to justify the additional work over tagging by image&branch on S3. @ishitatsuyuki suggestion of buildkit might be interesting, but seems to require quite a bit of work:
A few more ideas for inspiration:
|
@steffengy I think for now we should probably jsut switch to Learning the branch isn't the easiest thing in the world right now unfortunately but I imagine we could throw something in there to make it not too bad. |
@alexcrichton Maybe it makes sense to combine that with trying to load from both |
@steffengy perhaps yeah but I've found that the curl + docker load can often take quite awhile for larger images, and they change rarely enough today anyway that I don't think it'd outweigh the cost |
Triage: Over two years later, I am not sure if this ticket is still relevant, given the amount of changes we've undergone. Maybe it is, maybe it isn't. |
The |
I saw a lot of custom caching tricks in https://github.com/rust-lang/rust/blob/fd815a5091eb4d49cd317f8ad272f17b7a5f550d/src/ci/docker/run.sh, but little use of the Docker Engine's or other container builders' caching features, such as Is a PR in this direction helpful? I'm not sure who to ping, perhaps @hkratz @pietroalbini? |
And, why the custom caching and can we reduce that at some point? |
Our CI builders (except macOS and Windows) use Docker, and we'll cache the Docker repository on Travis. Thanks to the cache, normally the
docker build
command only takes a few seconds to complete. However, when the cache is invalidated for whatever reason, the Docker image will need to be actually built, and this may take a very long time.Recently this happened with #49246 — the Docker image cache of
dist-x86_64-linux alt
became stale and thus needs to be built from scratch. One of the step involves compiling GCC. The wholedocker build
command thus takes over 40 minutes. Worse, thealt
builders have assertions enabled, and thus all stage1+rustc
invocations are slower than their normal counterpart. Together, it is impossible to complete within 3 hours. Travis will not update the cache unless the build is successful. Therefore, I need to exclude RLS, Rustfmt and Clippy from the distribution, to ensure the job is passing.I don't think we should entirely rely on Travis's cache for speed. Ideally, the
docker build
command should at most spend 10 minutes, assuming good network speed (~2 MB/s on Travis) and reasonable CPU performance (~2.3 GHz × 4 CPUs on Travis).In the
dist-x86_64-linux alt
case, if we host the precompiled GCC 4.8.5 for Centos 5, we could have trimmed 32 minutes out of the Docker build time, which allows us to complete the build without removing anything.The text was updated successfully, but these errors were encountered: