-
Notifications
You must be signed in to change notification settings - Fork 14.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(dockerfile): Add pip caching for faster build #35026
feat(dockerfile): Add pip caching for faster build #35026
Conversation
8753e0c
to
bceb2fc
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would love to learn more about your use case and see how and whether it can help at all.
I am not sure if it changes anything and it adds some assumptions that the image will be built several times on the same machine while also for some reason it will not use the already cached layers.
Could you please show some usage where it actually helps and improve things ? Some benchmarks showing what you've done and before/after would be good.
Quite for sure caching when installing pip
(first part of your change) has no effect.
This is something that is only done once to upgrade to latest version and then (regardless of caching) pip
does not get re-installed again, because it is already in the target version. And (unless you have some other scenario in mind) re-using cache will only actually work when you use the same docker engine. And in this case - unless you change your docker file parameters, it is already handled by docker layer caching. I.e. if you are building your docker image twice in the same docker engine with the same parameters, then caching is done at the "container layer" level. So when you run your build again, on the same docker engine you should not see it trying to install Python again. You should instead see that CACHED layer is used. And this is all without having to mount cache volume.
So I wonder why you would like to cache pip upgrade with cache.
The second case might be a bit more interesting - but there also you will see reinstallation only happening if you change requirements.txt. And yes - in this case it will be a bit faster when you just add one requirement, but this will also work for the same user when rebuilding the image over and over again on the same machine, which is generally not something that you should use airflow's dockerfile for (usually) - usually this can be done by extending the image (FROM apache/airflow
) and adding lines in your own Dockerfile (where you could use indeed --mount-type cache etc.).
I wonder if you could describe your case in more detail and show some benchmark example of how much time is saved for particular scenarios.
I am not totally against it, but I would love to understand your case, because a) maybe you do not gain as much as you think or b) maybe you are doing somethign that prevents you from using docker layer caching or c) maybe you are using the Dockerfile of Airlfow in unintended way.
Looking forward for more detailed explanation of your case.
So caching pip
install makes no sense at all.
I agree, it will be useful in the CI when moby/buildkit#1512 will be implemented, in that case we can combine it with Github cache to persist the mounted cache. This could work now if we replace
This might be useful to someone playing with the requirements file to resolve a conflict, so if there is no potential risk from this change I would be OK with merging it. |
Yep I think this one is not a bad idea if there are no side-effects. I would still love to see the scenarios and ways how it is used. |
The goal here is to try every possible way to improve the build time of Airflow. We are using custom building and in our CI, the build doesnt use the cached layers 😭 So we have an average build time of 8minutes. This is not much but a lot of people are working on our repository and we would like to improve their developer experience as a whole.
Yes, in our CI, the layers are not cached. We are trying to use the buildkit experimental feature :
and in our docker buildx build .... \
--cache-to=type=gha,scope=platform-$ARCH,mode=max \
--cache-from=type=gha,scope=platform-$ARCH" but Github Actions is still not re-using the cached layer 😞
Ok, so it's not a proper benchmark, I just used the Without mounting cache:$ time docker build \
--build-arg PYTHON_BASE_IMAGE=python:3.11-slim-bullseye \
--build-arg AIRFLOW_VERSION=2.6.3 \
--build-arg AIRFLOW_HOME=/usr/local/airflow \
--build-arg AIRFLOW_EXTRAS=async,celery,redis,google_auth,hashicorp,statsd \
--build-arg 'ADDITIONAL_PYTHON_DEPS=apache-airflow-providers-amazon==8.3.1 apache-airflow-providers-celery==3.2.1 apache-airflow-providers-common-sql==1.6.0 apache-airflow-providers-cncf-kubernetes==7.4.2 apache-airflow-providers-datadog==3.3.1 apache-airflow-providers-docker==3.7.1 apache-airflow-providers-ftp==3.4.2 apache-airflow-providers-github==2.3.1 apache-airflow-providers-google==10.3.0 apache-airflow-providers-http==4.4.2 apache-airflow-providers-imap==3.2.2 apache-airflow-providers-mysql==5.1.1 apache-airflow-providers-postgres==5.5.2 apache-airflow-providers-sftp==4.3.1 apache-airflow-providers-slack==7.3.1 apache-airflow-providers-ssh==3.7.1 apache-airflow-providers-tableau==4.2.1 apache-airflow-providers-zendesk==4.3.1 apache-airflow-providers-salesforce==5.4.1' \
--build-arg 'ADDITIONAL_RUNTIME_APT_DEPS=groff gettext git' \
--build-arg AIRFLOW_CONSTRAINTS_LOCATION=/docker-context-files/constraints-airflow.txt \
--build-arg DOCKER_CONTEXT_FILES=docker-context-files \
--build-arg INSTALL_MYSQL_CLIENT=true \
--build-arg INSTALL_MSSQL_CLIENT=false \
--build-arg ADDITIONAL_DEV_APT_DEPS=pkgconf \
--no-cache
33.74s user 15.55s system 11% cpu 7:01.16 total In the above, all pip requirements have been downloaded before being installed. In my case, I don't have at the moment a good internet connection, so it is even more slower than someone with a good internet connection WITH mounting cache:$ time docker build
...
--no-cache
17.52s user 7.62s system 7% cpu 5:16.20 total The
Sure, mistake from my side, I removed the line in 77c6813
Here is more about optimising the build for people trying to fix constraints, or just when we would like to re-build the image after adding one more requirement (as you said and as @hussein-awala pointed out). It is always nicer when it is really optimised at its fullest in my opinion, but I understand that the trouble might not be worth it. |
Also, I don't have the feeling that your CI is using cache layers even though it should?. As @hussein-awala said, once buildkit will have implemented a way to use |
It's because you were extremely unlucky when checking it. Python just released a new base image 4 hours ago, and our cache will refresh with the new base image after we merge something to main and the build in The way how we are using remote caching is specifically designed to automatically handle such cache refreshes. The whole CI of ourse is designed to automatically upgrade to latest versions of the base images, apt dependencies and Python dependencies as soon as:
This allows us to automtically upgrade and test our software and test suite to "latest", "most secure" and "consistent" set of dependencies (including base dependencies for the image), while our PRs keeps on using the "stable and confirmed to work in the past" set of python dependencies. This way we can detect breaking changes in incoming dependencies. We detected many breaking changes this way "early" - hence Indeed the build you see does not use the cache that corresponds to the latest pyhon version that's why it takes 14 minutes. Because cache is not yet refreshed. But it still works (which is great) while the cache is being rebuilt and refreshed. You can see the same job from 5 hours ago https://github.com/apache/airflow/actions/runs/6565914251/job/17835531939#step:6:1070 which lasted 1 minute and actually used the cache. |
Looking at the discussion @V0lantis @hussein-awala and thinking a bit - yeah, I think it's actually quite fine to use the local cache (even for pip installation) as long as it does not interfere with any of the caching mechanisms of ours and ability to build different images (especially for multiple python versions). As long as the remote cache works as it did (it should), aadding those local caches should not impact the build. And possibly can speed up some local build cases. I have one worry though. Knowing that "cache invalidation is the most difficult thing". I am not 100% sure how it will work - how does the local cache gets allocated when you have differnet arguments passed. Do you know how the local cache is determined in case we change parameters/arguments and use the same Dockerfile? To be honest I am quite worried about the case where base python changes (see my previous comment), about this case:
Then (note Python version change):
And about the case where I am asking all those questions, because if any problem like that happens, someone who will have it, will inevitably open isssue in our repo and will (rightfully) complain and ask us what to do. I do recall cases in the past that The main reason why we have Curious about your thoughts on whether this is something we should worry about ? Maybe cache location be dependent on Python version and PIP_VERSION to mitigate those simply ? Maybe we should have a way to clear the cache when things get broken (other than nuking whole docker installation). And finally possibly we should document it somewhere what to do when it gets broken. WDYT? |
Or changes to content used in any of the layers above pip install. It seems like it could be quite many, eg
YES! Good observation, this is a very possible issue. When doing similar caching with |
There were likely 10s iterations or so of making the right sequence of COPY/RUN there and also - in the main stage the number of layers is minimised. But maybe some of it could be improved still - it's open to contributions :). The whole Dockerfile is now designed with the "incremental rebuild" in mind - it does selectively COPY only those things that are supposed to be used in the next step. I think what Actually final
And then we add startup scripts .. I believe we could add |
Just to complete it - explaining the The
This layer is not invalidated when setup.py/requirements change, so we use it to pre-install the dependencies in main, so that if someone adds new dependency, it will not reinstall everything from scratch.
So any "local" caching of python installation should be done here - across the "airflow-build-image" segment @V0lantis - if you want to pursue it and attempt to do it in the way that we avoid the potential cache invalidation issues I mentioned (we could also add apt-caching as suggested by @hterik - those adding/iterating with new |
Thank you @potiuk for this detailed comment on how the Dockerfile is built. I came up with the following (d952c9c5b1): To invalidate cache between two different python versions, I am using the Result : => [airflow-build-image 15/16] RUN --mount=type=cache,[id=dockerhub.containers.mpi-internal.com/python:3.11-slim-bullseye-arm64,target=/home/airflow/.cache/pip,uid=50000](http://id%3Ddockerhub.containers.mpi-internal.com/python:3.11-slim-bullseye-arm64,target=/home/airflow/.cache/pip,uid=50000) if [[ false == "true" ]]; then bash /scr 7.7s |
Yep. I like this approach a lot. If we make it a little more elaborated, I am perfectly ok with it and solves all my caching problems. There are two things I would like to add:
And of course we need to add documentation to https://github.com/apache/airflow/blob/main/docs/docker-stack/build.rst and https://github.com/apache/airflow/blob/main/docs/docker-stack/build-arg-ref.rst and release notes here: https://github.com/apache/airflow/blob/main/docs/docker-stack/changelog.rst (targetting it for 2.7.3) |
Also re- architecture - I think it is not really needed. Building prod image on a different architecture is unllikely to happen on the same machine (building the image on ARM for X86 is ~ 16 times slower with emulation so it takes literally hours). |
On the other hand: I read this one: moby/buildkit#2598 , and yeah we should leave it in since TARGETARCH ARG is conveniently set by default, so it's perfectly OK to leave it in |
Good. Thanks for the hint |
fbdcf34
to
b14f42e
Compare
I updated the documentation in b14f42e784 but I am not very familiar with changelog updates. Would you be so kind to show me the way? Thank so much otherwise for all your hard work and hints @potiuk 💪 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fantastic. Looks great . Thanks for being responsive. For changelog - I think we just need new line here: https://github.com/apache/airflow/blob/main/docs/docker-stack/changelog.rst#airflow-27
Add it for 2.7.3 please so that this change is cherry-pickable and we will have the last chance to merge it (2.7.3 is just being cut).
UPDATE: I will add it to do it faster.
Remove cache mounting during pip version install because cache isn't necessary here
- Add AIRFLOW_PIP_VERSION for having a cache per pip version - Add PIP_CACHE_EPOCH to be able to invalidate cache easily
b14f42e
to
eafb6be
Compare
Yes, but I don't understand what makes the CI fail right now.
|
Unfold previous step and see the output of docker build comand - it fails to build. Possibly there is an error somewhere there because (independent from that) in the previous step because it's the previous step that should fail. Try it locally |
Hey ! I am passing pip cache directory directly to an env variable ( Wdyt guys? |
Yep.
Yep. Good idea. |
Congrats 🎉 |
The apache#35026 has been cherry-picked to 2.7.3 (with updated changelog) so we need to also update changelog in main.
--------- Co-authored-by: Arthur Volant <arthur.volant@adevinta.com> Co-authored-by: Jarek Potiuk <jarek@potiuk.com> (cherry picked from commit 66871a0)
The #35026 has been cherry-picked to 2.7.3 (with updated changelog) so we need to also update changelog in main.
The apache#35026 has been cherry-picked to 2.7.3 (with updated changelog) so we need to also update changelog in main.
Pip caching
For our corporation, we are doing a custom build. I noticed that every time, docker was downloading again and again the python packages and I thought it was a little bit too slow. I might want to enlarge this usage of pip caching for our CI as well.
wdyt?
^ Add meaningful description above
Read the Pull Request Guidelines for more information.
In case of fundamental code changes, an Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in a newsfragment file, named
{pr_number}.significant.rst
or{issue_number}.significant.rst
, in newsfragments.