-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove tarball.WithCompressedCaching flag to resolve OOM Killed error #1722
Conversation
Thanks for your pull request. It looks like this may be your first contribution to a Google open source project (if not, look below for help). Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA). 📝 Please visit https://cla.developers.google.com/ to sign. Once you've signed (or fixed any issues), please reply here with What to do if you already signed the CLAIndividual signers
Corporate signers
ℹ️ Googlers: Go here for more info. |
@googlebot I signed it! |
I just improved my PR by adding a command line flag. Now it is possible to set I really hope, that one of the maintainer can pick this up. |
Large images cannot be build as the kaniko container will be killed due to an OOM error. Removing the tarball compression drastically reduces the memory required to push large image layers. Fixes GoogleContainerTools#1680 This change may increase the build time for smaller images. Therefore a command line option to trigger the compression or a more intelligent behaviour may be useful.
b80c225
to
78fe2f5
Compare
Rebased over latest master! Will merge and pick up for release. |
If the k8s node where the MLFlow builder step is running doesn't have a lot of memory, the builder step will fail if it has to build larger images. For example, building the trainer image for the keras CIFAR10 codeset example resulted in an OOM failure on a node where only 8GB of memory were available. This is a known kaniko issue [1] and there's a fix available [2] with more recent (>=1.7.0) kaniko versions: disabling the compressed caching via the `--compressed-caching` command line argument. This commit models a workflow input parameter mapped to this new command line argument. To avoid OOM errors with bigger images, the user may set it in the workflow like so: ``` - name: builder image: ghcr.io/stefannica/mlflow-builder:latest inputs: - name: mlflow-codeset codeset: name: '{{ inputs.mlflow-codeset }}' path: /project - name: compressed_caching # Disable compressed caching to avoid running into OOM errors on cluster nodes with lower memory value: false ``` [1] GoogleContainerTools/kaniko#909 [2] GoogleContainerTools/kaniko#1722
Description
Large images cannot be build as the kaniko container will be killed due to an OOM error. Removing the tarball compression drastically reduces the memory required to push large image layers. Fixes #1680
Removing the tarball compression may increase the build time for smaller images. Therefore a command line option to disable the compression was chosen for the implementation.
Submitter Checklist
These are the criteria that every PR should meet, please check them off as you
review them:
I may need some support here, generating a good integration test.
Currently the integration tests are failing due to: Fix executor Dockerfile, which wasn't building #1741
See the contribution guide for more details.
Reviewer Notes
Release Notes