mlflow-builder: fix OOM failures during build with bigger images #74
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
If the k8s node where the MLFlow builder step is running doesn't
have a lot of memory, the builder step will fail if it has to build
larger images. For example, building the trainer image for the keras
CIFAR10 codeset example resulted in an OOM failure on a node where
only 8GB of memory were available.
This is a known kaniko issue [1] and there's a fix available [2] with
more recent (>=1.7.0) kaniko versions: disabling the compressed
caching via the
--compressed-caching
command line argument.This commit models a workflow input parameter mapped to this
new command line argument. To avoid OOM errors with bigger
images, the user may set it in the workflow like so:
[1] GoogleContainerTools/kaniko#909
[2] GoogleContainerTools/kaniko#1722