Refactor Kustomize manifests for Katib (#1464)

* Refactor kustomize manifests * Remove file * Modify README * Disable actions * Update Trial images tag to v1beta1-c6c9172 * Fix few comments * Remove test print * Remove image pull policy * Remove TODOs * Exclude PV from Kubeflow install Add image versions to katib-external-db install * Rename Katib IBM install * Create patch file for Katib config * Fix path for MC * Fix var name * Change tag in patch file * Change download mnist for mxnet example * Change MNIST to FashionMNIST * Remove comment from actions
kubeflow · Mar 12, 2021 · 070f1cb · 070f1cb
1 parent badbbdb
commit 070f1cb
Show file tree

Hide file tree

Showing 96 changed files with 407 additions and 1,127 deletions.
diff --git a/Makefile b/Makefile
@@ -23,11 +23,11 @@ vet:
 update:
 	hack/update-gofmt.sh
 
-# Deploy Katib v1beta1 manifests into a k8s cluster
+# Deploy Katib v1beta1 manifests using Kustomize into a k8s cluster.
 deploy:
 	bash scripts/v1beta1/deploy.sh
 
-# Undeploy Katib v1beta1 manifests from a k8s cluster
+# Undeploy Katib v1beta1 manifests using Kustomize from a k8s cluster
 undeploy:
 	bash scripts/v1beta1/undeploy.sh
 

diff --git a/README.md b/README.md
@@ -202,7 +202,7 @@ kubectl create namespace kubeflow
 Clone Kubeflow manifest repository:
 
 ```
-git clone git@github.com:kubeflow/manifests.git
+git clone -b v1.2-branch git@github.com:kubeflow/manifests.git
 Set `MANIFESTS_DIR` to the cloned folder.
 export MANIFESTS_DIR=<cloned-folder>
 ```
@@ -231,7 +231,8 @@ kustomize build . | kubectl apply -f -
 
 ### Katib
 
-Finally, you can install Katib:
+Note that your [kustomize](https://kustomize.io/) version should be >= 3.2.
+To install Katib run:
 
 ```
 git clone git@github.com:kubeflow/katib.git

diff --git a/docs/developer-guide.md b/docs/developer-guide.md
@@ -30,6 +30,7 @@ see the following user guides:
 
 - [Go](https://golang.org/) (1.13 or later)
 - [Docker](https://docs.docker.com/) (17.05 or later.)
+- [kustomize](https://kustomize.io/) (3.2 or later)
 
 ## Build from source code
 
@@ -65,16 +66,16 @@ make generate
 
 Below is a list of command-line flags accepted by Katib controller:
 
-| Name                            | Type                        | Default             | Description                                                                                                             |
-| ------------------------------- | --------------------------- | ------------------- | ----------------------------------------------------------------------------------------------------------------------- |
-| cert-localfs                    | bool                        | false               | Store the webhook cert in local file system                                                                             |
-| enable-grpc-probe-in-suggestion | bool                        | true                | Enable grpc probe in suggestions                                                                                        |
-| experiment-suggestion-name      | string                      | "default"           | The implementation of suggestion interface in experiment controller                                                     |
-| metrics-addr                    | string                      | ":8080"             | The address the metric endpoint binds to                                                                                |
-| trial-resources                 | []schema.GroupVersionKind   | null                | The list of resources that can be used as trial template, in the form: Kind.version.group (e.g. TFJob.v1.kubeflow.org)  |
-| webhook-inject-securitycontext  | bool                        | false               | Inject the securityContext of container[0] in the sidecar                                                               |
-| webhook-port                    | int                         | 8443                | The port number to be used for admission webhook server                                                                 |
-| webhook-service-name            | string                      | "katib-controller"  | The service name which will be used in webhook                                                                          |
+| Name                            | Type                      | Default            | Description                                                                                                            |
+| ------------------------------- | ------------------------- | ------------------ | ---------------------------------------------------------------------------------------------------------------------- |
+| cert-localfs                    | bool                      | false              | Store the webhook cert in local file system                                                                            |
+| enable-grpc-probe-in-suggestion | bool                      | true               | Enable grpc probe in suggestions                                                                                       |
+| experiment-suggestion-name      | string                    | "default"          | The implementation of suggestion interface in experiment controller                                                    |
+| metrics-addr                    | string                    | ":8080"            | The address the metric endpoint binds to                                                                               |
+| trial-resources                 | []schema.GroupVersionKind | null               | The list of resources that can be used as trial template, in the form: Kind.version.group (e.g. TFJob.v1.kubeflow.org) |
+| webhook-inject-securitycontext  | bool                      | false              | Inject the securityContext of container[0] in the sidecar                                                              |
+| webhook-port                    | int                       | 8443               | The port number to be used for admission webhook server                                                                |
+| webhook-service-name            | string                    | "katib-controller" | The service name which will be used in webhook                                                                         |
 
 ## Workflow design
 

diff --git a/examples/v1beta1/bayesianoptimization-example.yaml b/examples/v1beta1/bayesianoptimization-example.yaml
@@ -56,7 +56,7 @@ spec:
           spec:
             containers:
               - name: training-container
-                image: docker.io/kubeflowkatib/mxnet-mnist:v1beta1-e294a90
+                image: docker.io/kubeflowkatib/mxnet-mnist:v1beta1-c6c9172
                 command:
                   - "python3"
                   - "/opt/mxnet-mnist/mnist.py"

diff --git a/examples/v1beta1/cmaes-example.yaml b/examples/v1beta1/cmaes-example.yaml
@@ -53,7 +53,7 @@ spec:
           spec:
             containers:
               - name: training-container
-                image: docker.io/kubeflowkatib/mxnet-mnist:v1beta1-e294a90
+                image: docker.io/kubeflowkatib/mxnet-mnist:v1beta1-c6c9172
                 command:
                   - "python3"
                   - "/opt/mxnet-mnist/mnist.py"

diff --git a/examples/v1beta1/custom-metricscollector-example.yaml b/examples/v1beta1/custom-metricscollector-example.yaml
@@ -66,9 +66,7 @@ spec:
           spec:
             containers:
               - name: training-container
-                # TODO (andreyvelich): Add tag to the image.
-                image: docker.io/kubeflowkatib/pytorch-mnist:latest
-                imagePullPolicy: Always
+                image: docker.io/kubeflowkatib/pytorch-mnist:v1beta1-c6c9172
                 command:
                   - "python3"
                   - "/opt/pytorch-mnist/mnist.py"

diff --git a/examples/v1beta1/early-stopping/median-stop.yaml b/examples/v1beta1/early-stopping/median-stop.yaml
@@ -53,7 +53,7 @@ spec:
           spec:
             containers:
               - name: training-container
-                image: docker.io/kubeflowkatib/mxnet-mnist:v1beta1-e294a90
+                image: docker.io/kubeflowkatib/mxnet-mnist:v1beta1-c6c9172
                 command:
                   - "python3"
                   - "/opt/mxnet-mnist/mnist.py"

diff --git a/examples/v1beta1/file-metricscollector-example.yaml b/examples/v1beta1/file-metricscollector-example.yaml
@@ -53,9 +53,7 @@ spec:
           spec:
             containers:
               - name: training-container
-                # TODO (andreyvelich): Add tag to the image.
-                image: docker.io/kubeflowkatib/pytorch-mnist:latest
-                imagePullPolicy: Always
+                image: docker.io/kubeflowkatib/pytorch-mnist:v1beta1-c6c9172
                 command:
                   - "python3"
                   - "/opt/pytorch-mnist/mnist.py"

diff --git a/examples/v1beta1/grid-example.yaml b/examples/v1beta1/grid-example.yaml
@@ -54,7 +54,7 @@ spec:
           spec:
             containers:
               - name: training-container
-                image: docker.io/kubeflowkatib/mxnet-mnist:v1beta1-e294a90
+                image: docker.io/kubeflowkatib/mxnet-mnist:v1beta1-c6c9172
                 command:
                   - "python3"
                   - "/opt/mxnet-mnist/mnist.py"

diff --git a/examples/v1beta1/hyperband-example.yaml b/examples/v1beta1/hyperband-example.yaml
@@ -68,7 +68,7 @@ spec:
           spec:
             containers:
               - name: training-container
-                image: docker.io/kubeflowkatib/mxnet-mnist:v1beta1-e294a90
+                image: docker.io/kubeflowkatib/mxnet-mnist:v1beta1-c6c9172
                 command:
                   - "python3"
                   - "/opt/mxnet-mnist/mnist.py"

diff --git a/examples/v1beta1/metric-strategy-example.yaml b/examples/v1beta1/metric-strategy-example.yaml
@@ -58,7 +58,7 @@ spec:
           spec:
             containers:
               - name: training-container
-                image: docker.io/kubeflowkatib/mxnet-mnist:v1beta1-e294a90
+                image: docker.io/kubeflowkatib/mxnet-mnist:v1beta1-c6c9172
                 command:
                   - "python3"
                   - "/opt/mxnet-mnist/mnist.py"

diff --git a/examples/v1beta1/mxnet-mnist/mnist.py b/examples/v1beta1/mxnet-mnist/mnist.py
@@ -37,40 +37,19 @@
     level=logging.DEBUG)
 
 
-def read_data(label, image):
-    """
-    download and read data into numpy
-    """
-    base_url = 'http://yann.lecun.com/exdb/mnist/'
-    with gzip.open(utils.download_file(base_url+label, os.path.join('data', label))) as flbl:
-        magic, num = struct.unpack(">II", flbl.read(8))
-        label = np.fromstring(flbl.read(), dtype=np.int8)
-    with gzip.open(utils.download_file(base_url+image, os.path.join('data', image)), 'rb') as fimg:
-        magic, num, rows, cols = struct.unpack(">IIII", fimg.read(16))
-        image = np.fromstring(fimg.read(), dtype=np.uint8).reshape(len(label), rows, cols)
-    return (label, image)
-
-
-def to4d(img):
+def get_mnist_iter(args, kv):
     """
-    reshape to 4D arrays
+    Create data iterator with NDArrayIter
     """
-    return img.reshape(img.shape[0], 1, 28, 28).astype(np.float32)/255
+    mnist = mx.test_utils.get_mnist()
 
+    # Get MNIST data.
+    train_data = mx.io.NDArrayIter(
+        mnist['train_data'], mnist['train_label'], args.batch_size, shuffle=True)
+    val_data = mx.io.NDArrayIter(
+        mnist['test_data'], mnist['test_label'], args.batch_size)
 
-def get_mnist_iter(args, kv):
-    """
-    create data iterator with NDArrayIter
-    """
-    (train_lbl, train_img) = read_data(
-        'train-labels-idx1-ubyte.gz', 'train-images-idx3-ubyte.gz')
-    (val_lbl, val_img) = read_data(
-        't10k-labels-idx1-ubyte.gz', 't10k-images-idx3-ubyte.gz')
-    train = mx.io.NDArrayIter(
-        to4d(train_img), train_lbl, args.batch_size, shuffle=True)
-    val = mx.io.NDArrayIter(
-        to4d(val_img), val_lbl, args.batch_size)
-    return (train, val)
+    return (train_data, val_data)
 
 
 if __name__ == '__main__':

diff --git a/examples/v1beta1/nas/darts-example-cpu.yaml b/examples/v1beta1/nas/darts-example-cpu.yaml
@@ -59,8 +59,7 @@ spec:
           spec:
             containers:
               - name: training-container
-                image: docker.io/kubeflowkatib/darts-cnn-cifar10:v1beta1-e294a90
-                imagePullPolicy: Always
+                image: docker.io/kubeflowkatib/darts-cnn-cifar10:v1beta1-c6c9172
                 command:
                   - python3
                   - run_trial.py

diff --git a/examples/v1beta1/nas/darts-example-gpu.yaml b/examples/v1beta1/nas/darts-example-gpu.yaml
@@ -76,8 +76,7 @@ spec:
           spec:
             containers:
               - name: training-container
-                image: docker.io/kubeflowkatib/darts-cnn-cifar10:v1beta1-e294a90
-                imagePullPolicy: Always
+                image: docker.io/kubeflowkatib/darts-cnn-cifar10:v1beta1-c6c9172
                 command:
                   - python3
                   - run_trial.py

diff --git a/examples/v1beta1/nas/enas-example-cpu.yaml b/examples/v1beta1/nas/enas-example-cpu.yaml
@@ -139,7 +139,7 @@ spec:
           spec:
             containers:
               - name: training-container
-                image: docker.io/kubeflowkatib/enas-cnn-cifar10-cpu:v1beta1-e294a90
+                image: docker.io/kubeflowkatib/enas-cnn-cifar10-cpu:v1beta1-c6c9172
                 command:
                   - python3
                   - -u

diff --git a/examples/v1beta1/nas/enas-example-gpu.yaml b/examples/v1beta1/nas/enas-example-gpu.yaml
@@ -136,7 +136,7 @@ spec:
           spec:
             containers:
               - name: training-container
-                image: docker.io/kubeflowkatib/enas-cnn-cifar10-gpu:v1beta1-e294a90
+                image: docker.io/kubeflowkatib/enas-cnn-cifar10-gpu:v1beta1-c6c9172
                 command:
                   - python3
                   - -u

diff --git a/examples/v1beta1/pytorch-mnist/mnist.py b/examples/v1beta1/pytorch-mnist/mnist.py
@@ -11,12 +11,6 @@
 import torch.nn.functional as F
 import torch.optim as optim
 
-# To fix this issue: https://github.com/pytorch/vision/issues/1938.
-from six.moves import urllib
-opener = urllib.request.build_opener()
-opener.addheaders = [("User-agent", "Mozilla/5.0")]
-urllib.request.install_opener(opener)
-
 WORLD_SIZE = int(os.environ.get("WORLD_SIZE", 1))
 
 
@@ -138,18 +132,22 @@ def main():
         dist.init_process_group(backend=args.backend)
 
     kwargs = {"num_workers": 1, "pin_memory": True} if use_cuda else {}
+
     train_loader = torch.utils.data.DataLoader(
-        datasets.MNIST("../data", train=True, download=True,
-                       transform=transforms.Compose([
-                           transforms.ToTensor(),
-                           transforms.Normalize((0.1307,), (0.3081,))
-                       ])),
+        datasets.FashionMNIST("./data",
+                              train=True,
+                              download=True,
+                              transform=transforms.Compose([
+                                  transforms.ToTensor()
+                              ])),
         batch_size=args.batch_size, shuffle=True, **kwargs)
+
     test_loader = torch.utils.data.DataLoader(
-        datasets.MNIST("../data", train=False, transform=transforms.Compose([
-            transforms.ToTensor(),
-            transforms.Normalize((0.1307,), (0.3081,))
-        ])),
+        datasets.FashionMNIST("./data",
+                              train=False,
+                              transform=transforms.Compose([
+                                  transforms.ToTensor()
+                              ])),
         batch_size=args.test_batch_size, shuffle=False, **kwargs)
 
     model = Net().to(device)

diff --git a/examples/v1beta1/pytorchjob-example.yaml b/examples/v1beta1/pytorchjob-example.yaml
@@ -45,9 +45,7 @@ spec:
               spec:
                 containers:
                   - name: pytorch
-                    # TODO (andreyvelich): Add tag to the image.
-                    image: docker.io/kubeflowkatib/pytorch-mnist:latest
-                    imagePullPolicy: Always
+                    image: docker.io/kubeflowkatib/pytorch-mnist:v1beta1-c6c9172
                     command:
                       - "python3"
                       - "/opt/pytorch-mnist/mnist.py"
@@ -61,9 +59,7 @@ spec:
               spec:
                 containers:
                   - name: pytorch
-                    # TODO (andreyvelich): Add tag to the image.
-                    image: docker.io/kubeflowkatib/pytorch-mnist:latest
-                    imagePullPolicy: Always
+                    image: docker.io/kubeflowkatib/pytorch-mnist:v1beta1-c6c9172
                     command:
                       - "python3"
                       - "/opt/pytorch-mnist/mnist.py"

diff --git a/examples/v1beta1/random-example.yaml b/examples/v1beta1/random-example.yaml
@@ -53,7 +53,7 @@ spec:
           spec:
             containers:
               - name: training-container
-                image: docker.io/kubeflowkatib/mxnet-mnist:v1beta1-e294a90
+                image: docker.io/kubeflowkatib/mxnet-mnist:v1beta1-c6c9172
                 command:
                   - "python3"
                   - "/opt/mxnet-mnist/mnist.py"

diff --git a/examples/v1beta1/resume-experiment/from-volume-resume.yaml b/examples/v1beta1/resume-experiment/from-volume-resume.yaml
@@ -54,7 +54,7 @@ spec:
           spec:
             containers:
               - name: training-container
-                image: docker.io/kubeflowkatib/mxnet-mnist:v1beta1-e294a90
+                image: docker.io/kubeflowkatib/mxnet-mnist:v1beta1-c6c9172
                 command:
                   - "python3"
                   - "/opt/mxnet-mnist/mnist.py"

diff --git a/examples/v1beta1/resume-experiment/never-resume.yaml b/examples/v1beta1/resume-experiment/never-resume.yaml
@@ -56,7 +56,7 @@ spec:
           spec:
             containers:
               - name: training-container
-                image: docker.io/kubeflowkatib/mxnet-mnist:v1beta1-e294a90
+                image: docker.io/kubeflowkatib/mxnet-mnist:v1beta1-c6c9172
                 command:
                   - "python3"
                   - "/opt/mxnet-mnist/mnist.py"

diff --git a/examples/v1beta1/tekton/pipeline-run.yaml b/examples/v1beta1/tekton/pipeline-run.yaml
@@ -88,7 +88,7 @@ spec:
                     description: Number of training examples
                 steps:
                   - name: model-training
-                    image: docker.io/kubeflowkatib/mxnet-mnist:v1beta1-e294a90
+                    image: docker.io/kubeflowkatib/mxnet-mnist:v1beta1-c6c9172
                     command:
                       - "python3"
                       - "/opt/mxnet-mnist/mnist.py"

diff --git a/examples/v1beta1/tfjob-example.yaml b/examples/v1beta1/tfjob-example.yaml
@@ -53,7 +53,6 @@ spec:
                 containers:
                   - name: tensorflow
                     image: gcr.io/kubeflow-ci/tf-mnist-with-summaries:1.0
-                    imagePullPolicy: Always
                     command:
                       - "python"
                       - "/var/tf_mnist/mnist_with_summaries.py"

diff --git a/examples/v1beta1/tpe-example.yaml b/examples/v1beta1/tpe-example.yaml
@@ -53,7 +53,7 @@ spec:
           spec:
             containers:
               - name: training-container
-                image: docker.io/kubeflowkatib/mxnet-mnist:v1beta1-e294a90
+                image: docker.io/kubeflowkatib/mxnet-mnist:v1beta1-c6c9172
                 command:
                   - "python3"
                   - "/opt/mxnet-mnist/mnist.py"

diff --git a/examples/v1beta1/trial-metadata-substitution.yaml b/examples/v1beta1/trial-metadata-substitution.yaml
@@ -59,7 +59,7 @@ spec:
           spec:
             containers:
               - name: training-container
-                image: docker.io/kubeflowkatib/mxnet-mnist:v1beta1-e294a90
+                image: docker.io/kubeflowkatib/mxnet-mnist:v1beta1-c6c9172
                 command:
                   - "python3"
                   - "/opt/mxnet-mnist/mnist.py"

diff --git a/...a1/katib-controller/katib-controller.yaml → ...ta1/components/controller/controller.yaml b/...a1/katib-controller/katib-controller.yaml → ...ta1/components/controller/controller.yaml
@@ -3,6 +3,7 @@ kind: Deployment
 metadata:
   name: katib-controller
   namespace: kubeflow
+  # TODO (andreyvelich): Modify labels to follow k8s guidelines.
   labels:
     app: katib-controller
 spec:
@@ -21,7 +22,6 @@ spec:
       containers:
         - name: katib-controller
           image: docker.io/kubeflowkatib/katib-controller
-          imagePullPolicy: Always
           command: ["./katib-controller"]
           args:
             - "--webhook-port=8443"