Skip to content

Commit

Permalink
Refactor Kustomize manifests for Katib (#1464)
Browse files Browse the repository at this point in the history
* Refactor kustomize manifests

* Remove file

* Modify README

* Disable actions

* Update Trial images tag to v1beta1-c6c9172

* Fix few comments

* Remove test print

* Remove image pull policy

* Remove TODOs

* Exclude PV from Kubeflow install
Add image versions to katib-external-db install

* Rename Katib IBM install

* Create patch file for Katib config

* Fix path for MC

* Fix var name

* Change tag in patch file

* Change download mnist for mxnet example

* Change MNIST to FashionMNIST

* Remove comment from actions
  • Loading branch information
andreyvelich authored Mar 12, 2021
1 parent badbbdb commit 070f1cb
Show file tree
Hide file tree
Showing 96 changed files with 407 additions and 1,127 deletions.
4 changes: 2 additions & 2 deletions Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -23,11 +23,11 @@ vet:
update:
hack/update-gofmt.sh

# Deploy Katib v1beta1 manifests into a k8s cluster
# Deploy Katib v1beta1 manifests using Kustomize into a k8s cluster.
deploy:
bash scripts/v1beta1/deploy.sh

# Undeploy Katib v1beta1 manifests from a k8s cluster
# Undeploy Katib v1beta1 manifests using Kustomize from a k8s cluster
undeploy:
bash scripts/v1beta1/undeploy.sh

Expand Down
5 changes: 3 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -202,7 +202,7 @@ kubectl create namespace kubeflow
Clone Kubeflow manifest repository:

```
git clone git@github.com:kubeflow/manifests.git
git clone -b v1.2-branch git@github.com:kubeflow/manifests.git
Set `MANIFESTS_DIR` to the cloned folder.
export MANIFESTS_DIR=<cloned-folder>
```
Expand Down Expand Up @@ -231,7 +231,8 @@ kustomize build . | kubectl apply -f -

### Katib

Finally, you can install Katib:
Note that your [kustomize](https://kustomize.io/) version should be >= 3.2.
To install Katib run:

```
git clone git@github.com:kubeflow/katib.git
Expand Down
21 changes: 11 additions & 10 deletions docs/developer-guide.md
Original file line number Diff line number Diff line change
Expand Up @@ -30,6 +30,7 @@ see the following user guides:

- [Go](https://golang.org/) (1.13 or later)
- [Docker](https://docs.docker.com/) (17.05 or later.)
- [kustomize](https://kustomize.io/) (3.2 or later)

## Build from source code

Expand Down Expand Up @@ -65,16 +66,16 @@ make generate

Below is a list of command-line flags accepted by Katib controller:

| Name | Type | Default | Description |
| ------------------------------- | --------------------------- | ------------------- | ----------------------------------------------------------------------------------------------------------------------- |
| cert-localfs | bool | false | Store the webhook cert in local file system |
| enable-grpc-probe-in-suggestion | bool | true | Enable grpc probe in suggestions |
| experiment-suggestion-name | string | "default" | The implementation of suggestion interface in experiment controller |
| metrics-addr | string | ":8080" | The address the metric endpoint binds to |
| trial-resources | []schema.GroupVersionKind | null | The list of resources that can be used as trial template, in the form: Kind.version.group (e.g. TFJob.v1.kubeflow.org) |
| webhook-inject-securitycontext | bool | false | Inject the securityContext of container[0] in the sidecar |
| webhook-port | int | 8443 | The port number to be used for admission webhook server |
| webhook-service-name | string | "katib-controller" | The service name which will be used in webhook |
| Name | Type | Default | Description |
| ------------------------------- | ------------------------- | ------------------ | ---------------------------------------------------------------------------------------------------------------------- |
| cert-localfs | bool | false | Store the webhook cert in local file system |
| enable-grpc-probe-in-suggestion | bool | true | Enable grpc probe in suggestions |
| experiment-suggestion-name | string | "default" | The implementation of suggestion interface in experiment controller |
| metrics-addr | string | ":8080" | The address the metric endpoint binds to |
| trial-resources | []schema.GroupVersionKind | null | The list of resources that can be used as trial template, in the form: Kind.version.group (e.g. TFJob.v1.kubeflow.org) |
| webhook-inject-securitycontext | bool | false | Inject the securityContext of container[0] in the sidecar |
| webhook-port | int | 8443 | The port number to be used for admission webhook server |
| webhook-service-name | string | "katib-controller" | The service name which will be used in webhook |

## Workflow design

Expand Down
2 changes: 1 addition & 1 deletion examples/v1beta1/bayesianoptimization-example.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -56,7 +56,7 @@ spec:
spec:
containers:
- name: training-container
image: docker.io/kubeflowkatib/mxnet-mnist:v1beta1-e294a90
image: docker.io/kubeflowkatib/mxnet-mnist:v1beta1-c6c9172
command:
- "python3"
- "/opt/mxnet-mnist/mnist.py"
Expand Down
2 changes: 1 addition & 1 deletion examples/v1beta1/cmaes-example.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -53,7 +53,7 @@ spec:
spec:
containers:
- name: training-container
image: docker.io/kubeflowkatib/mxnet-mnist:v1beta1-e294a90
image: docker.io/kubeflowkatib/mxnet-mnist:v1beta1-c6c9172
command:
- "python3"
- "/opt/mxnet-mnist/mnist.py"
Expand Down
4 changes: 1 addition & 3 deletions examples/v1beta1/custom-metricscollector-example.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -66,9 +66,7 @@ spec:
spec:
containers:
- name: training-container
# TODO (andreyvelich): Add tag to the image.
image: docker.io/kubeflowkatib/pytorch-mnist:latest
imagePullPolicy: Always
image: docker.io/kubeflowkatib/pytorch-mnist:v1beta1-c6c9172
command:
- "python3"
- "/opt/pytorch-mnist/mnist.py"
Expand Down
2 changes: 1 addition & 1 deletion examples/v1beta1/early-stopping/median-stop.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -53,7 +53,7 @@ spec:
spec:
containers:
- name: training-container
image: docker.io/kubeflowkatib/mxnet-mnist:v1beta1-e294a90
image: docker.io/kubeflowkatib/mxnet-mnist:v1beta1-c6c9172
command:
- "python3"
- "/opt/mxnet-mnist/mnist.py"
Expand Down
4 changes: 1 addition & 3 deletions examples/v1beta1/file-metricscollector-example.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -53,9 +53,7 @@ spec:
spec:
containers:
- name: training-container
# TODO (andreyvelich): Add tag to the image.
image: docker.io/kubeflowkatib/pytorch-mnist:latest
imagePullPolicy: Always
image: docker.io/kubeflowkatib/pytorch-mnist:v1beta1-c6c9172
command:
- "python3"
- "/opt/pytorch-mnist/mnist.py"
Expand Down
2 changes: 1 addition & 1 deletion examples/v1beta1/grid-example.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -54,7 +54,7 @@ spec:
spec:
containers:
- name: training-container
image: docker.io/kubeflowkatib/mxnet-mnist:v1beta1-e294a90
image: docker.io/kubeflowkatib/mxnet-mnist:v1beta1-c6c9172
command:
- "python3"
- "/opt/mxnet-mnist/mnist.py"
Expand Down
2 changes: 1 addition & 1 deletion examples/v1beta1/hyperband-example.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -68,7 +68,7 @@ spec:
spec:
containers:
- name: training-container
image: docker.io/kubeflowkatib/mxnet-mnist:v1beta1-e294a90
image: docker.io/kubeflowkatib/mxnet-mnist:v1beta1-c6c9172
command:
- "python3"
- "/opt/mxnet-mnist/mnist.py"
Expand Down
2 changes: 1 addition & 1 deletion examples/v1beta1/metric-strategy-example.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -58,7 +58,7 @@ spec:
spec:
containers:
- name: training-container
image: docker.io/kubeflowkatib/mxnet-mnist:v1beta1-e294a90
image: docker.io/kubeflowkatib/mxnet-mnist:v1beta1-c6c9172
command:
- "python3"
- "/opt/mxnet-mnist/mnist.py"
Expand Down
39 changes: 9 additions & 30 deletions examples/v1beta1/mxnet-mnist/mnist.py
Original file line number Diff line number Diff line change
Expand Up @@ -37,40 +37,19 @@
level=logging.DEBUG)


def read_data(label, image):
"""
download and read data into numpy
"""
base_url = 'http://yann.lecun.com/exdb/mnist/'
with gzip.open(utils.download_file(base_url+label, os.path.join('data', label))) as flbl:
magic, num = struct.unpack(">II", flbl.read(8))
label = np.fromstring(flbl.read(), dtype=np.int8)
with gzip.open(utils.download_file(base_url+image, os.path.join('data', image)), 'rb') as fimg:
magic, num, rows, cols = struct.unpack(">IIII", fimg.read(16))
image = np.fromstring(fimg.read(), dtype=np.uint8).reshape(len(label), rows, cols)
return (label, image)


def to4d(img):
def get_mnist_iter(args, kv):
"""
reshape to 4D arrays
Create data iterator with NDArrayIter
"""
return img.reshape(img.shape[0], 1, 28, 28).astype(np.float32)/255
mnist = mx.test_utils.get_mnist()

# Get MNIST data.
train_data = mx.io.NDArrayIter(
mnist['train_data'], mnist['train_label'], args.batch_size, shuffle=True)
val_data = mx.io.NDArrayIter(
mnist['test_data'], mnist['test_label'], args.batch_size)

def get_mnist_iter(args, kv):
"""
create data iterator with NDArrayIter
"""
(train_lbl, train_img) = read_data(
'train-labels-idx1-ubyte.gz', 'train-images-idx3-ubyte.gz')
(val_lbl, val_img) = read_data(
't10k-labels-idx1-ubyte.gz', 't10k-images-idx3-ubyte.gz')
train = mx.io.NDArrayIter(
to4d(train_img), train_lbl, args.batch_size, shuffle=True)
val = mx.io.NDArrayIter(
to4d(val_img), val_lbl, args.batch_size)
return (train, val)
return (train_data, val_data)


if __name__ == '__main__':
Expand Down
3 changes: 1 addition & 2 deletions examples/v1beta1/nas/darts-example-cpu.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -59,8 +59,7 @@ spec:
spec:
containers:
- name: training-container
image: docker.io/kubeflowkatib/darts-cnn-cifar10:v1beta1-e294a90
imagePullPolicy: Always
image: docker.io/kubeflowkatib/darts-cnn-cifar10:v1beta1-c6c9172
command:
- python3
- run_trial.py
Expand Down
3 changes: 1 addition & 2 deletions examples/v1beta1/nas/darts-example-gpu.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -76,8 +76,7 @@ spec:
spec:
containers:
- name: training-container
image: docker.io/kubeflowkatib/darts-cnn-cifar10:v1beta1-e294a90
imagePullPolicy: Always
image: docker.io/kubeflowkatib/darts-cnn-cifar10:v1beta1-c6c9172
command:
- python3
- run_trial.py
Expand Down
2 changes: 1 addition & 1 deletion examples/v1beta1/nas/enas-example-cpu.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -139,7 +139,7 @@ spec:
spec:
containers:
- name: training-container
image: docker.io/kubeflowkatib/enas-cnn-cifar10-cpu:v1beta1-e294a90
image: docker.io/kubeflowkatib/enas-cnn-cifar10-cpu:v1beta1-c6c9172
command:
- python3
- -u
Expand Down
2 changes: 1 addition & 1 deletion examples/v1beta1/nas/enas-example-gpu.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -136,7 +136,7 @@ spec:
spec:
containers:
- name: training-container
image: docker.io/kubeflowkatib/enas-cnn-cifar10-gpu:v1beta1-e294a90
image: docker.io/kubeflowkatib/enas-cnn-cifar10-gpu:v1beta1-c6c9172
command:
- python3
- -u
Expand Down
28 changes: 13 additions & 15 deletions examples/v1beta1/pytorch-mnist/mnist.py
Original file line number Diff line number Diff line change
Expand Up @@ -11,12 +11,6 @@
import torch.nn.functional as F
import torch.optim as optim

# To fix this issue: https://github.com/pytorch/vision/issues/1938.
from six.moves import urllib
opener = urllib.request.build_opener()
opener.addheaders = [("User-agent", "Mozilla/5.0")]
urllib.request.install_opener(opener)

WORLD_SIZE = int(os.environ.get("WORLD_SIZE", 1))


Expand Down Expand Up @@ -138,18 +132,22 @@ def main():
dist.init_process_group(backend=args.backend)

kwargs = {"num_workers": 1, "pin_memory": True} if use_cuda else {}

train_loader = torch.utils.data.DataLoader(
datasets.MNIST("../data", train=True, download=True,
transform=transforms.Compose([
transforms.ToTensor(),
transforms.Normalize((0.1307,), (0.3081,))
])),
datasets.FashionMNIST("./data",
train=True,
download=True,
transform=transforms.Compose([
transforms.ToTensor()
])),
batch_size=args.batch_size, shuffle=True, **kwargs)

test_loader = torch.utils.data.DataLoader(
datasets.MNIST("../data", train=False, transform=transforms.Compose([
transforms.ToTensor(),
transforms.Normalize((0.1307,), (0.3081,))
])),
datasets.FashionMNIST("./data",
train=False,
transform=transforms.Compose([
transforms.ToTensor()
])),
batch_size=args.test_batch_size, shuffle=False, **kwargs)

model = Net().to(device)
Expand Down
8 changes: 2 additions & 6 deletions examples/v1beta1/pytorchjob-example.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -45,9 +45,7 @@ spec:
spec:
containers:
- name: pytorch
# TODO (andreyvelich): Add tag to the image.
image: docker.io/kubeflowkatib/pytorch-mnist:latest
imagePullPolicy: Always
image: docker.io/kubeflowkatib/pytorch-mnist:v1beta1-c6c9172
command:
- "python3"
- "/opt/pytorch-mnist/mnist.py"
Expand All @@ -61,9 +59,7 @@ spec:
spec:
containers:
- name: pytorch
# TODO (andreyvelich): Add tag to the image.
image: docker.io/kubeflowkatib/pytorch-mnist:latest
imagePullPolicy: Always
image: docker.io/kubeflowkatib/pytorch-mnist:v1beta1-c6c9172
command:
- "python3"
- "/opt/pytorch-mnist/mnist.py"
Expand Down
2 changes: 1 addition & 1 deletion examples/v1beta1/random-example.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -53,7 +53,7 @@ spec:
spec:
containers:
- name: training-container
image: docker.io/kubeflowkatib/mxnet-mnist:v1beta1-e294a90
image: docker.io/kubeflowkatib/mxnet-mnist:v1beta1-c6c9172
command:
- "python3"
- "/opt/mxnet-mnist/mnist.py"
Expand Down
2 changes: 1 addition & 1 deletion examples/v1beta1/resume-experiment/from-volume-resume.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -54,7 +54,7 @@ spec:
spec:
containers:
- name: training-container
image: docker.io/kubeflowkatib/mxnet-mnist:v1beta1-e294a90
image: docker.io/kubeflowkatib/mxnet-mnist:v1beta1-c6c9172
command:
- "python3"
- "/opt/mxnet-mnist/mnist.py"
Expand Down
2 changes: 1 addition & 1 deletion examples/v1beta1/resume-experiment/never-resume.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -56,7 +56,7 @@ spec:
spec:
containers:
- name: training-container
image: docker.io/kubeflowkatib/mxnet-mnist:v1beta1-e294a90
image: docker.io/kubeflowkatib/mxnet-mnist:v1beta1-c6c9172
command:
- "python3"
- "/opt/mxnet-mnist/mnist.py"
Expand Down
2 changes: 1 addition & 1 deletion examples/v1beta1/tekton/pipeline-run.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -88,7 +88,7 @@ spec:
description: Number of training examples
steps:
- name: model-training
image: docker.io/kubeflowkatib/mxnet-mnist:v1beta1-e294a90
image: docker.io/kubeflowkatib/mxnet-mnist:v1beta1-c6c9172
command:
- "python3"
- "/opt/mxnet-mnist/mnist.py"
Expand Down
1 change: 0 additions & 1 deletion examples/v1beta1/tfjob-example.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -53,7 +53,6 @@ spec:
containers:
- name: tensorflow
image: gcr.io/kubeflow-ci/tf-mnist-with-summaries:1.0
imagePullPolicy: Always
command:
- "python"
- "/var/tf_mnist/mnist_with_summaries.py"
Expand Down
2 changes: 1 addition & 1 deletion examples/v1beta1/tpe-example.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -53,7 +53,7 @@ spec:
spec:
containers:
- name: training-container
image: docker.io/kubeflowkatib/mxnet-mnist:v1beta1-e294a90
image: docker.io/kubeflowkatib/mxnet-mnist:v1beta1-c6c9172
command:
- "python3"
- "/opt/mxnet-mnist/mnist.py"
Expand Down
2 changes: 1 addition & 1 deletion examples/v1beta1/trial-metadata-substitution.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -59,7 +59,7 @@ spec:
spec:
containers:
- name: training-container
image: docker.io/kubeflowkatib/mxnet-mnist:v1beta1-e294a90
image: docker.io/kubeflowkatib/mxnet-mnist:v1beta1-c6c9172
command:
- "python3"
- "/opt/mxnet-mnist/mnist.py"
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,7 @@ kind: Deployment
metadata:
name: katib-controller
namespace: kubeflow
# TODO (andreyvelich): Modify labels to follow k8s guidelines.
labels:
app: katib-controller
spec:
Expand All @@ -21,7 +22,6 @@ spec:
containers:
- name: katib-controller
image: docker.io/kubeflowkatib/katib-controller
imagePullPolicy: Always
command: ["./katib-controller"]
args:
- "--webhook-port=8443"
Expand Down
Loading

0 comments on commit 070f1cb

Please sign in to comment.