Skip to content
This repository has been archived by the owner on Sep 18, 2024. It is now read-only.

A new training service implementation for Kubernetes by Petuum #3022

Merged
merged 45 commits into from
Nov 23, 2020
Merged
Show file tree
Hide file tree
Changes from 40 commits
Commits
Show all changes
45 commits
Select commit Hold shift + click to select a range
4ed72c4
A light pipeline to build, lint, test, release internally
pengwu22 Oct 12, 2020
41bade3
Add AdaptDL
Aug 7, 2020
a89ce3c
Add NFS support
Aug 13, 2020
fea7286
Support show trials in WebUI without requiring metrics reported
pengwu22 Aug 13, 2020
f45728b
Resolve BE-12443 "Dev"
TairuiWang Aug 14, 2020
9afe3fe
Deepcopy the templates. Avoids readding mounts and volumes
Aug 17, 2020
8b78311
Handle trial msg
pengwu22 Aug 20, 2020
cdc3a76
[BE-12469] Support adaptdl signal handling
TairuiWang Aug 21, 2020
a9cfeb9
make checkpoint optional in config file
TairuiWang Sep 1, 2020
1974390
image pulling error handling added
TairuiWang Sep 1, 2020
0db3111
nnictl log
pengwu22 Sep 1, 2020
0177886
fix backward incompatible changes after rebase
pengwu22 Sep 1, 2020
e21e884
fix nnictl tensorboard start allowing optional experiment id
pengwu22 Sep 9, 2020
629804f
`waiting` doesn't exist bug fix
TairuiWang Sep 9, 2020
918cbb8
Resolve BE-12465: add fail msg and resource config
TairuiWang Sep 10, 2020
7ff5fd2
BE-12510: nni-tensorboard issue fixed
TairuiWang Sep 20, 2020
d129e23
AdaptDL-Compatible Python CLI APIs Example
pengwu22 Sep 21, 2020
b24ba6c
Hide KubeConfig for Going Public
pengwu22 Oct 9, 2020
a20b33d
Integrate Bert finetuning model as an example in NNI
ZeyaWang Oct 12, 2020
1e8c461
tensorboard ui and web ui: at the same ip
pengwu22 Oct 12, 2020
4cc7cec
adaptive support (#1)
pengwu22 Oct 12, 2020
2096a9e
Simplify Examples
pengwu22 Oct 22, 2020
2e9b767
Cleanup: General
pengwu22 Oct 22, 2020
565d807
cleanup: webui
pengwu22 Oct 22, 2020
9c76d39
cleanup: sanity
pengwu22 Oct 22, 2020
8b4649d
Simplify SDK Change for OSS (#5)
pengwu22 Oct 22, 2020
d587906
lint
pengwu22 Oct 22, 2020
f736900
skipping like kubeflowtest
pengwu22 Oct 23, 2020
95bb20b
comments: lint
pengwu22 Oct 27, 2020
1ee7de8
comments: message value-write-read
pengwu22 Oct 27, 2020
55c25c0
unit test
pengwu22 Oct 28, 2020
af315ca
unit test
pengwu22 Oct 28, 2020
d7afaab
unit test
pengwu22 Oct 28, 2020
7a73286
doc
pengwu22 Nov 9, 2020
91fd236
toc tree reference
pengwu22 Nov 10, 2020
a0dc313
lint
pengwu22 Nov 10, 2020
6825a5e
toctree
pengwu22 Nov 11, 2020
0024670
intermediate seq
pengwu22 Nov 11, 2020
c85798d
doc fix
pengwu22 Nov 11, 2020
81d6c29
intermediate sequence
pengwu22 Nov 11, 2020
d527a89
import fix
pengwu22 Nov 11, 2020
9ef620f
===
pengwu22 Nov 12, 2020
cdb268e
add adl cifar10 example
hao-howard-zhang Nov 20, 2020
6a32f17
rename tensorboard dir env var
hao-howard-zhang Nov 20, 2020
7dd3ba0
improve adaptdl doc
hao-howard-zhang Nov 20, 2020
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -38,6 +38,7 @@ build/Release
# Dependency directories
node_modules/
jspm_packages/
**/package-lock.json

# TypeScript v1 declaration files
typings/
Expand Down Expand Up @@ -68,6 +69,8 @@ __pycache__
build
*.egg-info
setup.pye
**/__init__.pye
**/.ipynb_checkpoints

# Environments
.env
Expand Down
188 changes: 188 additions & 0 deletions docs/en_US/TrainingService/AdaptDLMode.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,188 @@
# Run an Experiment on AdaptDL

===

Now NNI supports running experiment on [AdaptDL](https://github.com/petuum/adaptdl). Before starting to use NNI AdaptDL mode, you should have a Kubernetes cluster, either on-premises or [Azure Kubernetes Service(AKS)](https://azure.microsoft.com/en-us/services/kubernetes-service/), a Ubuntu machine on which [kubeconfig](https://kubernetes.io/docs/concepts/configuration/organize-cluster-access-kubeconfig/) is setup to connect to your Kubernetes cluster. In AdaptDL mode, your trial program will run as AdaptDL job in Kubernetes cluster.

AdaptDL aims to make distributed deep learning easy and efficient in dynamic-resource environments such as shared clusters and the cloud.

## Prerequisite for Kubernetes Service

1. A **Kubernetes** cluster using Kubernetes 1.14 or later with storage. Follow this guideline to set up Kubernetes [on Azure](https://azure.microsoft.com/en-us/services/kubernetes-service/), or [on-premise](https://kubernetes.io/docs/setup/) with [cephfs](https://kubernetes.io/docs/concepts/storage/storage-classes/#ceph-rbd), or [microk8s with storage add-on enabled](https://microk8s.io/docs/addons).
2. Helm install **AdaptDL Scheduler** to your Kubernetes cluster. Follow this [guideline](https://adaptdl.readthedocs.io/en/latest/installation/install-adaptdl.html) to setup AdaptDL scheduler.
3. Prepare a **kubeconfig** file, which will be used by NNI to interact with your Kubernetes API server. By default, NNI manager will use $(HOME)/.kube/config as kubeconfig file's path. You can also specify other kubeconfig files by setting the **KUBECONFIG** environment variable. Refer this [guideline]( https://kubernetes.io/docs/concepts/configuration/organize-cluster-access-kubeconfig) to learn more about kubeconfig.
4. If your NNI trial job needs GPU resource, you should follow this [guideline](https://github.com/NVIDIA/k8s-device-plugin) to configure **Nvidia device plugin for Kubernetes**.
5. (Optional) Prepare a **NFS server** and export a general purpose mount as external storage.
6. Install **NNI**, follow the install guide [here](../Tutorial/QuickStart.md).

### Verify Prerequisites

```bash
nnictl --version
# Expected: <version_number>
```

```bash
kubectl version
# Expected that the kubectl client version matches the server version.
```

```bash
kubectl api-versions | grep adaptdl
# Expected: adaptdl.petuum.com/v1
```

## Run an experiment

Here is a template configuration specification to use AdaptDL as a training service.

```yaml
authorName: default
experimentName: minimal_adl

trainingServicePlatform: adl
nniManagerIp: 10.1.10.11
logCollection: http

tuner:
builtinTunerName: GridSearch
searchSpacePath: search_space.json

trialConcurrency: 2
maxTrialNum: 2

trial:
adaptive: false # optional.
image: <image_tag>
imagePullSecrets: # optional
- name: stagingsecret
codeDir: .
command: python main.py
gpuNum: 1
cpuNum: 1 # optional
memorySize: 8Gi # optional
nfs: # optional
server: 10.20.41.55
path: /
containerMountPath: /nfs
checkpoint: # optional
storageClass: microk8s-hostpath
storageSize: 1Gi
```

Those configs not mentioned below, are following the
[default specs defined in the NNI doc](https://nni.readthedocs.io/en/latest/Tutorial/ExperimentConfig.html#configuration-spec).

* **trainingServicePlatform**: Choose `adl` to use the Kubernetes cluster with AdaptDL scheduler.
* **nniManagerIp**: *Required* to get the correct info and metrics back from the cluster, for `adl` training service.
IP address of the machine with NNI manager (NNICTL) that launches NNI experiment.
* **logCollection**: *Recommended* to set as `http`. It will collect the trial logs on cluster back to your machine via http.
* **tuner**: It supports the Tuun tuner and all NNI built-in tuners (only except for the checkpoint feature of the NNI PBT tuners).
* **trial**: It defines the specs of an `adl` trial.
* **adaptive**: (*Optional*) Boolean for AdaptDL trainer. While `true`, it the job is preemptible and adaptive.
* **image**: Docker image for the trial
* **imagePullSecret**: (*Optional*) If you are using a private registry,
you need to provide the secret to successfully pull the image.
* **codeDir**: the working directory of the container. `.` means the default working directory defined by the image.
* **command**: the bash command to start the trial
* **gpuNum**: the number of GPUs requested for this trial. It must be non-negative integer.
* **cpuNum**: (*Optional*) the number of CPUs requested for this trial. It must be non-negative integer.
* **memorySize**: (*Optional*) the size of memory requested for this trial. It must follow the Kubernetes
[default format](https://kubernetes.io/docs/concepts/configuration/manage-resources-containers/#meaning-of-memory).
* **nfs**: (*Optional*) mounting external storage. For more information about using NFS please check the below paragraph.
* **checkpoint** (*Optional*) [storage settings](https://kubernetes.io/docs/concepts/storage/storage-classes/) for AdaptDL internal checkpoints. You can keep it optional if you are not dev users.

### NFS Storage

As you may have noticed in the above configuration spec,
an *optional* section is available to configure NFS external storage. It is optional when no external storage is required, when for example an docker image is sufficient with codes and data inside.

Note that `adl` training service does NOT help mount an NFS to the local dev machine, so that one can manually mount it to local, manage the filesystem, copy the data or code etc.
The `adl` training service can then mount it to the kubernetes for every trials, with the proper configurations:

* **server**: NFS server address, e.g. IP address or domain
* **path**: NFS server export path, i.e. the absolute path in NFS that can be mounted to trials
* **containerMountPath**: In container absolute path to mount the NFS **path** above,
so that every trial will have the access to the NFS.
In the trial containers, you can access the NFS with this path.

Use cases:

* If your training trials depend on a dataset of large size, you may want to download it first onto the NFS first,
and mount it so that it can be shared across multiple trials.
* The storage for containers are ephemeral and the trial containers will be deleted after a trial's lifecycle is over.
So if you want to export your trained models,
you may mount the NFS to the trial to persist and export your trained models.

In short, it is not limited how a trial wants to read from or write on the NFS storage, so you may use it flexibly as per your needs.


## Monitor via Log Stream

Follow the log streaming of a certain trial:

```bash
nnictl log trial --trial_id=<trial_id>
```

```bash
nnictl log trial <experiment_id> --trial_id=<trial_id>
```

Note that *after* a trial has done and its pod has been deleted,
no logs can be retrieved then via this command.
However you may still be able to access the past trial logs
according to the following approach.


## Monitor via TensorBoard

In the context of NNI, an experiment has multiple trials.
For easy comparison across trials for a model tuning process,
we support TensorBoard integration. Here one experiment has
an independent TensorBoard logging directory thus dashboard.


You can only use the TensorBoard while the monitored experiment is running.
In other words, it is not supported to monitor stopped experiments.


In the trial container you may have access to two environment variables:

* `ADAPTDLCTL_TENSORBOARD_LOGDIR`: the TensorBoard logging directory for the current experiment,
* `NNI_TRIAL_JOB_ID`: the `trial` job id for the current trial.

It is recommended for to have them joined as the directory for trial,
for example in Python:

```python
import os
tensorboard_logdir = os.path.join(
os.getenv("ADAPTDLCTL_TENSORBOARD_LOGDIR"),
os.getenv("NNI_TRIAL_JOB_ID")
)
```

If an experiment is stopped, the data logged here
(defined by *the above envs* for monitoring with the following commands)
will be lost. To persist the logged data, you can use the external storage (e.g. to mount an NFS)
to export it and view the TensorBoard locally.


With the above setting, you can monitor the experiment easily
via TensorBoard by

```bash
nnictl tensorboard start
```

If having multiple experiment running at the same time, you may use

```bash
nnictl tensorboard start <experiment_id>
```

It will provide you the web url to access the tensorboard.

Note that you have the flexibility to set up the local `--port`
for the TensorBoard.
3 changes: 2 additions & 1 deletion docs/en_US/TrainingService/Overview.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@

NNI training service is designed to allow users to focus on AutoML itself, agnostic to the underlying computing infrastructure where the trials are actually run. When migrating from one cluster to another (e.g., local machine to Kubeflow), users only need to tweak several configurations, and the experiment can be easily scaled.

Users can use training service provided by NNI, to run trial jobs on [local machine](./LocalMode.md), [remote machines](./RemoteMachineMode.md), and on clusters like [PAI](./PaiMode.md), [Kubeflow](./KubeflowMode.md), [FrameworkController](./FrameworkControllerMode.md), [DLTS](./DLTSMode.md) and [AML](./AMLMode.md). These are called *built-in training services*.
Users can use training service provided by NNI, to run trial jobs on [local machine](./LocalMode.md), [remote machines](./RemoteMachineMode.md), and on clusters like [PAI](./PaiMode.md), [Kubeflow](./KubeflowMode.md), [AdaptDL](./AdaptDLMode.md), [FrameworkController](./FrameworkControllerMode.md), [DLTS](./DLTSMode.md) and [AML](./AMLMode.md). These are called *built-in training services*.

If the computing resource customers try to use is not listed above, NNI provides interface that allows users to build their own training service easily. Please refer to "[how to implement training service](./HowToImplementTrainingService)" for details.

Expand All @@ -24,6 +24,7 @@ In case users intend to use large files in their experiment (like large-scaled d
|[__Remote__](./RemoteMachineMode.md)|NNI supports running an experiment on multiple machines through SSH channel, called remote mode. NNI assumes that you have access to those machines, and already setup the environment for running deep learning training code. NNI will submit the trial jobs in remote machine, and schedule suitable machine with enough gpu resource if specified.|
|[__PAI__](./PaiMode.md)|NNI supports running an experiment on [OpenPAI](https://github.com/Microsoft/pai) (aka PAI), called PAI mode. Before starting to use NNI PAI mode, you should have an account to access an [OpenPAI](https://github.com/Microsoft/pai) cluster. See [here](https://github.com/Microsoft/pai#how-to-deploy) if you don't have any OpenPAI account and want to deploy an OpenPAI cluster. In PAI mode, your trial program will run in PAI's container created by Docker.|
|[__Kubeflow__](./KubeflowMode.md)|NNI supports running experiment on [Kubeflow](https://github.com/kubeflow/kubeflow), called kubeflow mode. Before starting to use NNI kubeflow mode, you should have a Kubernetes cluster, either on-premises or [Azure Kubernetes Service(AKS)](https://azure.microsoft.com/en-us/services/kubernetes-service/), a Ubuntu machine on which [kubeconfig](https://kubernetes.io/docs/concepts/configuration/organize-cluster-access-kubeconfig/) is setup to connect to your Kubernetes cluster. If you are not familiar with Kubernetes, [here](https://kubernetes.io/docs/tutorials/kubernetes-basics/) is a good start. In kubeflow mode, your trial program will run as Kubeflow job in Kubernetes cluster.|
|[__AdaptDL__](./AdaptDLMode.md)|NNI supports running experiment on [AdaptDL](https://github.com/petuum/adaptdl), called AdaptDL mode. Before starting to use NNI kubeflow mode, you should have a Kubernetes cluster.|
|[__FrameworkController__](./FrameworkControllerMode.md)|NNI supports running experiment using [FrameworkController](https://github.com/Microsoft/frameworkcontroller), called frameworkcontroller mode. FrameworkController is built to orchestrate all kinds of applications on Kubernetes, you don't need to install Kubeflow for specific deep learning framework like tf-operator or pytorch-operator. Now you can use FrameworkController as the training service to run NNI experiment.|
|[__DLTS__](./DLTSMode.md)|NNI supports running experiment using [DLTS](https://github.com/microsoft/DLWorkspace.git), which is an open source toolkit, developed by Microsoft, that allows AI scientists to spin up an AI cluster in turn-key fashion.|
|[__AML__](./AMLMode.md)|NNI supports running an experiment on [AML](https://azure.microsoft.com/en-us/services/machine-learning/) , called aml mode.
Expand Down
2 changes: 2 additions & 0 deletions docs/en_US/Tutorial/ExperimentConfig.md
Original file line number Diff line number Diff line change
Expand Up @@ -260,6 +260,8 @@ Specifies the platform to run the experiment, including __local__, __remote__, _

* __kubeflow__ submit trial jobs to [kubeflow](https://www.kubeflow.org/docs/about/kubeflow/), NNI support kubeflow based on normal kubernetes and [azure kubernetes](https://azure.microsoft.com/en-us/services/kubernetes-service/). For detail please refer to [Kubeflow Docs](../TrainingService/KubeflowMode.md)

* __adl__ submit trial jobs to [AdaptDL](https://www.kubeflow.org/docs/about/kubeflow/), NNI support AdaptDL on Kubernetes cluster. For detail please refer to [AdaptDL Docs](../TrainingService/AdaptDLMode.md)

* TODO: explain frameworkcontroller.

### searchSpacePath
Expand Down
1 change: 1 addition & 0 deletions docs/en_US/Tutorial/InstallationLinux.md
Original file line number Diff line number Diff line change
Expand Up @@ -118,3 +118,4 @@ Due to potential programming changes, the minimum system requirements of NNI may
* [How to run an experiment on OpenPAI?](../TrainingService/PaiMode.md)
* [How to run an experiment on Kubernetes through Kubeflow?](../TrainingService/KubeflowMode.md)
* [How to run an experiment on Kubernetes through FrameworkController?](../TrainingService/FrameworkControllerMode.md)
* [How to run an experiment on Kubernetes through AdaptDL?](../TrainingService/AdaptDLMode.md)
1 change: 1 addition & 0 deletions docs/en_US/Tutorial/QuickStart.md
Original file line number Diff line number Diff line change
Expand Up @@ -281,3 +281,4 @@ Below is the status of all trials. Specifically:
* [How to run an experiment on OpenPAI?](../TrainingService/PaiMode.md)
* [How to run an experiment on Kubernetes through Kubeflow?](../TrainingService/KubeflowMode.md)
* [How to run an experiment on Kubernetes through FrameworkController?](../TrainingService/FrameworkControllerMode.md)
* [How to run an experiment on Kubernetes through AdaptDL?](../TrainingService/AdaptDLMode.md)
1 change: 1 addition & 0 deletions docs/en_US/training_services.rst
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,7 @@ Introduction to NNI Training Services
OpenPAI<./TrainingService/PaiMode>
OpenPAI Yarn Mode<./TrainingService/PaiYarnMode>
Kubeflow<./TrainingService/KubeflowMode>
AdaptDL<./TrainingService/AdaptDLMode>
FrameworkController<./TrainingService/FrameworkControllerMode>
DLTS<./TrainingService/DLTSMode>
AML<./TrainingService/AMLMode>
21 changes: 21 additions & 0 deletions examples/trials/mnist-pytorch/config_adl.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
authorName: default
experimentName: example_mnist_pytorch
trialConcurrency: 1
maxExecDuration: 1h
maxTrialNum: 10

logCollection: http
trainingServicePlatform: adl

searchSpacePath: search_space.json
useAnnotation: false
tuner:
builtinTunerName: TPE
classArgs:
optimize_mode: maximize

trial:
image: {replace_to_your_image_tag}
command: python3 mnist.py
codeDir: .
gpuNum: 1
1 change: 1 addition & 0 deletions src/nni_manager/common/datastore.ts
Original file line number Diff line number Diff line change
Expand Up @@ -46,6 +46,7 @@ interface TrialJobInfo {
id: string;
sequenceId?: number;
status: TrialJobStatus;
message?: string;
startTime?: number;
endTime?: number;
hyperParameters?: string[];
Expand Down
1 change: 1 addition & 0 deletions src/nni_manager/common/manager.ts
Original file line number Diff line number Diff line change
Expand Up @@ -105,6 +105,7 @@ abstract class Manager {
public abstract getTrialLog(trialJobId: string, logType: LogType): Promise<string>;

public abstract getTrialJobStatistics(): Promise<TrialJobStatistics[]>;
public abstract getTrialJobMessage(trialJobId: string): string | undefined;
public abstract getStatus(): NNIManagerStatus;
}

Expand Down
1 change: 1 addition & 0 deletions src/nni_manager/common/trainingService.ts
Original file line number Diff line number Diff line change
Expand Up @@ -42,6 +42,7 @@ interface TrialJobDetail {
readonly workingDirectory: string;
readonly form: TrialJobApplicationForm;
isEarlyStopped?: boolean;
message?: string;
}

/**
Expand Down
17 changes: 17 additions & 0 deletions src/nni_manager/config/adl/adaptdl-crd-v1.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
{
"apiVersion": "apiextensions.k8s.io/v1beta1",
"kind": "CustomResourceDefinition",
"metadata": {
"name": "adaptdljobs.adaptdl.petuum.com"
},
"spec": {
"group": "adaptdl.petuum.com",
"version": "v1",
"scope": "Namespaced",
"names": {
"plural": "adaptdljobs",
"singular": "adaptdljob",
"kind": "AdaptDLJob"
}
}
}
19 changes: 19 additions & 0 deletions src/nni_manager/config/adl/adaptdl-nni-configmap-template.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
{
"apiVersion": "v1",
"kind": "ConfigMap",
"metadata": {
"name": "<name>",
"ownerReferences": [
{
"apiVersion": "adaptdl.petuum.com/v1",
"kind": "AdaptDLJob",
"name": "<adaptdljob_name>",
"uid": "<adaptdljob_uid>"
}
]
},
"data": {
"run.sh": "<run_script>",
"cleanup.sh": "<clean_script>"
}
}
27 changes: 27 additions & 0 deletions src/nni_manager/config/adl/adaptdl-pvc-template.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
{
"apiVersion": "v1",
"kind": "PersistentVolumeClaim",
"metadata": {
"name": "<name>",
"ownerReferences": [
{
"apiVersion": "adaptdl.petuum.com/v1",
"kind": "AdaptDLJob",
"name": "<adaptdljob_name>",
"uid": "<adaptdljob_uid>"
}
]
},
"spec": {
"accessModes": [
"ReadWriteMany"
],
"resources": {
"requests": {
"storage": "<storage_size>"
}
},
"storageClassName": "<storage_class>",
"volumeMode": "Filesystem"
}
}
Loading