diff --git a/README.md b/README.md index 0d4057f7a2..78c118b614 100644 --- a/README.md +++ b/README.md @@ -106,7 +106,7 @@ We encourage researchers and students leverage these projects to accelerate the ## **Install & Verify** -If you choose NNI Windows local mode and you use PowerShell to run script for the first time, you need to **run PowerShell as administrator** with this command first: +If you are using NNI on Windows and use PowerShell to run script for the first time, you need to **run PowerShell as administrator** with this command first: ```bash Set-ExecutionPolicy -ExecutionPolicy Unrestricted @@ -114,7 +114,7 @@ If you choose NNI Windows local mode and you use PowerShell to run script for th **Install through pip** -* We support Linux, MacOS and Windows(local mode) in current stage, Ubuntu 16.04 or higher, MacOS 10.14.1 along with Windows 10.1809 are tested and supported. Simply run the following `pip install` in an environment that has `python >= 3.5`. +* We support Linux, MacOS and Windows(local, remote and pai mode) in current stage, Ubuntu 16.04 or higher, MacOS 10.14.1 along with Windows 10.1809 are tested and supported. Simply run the following `pip install` in an environment that has `python >= 3.5`. Linux and MacOS @@ -131,12 +131,12 @@ python -m pip install --upgrade nni Note: * `--user` can be added if you want to install NNI in your home directory, which does not require any special privileges. -* Currently NNI on Windows only support local mode. Anaconda or Miniconda is highly recommended to install NNI on Windows. +* Currently NNI on Windows support local, remote and pai mode. Anaconda or Miniconda is highly recommended to install NNI on Windows. * If there is any error like `Segmentation fault`, please refer to [FAQ](docs/en_US/FAQ.md) **Install through source code** -* We support Linux (Ubuntu 16.04 or higher), MacOS (10.14.1) and Windows local mode (10.1809) in our current stage. +* We support Linux (Ubuntu 16.04 or higher), MacOS (10.14.1) and Windows (10.1809) in our current stage. Linux and MacOS @@ -155,12 +155,12 @@ Windows ```bash git clone -b v0.7 https://github.com/Microsoft/nni.git cd nni - powershell ./install.ps1 + powershell .\install.ps1 ``` For the system requirements of NNI, please refer to [Install NNI](docs/en_US/Installation.md) -For NNI Windows local mode, please refer to [NNI Windows local mode](docs/en_US/WindowsLocalMode.md) +For NNI on Windows, please refer to [NNI on Windows](docs/en_US/NniOnWindows.md) **Verify install** @@ -185,7 +185,7 @@ Windows * Run the MNIST example. ```bash - nnictl create --config nni/examples/trials/mnist/config_windows.yml + nnictl create --config nni\examples\trials\mnist\config_windows.yml ``` * Wait for the message `INFO: Successfully started experiment!` in the command line. This message indicates that your experiment has been successfully started. You can explore the experiment using the `Web UI url`. diff --git a/README_zh_CN.md b/README_zh_CN.md index e43d9e83e1..3e27f9d2f1 100644 --- a/README_zh_CN.md +++ b/README_zh_CN.md @@ -47,32 +47,32 @@ NNI (Neural Network Intelligence) 是自动机器学习(AutoML)的工具包 - Tuner(调参器) + Tuner(调参器) - Assessor(评估器) + Assessor(评估器) @@ -94,7 +94,10 @@ NNI (Neural Network Intelligence) 是自动机器学习(AutoML)的工具包 * [OpenPAI](https://github.com/Microsoft/pai):作为开源平台,提供了完整的 AI 模型训练和资源管理能力,能轻松扩展,并支持各种规模的私有部署、云和混合环境。 * [FrameworkController](https://github.com/Microsoft/frameworkcontroller):开源的通用 Kubernetes Pod 控制器,通过单个控制器来编排 Kubernetes 上所有类型的应用。 -* [MMdnn](https://github.com/Microsoft/MMdnn):一个完整、跨框架的解决方案,能够转换、可视化、诊断深度神经网络模型。 MMdnn 中的 "MM" 表示model management(模型管理),而 "dnn" 是 deep neural network(深度神经网络)的缩写。 我们鼓励研究人员和学生利用这些项目来加速 AI 开发和研究。 +* [MMdnn](https://github.com/Microsoft/MMdnn):一个完整、跨框架的解决方案,能够转换、可视化、诊断深度神经网络模型。 MMdnn 中的 "MM" 表示model management(模型管理),而 "dnn" 是 deep neural network(深度神经网络)的缩写。 +* [SPTAG](https://github.com/Microsoft/SPTAG) : Space Partition Tree And Graph (SPTAG) 是用于大规模向量的最近邻搜索场景的开源库。 + +我们鼓励研究人员和学生利用这些项目来加速 AI 开发和研究。 ## **安装和验证** @@ -106,9 +109,9 @@ NNI (Neural Network Intelligence) 是自动机器学习(AutoML)的工具包 **通过 pip 命令安装** -* 当前支持 Linux,MacOS 和 Windows(本机模式),在 Ubuntu 16.04 或更高版本,MacOS 10.14.1 以及 Windows 10.1809 上进行了测试。 在 `python >= 3.5` 的环境中,只需要运行 `pip install` 即可完成安装。 +* 当前支持 Linux,MacOS 和 Windows(本机,远程,OpenPAI 模式),在 Ubuntu 16.04 或更高版本,MacOS 10.14.1 以及 Windows 10.1809 上进行了测试。 在 `python >= 3.5` 的环境中,只需要运行 `pip install` 即可完成安装。 -Linux 和 MacOS +Linux 和 macOS ```bash python3 -m pip install --upgrade nni @@ -123,14 +126,14 @@ python -m pip install --upgrade nni 注意: * 如果需要将 NNI 安装到自己的 home 目录中,可使用 `--user`,这样也不需要任何特殊权限。 -* 当前 NNI 在 Windows 上仅支持本机模式。 强烈推荐使用 Anaconda 在 Windows 上安装 NNI。 +* 目前,Windows 上的 NNI 支持本机,远程和 OpenPAI 模式。 强烈推荐使用 Anaconda 或 Miniconda 在 Windows 上安装 NNI。 * 如果遇到如`Segmentation fault` 这样的任何错误请参考[常见问题](docs/zh_CN/FAQ.md)。 **通过源代码安装** -* 当前支持 Linux(Ubuntu 16.04 或更高版本),MacOS(10.14.1)以及 Windows 10(1809 版)下的本机模式。 +* 当前支持 Linux(Ubuntu 16.04 或更高版本),MacOS(10.14.1)以及 Windows 10(1809 版)。 -Linux 和 MacOS +Linux 和 macOS * 在 `python >= 3.5` 的环境中运行命令: `git` 和 `wget`,确保安装了这两个组件。 @@ -152,7 +155,7 @@ Windows 参考[安装 NNI](docs/zh_CN/Installation.md) 了解系统需求。 -参考 [NNI Windows 本机模式](docs/zh_CN/WindowsLocalMode.md),了解更多信息。 +Windows 上参考 [Windows 上使用 NNI](docs/zh_CN/NniOnWindows.md)。 **验证安装** @@ -224,11 +227,11 @@ You can use these commands to get more information about the experiment ## **入门** * [安装 NNI](docs/zh_CN/Installation.md) -* [使用命令行工具 nnictl](docs/zh_CN/NNICTLDOC.md) +* [使用命令行工具 nnictl](docs/zh_CN/Nnictl.md) * [使用 NNIBoard](docs/zh_CN/WebUI.md) * [如何定义搜索空间](docs/zh_CN/SearchSpaceSpec.md) * [如何编写 Trial 代码](docs/zh_CN/Trials.md) -* [如何选择 Tuner、搜索算法](docs/zh_CN/Builtin_Tuner.md) +* [如何选择 Tuner、搜索算法](docs/zh_CN/BuiltinTuner.md) * [配置 Experiment](docs/zh_CN/ExperimentConfig.md) * [如何使用 Annotation](docs/zh_CN/Trials.md#nni-python-annotation) @@ -236,12 +239,12 @@ You can use these commands to get more information about the experiment * [在本机运行 Experiment (支持多 GPU 卡)](docs/zh_CN/LocalMode.md) * [在多机上运行 Experiment](docs/zh_CN/RemoteMachineMode.md) -* [在 OpenPAI 上运行 Experiment](docs/zh_CN/PAIMode.md) +* [在 OpenPAI 上运行 Experiment](docs/zh_CN/PaiMode.md) * [在 Kubeflow 上运行 Experiment。](docs/zh_CN/KubeflowMode.md) * [尝试不同的 Tuner](docs/zh_CN/tuners.rst) * [尝试不同的 Assessor](docs/zh_CN/assessors.rst) -* [实现自定义 Tuner](docs/zh_CN/Customize_Tuner.md) -* [实现自定义 Assessor](docs/zh_CN/Customize_Assessor.md) +* [实现自定义 Tuner](docs/zh_CN/CustomizeTuner.md) +* [实现自定义 Assessor](docs/zh_CN/CustomizeAssessor.md) * [使用进化算法为阅读理解任务找到好模型](examples/trials/ga_squad/README_zh_CN.md) ## **贡献** @@ -250,9 +253,9 @@ You can use these commands to get more information about the experiment 推荐新贡献者从标有 **good first issue** 的简单需求开始。 -如要安装 NNI 开发环境,参考: [配置 NNI 开发环境](docs/zh_CN/SetupNNIDeveloperEnvironment.md)。 +如要安装 NNI 开发环境,参考:[配置 NNI 开发环境](docs/zh_CN/SetupNniDeveloperEnvironment.md)。 -在写代码之前,请查看并熟悉 NNI 代码贡献指南:[贡献](docs/zh_CN/CONTRIBUTING.md)。 +在写代码之前,请查看并熟悉 NNI 代码贡献指南:[贡献](docs/zh_CN/Contributing.md)。 我们正在编写[如何调试](docs/zh_CN/HowToDebug.md) 的页面,欢迎提交建议和问题。 diff --git a/deployment/pypi/Makefile b/deployment/pypi/Makefile index b75cc3212c..4854f24ca1 100644 --- a/deployment/pypi/Makefile +++ b/deployment/pypi/Makefile @@ -20,22 +20,28 @@ ifeq ($(version_ts), true) NNI_VERSION_VALUE := $(NNI_VERSION_VALUE).$(TIME_STAMP) endif NNI_VERSION_TEMPLATE = 999.0.0-developing - +NNI_YARN_TARBALL ?= $(CWD)nni-yarn.tar.gz +NNI_YARN_FOLDER ?= $(CWD)nni-yarn +NNI_YARN := PATH=$(CWD)node-$(OS_SPEC)-x64/bin:$${PATH} $(NNI_YARN_FOLDER)/bin/yarn .PHONY: build build: python3 -m pip install --user --upgrade setuptools wheel - wget https://aka.ms/nni/nodejs-download/$(OS_SPEC) -O $(CWD)node-$(OS_SPEC)-x64.tar.xz + wget -q https://aka.ms/nni/nodejs-download/$(OS_SPEC) -O $(CWD)node-$(OS_SPEC)-x64.tar.xz rm -rf $(CWD)node-$(OS_SPEC)-x64 mkdir $(CWD)node-$(OS_SPEC)-x64 tar xf $(CWD)node-$(OS_SPEC)-x64.tar.xz -C node-$(OS_SPEC)-x64 --strip-components 1 - cd $(CWD)../../src/nni_manager && yarn && yarn build - cd $(CWD)../../src/webui && yarn && yarn build + wget -q https://aka.ms/yarn-download -O $(NNI_YARN_TARBALL) + rm -rf $(NNI_YARN_FOLDER) + mkdir $(NNI_YARN_FOLDER) + tar -xf $(NNI_YARN_TARBALL) -C $(NNI_YARN_FOLDER) --strip-components 1 + cd $(CWD)../../src/nni_manager && $(NNI_YARN) && $(NNI_YARN) build + cd $(CWD)../../src/webui && $(NNI_YARN) && $(NNI_YARN) build rm -rf $(CWD)nni cp -r $(CWD)../../src/nni_manager/dist $(CWD)nni cp -r $(CWD)../../src/webui/build $(CWD)nni/static cp $(CWD)../../src/nni_manager/package.json $(CWD)nni sed -ie 's/$(NNI_VERSION_TEMPLATE)/$(NNI_VERSION_VALUE)/' $(CWD)nni/package.json - cd $(CWD)nni && yarn --prod + cd $(CWD)nni && $(NNI_YARN) --prod cd $(CWD) && sed -ie 's/$(NNI_VERSION_TEMPLATE)/$(NNI_VERSION_VALUE)/' setup.py && python3 setup.py bdist_wheel -p $(WHEEL_SPEC) cd $(CWD) @@ -50,4 +56,4 @@ clean: rm -rf $(CWD)dist rm -rf $(CWD)nni rm -rf $(CWD)nni.egg-info - rm -rf $(CWD)node-$(OS_SPEC)-x64 \ No newline at end of file + rm -rf $(CWD)node-$(OS_SPEC)-x64 diff --git a/docs/en_US/AnnotationSpec.md b/docs/en_US/AnnotationSpec.md index a02fc27603..294308eea3 100644 --- a/docs/en_US/AnnotationSpec.md +++ b/docs/en_US/AnnotationSpec.md @@ -36,8 +36,8 @@ There are 10 types to express your search space as follows: * `@nni.variable(nni.choice(option1,option2,...,optionN),name=variable)` Which means the variable value is one of the options, which should be a list The elements of options can themselves be stochastic expressions -* `@nni.variable(nni.randint(upper),name=variable)` - Which means the variable value is a random integer in the range [0, upper). +* `@nni.variable(nni.randint(lower, upper),name=variable)` + Which means the variable value is a value like round(uniform(low, high)). For now, the type of chosen value is float. If you want to use integer value, please convert it explicitly. * `@nni.variable(nni.uniform(low, high),name=variable)` Which means the variable value is a value uniformly between low and high. * `@nni.variable(nni.quniform(low, high, q),name=variable)` diff --git a/docs/en_US/BuiltinTuner.md b/docs/en_US/BuiltinTuner.md index a84d159aa2..336eecce91 100644 --- a/docs/en_US/BuiltinTuner.md +++ b/docs/en_US/BuiltinTuner.md @@ -2,7 +2,7 @@ NNI provides state-of-the-art tuning algorithm as our builtin-tuners and makes them easy to use. Below is the brief summary of NNI currently built-in Tuners: -Note: Click the **Tuner's name** to get a detailed description of the algorithm, click the corresponding **Usage** to get the Tuner's installation requirements, suggested scenario and using example. Here is an [article](./Blog/HPOComparison.md) about the comparison of different Tuners on several problems. +Note: Click the **Tuner's name** to get a detailed description of the algorithm, click the corresponding **Usage** to get the Tuner's installation requirements, suggested scenario and using example. Here is an [article](./CommunitySharings/HPOComparison.md) about the comparison of different Tuners on several problems. Currently we support the following algorithms: diff --git a/docs/en_US/CommunitySharings/NniPracticeSharing/HpoComparison.md b/docs/en_US/CommunitySharings/HpoComparision.md similarity index 100% rename from docs/en_US/CommunitySharings/NniPracticeSharing/HpoComparison.md rename to docs/en_US/CommunitySharings/HpoComparision.md diff --git a/docs/en_US/CommunitySharings/AutomlPracticeSharing/NasComparison.md b/docs/en_US/CommunitySharings/NasComparision.md similarity index 100% rename from docs/en_US/CommunitySharings/AutomlPracticeSharing/NasComparison.md rename to docs/en_US/CommunitySharings/NasComparision.md diff --git a/docs/en_US/ExperimentConfig.md b/docs/en_US/ExperimentConfig.md index 892fcd1526..8bb9a13e9f 100644 --- a/docs/en_US/ExperimentConfig.md +++ b/docs/en_US/ExperimentConfig.md @@ -399,6 +399,15 @@ machineList: __gpuIndices__ is used to specify designated GPU devices for NNI, if it is set, only the specified GPU devices are used for NNI trial jobs. Single or multiple GPU indices can be specified, multiple GPU indices are seperated by comma(,), such as `1` or `0,1,3`. + * __maxTrialNumPerGpu__ + + __maxTrialNumPerGpu__ is used to specify the max concurrency trial number on a GPU device. + + * __useActiveGpu__ + + __useActiveGpu__ is used to specify whether to use a GPU if there is another process. By default, NNI will use the GPU only if there is no another active process in the GPU, if __useActiveGpu__ is set to true, NNI will use the GPU regardless of another processes. This field is not applicable for NNI on Windows. + + * __machineList__ __machineList__ should be set if __trainingServicePlatform__ is set to remote, or it should be empty. @@ -433,6 +442,14 @@ machineList: __gpuIndices__ is used to specify designated GPU devices for NNI on this remote machine, if it is set, only the specified GPU devices are used for NNI trial jobs. Single or multiple GPU indices can be specified, multiple GPU indices are seperated by comma(,), such as `1` or `0,1,3`. + * __maxTrialNumPerGpu__ + + __maxTrialNumPerGpu__ is used to specify the max concurrency trial number on a GPU device. + + * __useActiveGpu__ + + __useActiveGpu__ is used to specify whether to use a GPU if there is another process. By default, NNI will use the GPU only if there is no another active process in the GPU, if __useActiveGpu__ is set to true, NNI will use the GPU regardless of another processes. This field is not applicable for NNI on Windows. + * __kubeflowConfig__: * __operator__ diff --git a/docs/en_US/FAQ.md b/docs/en_US/FAQ.md index 05756fd08b..1b1c9146bf 100644 --- a/docs/en_US/FAQ.md +++ b/docs/en_US/FAQ.md @@ -36,8 +36,8 @@ Unable to open the WebUI may have the following reasons: * If you still can't see the WebUI after you use the server IP, you can check the proxy and the firewall of your machine. Or use the browser on the machine where you start your NNI experiment. * Another reason may be your experiment is failed and NNI may fail to get the experiment infomation. You can check the log of NNImanager in the following directory: ~/nni/experiment/[your_experiment_id] /log/nnimanager.log -### Windows local mode problems -Please refer to [NNI Windows local mode](WindowsLocalMode.md) +### NNI on Windows problems +Please refer to [NNI on Windows](NniOnWindows.md) ### Help us improve Please inquiry the problem in https://github.com/Microsoft/nni/issues to see whether there are other people already reported the problem, create a new one if there are no existing issues been created. diff --git a/docs/en_US/GeneralNasInterfaces.md b/docs/en_US/GeneralNasInterfaces.md new file mode 100644 index 0000000000..40e1518bbe --- /dev/null +++ b/docs/en_US/GeneralNasInterfaces.md @@ -0,0 +1,142 @@ +# General Programming Interface for Neural Architecture Search + +Automatic neural architecture search is taking an increasingly important role on finding better models. Recent research works have proved the feasibility of automatic NAS, and also found some models that could beat manually designed and tuned models. Some of representative works are [NASNet][2], [ENAS][1], [DARTS][3], [Network Morphism][4], and [Evolution][5]. There are new innovations keeping emerging. However, it takes great efforts to implement those algorithms, and it is hard to reuse code base of one algorithm for implementing another. + +To facilitate NAS innovations (e.g., design/implement new NAS models, compare different NAS models side-by-side), an easy-to-use and flexibile programming interface is crucial. + +## Programming interface + + A new programming interface for designing and searching for a model is often demanded in two scenarios. 1) When designing a neural network, the designer may have multiple choices for a layer, sub-model, or connection, and not sure which one or a combination performs the best. It would be appealing to have an easy way to express the candidate layers/sub-models they want to try. 2) For the researchers who are working on automatic NAS, they want to have an unified way to express the search space of neural architectures. And making unchanged trial code adapted to different searching algorithms. + + We designed a simple and flexible programming interface based on [NNI annotation](./AnnotationSpec.md). It is elaborated through examples below. + + ### Example: choose an operator for a layer + +When designing the following model there might be several choices in the fourth layer that may make this model perform good. In the script of this model, we can use annotation for the fourth layer as shown in the figure. In this annotation, there are five fields in total: + +![](../img/example_layerchoice.png) + +* __layer_choice__: It is a list of function calls, each function should have defined in user's script or imported libraries. The input arguments of the function should follow the format: `def XXX(inputs, arg2, arg3, ...)`, where `inputs` is a list with two elements. One is the list of `fixed_inputs`, and the other is a list of the chosen inputs from `optional_inputs`. `conv` and `pool` in the figure are examples of function definition. For the function calls in this list, no need to write the first argument (i.e., `input`). Note that only one of the function calls are chosen for this layer. +* __fixed_inputs__: It is a list of variables, the variable could be an output tensor from a previous layer. The variable could be `layer_output` of another nni.mutable_layer before this layer, or other python variables before this layer. All the variables in this list will be fed into the chosen function in `layer_choice` (as the first element of the `input` list). +* __optional_inputs__: It is a list of variables, the variable could be an output tensor from a previous layer. The variable could be `layer_output` of another nni.mutable_layer before this layer, or other python variables before this layer. Only `input_num` variables will be fed into the chosen function in `layer_choice` (as the second element of the `input` list). +* __optional_input_size__: It indicates how many inputs are chosen from `input_candidates`. It could be a number or a range. A range [1,3] means it chooses 1, 2, or 3 inputs. +* __layer_output__: The name of the output(s) of this layer, in this case it represents the return of the function call in `layer_choice`. This will be a variable name that can be used in the following python code or nni.mutable_layer(s). + +There are two ways to write annotation for this example. For the upper one, `input` of the function calls is `[[],[out3]]`. For the bottom one, `input` is `[[out3],[]]`. + +### Example: choose input connections for a layer + +Designing connections of layers is critical for making a high performance model. With our provided interface, users could annotate which connections a layer takes (as inputs). They could choose several ones from a set of connections. Below is an example which chooses two inputs from three candidate inputs for `concat`. Here `concat` always takes the output of its previous layer using `fixed_inputs`. + +![](../img/example_connectchoice.png) + +### Example: choose both operators and connections + +In this example, we choose one from the three operators and choose two connections for it. As there are multiple variables in `inputs`, we call `concat` at the beginning of the functions. + +![](../img/example_combined.png) + +### Example: [ENAS][1] macro search space + +To illustrate the convenience of the programming interface, we use the interface to implement the trial code of "ENAS + macro search space". The left figure is the macro search space in ENAS paper. + +![](../img/example_enas.png) + + +## Unified NAS search space specification + +After finishing the trial code through the annotation above, users have implicitly specified the search space of neural architectures in the code. Based on the code, NNI will automatcailly generate a search space file which could be fed into tuning algorithms. This search space file follows the following `json` format. + +```json +{ + "mutable_1": { + "layer_1": { + "layer_choice": ["conv(ch=128)", "pool", "identity"], + "optional_inputs": ["out1", "out2", "out3"], + "optional_input_size": 2 + }, + "layer_2": { + ... + } + } +} +``` + +Accordingly, a specified neural architecture (generated by tuning algorithm) is expressed as follows: + +```json +{ + "mutable_1": { + "layer_1": { + "chosen_layer": "pool", + "chosen_inputs": ["out1", "out3"] + }, + "layer_2": { + ... + } + } +} +``` + +With the specification of the format of search space and architecture (choice) expression, users are free to implement various (general) tuning algorithms for neural architecture search on NNI. One future work is to provide a general NAS algorihtm. + +============================================================= + +## Neural architecture search on NNI + +### Basic flow of experiment execution + +NNI's annotation compiler transforms the annotated trial code to the code that could receive architecture choice and build the corresponding model (i.e., graph). The NAS search space can be seen as a full graph (here, full graph means enabling all the provided operators and connections to build a graph), the architecture chosen by the tuning algorithm is a subgraph in it. By default, the compiled trial code only builds and executes the subgraph. + +![](../img/nas_on_nni.png) + +The above figure shows how the trial code runs on NNI. `nnictl` processes user trial code to generate a search space file and compiled trial code. The former is fed to tuner, and the latter is used to run trilas. + +[__TODO__] Simple example of NAS on NNI. + +### Weight sharing + +Sharing weights among chosen architectures (i.e., trials) could speedup model search. For example, properly inheriting weights of completed trials could speedup the converge of new trials. One-Shot NAS (e.g., ENAS, Darts) is more aggressive, the training of different architectures (i.e., subgraphs) shares the same copy of the weights in full graph. + +![](../img/nas_weight_share.png) + +We believe weight sharing (transferring) plays a key role on speeding up NAS, while finding efficient ways of sharing weights is still a hot research topic. We provide a key-value store for users to store and load weights. Tuners and Trials use a provided KV client lib to access the storage. + +[__TODO__] Example of weight sharing on NNI. + +### Support of One-Shot NAS + +One-Shot NAS is a popular approach to find good neural architecture within a limited time and resource budget. Basically, it builds a full graph based on the search space, and uses gradient descent to at last find the best subgraph. There are different training approaches, such as [training subgraphs (per mini-batch)][1], [training full graph through dropout][6], [training with architecture weights (regularization)][3]. Here we focus on the first approach, i.e., training subgraphs (ENAS). + +With the same annotated trial code, users could choose One-Shot NAS as execution mode on NNI. Specifically, the compiled trial code builds the full graph (rather than subgraph demonstrated above), it receives a chosen architecture and training this architecture on the full graph for a mini-batch, then request another chosen architecture. It is supported by [NNI multi-phase](./multiPhase.md). We support this training approach because training a subgraph is very fast, building the graph every time training a subgraph induces too much overhead. + +![](../img/one-shot_training.png) + +The design of One-Shot NAS on NNI is shown in the above figure. One-Shot NAS usually only has one trial job with full graph. NNI supports running multiple such trial jobs each of which runs independently. As One-Shot NAS is not stable, running multiple instances helps find better model. Moreover, trial jobs are also able to synchronize weights during running (i.e., there is only one copy of weights, like asynchroneous parameter-server mode). This may speedup converge. + +[__TODO__] Example of One-Shot NAS on NNI. + + +## General tuning algorithms for NAS + +Like hyperparameter tuning, a relatively general algorithm for NAS is required. The general programming interface makes this task easier to some extent. We have a RL-based tuner algorithm for NAS from our contributors. We expect efforts from community to design and implement better NAS algorithms. + +[__TODO__] More tuning algorithms for NAS. + +## Export best neural architecture and code + +[__TODO__] After the NNI experiment is done, users could run `nnictl experiment export --code` to export the trial code with the best neural architecture. + +## Conclusion and Future work + +There could be different NAS algorithms and execution modes, but they could be supported with the same programming interface as demonstrated above. + +There are many interesting research topics in this area, both system and machine learning. + + +[1]: https://arxiv.org/abs/1802.03268 +[2]: https://arxiv.org/abs/1707.07012 +[3]: https://arxiv.org/abs/1806.09055 +[4]: https://arxiv.org/abs/1806.10282 +[5]: https://arxiv.org/abs/1703.01041 +[6]: http://proceedings.mlr.press/v80/bender18a/bender18a.pdf diff --git a/docs/en_US/Installation.md b/docs/en_US/Installation.md index 91156481b8..213012eee6 100644 --- a/docs/en_US/Installation.md +++ b/docs/en_US/Installation.md @@ -1,6 +1,6 @@ # Installation of NNI -Currently we support installation on Linux, Mac and Windows(local mode). +Currently we support installation on Linux, Mac and Windows(local, remote and pai mode). ## **Installation on Linux & Mac** @@ -15,7 +15,7 @@ Currently we support installation on Linux, Mac and Windows(local mode). Prerequisite: `python >=3.5`, `git`, `wget` ```bash - git clone -b v0.7 https://github.com/Microsoft/nni.git + git clone -b v0.8 https://github.com/Microsoft/nni.git cd nni ./install.sh ``` @@ -48,9 +48,9 @@ Currently we support installation on Linux, Mac and Windows(local mode). you can install NNI as administrator or current user as follows: ```bash - git clone -b v0.7 https://github.com/Microsoft/nni.git + git clone -b v0.8 https://github.com/Microsoft/nni.git cd nni - powershell ./install.ps1 + powershell .\install.ps1 ``` ## **System requirements** diff --git a/docs/en_US/MnistExamples.md b/docs/en_US/MnistExamples.md index 916c32f153..5393a0eff5 100644 --- a/docs/en_US/MnistExamples.md +++ b/docs/en_US/MnistExamples.md @@ -51,7 +51,7 @@ This example is to show how to use hyperband to tune the model. There is one mor This example is to show that NNI also support nested search space. The search space file is an example of how to define nested search space. -`code directory: examples/trials/mnist-cascading-search-space/` +`code directory: examples/trials/mnist-nested-search-space/` **distributed MNIST (tensorflow) using kubeflow** diff --git a/docs/en_US/WindowsLocalMode.md b/docs/en_US/NniOnWindows.md similarity index 72% rename from docs/en_US/WindowsLocalMode.md rename to docs/en_US/NniOnWindows.md index d4a1e172e6..23d6d46f8b 100644 --- a/docs/en_US/WindowsLocalMode.md +++ b/docs/en_US/NniOnWindows.md @@ -1,39 +1,15 @@ -# Windows Local Mode (experimental feature) +# NNI on Windows (experimental feature) -Currently we only support local mode on Windows. Windows 10.1809 is well tested and recommended. +Currently we support local, remote and pai mode on Windows. Windows 10.1809 is well tested and recommended. ## **Installation on Windows** - **Anaconda or Miniconda python(64-bit) is highly recommended.** - -When you use PowerShell to run script for the first time, you need **run PowerShell as administrator** with this command: - -```bash -Set-ExecutionPolicy -ExecutionPolicy Unrestricted -``` - -* __Install NNI through pip__ - - Prerequisite: `python(64-bit) >= 3.5` - - ```bash - python -m pip install --upgrade nni - ``` - -* __Install NNI through source code__ - - Prerequisite: `python >=3.5`, `git`, `PowerShell` - - ```bash - git clone -b v0.7 https://github.com/Microsoft/nni.git - cd nni - powershell ./install.ps1 - ``` + please refer to [Installation](Installation.md#installation-on-windows) for more details. When these things are done, use the **config_windows.yml** configuration to start an experiment for validation. ```bash -nnictl create --config nni/examples/trials/mnist/config_windows.yml +nnictl create --config nni\examples\trials\mnist\config_windows.yml ``` For other examples you need to change trial command `python3` into `python` in each example YAML. diff --git a/docs/en_US/Nnictl.md b/docs/en_US/Nnictl.md index 5eaa51e25e..8fca6d6034 100644 --- a/docs/en_US/Nnictl.md +++ b/docs/en_US/Nnictl.md @@ -124,21 +124,21 @@ Debug mode will disable version check function in Trialkeeper. nnictl stop ``` - 1. If there is an id specified, and the id matches the running experiment, nnictl will stop the corresponding experiment, or will print error message. + 2. If there is an id specified, and the id matches the running experiment, nnictl will stop the corresponding experiment, or will print error message. ```bash nnictl stop [experiment_id] ``` - 1. Users could use 'nnictl stop all' to stop all experiments. + 3. Users could use 'nnictl stop all' to stop all experiments. ```bash nnictl stop all ``` - 1. If the id ends with *, nnictl will stop all experiments whose ids matchs the regular. - 1. If the id does not exist but match the prefix of an experiment id, nnictl will stop the matched experiment. - 1. If the id does not exist but match multiple prefix of the experiment ids, nnictl will give id information. + 4. If the id ends with *, nnictl will stop all experiments whose ids matchs the regular. + 5. If the id does not exist but match the prefix of an experiment id, nnictl will stop the matched experiment. + 6. If the id does not exist but match multiple prefix of the experiment ids, nnictl will give id information. @@ -650,3 +650,4 @@ Debug mode will disable version check function in Trialkeeper. ```bash nnictl --version ``` + \ No newline at end of file diff --git a/docs/en_US/QuickStart.md b/docs/en_US/QuickStart.md index 9f7e929ac7..4d61c35748 100644 --- a/docs/en_US/QuickStart.md +++ b/docs/en_US/QuickStart.md @@ -2,7 +2,7 @@ ## Installation -We support Linux MacOS and Windows(local mode) in current stage, Ubuntu 16.04 or higher, MacOS 10.14.1 and Windows 10.1809 are tested and supported. Simply run the following `pip install` in an environment that has `python >= 3.5`. +We support Linux MacOS and Windows in current stage, Ubuntu 16.04 or higher, MacOS 10.14.1 and Windows 10.1809 are tested and supported. Simply run the following `pip install` in an environment that has `python >= 3.5`. #### Linux and MacOS ```bash @@ -10,7 +10,7 @@ We support Linux MacOS and Windows(local mode) in current stage, Ubuntu 16.04 or ``` #### Windows -If you choose Windows local mode and use PowerShell to run script, you need run below PowerShell command as administrator at first time. +If you are using NNI on Windows, you need run below PowerShell command as administrator at first time. ```bash Set-ExecutionPolicy -ExecutionPolicy Unrestricted ``` @@ -151,10 +151,10 @@ Run the **config.yml** file from your command line to start MNIST experiment. #### Windows Run the **config_windows.yml** file from your command line to start MNIST experiment. -**Note**, if you're using windows local mode, it needs to change `python3` to `python` in the config.yml file, or use the config_windows.yml file to start the experiment. +**Note**, if you're using NNI on Windows, it needs to change `python3` to `python` in the config.yml file, or use the config_windows.yml file to start the experiment. ```bash - nnictl create --config nni/examples/trials/mnist/config_windows.yml + nnictl create --config nni\examples\trials\mnist\config_windows.yml ``` Note, **nnictl** is a command line tool, which can be used to control experiments, such as start/stop/resume an experiment, start/stop NNIBoard, etc. Click [here](Nnictl.md) for more usage of `nnictl` diff --git a/docs/en_US/RemoteMachineMode.md b/docs/en_US/RemoteMachineMode.md index f5e0aa3859..8d7d1a3c34 100644 --- a/docs/en_US/RemoteMachineMode.md +++ b/docs/en_US/RemoteMachineMode.md @@ -55,7 +55,8 @@ machineList: username: bob passwd: bob123 ``` - +You can use different systems to run experiments on the remote machine. +#### Linux and MacOS Simply filling the `machineList` section and then run: ```bash @@ -64,5 +65,14 @@ nnictl create --config ~/nni/examples/trials/mnist-annotation/config_remote.yml to start the experiment. +#### Windows +Simply filling the `machineList` section and then run: + +```bash +nnictl create --config %userprofile%\nni\examples\trials\mnist-annotation\config_remote.yml +``` + +to start the experiment. + ## version check NNI support version check feature in since version 0.6, [refer](PaiMode.md) \ No newline at end of file diff --git a/docs/en_US/SearchSpaceSpec.md b/docs/en_US/SearchSpaceSpec.md index 0a5f06737f..cc7ce21204 100644 --- a/docs/en_US/SearchSpaceSpec.md +++ b/docs/en_US/SearchSpaceSpec.md @@ -29,16 +29,16 @@ All types of sampling strategies and their parameter are listed here: * Which means the variable's value is one of the options. Here 'options' should be a list. Each element of options is a number of string. It could also be a nested sub-search-space, this sub-search-space takes effect only when the corresponding element is chosen. The variables in this sub-search-space could be seen as conditional variables. - * An simple [example](../../examples/trials/mnist-cascading-search-space/search_space.json) of [nested] search space definition. If an element in the options list is a dict, it is a sub-search-space, and for our built-in tuners you have to add a key '_name' in this dict, which helps you to identify which element is chosen. Accordingly, here is a [sample](../../examples/trials/mnist-cascading-search-space/sample.json) which users can get from nni with nested search space definition. Tuners which support nested search space is as follows: + * An simple [example](https://github.com/microsoft/nni/tree/master/examples/trials/mnist-nested-search-space/search_space.json) of [nested] search space definition. If an element in the options list is a dict, it is a sub-search-space, and for our built-in tuners you have to add a key '_name' in this dict, which helps you to identify which element is chosen. Accordingly, here is a [sample](https://github.com/microsoft/nni/tree/master/examples/trials/mnist-nested-search-space/sample.json) which users can get from nni with nested search space definition. Tuners which support nested search space is as follows: - Random Search - TPE - Anneal - Evolution -* {"_type":"randint","_value":[upper]} +* {"_type":"randint","_value":[lower, upper]} - * Which means the variable value is a random integer in the range [0, upper). The semantics of this distribution is that there is no more correlation in the loss function between nearby integer values, as compared with more distant integer values. This is an appropriate distribution for describing random seeds for example. If the loss function is probably more correlated for nearby integer values, then you should probably use one of the "quantized" continuous distributions, such as either quniform, qloguniform, qnormal or qlognormal. Note that if you want to change lower bound, you can use `quniform` for now. + * For now, we implment the "randint" distribution with "quniform", which means the variable value is a value like round(uniform(lower, upper)). The type of chosen value is float. If you want to use integer value, please convert it explicitly. * {"_type":"uniform","_value":[low, high]} * Which means the variable value is a value uniformly between low and high. @@ -86,9 +86,19 @@ All types of sampling strategies and their parameter are listed here: | Hyperband Advisor | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | | Metis Tuner | ✓ | ✓ | ✓ | ✓ | | | | | | | -Note that In Grid Search Tuner, for users' convenience, the definition of `quniform` and `qloguniform` change, where q here specifies the number of values that will be sampled. Details about them are listed as follows -* Type 'quniform' will receive three values [low, high, q], where [low, high] specifies a range and 'q' specifies the number of values that will be sampled evenly. Note that q should be at least 2. It will be sampled in a way that the first sampled value is 'low', and each of the following values is (high-low)/q larger that the value in front of it. -* Type 'qloguniform' behaves like 'quniform' except that it will first change the range to [log(low), log(high)] and sample and then change the sampled value back. +Known Limitations: -Note that Metis Tuner only support numerical `choice` now +* Note that In Grid Search Tuner, for users' convenience, the definition of `quniform` and `qloguniform` change, where q here specifies the number of values that will be sampled. Details about them are listed as follows + + * Type 'quniform' will receive three values [low, high, q], where [low, high] specifies a range and 'q' specifies the number of values that will be sampled evenly. Note that q should be at least 2. It will be sampled in a way that the first sampled value is 'low', and each of the following values is (high-low)/q larger that the value in front of it. + + * Type 'qloguniform' behaves like 'quniform' except that it will first change the range to [log(low), log(high)] and sample and then change the sampled value back. + +* Note that Metis Tuner only supports numerical `choice` now + +* Note that for nested search space: + + * Only Random Search/TPE/Anneal/Evolution tuner supports nested search space + + * We do not support nested search space "Hyper Parameter" parallel graph now, the enhancement is being considered in #1110(https://github.com/microsoft/nni/issues/1110), any suggestions or discussions or contributions are warmly welcomed diff --git a/docs/en_US/WebUI.md b/docs/en_US/WebUI.md index 6cbe8c88d4..17eb48cebf 100644 --- a/docs/en_US/WebUI.md +++ b/docs/en_US/WebUI.md @@ -6,6 +6,8 @@ Click the tab "Overview". * See the experiment trial profile and search space message. * Support to download the experiment result. +* Support to export nni-manager and dispatcher log file. +* If you have any question, you can click "Feedback" to report it. ![](../img/webui-img/over1.png) * See good performance trials. @@ -52,6 +54,14 @@ Click the tab "Trials Detail" to see the status of the all trials. Specifically: ![](../img/webui-img/detail-local.png) +* The button named "Add column" can select which column to show in the table. If you run an experiment that final result is dict, you can see other keys in the table. + +![](../img/webui-img/addColumn.png) + +* You can use the button named "Copy as python" to copy trial's parameters. + +![](../img/webui-img/copyParameter.png) + * If you run on OpenPAI or Kubeflow platform, you can also see the hdfsLog. ![](../img/webui-img/detail-pai.png) diff --git a/docs/en_US/automl_practice_sharing.rst b/docs/en_US/automl_practice_sharing.rst deleted file mode 100644 index ccb37051aa..0000000000 --- a/docs/en_US/automl_practice_sharing.rst +++ /dev/null @@ -1,8 +0,0 @@ -################# -AutoML Practice Sharing -################# - -.. toctree:: - :maxdepth: 2 - - Neural Architecture Search Comparison diff --git a/docs/en_US/community_sharings.rst b/docs/en_US/community_sharings.rst index 6f8ed9ec01..b663d22620 100644 --- a/docs/en_US/community_sharings.rst +++ b/docs/en_US/community_sharings.rst @@ -8,4 +8,5 @@ In addtion to the official tutorilas and examples, we encourage community contri :maxdepth: 2 NNI Practice Sharing - AutoML Practice Sharing + Neural Architecture Search Comparison + Hyper-parameter Tuning Algorithm Comparsion diff --git a/docs/en_US/nni_practice_sharing.rst b/docs/en_US/nni_practice_sharing.rst index ffadc90aa2..ee935774f4 100644 --- a/docs/en_US/nni_practice_sharing.rst +++ b/docs/en_US/nni_practice_sharing.rst @@ -7,5 +7,4 @@ Sharing the practice of leveraging NNI to tune models and systems. .. toctree:: :maxdepth: 2 - Tuning SVD of Recommenders on NNI - Auto-tuning AutoGBDT and RocksDB on NNI \ No newline at end of file + Tuning SVD of Recommenders on NNI \ No newline at end of file diff --git a/docs/img/example_combined.png b/docs/img/example_combined.png new file mode 100644 index 0000000000..4757892266 Binary files /dev/null and b/docs/img/example_combined.png differ diff --git a/docs/img/example_connectchoice.png b/docs/img/example_connectchoice.png new file mode 100644 index 0000000000..74559b8e47 Binary files /dev/null and b/docs/img/example_connectchoice.png differ diff --git a/docs/img/example_enas.png b/docs/img/example_enas.png new file mode 100644 index 0000000000..19c47ec89d Binary files /dev/null and b/docs/img/example_enas.png differ diff --git a/docs/img/example_layerchoice.png b/docs/img/example_layerchoice.png new file mode 100644 index 0000000000..d325328e58 Binary files /dev/null and b/docs/img/example_layerchoice.png differ diff --git a/docs/img/nas_on_nni.png b/docs/img/nas_on_nni.png new file mode 100644 index 0000000000..7359c210d8 Binary files /dev/null and b/docs/img/nas_on_nni.png differ diff --git a/docs/img/nas_weight_share.png b/docs/img/nas_weight_share.png new file mode 100644 index 0000000000..e66beb7829 Binary files /dev/null and b/docs/img/nas_weight_share.png differ diff --git a/docs/img/one-shot_training.png b/docs/img/one-shot_training.png new file mode 100644 index 0000000000..746d7008b1 Binary files /dev/null and b/docs/img/one-shot_training.png differ diff --git a/docs/img/webui-img/addColumn.png b/docs/img/webui-img/addColumn.png new file mode 100644 index 0000000000..70935e5e8b Binary files /dev/null and b/docs/img/webui-img/addColumn.png differ diff --git a/docs/img/webui-img/copyParameter.png b/docs/img/webui-img/copyParameter.png new file mode 100644 index 0000000000..91f06af350 Binary files /dev/null and b/docs/img/webui-img/copyParameter.png differ diff --git a/docs/img/webui-img/detail-local.png b/docs/img/webui-img/detail-local.png index 065221fda0..d1249eac79 100644 Binary files a/docs/img/webui-img/detail-local.png and b/docs/img/webui-img/detail-local.png differ diff --git a/docs/zh_CN/AdvancedNAS.md b/docs/zh_CN/AdvancedNas.md similarity index 99% rename from docs/zh_CN/AdvancedNAS.md rename to docs/zh_CN/AdvancedNas.md index f77ef2fa1c..e326d3686a 100644 --- a/docs/zh_CN/AdvancedNAS.md +++ b/docs/zh_CN/AdvancedNas.md @@ -15,7 +15,7 @@ ```yaml tuner: codeDir: path/to/customer_tuner - classFileName: customer_tuner.py + classFileName: customer_tuner.py className: CustomerTuner classArgs: ... diff --git a/docs/zh_CN/AnnotationSpec.md b/docs/zh_CN/AnnotationSpec.md index 3405aa2cc2..a7e8afc928 100644 --- a/docs/zh_CN/AnnotationSpec.md +++ b/docs/zh_CN/AnnotationSpec.md @@ -9,6 +9,7 @@ ```python '''@nni.variable(nni.choice(0.1, 0.01, 0.001), name=learning_rate)''' learning_rate = 0.1 + ``` 此样例中,NNI 会从 (0.1, 0.01, 0.001) 中选择一个值赋给 learning_rate 变量。 第一行就是 NNI 的 Annotation,是 Python 中的一个字符串。 接下来的一行需要是赋值语句。 NNI 会根据 Annotation 行的信息,来给这一行的变量赋上相应的值。 diff --git a/docs/zh_CN/batchTuner.md b/docs/zh_CN/BatchTuner.md similarity index 90% rename from docs/zh_CN/batchTuner.md rename to docs/zh_CN/BatchTuner.md index 6231965559..89b361d08a 100644 --- a/docs/zh_CN/batchTuner.md +++ b/docs/zh_CN/BatchTuner.md @@ -1,6 +1,6 @@ # Batch Tuner -## Batch Tuner(批量调参器) +## Batch Tuner(批处理 Tuner) Batch Tuner 能让用户简单的提供几组配置(如,超参选项的组合)。 当所有配置都执行完后,Experiment 即结束。 Batch Tuner 的[搜索空间](SearchSpaceSpec.md)只支持 `choice`。 diff --git a/docs/zh_CN/Blog/index.rst b/docs/zh_CN/Blog/index.rst deleted file mode 100644 index 0041f0f39e..0000000000 --- a/docs/zh_CN/Blog/index.rst +++ /dev/null @@ -1,9 +0,0 @@ -###################### -博客 -###################### - -.. toctree:: - :maxdepth: 2 - - 超参优化的对比 - 神经网络结构搜索(NAS)的对比 \ No newline at end of file diff --git a/docs/zh_CN/bohbAdvisor.md b/docs/zh_CN/BohbAdvisor.md similarity index 99% rename from docs/zh_CN/bohbAdvisor.md rename to docs/zh_CN/BohbAdvisor.md index 25731f5750..f319494f74 100644 --- a/docs/zh_CN/bohbAdvisor.md +++ b/docs/zh_CN/BohbAdvisor.md @@ -10,7 +10,7 @@ BOHB 依赖 HB(Hyperband)来决定每次跑多少组参数和每组参数分 ### HB(Hyperband) -按照 Hyperband 的方式来选择每次跑的参数个数与分配多少资源(budget),并继续使用“连续减半(SuccessiveHalving)”策略,更多有关Hyperband算法的细节,请参考[NNI 中的 Hyperband](hyperbandAdvisor.md) 和 [Hyperband 的参考论文](https://arxiv.org/abs/1603.06560)。 下面的伪代码描述了这个过程。 +按照 Hyperband 的方式来选择每次跑的参数个数与分配多少资源(budget),并继续使用“连续减半(SuccessiveHalving)”策略,更多有关Hyperband算法的细节,请参考[NNI 中的 Hyperband](HyperbandAdvisor.md) 和 [Hyperband 的参考论文](https://arxiv.org/abs/1603.06560)。 下面的伪代码描述了这个过程。 ![](../img/bohb_1.png) diff --git a/docs/zh_CN/Builtin_Assessors.md b/docs/zh_CN/BuiltinAssessors.md similarity index 100% rename from docs/zh_CN/Builtin_Assessors.md rename to docs/zh_CN/BuiltinAssessors.md diff --git a/docs/zh_CN/Builtin_Tuner.md b/docs/zh_CN/BuiltinTuner.md similarity index 99% rename from docs/zh_CN/Builtin_Tuner.md rename to docs/zh_CN/BuiltinTuner.md index a4f2aa42b9..caf1f1b027 100644 --- a/docs/zh_CN/Builtin_Tuner.md +++ b/docs/zh_CN/BuiltinTuner.md @@ -2,7 +2,7 @@ NNI 提供了先进的调优算法,使用上也很简单。 下面是内置 Tuner 的简单介绍: -注意:点击 **Tuner 的名称**可跳转到算法的详细描述,点击**用法**可看到 Tuner 的安装要求、建议场景和使用样例等等。 [此文章](./Blog/HPOComparison.md)对比了不同 Tuner 在几个问题下的不同效果。 +注意:点击 **Tuner 的名称**可跳转到算法的详细描述,点击**用法**可看到 Tuner 的安装要求、建议场景和使用样例等等。 [此文章](./CommunitySharings/HPOComparison.md)对比了不同 Tuner 在几个问题下的不同效果。 当前支持的 Tuner: diff --git a/docs/zh_CN/cifar10_examples.md b/docs/zh_CN/Cifar10Examples.md similarity index 100% rename from docs/zh_CN/cifar10_examples.md rename to docs/zh_CN/Cifar10Examples.md diff --git a/docs/zh_CN/Blog/NASComparison.md b/docs/zh_CN/CommunitySharings/AutomlPracticeSharing/NasComparison.md similarity index 91% rename from docs/zh_CN/Blog/NASComparison.md rename to docs/zh_CN/CommunitySharings/AutomlPracticeSharing/NasComparison.md index 969340ef8d..dc2524aa71 100644 --- a/docs/zh_CN/Blog/NASComparison.md +++ b/docs/zh_CN/CommunitySharings/AutomlPracticeSharing/NasComparison.md @@ -14,7 +14,7 @@ - NAO: -## 实验描述 +## 实验说明 为了避免算法仅仅在 **CIFAR-10** 数据集上过拟合,还对比了包括 Fashion-MNIST, CIFAR-100, OUI-Adience-Age, ImageNet-10-1 (ImageNet的子集) 和 ImageNet-10-2 (ImageNet 的另一个子集) 在内的其它 5 个数据集。 分别从 ImageNet 中抽取 10 种不同类别标签的子集,组成 ImageNet10-1 和 ImageNet10-2 数据集 。 @@ -33,7 +33,7 @@ NAO 需要太多的计算资源,因此只使用提供 Pipeline 脚本的 NAO-WS。 -对于 Autkeras,使用了 0.2.18 版本的代码, 因为这是开始实验时的最新版本。 +对于 AutoKeras,使用了 0.2.18 版本的代码, 因为这是开始实验时的最新版本。 ## NAS 结果对比 @@ -54,13 +54,13 @@ NAO 需要太多的计算资源,因此只使用提供 Pipeline 脚本的 NAO-W | --------- | ------------ |:----------------:|:----------------:|:--------------:|:-----------:| | CIFAR- 10 | 88.56(best) | 96.13(best) | 97.11(best) | 97.17(average) | 96.47(best) | -对于 AutoKeras,由于其算法中的随机因素,它在所有数据集中的表现相对较差。 +AutoKeras,由于其算法中的随机因素,它在所有数据集中的表现相对较差。 -对于ENAS,ENAS(macro)在 OUI-Adience-Age 数据集中表现较好,并且 ENAS(micro)在 CIFAR-10 数据集中表现较好。 +ENAS,ENAS(macro)在 OUI-Adience-Age 数据集中表现较好,并且 ENAS(micro)在 CIFAR-10 数据集中表现较好。 对于DARTS,在某些数据集上具有良好的结果,但在某些数据集中具有比较大的方差。 DARTS 三次实验中的差异在 OUI-Audience-Age 数据集上可达 5.37%(绝对值),在 ImageNet-10-1 数据集上可达4.36%(绝对值)。 -对于 NAO-WS,它在 ImageNet-10-2 中显示良好,但在 OUI-Adience-Age 中表现非常差。 +NAO-WS 在 ImageNet-10-2 中表现良好,但在 OUI-Adience-Age 中表现非常差。 ## 参考文献 diff --git a/docs/zh_CN/Blog/HPOComparison.md b/docs/zh_CN/CommunitySharings/NniPracticeSharing/HpoComparison.md similarity index 57% rename from docs/zh_CN/Blog/HPOComparison.md rename to docs/zh_CN/CommunitySharings/NniPracticeSharing/HpoComparison.md index 4a7d867909..f5d10a33f5 100644 --- a/docs/zh_CN/Blog/HPOComparison.md +++ b/docs/zh_CN/CommunitySharings/NniPracticeSharing/HpoComparison.md @@ -71,33 +71,33 @@ ### 结果 -| 算法 | 最好的损失值 | 最好的 5 次损失的平均值 | 最好的 10 次损失的平均 | -| ------------- | ------------ | ------------- | ------------- | -| Random Search | 0.418854 | 0.420352 | 0.421553 | -| Random Search | 0.417364 | 0.420024 | 0.420997 | -| Random Search | 0.417861 | 0.419744 | 0.420642 | -| Grid Search | 0.498166 | 0.498166 | 0.498166 | -| Evolution | 0.409887 | 0.409887 | 0.409887 | -| Evolution | 0.413620 | 0.413875 | 0.414067 | -| Evolution | 0.409887 | 0.409887 | 0.409887 | -| Anneal | 0.414877 | 0.417289 | 0.418281 | -| Anneal | 0.409887 | 0.409887 | 0.410118 | -| Anneal | 0.413683 | 0.416949 | 0.417537 | -| Metis | 0.416273 | 0.420411 | 0.422380 | -| Metis | 0.420262 | 0.423175 | 0.424816 | -| Metis | 0.421027 | 0.424172 | 0.425714 | -| TPE | 0.414478 | 0.414478 | 0.414478 | -| TPE | 0.415077 | 0.417986 | 0.418797 | -| TPE | 0.415077 | 0.417009 | 0.418053 | -| SMAC | **0.408386** | **0.408386** | **0.408386** | -| SMAC | 0.414012 | 0.414012 | 0.414012 | -| SMAC | **0.408386** | **0.408386** | **0.408386** | -| BOHB | 0.410464 | 0.415319 | 0.417755 | -| BOHB | 0.418995 | 0.420268 | 0.422604 | -| BOHB | 0.415149 | 0.418072 | 0.418932 | -| HyperBand | 0.414065 | 0.415222 | 0.417628 | -| HyperBand | 0.416807 | 0.417549 | 0.418828 | -| HyperBand | 0.415550 | 0.415977 | 0.417186 | +| 算法 | 最好的损失值 | 最好的 5 次损失的平均值 | 最好的 10 次损失的平均 | +| ------------------- | ------------ | ------------- | ------------- | +| Random Search(随机搜索) | 0.418854 | 0.420352 | 0.421553 | +| Random Search(随机搜索) | 0.417364 | 0.420024 | 0.420997 | +| Random Search(随机搜索) | 0.417861 | 0.419744 | 0.420642 | +| Grid Search(遍历搜索) | 0.498166 | 0.498166 | 0.498166 | +| Evolution | 0.409887 | 0.409887 | 0.409887 | +| Evolution | 0.413620 | 0.413875 | 0.414067 | +| Evolution | 0.409887 | 0.409887 | 0.409887 | +| Anneal(退火算法) | 0.414877 | 0.417289 | 0.418281 | +| Anneal(退火算法) | 0.409887 | 0.409887 | 0.410118 | +| Anneal(退火算法) | 0.413683 | 0.416949 | 0.417537 | +| Metis | 0.416273 | 0.420411 | 0.422380 | +| Metis | 0.420262 | 0.423175 | 0.424816 | +| Metis | 0.421027 | 0.424172 | 0.425714 | +| TPE | 0.414478 | 0.414478 | 0.414478 | +| TPE | 0.415077 | 0.417986 | 0.418797 | +| TPE | 0.415077 | 0.417009 | 0.418053 | +| SMAC | **0.408386** | **0.408386** | **0.408386** | +| SMAC | 0.414012 | 0.414012 | 0.414012 | +| SMAC | **0.408386** | **0.408386** | **0.408386** | +| BOHB | 0.410464 | 0.415319 | 0.417755 | +| BOHB | 0.418995 | 0.420268 | 0.422604 | +| BOHB | 0.415149 | 0.418072 | 0.418932 | +| HyperBand | 0.414065 | 0.415222 | 0.417628 | +| HyperBand | 0.416807 | 0.417549 | 0.418828 | +| HyperBand | 0.415550 | 0.415977 | 0.417186 | Metis 算法因为其高斯计算过程的复杂度为 O(n^3) 而运行非常慢,因此仅执行了 300 次 Trial。 @@ -188,14 +188,14 @@ IOPS 与在线处理能力有关,我们在实验中使用 IOPS 作为指标。 #### fillrandom 基准 -| 模型 | 最高 IOPS(重复 1 次) | 最高 IOPS(重复 2 次) | 最高 IOPS(重复 3 次) | -| --------- | --------------- | --------------- | --------------- | -| Random | 449901 | 427620 | 477174 | -| Anneal | 461896 | 467150 | 437528 | -| Evolution | 436755 | 389956 | 389790 | -| TPE | 378346 | 482316 | 468989 | -| SMAC | 491067 | 490472 | **491136** | -| Metis | 444920 | 457060 | 454438 | +| 模型 | 最高 IOPS(重复 1 次) | 最高 IOPS(重复 2 次) | 最高 IOPS(重复 3 次) | +| ------------ | --------------- | --------------- | --------------- | +| Random | 449901 | 427620 | 477174 | +| Anneal(退火算法) | 461896 | 467150 | 437528 | +| Evolution | 436755 | 389956 | 389790 | +| TPE | 378346 | 482316 | 468989 | +| SMAC | 491067 | 490472 | **491136** | +| Metis | 444920 | 457060 | 454438 | 图: @@ -203,14 +203,14 @@ IOPS 与在线处理能力有关,我们在实验中使用 IOPS 作为指标。 #### readrandom 基准 -| 模型 | 最高 IOPS(重复 1 次) | 最高 IOPS(重复 2 次) | 最高 IOPS(重复 3 次) | -| --------- | --------------- | --------------- | --------------- | -| Random | 2276157 | 2285301 | 2275142 | -| Anneal | 2286330 | 2282229 | 2284012 | -| Evolution | 2286524 | 2283673 | 2283558 | -| TPE | 2287366 | 2282865 | 2281891 | -| SMAC | 2270874 | 2284904 | 2282266 | -| Metis | **2287696** | 2283496 | 2277701 | +| 模型 | 最高 IOPS(重复 1 次) | 最高 IOPS(重复 2 次) | 最高 IOPS(重复 3 次) | +| ------------ | --------------- | --------------- | --------------- | +| Random | 2276157 | 2285301 | 2275142 | +| Anneal(退火算法) | 2286330 | 2282229 | 2284012 | +| Evolution | 2286524 | 2283673 | 2283558 | +| TPE | 2287366 | 2282865 | 2281891 | +| SMAC | 2270874 | 2284904 | 2282266 | +| Metis | **2287696** | 2283496 | 2277701 | 图: diff --git a/docs/zh_CN/CommunitySharings/NniPracticeSharing/RecommendersSvd.md b/docs/zh_CN/CommunitySharings/NniPracticeSharing/RecommendersSvd.md new file mode 100644 index 0000000000..39666f7692 --- /dev/null +++ b/docs/zh_CN/CommunitySharings/NniPracticeSharing/RecommendersSvd.md @@ -0,0 +1,13 @@ +# 在 NNI 上自动调优 SVD + +本教程中,会首先介绍 GitHub 存储库:[Recommenders](https://github.com/Microsoft/Recommenders)。 它使用 Jupyter Notebook 提供了构建推荐系统的一些示例和实践技巧。 其中大量的模型被广泛的应用于推荐系统中。 为了提供完整的体验,每个示例都通过以下五个关键任务中展示: + +- [准备数据](https://github.com/Microsoft/Recommenders/blob/master/notebooks/01_prepare_data/README.md):为每个推荐算法准备并读取数据。 + - [模型](https://github.com/Microsoft/Recommenders/blob/master/notebooks/02_model/README.md):使用各种经典的以及深度学习推荐算法,如交替最小二乘法([ALS](https://spark.apache.org/docs/latest/api/python/_modules/pyspark/ml/recommendation.html#ALS))或极限深度分解机([xDeepFM](https://arxiv.org/abs/1803.05170))。 + - [评估](https://github.com/Microsoft/Recommenders/blob/master/notebooks/03_evaluate/README.md):使用离线指标来评估算法。 + - [模型选择和优化](https://github.com/Microsoft/Recommenders/blob/master/notebooks/04_model_select_and_optimize/README.md):为推荐算法模型调优超参。 + - [运营](https://github.com/Microsoft/Recommenders/blob/master/notebooks/05_operationalize/README.md):在 Azure 的生产环境上运行模型。 + +在第四项调优模型超参的任务上,NNI 可以发挥作用。 在 NNI 上调优推荐模型的具体示例,采用了 [SVD](https://github.com/Microsoft/Recommenders/blob/master/notebooks/02_model/surprise_svd_deep_dive.ipynb) 算法,以及数据集 Movielens100k。 此模型有超过 10 个超参需要调优。 + +由 Recommenders 提供的[ Jupyter notebook](https://github.com/Microsoft/Recommenders/blob/master/notebooks/04_model_select_and_optimize/nni_surprise_svd.ipynb) 中有非常详细的一步步的教程。 其中使用了不同的调优函数,包括 `Annealing`,`SMAC`,`Random Search`,`TPE`,`Hyperband`,`Metis` 以及 `Evolution`。 最后比较了不同调优算法的结果。 请参考此 Notebook,来学习如何使用 NNI 调优 SVD 模型,并可以继续使用 NNI 来调优 Recommenders 中的其它模型。 \ No newline at end of file diff --git a/docs/zh_CN/CONTRIBUTING.md b/docs/zh_CN/Contributing.md similarity index 96% rename from docs/zh_CN/CONTRIBUTING.md rename to docs/zh_CN/Contributing.md index f4d622f0ad..5e9ea69550 100644 --- a/docs/zh_CN/CONTRIBUTING.md +++ b/docs/zh_CN/Contributing.md @@ -29,7 +29,7 @@ 拉取请求需要选好正确的标签,表明是 Bug 修复还是功能改进。 所有代码都需要遵循正确的命名约定和代码风格。 -参考[如何配置 NNI 的开发环境](./SetupNNIDeveloperEnvironment.md),来安装开发环境。 +参考[如何配置 NNI 的开发环境](./SetupNniDeveloperEnvironment.md),来安装开发环境。 与[快速入门](QuickStart.md)类似。 其它内容,参考[NNI 文档](http://nni.readthedocs.io)。 @@ -42,7 +42,7 @@ ## 代码风格和命名约定 * NNI 遵循 [PEP8](https://www.python.org/dev/peps/pep-0008/) 的 Python 代码命名约定。在提交拉取请求时,请尽量遵循此规范。 可通过`flake8`或`pylint`的提示工具来帮助遵循规范。 -* NNI 还遵循 [NumPy Docstring 风格](https://www.sphinx-doc.org/en/master/usage/extensions/example_numpy.html#example-numpy) 的 Python Docstring 命名方案。 Python API 使用了[sphinx.ext.napoleon](https://www.sphinx-doc.org/en/master/usage/extensions/napoleon.html) 来[生成文档](CONTRIBUTING.md#documentation)。 +* NNI 还遵循 [NumPy Docstring 风格](https://www.sphinx-doc.org/en/master/usage/extensions/example_numpy.html#example-numpy) 的 Python Docstring 命名方案。 Python API 使用了[sphinx.ext.napoleon](https://www.sphinx-doc.org/en/master/usage/extensions/napoleon.html) 来[生成文档](Contributing.md#documentation)。 ## 文档 diff --git a/docs/zh_CN/CurvefittingAssessor.md b/docs/zh_CN/CurvefittingAssessor.md new file mode 100644 index 0000000000..65a1d1ef9a --- /dev/null +++ b/docs/zh_CN/CurvefittingAssessor.md @@ -0,0 +1,72 @@ +# NNI 中的 Curve Fitting Assessor + +## 1. 介绍 + +Curve Fitting Assessor 是一个 LPA (learning, predicting, assessing,即学习、预测、评估) 的算法。 如果预测的Trial X 在 step S 比性能最好的 Trial 要差,就会提前终止它。 + +此算法中,使用了 12 条曲线来拟合学习曲线,从[参考论文](http://aad.informatik.uni-freiburg.de/papers/15-IJCAI-Extrapolation_of_Learning_Curves.pdf)中选择了大量的参数曲线模型。 学习曲线的形状与先验知识是一致的:都是典型的递增的、饱和的函数。 + +![](../img/curvefitting_learning_curve.PNG) + +所有学习曲线模型被合并到了单个,更强大的模型中。 合并的模型通过加权线性混合: + +![](../img/curvefitting_f_comb.gif) + +合并后的参数向量 + +![](../img/curvefitting_expression_xi.gif) + +假设增加一个高斯噪声,且噪声参数初始化为最大似然估计。 + +通过学习历史数据来确定新的组合参数向量的最大概率值。 用这样的方法来预测后面的 Trial 性能,并停止不好的 Trial 来节省计算资源。 + +具体来说,该算法有学习、预测和评估三个阶段。 + +* 步骤 1:学习。 从当前 Trial 的历史中学习,并从贝叶斯角度决定 \xi 。 首先,使用最小二乘法 (由 `fit_theta` 实现) 来节省时间。 获得参数后,过滤曲线并移除异常点(由 `filter_curve` 实现)。 最后,使用 MCMC 采样方法 (由 `mcmc_sampling` 实现) 来调整每个曲线的权重。 至此,确定了 \xi 中的所有参数。 + +* 步骤 2:预测。 用 \xi 和混合模型公式,在目标位置(例如 epoch 的总数)来计算期望的最终结果精度(由 `f_comb` 实现)。 + +* 步骤 3:如果拟合结果没有收敛,预测结果会是 `None`,并返回 `AssessResult.Good`,待下次有了更多精确信息后再次预测。 此外,会通过 `predict()` 函数获得正数。如果该值大于 __历史最好结果__ * `THRESHOLD`(默认为 0.95),则返回 `AssessResult.Good`,否则返回 `AssessResult.Bad`。 + +下图显示了此算法在 MNIST Trial 历史数据上结果。其中绿点表示 Assessor 获得的数据,蓝点表示将来,但未知的数据,红色线条是 Curve fitting Assessor 的预测曲线。 + +![](../img/curvefitting_example.PNG) + +## 2. 用法 + +要使用 Curve Fitting Assessor,需要在 Experiment 的 YAML 配置文件进行如下改动。 + + assessor: + builtinAssessorName: Curvefitting + classArgs: + # (必须) epoch 的总数。 + # 需要此数据来决定需要预测的点。 + epoch_num: 20 + # (可选) 选项: maximize, minimize + * optimize_mode 的默认值是 maximize + optimize_mode: maximize + # (可选) 为了节约计算资源,在收到了 start_step 个中间结果后,才开始预测。 + # start_step 的默认值是 6。 + start_step: 6 + # (可选) 决定是否提前终止的阈值。 + # 例如,如果 threshold = 0.95, optimize_mode = maximize,最好的历史结果是 0.9,那么会在 Trial 的预测值低于 0.95 * 0.9 = 0.855 时停止。 + * 阈值的默认值是 0.95。 + # 注意:如果选择了 minimize 模式,要让 threshold >= 1.0 (如 threshold=1.1) + threshold: 0.95 + # (可选) gap 是两次评估之间的间隔次数。 + # 例如:如果 gap = 2, start_step = 6,就会评估第 6, 8, 10, 12... 个中间结果。 + * gap 的默认值是 1。 + gap: 1 + + +## 3. 文件结构 + +Assessor 有大量的文件、函数和类。 这里只简单介绍最重要的文件: + +* `curvefunctions.py` 包含了所有函数表达式和默认参数。 +* `modelfactory.py` 包括学习和预测部分,并实现了相应的计算部分。 +* `curvefitting_assessor.py` 是接收 Trial 历史数据并评估是否需要提前终止的 Assessor。 + +## 4. TODO + +* 进一步提高预测精度,并在更多模型上测试。 \ No newline at end of file diff --git a/docs/zh_CN/Customize_Advisor.md b/docs/zh_CN/CustomizeAdvisor.md similarity index 100% rename from docs/zh_CN/Customize_Advisor.md rename to docs/zh_CN/CustomizeAdvisor.md diff --git a/docs/zh_CN/Customize_Assessor.md b/docs/zh_CN/CustomizeAssessor.md similarity index 100% rename from docs/zh_CN/Customize_Assessor.md rename to docs/zh_CN/CustomizeAssessor.md diff --git a/docs/zh_CN/Customize_Tuner.md b/docs/zh_CN/CustomizeTuner.md similarity index 97% rename from docs/zh_CN/Customize_Tuner.md rename to docs/zh_CN/CustomizeTuner.md index 92d761f33b..54028e3aa4 100644 --- a/docs/zh_CN/Customize_Tuner.md +++ b/docs/zh_CN/CustomizeTuner.md @@ -109,4 +109,4 @@ tuner: ### 实现更高级的自动机器学习算法 -上述内容足够写出通用的 Tuner。 但有时可能需要更多的信息,例如,中间结果, Trial 的状态等等,从而能够实现更强大的自动机器学习算法。 因此,有另一个 `Advisor` 类,直接继承于 `MsgDispatcherBase`,它在 [`src/sdk/pynni/nni/msg_dispatcher_base.py`](https://github.com/Microsoft/nni/tree/master/src/sdk/pynni/nni/msg_dispatcher_base.py)。 参考[这里](Customize_Advisor.md)来了解如何实现自定义的 Advisor。 \ No newline at end of file +上述内容足够写出通用的 Tuner。 但有时可能需要更多的信息,例如,中间结果, Trial 的状态等等,从而能够实现更强大的自动机器学习算法。 因此,有另一个 `Advisor` 类,直接继承于 `MsgDispatcherBase`,它在 [`src/sdk/pynni/nni/msg_dispatcher_base.py`](https://github.com/Microsoft/nni/tree/master/src/sdk/pynni/nni/msg_dispatcher_base.py)。 参考[这里](CustomizeAdvisor.md)来了解如何实现自定义的 Advisor。 \ No newline at end of file diff --git a/docs/zh_CN/EvolutionTuner.md b/docs/zh_CN/EvolutionTuner.md new file mode 100644 index 0000000000..4b8ab7ef83 --- /dev/null +++ b/docs/zh_CN/EvolutionTuner.md @@ -0,0 +1,5 @@ +# Naive Evolution Tuner + +## Naive Evolution(进化算法) + +进化算法来自于 [Large-Scale Evolution of Image Classifiers](https://arxiv.org/pdf/1703.01041.pdf)。 它会基于搜索空间随机生成一个种群。 在每一代中,会选择较好的结果,并对其下一代进行一些变异(例如,改动一个超参,增加或减少一层)。 进化算法需要很多次 Trial 才能有效,但它也非常简单,也很容易扩展新功能。 \ No newline at end of file diff --git a/docs/zh_CN/Examples.rst b/docs/zh_CN/Examples.rst deleted file mode 100644 index 0e4e0bccb2..0000000000 --- a/docs/zh_CN/Examples.rst +++ /dev/null @@ -1,12 +0,0 @@ -###################### -样例 -###################### - -.. toctree:: - :maxdepth: 2 - - MNIST - Cifar10 - Scikit-learn - EvolutionSQuAD - GBDT diff --git a/docs/zh_CN/ExperimentConfig.md b/docs/zh_CN/ExperimentConfig.md index da7fd92300..fea84cadf7 100644 --- a/docs/zh_CN/ExperimentConfig.md +++ b/docs/zh_CN/ExperimentConfig.md @@ -175,7 +175,7 @@ machineList: - **remote** 将任务提交到远程的 Ubuntu 上,必须用 **machineList** 来指定远程的 SSH 连接信息。 - - **pai** 提交任务到微软开源的 [OpenPAI](https://github.com/Microsoft/pai) 上。 更多 OpenPAI 配置,参考 [pai 模式](./PAIMode.md)。 + - **pai** 提交任务到微软开源的 [OpenPAI](https://github.com/Microsoft/pai) 上。 更多 OpenPAI 配置,参考 [pai 模式](./PaiMode.md)。 - **kubeflow** 提交任务至 [Kubeflow](https://www.kubeflow.org/docs/about/kubeflow/)。 NNI 支持基于 Kubeflow 的 Kubenetes,以及[Azure Kubernetes](https://azure.microsoft.com/en-us/services/kubernetes-service/)。 diff --git a/docs/zh_CN/FAQ.md b/docs/zh_CN/FAQ.md index a796fdd1ec..c3c410bd43 100644 --- a/docs/zh_CN/FAQ.md +++ b/docs/zh_CN/FAQ.md @@ -37,9 +37,17 @@ nnictl 在执行时,使用 tmp 目录作为临时目录来复制 codeDir 下 将虚拟机的网络配置为桥接模式来让虚拟机能被网络访问,并确保虚拟机的防火墙没有禁止相关端口。 -### Windows 本机模式 +### 无法打开 Web 界面的链接 -参考 [NNI Windows 本机模式](WindowsLocalMode.md) +无法打开 Web 界面的链接可能有以下几个原因: + +* http://127.0.0.1,http://172.17.0.1 以及 http://10.0.0.15 都是 localhost。如果在服务器或远程计算机上启动 Experiment, 可将此 IP 替换为所连接的 IP 来查看 Web 界面,如 http://[远程连接的地址]:8080 +* 如果使用服务器 IP 后还是无法看到 Web 界面,可检查此服务器上是否有防火墙或需要代理。 或使用此运行 NNI Experiment 的服务器上的浏览器来查看 Web 界面。 +* 另一个可能的原因是 Experiment 启动失败了,NNI 无法读取 Experiment 的信息。 可在如下目录中查看 NNIManager 的日志: ~/nni/experiment/[your_experiment_id] /log/nnimanager.log + +### NNI 在 Windows 上的问题 + +参考 [Windows 上使用 NNI](NniOnWindows.md)。 ### 帮助改进 diff --git a/docs/zh_CN/FrameworkControllerMode.md b/docs/zh_CN/FrameworkControllerMode.md index cb815a3477..cc9775713a 100644 --- a/docs/zh_CN/FrameworkControllerMode.md +++ b/docs/zh_CN/FrameworkControllerMode.md @@ -106,4 +106,4 @@ frameworkcontroller 模式中的 Trial 配置使用以下主键: ## 版本校验 -从 0.6 开始,NNI 支持查看版本,详情参考[这里](PAIMode.md)。 \ No newline at end of file +从 0.6 开始,NNI 支持版本校验,详情参考[这里](PaiMode.md)。 \ No newline at end of file diff --git a/docs/zh_CN/gbdt_example.md b/docs/zh_CN/GbdtExample.md similarity index 100% rename from docs/zh_CN/gbdt_example.md rename to docs/zh_CN/GbdtExample.md diff --git a/docs/zh_CN/GridsearchTuner.md b/docs/zh_CN/GridsearchTuner.md new file mode 100644 index 0000000000..fd8fbd51c9 --- /dev/null +++ b/docs/zh_CN/GridsearchTuner.md @@ -0,0 +1,5 @@ +# Grid Search + +## Grid Search(遍历搜索) + +Grid Search 会穷举定义在搜索空间文件中的所有超参组合。 注意,搜索空间仅支持 `choice`, `quniform`, `qloguniform`。 `quniform` 和 `qloguniform` 中的 **数字 `q` 有不同的含义(与[搜索空间](SearchSpaceSpec.md)说明不同)。 这里的意义是在 `low` 和 `high` 之间均匀取值的数量。

\ No newline at end of file diff --git a/docs/zh_CN/HowToDebug.md b/docs/zh_CN/HowToDebug.md index f1dd099e1f..8bbaa824de 100644 --- a/docs/zh_CN/HowToDebug.md +++ b/docs/zh_CN/HowToDebug.md @@ -19,7 +19,7 @@ NNI 中有三种日志。 在创建 Experiment 时,可增加命令行参数 `- 在启动 NNI Experiment 时发生的错误,都可以在这里找到。 -通过 `nnictl log stderr` 命令来查看错误信息。 参考 [NNICTL](NNICTLDOC.md) 了解更多命令选项。 +通过 `nnictl log stderr` 命令来查看错误信息。 参考 [NNICTL](Nnictl.md) 了解更多命令选项。 ### Experiment 根目录 diff --git a/docs/zh_CN/HowToImplementTrainingService.md b/docs/zh_CN/HowToImplementTrainingService.md index 05be8f0b9c..e79c6a2246 100644 --- a/docs/zh_CN/HowToImplementTrainingService.md +++ b/docs/zh_CN/HowToImplementTrainingService.md @@ -8,7 +8,7 @@ TrainingService 是与平台管理、任务调度相关的模块。 TrainingServ ![](../img/NNIDesign.jpg) -NNI 的架构如图所示。 NNIManager 是系统的核心管理模块,负责调用 TrainingService 来管理 Trial,并负责不同模块之间的通信。 Dispatcher 是消息处理中心。 TrainingService 是管理任务的模块,它和 NNIManager 通信,并且根据平台的特点有不同的实现。 当前,NNI 支持本地平台、[远程平台](RemoteMachineMode.md)、[OpenPAI 平台](PAIMode.md)、[Kubeflow 平台](KubeflowMode.md)和[FrameworkController 平台](FrameworkController.md)。 +NNI 的架构如图所示。 NNIManager 是系统的核心管理模块,负责调用 TrainingService 来管理 Trial,并负责不同模块之间的通信。 Dispatcher 是消息处理中心。 TrainingService 是管理任务的模块,它和 NNIManager 通信,并且根据平台的特点有不同的实现。 当前,NNI 支持本地平台、[远程平台](RemoteMachineMode.md)、[OpenPAI 平台](PaiMode.md)、[Kubeflow 平台](KubeflowMode.md)和[FrameworkController 平台](FrameworkController.md)。 在这个文档中,会简要介绍 TrainingService 的设计。 如果要添加新的 TrainingService,只需要继承 TrainingServcie 类并实现相应的方法,不需要理解NNIManager、Dispatcher 等其它模块的细节。 ## 代码文件夹结构 @@ -151,4 +151,4 @@ NNI 提供了 TrialKeeper 工具,用来帮助维护 Trial 任务。 可以在 ## 参考 更多关于如何调试的信息,请[参考这里](HowToDebug.md)。 -关于如何贡献代码,请[参考这里](CONTRIBUTING)。 \ No newline at end of file +关于如何贡献代码,请[参考这里](Contributing.md)。 \ No newline at end of file diff --git a/docs/zh_CN/HyperbandAdvisor.md b/docs/zh_CN/HyperbandAdvisor.md new file mode 100644 index 0000000000..8fe96752b1 --- /dev/null +++ b/docs/zh_CN/HyperbandAdvisor.md @@ -0,0 +1,56 @@ +# NNI 中使用 Hyperband + +## 1. 介绍 + +[Hyperband](https://arxiv.org/pdf/1603.06560.pdf) 是一种流行的自动机器学习算法。 Hyperband 的基本思想是对配置分组,每组有 `n` 个随机生成的超参配置,每个配置使用 `r` 次资源(如,epoch 数量,批处理数量等)。 当 `n` 个配置完成后,会选择最好的 `n/eta` 个配置,并增加 `r*eta` 次使用的资源。 最后,会选择出的最好配置。 + +## 2. 实现并行 + +首先,此样例是基于 MsgDispatcherBase 来实现的自动机器学习算法,而不是基于 Tuner 和Assessor。 这种实现方法下,Hyperband 集成了 Tuner 和 Assessor 两者的功能,因而将它叫做 Advisor。 + +其次,本实现完全利用了 Hyperband 内部的并行性。 具体来说,下一个分组不会严格的在当前分组结束后再运行,只要有资源,就可以开始运行新的分组。 + +## 3. 用法 + +要使用 Hyperband,需要在 Experiment 的 YAML 配置文件进行如下改动。 + + advisor: + #可选项: Hyperband + builtinAdvisorName: Hyperband + classArgs: + #R: 最大的步骤 + R: 100 + #eta: 丢弃的 Trial 的比例 + eta: 3 + #可选项: maximize, minimize + optimize_mode: maximize + + +注意,一旦使用了 Advisor,就不能在配置文件中添加 Tuner 和 Assessor。 使用 Hyperband 时,Trial 代码收到的超参(如键值对)中,除了用户定义的超参,会多一个 `TRIAL_BUDGET`。 **使用 `TRIAL_BUDGET`,Trial 能够控制其运行的时间。

+ +对于 Trial 代码中 `report_intermediate_result(metric)` 和 `report_final_result(metric)` 的**`指标` 应该是数值,或者用一个 dict,并保证其中有键值为 default 的项目,其值也为数值型**。 这是需要进行最大化或者最小化优化的数值,如精度或者损失度。 + +`R` 和 `eta` 是 Hyperband 中可以改动的参数。 `R` 表示可以分配给 Trial 的最大资源。 这里,资源可以代表 epoch 或 批处理数量。 `TRIAL_BUDGET` 应该被尝试代码用来控制运行的次数。 参考样例 `examples/trials/mnist-advisor/` ,了解详细信息。 + +`eta` 表示 `n` 个配置中的 `n/eta` 个配置会留存下来,并用更多的资源来运行。 + +下面是 `R=81` 且 `eta=3` 时的样例: + +| | s=4 | s=3 | s=2 | s=1 | s=0 | +| - | ---- | ---- | ---- | ---- | ---- | +| i | n r | n r | n r | n r | n r | +| 0 | 81 1 | 27 3 | 9 9 | 6 27 | 5 81 | +| 1 | 27 3 | 9 9 | 3 27 | 2 81 | | +| 2 | 9 9 | 3 27 | 1 81 | | | +| 3 | 3 27 | 1 81 | | | | +| 4 | 1 81 | | | | | + +`s` 表示分组, `n` 表示生成的配置数量,相应的 `r` 表示配置使用多少资源来运行。 `i` 表示轮数,如分组 4 有 5 轮,分组 3 有 4 轮。 + +关于如何实现 Trial 代码,参考 `examples/trials/mnist-hyperband/` 中的说明。 + +## 4. 待改进 + +当前实现的 Hyperband 算法可以通过改进支持的提前终止算法来提高,原因是最好的 `n/eta` 个配置并不一定都表现很好。 不好的配置可以更早的终止。 + +在当前实现中,遵循了[此论文](https://arxiv.org/pdf/1603.06560.pdf)的设计,配置都是随机生成的。 要进一步提升,配置生成过程可以利用更高级的算法。 \ No newline at end of file diff --git a/docs/zh_CN/HyperoptTuner.md b/docs/zh_CN/HyperoptTuner.md new file mode 100644 index 0000000000..8eec53e3b7 --- /dev/null +++ b/docs/zh_CN/HyperoptTuner.md @@ -0,0 +1,13 @@ +# TPE, Random Search, Anneal Tuners + +## TPE + +Tree-structured Parzen Estimator (TPE) 是一种 sequential model-based optimization(SMBO,即基于序列模型优化)的方法。 SMBO 方法根据历史指标数据来按顺序构造模型,来估算超参的性能,随后基于此模型来选择新的超参。 TPE 方法对 P(x|y) 和 P(y) 建模,其中 x 表示超参,y 表示相关的评估指标。 P(x|y) 通过变换超参的生成过程来建模,用非参数密度(non-parametric densities)代替配置的先验分布。 细节可参考 [Algorithms for Hyper-Parameter Optimization](https://papers.nips.cc/paper/4443-algorithms-for-hyper-parameter-optimization.pdf)。 ​ + +## Random Search(随机搜索) + +[Random Search for Hyper-Parameter Optimization](http://www.jmlr.org/papers/volume13/bergstra12a/bergstra12a.pdf) 中介绍了随机搜索惊人的简单和效果。 建议当不清楚超参的先验分布时,采用随机搜索作为基准。 + +## Anneal(退火算法) + +这种简单的退火算法从先前的采样开始,会越来越靠近发现的最佳点取样。 此算法是随机搜索的简单变体,利用了响应面的平滑性。 退火率不是自适应的。 \ No newline at end of file diff --git a/docs/zh_CN/Installation.md b/docs/zh_CN/Installation.md index da14cd2ef3..20a01deb96 100644 --- a/docs/zh_CN/Installation.md +++ b/docs/zh_CN/Installation.md @@ -1,6 +1,6 @@ # 安装 NNI -当前支持在 Linux,Mac 和 Windows(本机模式)下安装。 +当前支持在 Linux,Mac 和 Windows(本机,远程和 OpenPAI 模式)下安装。 ## **在 Linux 和 Mac 下安装** @@ -28,10 +28,12 @@ ## **在 Windows 上安装** -在第一次使用 PowerShell 运行脚本时,需要用**使用管理员权限**运行如下命令: +在第一次使用 PowerShell 运行脚本时,需要用**使用管理员权限**运行如下命令: bash - Set-ExecutionPolicy -ExecutionPolicy Unrestricted 强烈推荐使用 Anaconda。 + Set-ExecutionPolicy -ExecutionPolicy Unrestricted + +推荐使用 Anaconda 或 Miniconda。 * **通过 pip 命令安装 NNI** @@ -43,8 +45,9 @@ * **通过源代码安装 NNI** - 先决条件:`python >=3.5`, `git`, `powershell` - 可使用管理员或当前用户权限运行下列命令: + 先决条件:`python >=3.5`, `git`, `PowerShell` + + 然后可以使用管理员或当前用户安装 NNI: ```bash git clone -b v0.7 https://github.com/Microsoft/nni.git @@ -93,12 +96,12 @@ ## 更多 * [概述](Overview.md) -* [使用命令行工具 nnictl](NNICTLDOC.md) +* [使用命令行工具 nnictl](Nnictl.md) * [使用 NNIBoard](WebUI.md) * [定制搜索空间](SearchSpaceSpec.md) * [配置 Experiment](ExperimentConfig.md) * [如何在本机运行 Experiment (支持多 GPU 卡)?](LocalMode.md) * [如何在多机上运行 Experiment?](RemoteMachineMode.md) -* [如何在 OpenPAI 上运行 Experiment?](PAIMode.md) +* [如何在 OpenPAI 上运行 Experiment?](PaiMode.md) * [如何通过 Kubeflow 在 Kubernetes 上运行 Experiment?](KubeflowMode.md) * [如何通过 FrameworkController 在 Kubernetes 上运行 Experiment?](FrameworkControllerMode.md) \ No newline at end of file diff --git a/docs/zh_CN/KubeflowMode.md b/docs/zh_CN/KubeflowMode.md index 68467e8f94..dea7a859f2 100644 --- a/docs/zh_CN/KubeflowMode.md +++ b/docs/zh_CN/KubeflowMode.md @@ -204,6 +204,6 @@ Kubeflow 模式的配置有下列主键: ## 版本校验 -从 0.6 开始,NNI 支持版本校验,详情参考[这里](PAIMode.md)。 +从 0.6 开始,NNI 支持版本校验,详情参考[这里](PaiMode.md)。 如果在使用 Kubeflow 模式时遇到任何问题,请到 [NNI Github](https://github.com/Microsoft/nni) 中创建问题。 \ No newline at end of file diff --git a/docs/zh_CN/LocalMode.md b/docs/zh_CN/LocalMode.md index a635046b54..2fe2160f04 100644 --- a/docs/zh_CN/LocalMode.md +++ b/docs/zh_CN/LocalMode.md @@ -87,7 +87,7 @@ 上面的命令会写在 YAML 文件中。 参考[这里](Trials.md)来写出自己的 Experiment 代码。 -**准备 Tuner**: NNI 支持多种流行的自动机器学习算法,包括:Random Search(随机搜索),Tree of Parzen Estimators (TPE),Evolution(进化算法)等等。 也可以实现自己的 Tuner(参考[这里](Customize_Tuner.md))。下面使用了 NNI 内置的 Tuner: +**准备 Tuner**: NNI 支持多种流行的自动机器学习算法,包括:Random Search(随机搜索),Tree of Parzen Estimators (TPE),Evolution(进化算法)等等。 也可以实现自己的 Tuner(参考[这里](CustomizeTuner.md))。下面使用了 NNI 内置的 Tuner: tuner: builtinTunerName: TPE @@ -95,7 +95,7 @@ optimize_mode: maximize -*builtinTunerName* 用来指定 NNI 中的 Tuner,*classArgs* 是传入到 Tuner的参数(内置 Tuner 在[这里](Builtin_Tuner.md)),*optimization_mode* 表明需要最大化还是最小化 Trial 的结果。 +*builtinTunerName* 用来指定 NNI 中的 Tuner,*classArgs* 是传入到 Tuner的参数(内置 Tuner 在[这里](BuiltinTuner.md)),*optimization_mode* 表明需要最大化还是最小化 Trial 的结果。 **准备配置文件**:实现 Trial 的代码,并选择或实现自定义的 Tuner 后,就要准备 YAML 配置文件了。 NNI 为每个 Trial 样例都提供了演示的配置文件,用命令`cat ~/nni/examples/trials/mnist-annotation/config.yml` 来查看其内容。 大致内容如下: @@ -133,7 +133,7 @@ nnictl create --config ~/nni/examples/trials/mnist-annotation/config.yml -参考[这里](NNICTLDOC.md)来了解 *nnictl* 命令行工具的更多用法。 +参考[这里](Nnictl.md)来了解 *nnictl* 命令行工具的更多用法。 ## 查看 Experiment 结果 diff --git a/docs/zh_CN/MedianstopAssessor.md b/docs/zh_CN/MedianstopAssessor.md new file mode 100644 index 0000000000..86f6f3b48b --- /dev/null +++ b/docs/zh_CN/MedianstopAssessor.md @@ -0,0 +1,5 @@ +# Medianstop Assessor + +## Median Stop + +Medianstop 是一种简单的提前终止 Trial 的策略,可参考[论文](https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/46180.pdf)。 如果 Trial X 的在步骤 S 的最好目标值比所有已完成 Trial 的步骤 S 的中位数值明显要低,这个 Trial 就会被提前停止。 \ No newline at end of file diff --git a/docs/zh_CN/MetisTuner.md b/docs/zh_CN/MetisTuner.md new file mode 100644 index 0000000000..153f2d8e4a --- /dev/null +++ b/docs/zh_CN/MetisTuner.md @@ -0,0 +1,19 @@ +# Metis Tuner + +## Metis Tuner + +大多数调参工具仅仅预测最优配置,而 [Metis](https://www.microsoft.com/en-us/research/publication/metis-robustly-tuning-tail-latencies-cloud-systems/) 的优势在于有两个输出:(a) 最优配置的当前预测结果, 以及 (b) 下一次 Trial 的建议。 不再需要随机猜测! + +大多数工具假设训练集没有噪声数据,但 Metis 会知道是否需要对某个超参重新采样。 + +大多数工具都有着重于在已有结果上继续发展的问题,而 Metis 的搜索策略可以在探索,发展和重新采样(可选)中进行平衡。 + +Metis 属于基于序列的贝叶斯优化 (SMBO) 的类别,它也基于贝叶斯优化框架。 为了对超参-性能空间建模,Metis 同时使用了高斯过程(Gaussian Process)和高斯混合模型(GMM)。 由于每次 Trial 都可能有很高的时间成本,Metis 大量使用了已有模型来进行推理计算。 在每次迭代中,Metis 执行两个任务: + +在高斯过程空间中找到全局最优点。 这一点表示了最佳配置。 + +它会标识出下一个超参的候选项。 这是通过对隐含信息的探索、挖掘和重采样来实现的。 + +注意,搜索空间仅支持 `choice`, `quniform`, `uniform` 和 `randint`。 + +更多详情,参考论文:https://www.microsoft.com/en-us/research/publication/metis-robustly-tuning-tail-latencies-cloud-systems/ \ No newline at end of file diff --git a/docs/zh_CN/mnist_examples.md b/docs/zh_CN/MnistExamples.md similarity index 84% rename from docs/zh_CN/mnist_examples.md rename to docs/zh_CN/MnistExamples.md index 04a5546362..c1402f0edc 100644 --- a/docs/zh_CN/mnist_examples.md +++ b/docs/zh_CN/MnistExamples.md @@ -5,7 +5,7 @@ - [MNIST 中使用 NNI API](#mnist) - [MNIST 中使用 NNI 标记(annotation)](#mnist-annotation) - [在 Keras 中使用 MNIST](#mnist-keras) -- [MNIST -- 用批处理调参器来调优](#mnist-batch) +- [MNIST -- 用批处理 Tuner 来调优](#mnist-batch) - [MNIST -- 用 hyperband 调优](#mnist-hyperband) - [MNIST -- 用嵌套搜索空间调优](#mnist-nested) - [用 Kubeflow 运行分布式的 MNIST (tensorflow)](#mnist-kubeflow-tf) @@ -14,7 +14,7 @@ **MNIST 中使用 NNI API** -这是个简单的卷积网络,有两个卷积层,两个池化层和一个全连接层。 调优的超参包括 dropout 比率,卷积层大小,隐藏层(全连接层)大小等等。 它能用 NNI 中大部分内置的调参器来调优,如 TPE,SMAC,Random。 样例的 YAML 文件也启用了评估器来提前终止一些中间结果不好的尝试。 +这是个简单的卷积网络,有两个卷积层,两个池化层和一个全连接层。 调优的超参包括 dropout 比率,卷积层大小,隐藏层(全连接层)大小等等。 它能用 NNI 中大部分内置的 Tuner 来调优,如 TPE,SMAC,Random。 样例的 YAML 文件也启用了评估器来提前终止一些中间结果不好的尝试。 `代码目录: examples/trials/mnist/` @@ -33,9 +33,9 @@ `代码目录: examples/trials/mnist-keras/` -**MNIST -- 用批处理调参器来调优** +**MNIST -- 用批处理 Tuner 来调优** -此样例演示了如何使用批处理调参器。 只需要在搜索空间文件中列出所有要尝试的配置, NNI 会逐个尝试。 +此样例演示了如何使用批处理 Tuner。 只需要在搜索空间文件中列出所有要尝试的配置, NNI 会逐个尝试。 `代码目录: examples/trials/mnist-batch-tune-keras/` @@ -51,7 +51,7 @@ 此样例演示了 NNI 如何支持嵌套的搜索空间。 搜索空间文件示了如何定义嵌套的搜索空间。 -`代码目录: examples/trials/mnist-cascading-search-space/` +`代码目录: examples/trials/mnist-nested-search-space/` **用 Kubeflow 运行分布式的 MNIST (tensorflow)** diff --git a/docs/zh_CN/MultiPhase.md b/docs/zh_CN/MultiPhase.md new file mode 100644 index 0000000000..5ea2cba300 --- /dev/null +++ b/docs/zh_CN/MultiPhase.md @@ -0,0 +1,46 @@ +## 多阶段 Experiment + +通常,每个 Trial 任务只需要从 Tuner 获取一个配置(超参等),然后使用这个配置执行并报告结果,然后退出。 但有时,一个 Trial 任务可能需要从 Tuner 请求多次配置。 这是一个非常有用的功能。 例如: + +1. 在一些训练平台上,需要数十秒来启动一个任务。 如果一个配置只需要一分钟就能完成,那么每个 Trial 任务中只运行一个配置就会非常低效。 这种情况下,可以在同一个 Trial 任务中,完成一个配置后,再请求并完成另一个配置。 极端情况下,一个 Trial 任务可以运行无数个配置。 如果设置了并发(例如设为 6),那么就会有 6 个**长时间**运行的任务来不断尝试不同的配置。 + +2. 有些类型的模型需要进行多阶段的训练,而下一个阶段的配置依赖于前一个阶段的结果。 例如,为了找到模型最好的量化结果,训练过程通常为:自动量化算法(例如 NNI 中的 TunerJ)选择一个位宽(如 16 位), Trial 任务获得此配置,并训练数个 epoch,并返回结果(例如精度)。 算法收到结果后,决定是将 16 位改为 8 位,还是 32 位。 此过程会重复多次。 + +上述情况都可以通过多阶段执行的功能来支持。 为了支持这些情况,一个 Trial 任务需要能从 Tuner 请求多个配置。 Tuner 需要知道两次配置请求是否来自同一个 Trial 任务。 同时,多阶段中的 Trial 任务需要多次返回最终结果。 + +注意, `nni.get_next_parameter()` 和 `nni.report_final_result()` 需要被依次调用:**先调用前者,然后调用后者,并按此顺序重复调用**。 如果 `nni.get_next_parameter()` 被连续多次调用,然后再调用 `nni.report_final_result()`,这会造成最终结果只会与 get_next_parameter 所返回的最后一个配置相关联。 因此,前面的 get_next_parameter 调用都没有关联的结果,这可能会造成一些多阶段算法出问题。 + +## 创建多阶段的 Experiment + +### 编写使用多阶段的 Trial 代码: + +**1. 更新 Trial 代码** + +Trial 代码中使用多阶段非常容易,样例如下: + + ```python + # ... + for i in range(5): + # 从 Tuner 中获得参数 + tuner_param = nni.get_next_parameter() + + # 使用参数 + # ... + # 为上面获取的参数返回最终结果 + nni.report_final_result() + # ... + # ... + ``` + + +**2. 修改 Experiment 配置** + +要启用多阶段,需要在 Experiment 的 YAML 配置文件中增加 `multiPhase: true`。 如果不添加此参数,`nni.get_next_parameter()` 会一直返回同样的配置。 对于所有内置的 Tuner 和 Advisor,不需要修改任何代码,就直接支持多阶段请求配置。 + +### 编写使用多阶段的 Tuner: + +强烈建议首先阅读[自定义 Tuner](https://nni.readthedocs.io/en/latest/Customize_Tuner.html),再开始编写多阶段 Tuner。 与普通 Tuner 不同的是,必须继承于 `MultiPhaseTuner`(在 nni.multi_phase_tuner 中)。 `Tuner` 与 `MultiPhaseTuner` 之间最大的不同是,MultiPhaseTuner 多了一些信息,即 `trial_job_id`。 有了这个信息, Tuner 能够知道哪个 Trial 在请求配置信息, 返回的结果是哪个 Trial 的。 通过此信息,Tuner 能够灵活的为不同的 Trial 及其阶段实现功能。 例如,可在 generate_parameters 方法中使用 trial_job_id 来为特定的 Trial 任务生成超参。 + +当然,要使用自定义的多阶段 Tuner ,也需要**在 Experiment 的 YAML 配置文件中增加`multiPhase: true`**。 + +[ENAS Tuner](https://github.com/countif/enas_nni/blob/master/nni/examples/tuners/enas/nni_controller_ptb.py) 是多阶段 Tuner 的样例。 \ No newline at end of file diff --git a/docs/zh_CN/NetworkmorphismTuner.md b/docs/zh_CN/NetworkmorphismTuner.md new file mode 100644 index 0000000000..5a167e4575 --- /dev/null +++ b/docs/zh_CN/NetworkmorphismTuner.md @@ -0,0 +1,245 @@ +# Network Morphism Tuner + +## 1. 介绍 + +[Autokeras](https://arxiv.org/abs/1806.10282) 是使用 Network Morphism 算法的流行的自动机器学习工具。 Autokeras 的基本理念是使用贝叶斯回归来预测神经网络架构的指标。 每次都会从父网络生成几个子网络。 然后使用朴素贝叶斯回归,从网络的历史训练结果来预测它的指标值。 接下来,会选择预测结果最好的子网络加入训练队列中。 在[此代码](https://github.com/jhfjhfj1/autokeras)的启发下,我们在 NNI 中实现了 Network Morphism 算法。 + +要了解 Network Morphism Trial 的用法,参考 [Readme_zh_CN.md](https://github.com/Microsoft/nni/blob/master/examples/trials/network_morphism/README_zh_CN.md),了解更多细节。 + +## 2. 用法 + +要使用 Network Morphism,需要如下配置 `config.yml` 文件: + +```yaml +tuner: + #选择: NetworkMorphism + builtinTunerName: NetworkMorphism + classArgs: + #可选项: maximize, minimize + optimize_mode: maximize + #当前仅支持 cv 领域 + task: cv + #修改来支持实际图像宽度 + input_width: 32 + #修改来支持实际图像通道 + input_channel: 3 + #修改来支持实际的分类数量 + n_output_node: 10 +``` + +在训练过程中,会生成一个 JSON 文件来表示网络图。 可调用 "json\_to\_graph()" 函数来将 JSON 文件转化为 Pytoch 或 Keras 模型。 + +```python +import nni +from nni.networkmorphism_tuner.graph import json_to_graph + +def build_graph_from_json(ir_model_json): + """从 JSON 生成 Pytorch 模型 + """ + graph = json_to_graph(ir_model_json) + model = graph.produce_torch_model() + return model + +# 从网络形态 Tuner 中获得下一组参数 +RCV_CONFIG = nni.get_next_parameter() +# 调用函数来生成 Pytorch 或 Keras 模型 +net = build_graph_from_json(RCV_CONFIG) + +# 训练过程 +# .... + +# 将最终精度返回给 NNI +nni.report_final_result(best_acc) +``` + +如果需要保存并**读取最佳模型**,推荐采用以下方法。 + +```python +# 1. 使用 NNI API +## 从 Web 界面获取最佳模型的 ID +## 或查看 `nni/experiments/experiment_id/log/model_path/best_model.txt' 文件 + +## 从 JSON 文件中读取,并使用 NNI API 来加载 +with open("best-model.json") as json_file: + json_of_model = json_file.read() +model = build_graph_from_json(json_of_model) + +# 2. 使用框架的 API (与具体框架相关) +## 2.1 Keras API + +## 在 Trial 代码中使用 Keras API 保存 +## 最好保存 NNI 的 ID +model_id = nni.get_sequence_id() +## 将模型序列化为 JSON +model_json = model.to_json() +with open("model-{}.json".format(model_id), "w") as json_file: + json_file.write(model_json) +## 将权重序列化至 HDF5 +model.save_weights("model-{}.h5".format(model_id)) + +## 重用模型时,使用 Keras API 读取 +## 读取 JSON 文件,并创建模型 +model_id = "" # 需要重用的模型 ID +with open('model-{}.json'.format(model_id), 'r') as json_file: + loaded_model_json = json_file.read() +loaded_model = model_from_json(loaded_model_json) +## 将权重加载到新模型中 +loaded_model.load_weights("model-{}.h5".format(model_id)) + +## 2.2 PyTorch API + +## 在 Trial 代码中使用 PyTorch API 保存 +model_id = nni.get_sequence_id() +torch.save(model, "model-{}.pt".format(model_id)) + +## 重用模型时,使用 PyTorch API 读取 +model_id = "" # 需要重用的模型 ID +loaded_model = torch.load("model-{}.pt".format(model_id)) + +``` + +## 3. 文件结构 + +Tuner 有大量的文件、函数和类。 这里只简单介绍最重要的文件: + +- `networkmorphism_tuner.py` 是使用 network morphism 算法的 Tuner。 + +- `bayesian.py` 是用来基于已经搜索道德模型来预测未知模型指标的贝叶斯算法。 + +- `graph.py` 是元图数据结构。 类 Graph 表示了模型的神经网络图。 + - Graph 从模型中抽取神经网络。 + - 图中的每个节点都是层之间的中间张量。 + - 在图中,边表示层。 + - 注意,多条边可能会表示同一层。 + +- `graph_transformer.py` 包含了一些图转换,包括变宽,变深,或在图中增加跳跃连接。 + +- `layers.py` 包括模型中用到的所有层。 + +- `layer_transformer.py` 包含了一些层转换,包括变宽,变深,或在层中增加跳跃连接。 +- `nn.py` 包含生成初始化网的类。 +- `metric.py` 包括了一些指标类,如 Accuracy 和 MSE。 +- `utils.py` 是使用 Keras 在数据集 `cifar10` 上搜索神经网络的样例。 + +## 4. 网络表示的 JSON 样例 + +这是定义的中间表示 JSON 样例,在架构搜索过程中会从 Tuner 传到 Trial。 可调用 "json\_to\_graph()" 函数来将 JSON 文件转化为 Pytoch 或 Keras 模型。 样例如下。 + +```json +{ + "input_shape": [32, 32, 3], + "weighted": false, + "operation_history": [], + "layer_id_to_input_node_ids": {"0": [0],"1": [1],"2": [2],"3": [3],"4": [4],"5": [5],"6": [6],"7": [7],"8": [8],"9": [9],"10": [10],"11": [11],"12": [12],"13": [13],"14": [14],"15": [15],"16": [16] + }, + "layer_id_to_output_node_ids": {"0": [1],"1": [2],"2": [3],"3": [4],"4": [5],"5": [6],"6": [7],"7": [8],"8": [9],"9": [10],"10": [11],"11": [12],"12": [13],"13": [14],"14": [15],"15": [16],"16": [17] + }, + "adj_list": { + "0": [[1, 0]], + "1": [[2, 1]], + "2": [[3, 2]], + "3": [[4, 3]], + "4": [[5, 4]], + "5": [[6, 5]], + "6": [[7, 6]], + "7": [[8, 7]], + "8": [[9, 8]], + "9": [[10, 9]], + "10": [[11, 10]], + "11": [[12, 11]], + "12": [[13, 12]], + "13": [[14, 13]], + "14": [[15, 14]], + "15": [[16, 15]], + "16": [[17, 16]], + "17": [] + }, + "reverse_adj_list": { + "0": [], + "1": [[0, 0]], + "2": [[1, 1]], + "3": [[2, 2]], + "4": [[3, 3]], + "5": [[4, 4]], + "6": [[5, 5]], + "7": [[6, 6]], + "8": [[7, 7]], + "9": [[8, 8]], + "10": [[9, 9]], + "11": [[10, 10]], + "12": [[11, 11]], + "13": [[12, 12]], + "14": [[13, 13]], + "15": [[14, 14]], + "16": [[15, 15]], + "17": [[16, 16]] + }, + "node_list": [ + [0, [32, 32, 3]], + [1, [32, 32, 3]], + [2, [32, 32, 64]], + [3, [32, 32, 64]], + [4, [16, 16, 64]], + [5, [16, 16, 64]], + [6, [16, 16, 64]], + [7, [16, 16, 64]], + [8, [8, 8, 64]], + [9, [8, 8, 64]], + [10, [8, 8, 64]], + [11, [8, 8, 64]], + [12, [4, 4, 64]], + [13, [64]], + [14, [64]], + [15, [64]], + [16, [64]], + [17, [10]] + ], + "layer_list": [ + [0, ["StubReLU", 0, 1]], + [1, ["StubConv2d", 1, 2, 3, 64, 3]], + [2, ["StubBatchNormalization2d", 2, 3, 64]], + [3, ["StubPooling2d", 3, 4, 2, 2, 0]], + [4, ["StubReLU", 4, 5]], + [5, ["StubConv2d", 5, 6, 64, 64, 3]], + [6, ["StubBatchNormalization2d", 6, 7, 64]], + [7, ["StubPooling2d", 7, 8, 2, 2, 0]], + [8, ["StubReLU", 8, 9]], + [9, ["StubConv2d", 9, 10, 64, 64, 3]], + [10, ["StubBatchNormalization2d", 10, 11, 64]], + [11, ["StubPooling2d", 11, 12, 2, 2, 0]], + [12, ["StubGlobalPooling2d", 12, 13]], + [13, ["StubDropout2d", 13, 14, 0.25]], + [14, ["StubDense", 14, 15, 64, 64]], + [15, ["StubReLU", 15, 16]], + [16, ["StubDense", 16, 17, 64, 10]] + ] + } +``` + +每个模型的定义都是一个 JSON 对象 (也可以认为模型是一个 [有向无环图](https://en.wikipedia.org/wiki/Directed_acyclic_graph)): + +- `input_shape` 是整数的列表,不包括批量维度。 +- `weighted` 表示是否权重和偏移值应该包含在此神经网络图中。 +- `operation_history` 是保存了所有网络形态操作的列表。 +- `layer_id_to_input_node_ids` 是字典实例,将层的标识映射到输入节点标识。 +- `layer_id_to_output_node_ids` 是字典实例,将层的标识映射到输出节点标识。 +- `adj_list` 是二维列表。 是图的邻接列表。 第一维是张量标识。 在每条边的列表中,元素是两元组(张量标识,层标识)。 +- `reverse_adj_list` 是与 adj_list 格式一样的反向邻接列表。 +- `node_list` 是一个整数列表。 列表的索引是标识。 +- `layer_list` 是层的列表。 列表的索引是标识。 + + - 对于 `StubConv (StubConv1d, StubConv2d, StubConv3d)`,后面的数字表示节点的输入 id(或 id 列表),节点输出 id,input_channel,filters,kernel_size,stride 和 padding。 + + - 对于 `StubDense`,后面的数字表示节点的输入 id (或 id 列表),节点输出 id,input_units 和 units。 + + - 对于 `StubBatchNormalization (StubBatchNormalization1d, StubBatchNormalization2d, StubBatchNormalization3d)`,后面的数字表示节点输入 id(或 id 列表),节点输出 id,和特征数量。 + + - 对于 `StubDropout(StubDropout1d, StubDropout2d, StubDropout3d)`,后面的数字表示节点的输入 id (或 id 列表),节点的输出 id 和 dropout 率。 + + - 对于 `StubPooling (StubPooling1d, StubPooling2d, StubPooling3d)`后面的数字表示节点的输入 id(或 id 列表),节点输出 id,kernel_size, stride 和 padding。 + + - 对于其它层,后面的数字表示节点的输入 id(或 id 列表)以及节点的输出 id。 + +## 5. TODO + +下一步,会将 API 从固定的网络生成方法改为更多的网络操作生成方法。 此外,还会使用 ONNX 格式来替代 JSON 作为中间表示结果。 \ No newline at end of file diff --git a/docs/zh_CN/WindowsLocalMode.md b/docs/zh_CN/NniOnWindows.md similarity index 81% rename from docs/zh_CN/WindowsLocalMode.md rename to docs/zh_CN/NniOnWindows.md index 900d9a2020..683ae71691 100644 --- a/docs/zh_CN/WindowsLocalMode.md +++ b/docs/zh_CN/NniOnWindows.md @@ -1,10 +1,10 @@ -# Windows 本地模式(测试中) +# Windows 上的 NNI(实验阶段的功能) -当前 Windows 下仅支持本机模式。 推荐 Windows 10 的 1809 版,其经过了测试。 +当前 Windows 上支持本机、远程和 OpenPAI 模式。 推荐 Windows 10 的 1809 版,其经过了测试。 ## **在 Windows 上安装** -**强烈推荐使用 Anaconda python(64 位)。** +**强烈推荐使用 Anaconda 或 Miniconda Python(64位)。** 在第一次使用 PowerShell 运行脚本时,需要用**使用管理员权限**运行如下命令: @@ -22,18 +22,18 @@ Set-ExecutionPolicy -ExecutionPolicy Unrestricted * __通过代码安装 NNI__ - 先决条件: `python >=3.5`, `git`, `powershell` + 先决条件: `python >=3.5`, `git`, `PowerShell` ```bash - git clone -b v0.7 https://github.com/Microsoft/nni.git + git clone -b v0.8 https://github.com/Microsoft/nni.git cd nni - powershell ./install.ps1 + powershell -file install.ps1 ``` 运行完以上脚本后,从命令行使用 **config_windows.yml** 来启动 Experiment,完成安装验证。 ```bash -nnictl create --config nni/examples/trials/mnist/config_windows.yml +nnictl create --config nni\examples\trials\mnist\config_windows.yml ``` 同样,其它示例的 YAML 配置中也需将 Trial 命令的 `python3` 替换为 `python`。 @@ -58,7 +58,7 @@ Set-ExecutionPolicy -ExecutionPolicy Unrestricted ### 在命令行或 PowerShell 中,Trial 因为缺少 DLL 而失败 -此错误因为缺少 LIBIFCOREMD.DLL 和 LIBMMD.DLL 文件,且 SciPy 安装失败。 使用 Anaconda Python(64-bit) 可解决此问题。 +此错误因为缺少 LIBIFCOREMD.DLL 和 LIBMMD.DLL 文件,且 SciPy 安装失败。 使用 Anaconda 或 Miniconda 和 Python(64位)可解决。 > ImportError: DLL load failed diff --git a/docs/zh_CN/NNICTLDOC.md b/docs/zh_CN/Nnictl.md similarity index 99% rename from docs/zh_CN/NNICTLDOC.md rename to docs/zh_CN/Nnictl.md index 155daf5f66..5427b8c2ce 100644 --- a/docs/zh_CN/NNICTLDOC.md +++ b/docs/zh_CN/Nnictl.md @@ -461,7 +461,7 @@ nnictl 支持的命令: > 将数据导入运行中的 Experiment ```bash - nnictl experiment [experiment_id] -f experiment_data.json + nnictl experiment import [experiment_id] -f experiment_data.json ``` diff --git a/docs/zh_CN/Overview.md b/docs/zh_CN/Overview.md index b01ab843c2..cad8b33676 100644 --- a/docs/zh_CN/Overview.md +++ b/docs/zh_CN/Overview.md @@ -49,11 +49,11 @@ Experiment 的运行过程为:Tuner 接收搜索空间并生成配置。 这 * [开始使用](QuickStart.md) * [如何为 NNI 调整代码?](Trials.md) -* [NNI 支持哪些 Tuner?](Builtin_Tuner.md) -* [如何自定义 Tuner?](Customize_Tuner.md) -* [NNI 支持哪些 Assessor?](Builtin_Assessors.md) -* [如何自定义 Assessor?](Customize_Assessor.md) +* [NNI 支持哪些 Tuner?](BuiltinTuner.md) +* [如何自定义 Tuner?](CustomizeTuner.md) +* [NNI 支持哪些 Assessor?](BuiltinAssessors.md) +* [如何自定义 Assessor?](CustomizeAssessor.md) * [如何在本机上运行 Experiment?](LocalMode.md) * [如何在多机上运行 Experiment?](RemoteMachineMode.md) -* [如何在 OpenPAI 上运行 Experiment?](PAIMode.md) -* [样例](mnist_examples.md) \ No newline at end of file +* [如何在 OpenPAI 上运行 Experiment?](PaiMode.md) +* [样例](MnistExamples.md) \ No newline at end of file diff --git a/docs/zh_CN/PaiMode.md b/docs/zh_CN/PaiMode.md new file mode 100644 index 0000000000..b1df85aeb8 --- /dev/null +++ b/docs/zh_CN/PaiMode.md @@ -0,0 +1,97 @@ +# **在 OpenPAI 上运行 Experiment** + +NNI 支持在 [OpenPAI](https://github.com/Microsoft/pai) (简称 pai)上运行 Experiment,即 pai 模式。 在使用 NNI 的 pai 模式前, 需要有 [OpenPAI](https://github.com/Microsoft/pai) 群集的账户。 如果没有 OpenPAI 账户,参考[这里](https://github.com/Microsoft/pai#how-to-deploy)来进行部署。 在 pai 模式中,会在 Docker 创建的容器中运行 Trial 程序。 + +## 设置环境 + +参考[指南](QuickStart.md)安装 NNI。 + +## 运行 Experiment + +以 `examples/trials/mnist-annotation` 为例。 NNI 的 YAML 配置文件如下: + +```yaml +authorName: your_name +experimentName: auto_mnist +# 并发运行的 Trial 数量 +trialConcurrency: 2 +# Experiment 的最长持续运行时间 +maxExecDuration: 3h +# 空表示一直运行 +maxTrialNum: 100 +# 可选项: local, remote, pai +trainingServicePlatform: pai +# 可选项: true, false +useAnnotation: true +tuner: + builtinTunerName: TPE + classArgs: + optimize_mode: maximize +trial: + command: python3 mnist.py + codeDir: ~/nni/examples/trials/mnist-annotation + gpuNum: 0 + cpuNum: 1 + memoryMB: 8196 + image: openpai/pai.example.tensorflow + dataDir: hdfs://10.1.1.1:9000/nni + outputDir: hdfs://10.1.1.1:9000/nni +# 配置访问的 OpenPAI 集群 +paiConfig: + userName: your_pai_nni_user + passWord: your_pai_password + host: 10.1.1.1 +``` + +注意:如果用 pai 模式运行,需要在 YAML 文件中设置 `trainingServicePlatform: pai`。 + +与本机模式,以及[远程计算机模式](RemoteMachineMode.md)相比,pai 模式的 Trial 有额外的配置: + +* cpuNum + * 必填。 Trial 程序的 CPU 需求,必须为正数。 +* memoryMB + * 必填。 Trial 程序的内存需求,必须为正数。 +* image + * 必填。 在 pai 模式中,Trial 程序由 OpenPAI 在 [Docker 容器](https://www.docker.com/)中安排运行。 此键用来指定 Trial 程序的容器使用的 Docker 映像。 + * [Docker Hub](https://hub.docker.com/) 上有预制的 NNI Docker 映像 [nnimsra/nni](https://hub.docker.com/r/msranni/nni/)。 它包含了用来启动 NNI Experiment 所依赖的所有 Python 包,Node 模块和 JavaScript。 生成此 Docker 映像的文件在[这里](https://github.com/Microsoft/nni/tree/master/deployment/docker/Dockerfile)。 可以直接使用此映像,或参考它来生成自己的映像。 +* dataDir + * 可选。 指定了 Trial 用于下载数据的 HDFS 数据目录。 格式应为 hdfs://{your HDFS host}:9000/{数据目录} +* outputDir + * 可选。 指定了 Trial 的 HDFS 输出目录。 Trial 在完成(成功或失败)后,Trial 的 stdout, stderr 会被 NNI 自动复制到此目录中。 格式应为 hdfs://{your HDFS host}:9000/{输出目录} +* virtualCluster + * 可选。 设置 OpenPAI 的 virtualCluster,即虚拟集群。 如果未设置此参数,将使用默认的虚拟集群。 +* shmMB + * 可选。 设置 OpenPAI 的 shmMB,即 Docker 中的共享内存。 + +完成并保存 NNI Experiment 配置文件后(例如可保存为:exp_pai.yml),运行以下命令: + + nnictl create --config exp_pai.yml + + +来在 pai 模式下启动 Experiment。 NNI 会为每个 Trial 创建 OpenPAI 作业,作业名称的格式为 `nni_exp_{experiment_id}_trial_{trial_id}`。 可以在 OpenPAI 集群的网站中看到 NNI 创建的作业,例如: ![](../img/nni_pai_joblist.jpg) + +注意:pai 模式下,NNIManager 会启动 RESTful 服务,监听端口为 NNI 网页服务器的端口加1。 例如,如果网页端口为`8080`,那么 RESTful 服务器会监听在 `8081`端口,来接收运行在 Kubernetes 中的 Trial 作业的指标。 因此,需要在防火墙中启用端口 `8081` 的 TCP 协议,以允许传入流量。 + +当一个 Trial 作业完成后,可以在 NNI 网页的概述页面(如:http://localhost:8080/oview)中查看 Trial 的信息。 + +在 Trial 列表页面中展开 Trial 信息,点击如下的 logPath: ![](../img/nni_webui_joblist.jpg) + +接着将会打开 HDFS 的 WEB 界面,并浏览到 Trial 的输出文件: ![](../img/nni_trial_hdfs_output.jpg) + +在输出目录中可以看到三个文件:stderr, stdout, 以及 trial.log + +如果希望将 Trial 的模型数据等其它输出保存到HDFS中,可在 Trial 代码中使用 `NNI_OUTPUT_DIR` 来自己保存输出文件,NNI SDK会从 Trial 的容器中将 `NNI_OUTPUT_DIR` 中的文件复制到 HDFS 中。 + +如果在使用 pai 模式时遇到任何问题,请到 [NNI Github](https://github.com/Microsoft/nni) 中创建问题。 + +## 版本校验 + +从 0.6 开始,NNI 支持版本校验。确保 NNIManager 与 trialKeeper 的版本一致,避免兼容性错误。 +检查策略: + +1. 0.6 以前的 NNIManager 可与任何版本的 trialKeeper 一起运行,trialKeeper 支持向后兼容。 +2. 从 NNIManager 0.6 开始,与 triakKeeper 的版本必须一致。 例如,如果 NNIManager 是 0.6 版,则 trialKeeper 也必须是 0.6 版。 +3. 注意,只有版本的前两位数字才会被检查。例如,NNIManager 0.6.1 可以和 trialKeeper 的 0.6 或 0.6.2 一起使用,但不能与 trialKeeper 的 0.5.1 或 0.7 版本一起使用。 + +如果 Experiment 无法运行,而且不能确认是否是因为版本不匹配造成的,可以在 Web 界面检查是否有相关的错误消息。 +![](../img/version_check.png) \ No newline at end of file diff --git a/docs/zh_CN/QuickStart.md b/docs/zh_CN/QuickStart.md index c5c4cd3f07..a4bb38b2da 100644 --- a/docs/zh_CN/QuickStart.md +++ b/docs/zh_CN/QuickStart.md @@ -2,7 +2,7 @@ ## 安装 -当前支持 Linux,MacOS 和 Windows(本机模式),在 Ubuntu 16.04 或更高版本,MacOS 10.14.1 以及 Windows 10.1809 上进行了测试。 在 `python >= 3.5` 的环境中,只需要运行 `pip install` 即可完成安装。 +当前支持 Linux,MacOS 和 Windows,在 Ubuntu 16.04 或更高版本,MacOS 10.14.1 以及 Windows 10.1809 上进行了测试。 在 `python >= 3.5` 的环境中,只需要运行 `pip install` 即可完成安装。 #### Linux 和 MacOS @@ -12,7 +12,7 @@ #### Windows -如果选择 Windows 本机模式并使用 PowerShell 运行脚本,需要首次以管理员身份在 PowerShell 环境中运行以下命令。 +如果在 Windows 上使用 NNI,首次使用 PowerShell 时,需要以管理员身份运行下列命令。 ```bash Set-ExecutionPolicy -ExecutionPolicy Unrestricted @@ -161,13 +161,13 @@ trial: 从命令行使用 **config_windows.yml** 文件启动 MNIST Experiment 。 -**注意**:如果使用了 Windows 本机模式,则需要在 config.yml 文件中,将 `python3` 改为 `python`,或者使用 config_windows.yml 来开始 Experiment。 +**注意**:如果使用 Windows,则需要在 config.yml 文件中,将 `python3` 改为 `python`,或者使用 config_windows.yml 来开始 Experiment。 ```bash nnictl create --config nni/examples/trials/mnist/config_windows.yml ``` -注意:**nnictl** 是一个命令行工具,用来控制 NNI Experiment,如启动、停止、继续 Experiment,启动、停止 NNIBoard 等等。 查看[这里](NNICTLDOC.md),了解 `nnictl` 更多用法。 +注意:**nnictl** 是一个命令行工具,用来控制 NNI Experiment,如启动、停止、继续 Experiment,启动、停止 NNIBoard 等等。 查看[这里](Nnictl.md),了解 `nnictl` 更多用法。 在命令行中等待输出 `INFO: Successfully started experiment!`。 此消息表明 Experiment 已成功启动。 期望的输出如下: @@ -208,7 +208,7 @@ You can use these commands to get more information about the experiment The Web UI urls are: [IP 地址]:8080 ``` -在浏览器中打开 `Web 界面地址`(即:`[IP 地址]:8080`),就可以看到 Experiment 的详细信息,以及所有的 Trial 任务。 +在浏览器中打开 `Web 界面地址`(即:`[IP 地址]:8080`),就可以看到 Experiment 的详细信息,以及所有的 Trial 任务。 如果无法打开终端中的 Web 界面链接,可以参考 [FAQ](FAQ.md)。 #### 查看概要页面 @@ -254,12 +254,12 @@ Experiment 相关信息会显示在界面上,配置和搜索空间等。 可 ## 相关主题 -* [尝试不同的 Tuner](Builtin_Tuner.md) -* [尝试不同的 Assessor](Builtin_Assessors.md) -* [使用命令行工具 nnictl](NNICTLDOC.md) +* [尝试不同的 Tuner](BuiltinTuner.md) +* [尝试不同的 Assessor](BuiltinAssessors.md) +* [使用命令行工具 nnictl](Nnictl.md) * [如何编写 Trial 代码](Trials.md) * [如何在本机运行 Experiment (支持多 GPU 卡)?](LocalMode.md) * [如何在多机上运行 Experiment?](RemoteMachineMode.md) -* [如何在 OpenPAI 上运行 Experiment?](PAIMode.md) +* [如何在 OpenPAI 上运行 Experiment?](PaiMode.md) * [如何通过 Kubeflow 在 Kubernetes 上运行 Experiment?](KubeflowMode.md) * [如何通过 FrameworkController 在 Kubernetes 上运行 Experiment?](FrameworkControllerMode.md) \ No newline at end of file diff --git a/docs/zh_CN/RELEASE.md b/docs/zh_CN/Release.md similarity index 94% rename from docs/zh_CN/RELEASE.md rename to docs/zh_CN/Release.md index 722db6cfa4..b9d6ebec23 100644 --- a/docs/zh_CN/RELEASE.md +++ b/docs/zh_CN/Release.md @@ -6,9 +6,9 @@ * [支持在 Windows 上使用 NNI](./WindowsLocalMode.md) * NNI 可在 Windows 上使用本机模式 -* [支持新的 Advisor: BOHB](./bohbAdvisor.md) +* [支持新的 Advisor: BOHB](./BohbAdvisor.md) * 支持新的 BOHB Advisor,这是一个健壮而有效的超参调优算法,囊括了贝叶斯优化和 Hyperband 的优点 -* [支持通过 nnictl 来导入导出 Experiment 数据](./NNICTLDOC.md#experiment) +* [支持通过 nnictl 来导入导出 Experiment 数据](./Nnictl.md#experiment) * 在 Experiment 执行完后,可生成分析结果报告 * 支持将先前的调优数据导入到 Tuner 和 Advisor 中 * [可为 NNI Trial 任务指定 GPU](./ExperimentConfig.md#localConfig) @@ -31,7 +31,7 @@ ### 主要功能 -* [版本检查](https://github.com/Microsoft/nni/blob/master/docs/en_US/PAIMode.md#version-check) +* [版本检查](https://github.com/Microsoft/nni/blob/master/docs/zh_CN/PaiMode.md#version-check) * 检查 nniManager 和 trialKeeper 的版本是否一致 * [提前终止的任务也可返回最终指标](https://github.com/Microsoft/nni/issues/776) * 如果 includeIntermediateResults 为 true,最后一个 Assessor 的中间结果会被发送给 Tuner 作为最终结果。 includeIntermediateResults 的默认值为 false。 @@ -93,10 +93,10 @@ #### 支持新的 Tuner 和 Assessor -* 支持新的 [Metis Tuner](metisTuner.md)。 **在线**超参调优的场景下,Metis 算法已经被证明非常有效。 +* 支持新的 [Metis Tuner](MetisTuner.md)。 **在线**超参调优的场景下,Metis 算法已经被证明非常有效。 * 支持 [ENAS customized tuner](https://github.com/countif/enas_nni)。由 GitHub 社区用户所贡献。它是神经网络的搜索算法,能够通过强化学习来学习神经网络架构,比 NAS 的性能更好。 -* 支持 [Curve fitting (曲线拟合)Assessor](curvefittingAssessor.md),通过曲线拟合的策略来实现提前终止 Trial。 -* 进一步支持 [Weight Sharing(权重共享)](./AdvancedNAS.md):为 NAS Tuner 通过 NFS 来提供权重共享。 +* 支持 [Curve fitting (曲线拟合)Assessor](CurvefittingAssessor.md),通过曲线拟合的策略来实现提前终止 Trial。 +* 进一步支持 [Weight Sharing(权重共享)](./AdvancedNas.md):为 NAS Tuner 通过 NFS 来提供权重共享。 #### 改进训练平台 @@ -118,7 +118,7 @@ #### 支持新的 Tuner -* 支持新的 [network morphism](networkmorphismTuner.md) Tuner。 +* 支持新的 [network morphism](NetworkmorphismTuner.md) Tuner。 #### 改进训练平台 @@ -152,8 +152,8 @@ * [Kubeflow 训练服务](./KubeflowMode.md) * 支持 tf-operator * 使用 Kubeflow 的[分布式 Trial 样例](https://github.com/Microsoft/nni/tree/master/examples/trials/mnist-distributed/dist_mnist.py) -* [网格搜索 Tuner](gridsearchTuner.md) -* [Hyperband Tuner](hyperbandAdvisor.md) +* [网格搜索 Tuner](GridsearchTuner.md) +* [Hyperband Tuner](HyperbandAdvisor.md) * 支持在 MAC 上运行 NNI Experiment * Web 界面 * 支持 hyperband Tuner @@ -187,7 +187,7 @@ nnictl create --port 8081 --config ``` -* 支持更新最大 Trial 的数量。 使用 `nnictl update --help` 了解详情。 或参考 [NNICTL](NNICTLDOC.md) 查看完整帮助。 +* 支持更新最大 Trial 的数量。 使用 `nnictl update --help` 了解详情。 或参考 [NNICTL](Nnictl.md) 查看完整帮助。 ### API 的新功能和更新 @@ -233,10 +233,10 @@ ### 主要功能 -* 支持 [OpenPAI](https://github.com/Microsoft/pai) (又称 pai) 训练服务 (参考[这里](./PAIMode.md)来了解如何在 OpenPAI 下提交 NNI 任务) +* 支持 [OpenPAI](https://github.com/Microsoft/pai) (又称 pai) 训练服务 (参考[这里](./PaiMode.md)来了解如何在 OpenPAI 下提交 NNI 任务) * 支持 pai 模式的训练服务。 NNI Trial 可发送至 OpenPAI 集群上运行 * NNI Trial 输出 (包括日志和模型文件) 会被复制到 OpenPAI 的 HDFS 中。 -* 支持 [SMAC](https://www.cs.ubc.ca/~hutter/papers/10-TR-SMAC.pdf) Tuner (参考[这里](smacTuner.md),了解如何使用 SMAC Tuner) +* 支持 [SMAC](https://www.cs.ubc.ca/~hutter/papers/10-TR-SMAC.pdf) Tuner (参考[这里](SmacTuner.md),了解如何使用 SMAC Tuner) * [SMAC](https://www.cs.ubc.ca/~hutter/papers/10-TR-SMAC.pdf) 基于 Sequential Model-Based Optimization (SMBO). 它会利用使用过的结果好的模型(高斯随机过程模型),并将随机森林引入到 SMBO 中,来处理分类参数。 NNI 的 SMAC 通过包装 [SMAC3](https://github.com/automl/SMAC3) 来支持。 * 支持将 NNI 安装在 [conda](https://conda.io/docs/index.html) 和 Python 虚拟环境中。 * 其它 diff --git a/docs/zh_CN/RemoteMachineMode.md b/docs/zh_CN/RemoteMachineMode.md index 8f50913191..dcd4f7de2e 100644 --- a/docs/zh_CN/RemoteMachineMode.md +++ b/docs/zh_CN/RemoteMachineMode.md @@ -56,6 +56,10 @@ machineList: passwd: bob123 ``` +可以使用不同系统来在远程计算机上运行 Experiment。 + +#### Linux 和 macOS + 填好 `machineList` 部分,然后运行: ```bash @@ -64,6 +68,16 @@ nnictl create --config ~/nni/examples/trials/mnist-annotation/config_remote.yml 来启动 Experiment。 +#### Windows + +填好 `machineList` 部分,然后运行: + +```bash +nnictl create --config %userprofile%\nni\examples\trials\mnist-annotation\config_remote.yml +``` + +来启动 Experiment。 + ## 版本校验 -从 0.6 开始,NNI 支持版本校验,详情参考[这里](PAIMode.md)。 \ No newline at end of file +从 0.6 开始,NNI 支持版本校验,详情参考[这里](PaiMode.md)。 \ No newline at end of file diff --git a/docs/zh_CN/SearchSpaceSpec.md b/docs/zh_CN/SearchSpaceSpec.md index 41829ccc03..c8c7829ec3 100644 --- a/docs/zh_CN/SearchSpaceSpec.md +++ b/docs/zh_CN/SearchSpaceSpec.md @@ -27,7 +27,14 @@ * {"_type":"choice","_value":options} - * 这表示变量值应该是列表中的选项之一。 选项的元素也可以是 [nested](嵌套的)随机表达式。 在这种情况下,随机选项仅会在条件满足时出现。 + * 表示变量的值是选项之一。 这里的 'options' 是一个数组。 选项的每个元素都是字符串。 也可以是嵌套的子搜索空间。此子搜索空间仅在相应的元素选中后才起作用。 该子搜索空间中的变量可看作是条件变量。 + + * 这是个简单的 [nested] 搜索空间定义的[示例](https://github.com/microsoft/nni/tree/master/examples/trials/mnist-nested-search-space/search_space.json)。 如果选项列表中的元素是 dict,则它是一个子搜索空间,对于内置的 Tuner,必须在此 dict 中添加键 “_name”,这有助于标识选中的元素。 相应的,这是从 NNI 中获得的嵌套搜索空间定义的[示例](https://github.com/microsoft/nni/tree/master/examples/trials/mnist-nested-search-space/sample.json)。 以下 Tuner 支持嵌套搜索空间: + + * Random Search(随机搜索) + * TPE + * Anneal(退火算法) + * Evolution * {"_type":"randint","_value":[upper]} diff --git a/docs/zh_CN/SetupNNIDeveloperEnvironment.md b/docs/zh_CN/SetupNniDeveloperEnvironment.md similarity index 94% rename from docs/zh_CN/SetupNNIDeveloperEnvironment.md rename to docs/zh_CN/SetupNniDeveloperEnvironment.md index e243be2a11..3ad07ae443 100644 --- a/docs/zh_CN/SetupNNIDeveloperEnvironment.md +++ b/docs/zh_CN/SetupNniDeveloperEnvironment.md @@ -57,4 +57,4 @@ Trial 启动 Experiment 来检查环境。 例如,运行命令 * * * -最后,希望一切顺利。 参考[贡献](./CONTRIBUTING.md)文档,来了解更多创建拉取请求或问题的指南。 \ No newline at end of file +最后,希望一切顺利。 参考[贡献](./Contributing.md)文档,来了解更多创建拉取请求或问题的指南。 \ No newline at end of file diff --git a/docs/zh_CN/sklearn_examples.md b/docs/zh_CN/SklearnExamples.md similarity index 100% rename from docs/zh_CN/sklearn_examples.md rename to docs/zh_CN/SklearnExamples.md diff --git a/docs/zh_CN/SmacTuner.md b/docs/zh_CN/SmacTuner.md new file mode 100644 index 0000000000..a350717f87 --- /dev/null +++ b/docs/zh_CN/SmacTuner.md @@ -0,0 +1,7 @@ +# SMAC Tuner + +## SMAC + +[SMAC](https://www.cs.ubc.ca/~hutter/papers/10-TR-SMAC.pdf) 基于 Sequential Model-Based Optimization (SMBO). 它利用使用过的结果好的模型(高斯随机过程模型),并将随机森林引入到 SMBO 中,来处理分类参数。 NNI 的 SMAC 通过包装 [SMAC3](https://github.com/automl/SMAC3) 来支持。 + +NNI 中的 SMAC 只支持部分类型的[搜索空间](SearchSpaceSpec.md),包括`choice`, `randint`, `uniform`, `loguniform`, `quniform(q=1)`。 \ No newline at end of file diff --git a/docs/zh_CN/SQuAD_evolution_examples.md b/docs/zh_CN/SquadEvolutionExamples.md similarity index 100% rename from docs/zh_CN/SQuAD_evolution_examples.md rename to docs/zh_CN/SquadEvolutionExamples.md diff --git a/docs/zh_CN/Trials.md b/docs/zh_CN/Trials.md index 56d9dd7457..993952c72a 100644 --- a/docs/zh_CN/Trials.md +++ b/docs/zh_CN/Trials.md @@ -43,7 +43,7 @@ RECEIVED_PARAMS = nni.get_next_parameter() nni.report_intermediate_result(metrics) ``` -`指标`可以是任意的 Python 对象。 如果使用了 NNI 内置的 Tuner/Assessor,`指标`只可以是两种类型:1) 数值类型,如 float、int, 2) dict 对象,其中必须由键名为 `default`,值为数值的项目。 `指标`会发送给[Assessor](Builtin_Assessors.md)。 通常,`指标`是损失值或精度。 +`指标`可以是任意的 Python 对象。 如果使用了 NNI 内置的 Tuner/Assessor,`指标`只可以是两种类型:1) 数值类型,如 float、int, 2) dict 对象,其中必须由键名为 `default`,值为数值的项目。 `指标`会发送给[Assessor](BuiltinAssessors.md)。 通常,`指标`是损失值或精度。 * 返回配置的最终性能 @@ -51,7 +51,7 @@ nni.report_intermediate_result(metrics) nni.report_final_result(metrics) ``` -`指标`也可以是任意的 Python 对象。 如果使用了内置的 Tuner/Assessor,`指标`格式和 `report_intermediate_result` 中一样,这个数值表示模型的性能,如精度、损失值等。 `指标`会发送给 [Tuner](Builtin_Tuner.md)。 +`指标`也可以是任意的 Python 对象。 如果使用了内置的 Tuner/Assessor,`指标`格式和 `report_intermediate_result` 中一样,这个数值表示模型的性能,如精度、损失值等。 `指标`会发送给 [Tuner](BuiltinTuner.md)。 ### 第三步:启用 NNI API @@ -162,8 +162,8 @@ echo $? `date +%s000` >/home/user_name/nni/experiments/$experiment_id$/trials/$t ## 更多 Trial 的样例 -* [MNIST 样例](mnist_examples.md) -* [为 CIFAR 10 分类找到最佳的 optimizer](cifar10_examples.md) -* [如何在 NNI 调优 SciKit-learn 的参数](sklearn_examples.md) -* [在阅读理解上使用自动模型架构搜索。](SQuAD_evolution_examples.md) -* [如何在 NNI 上调优 GBDT](gbdt_example.md) \ No newline at end of file +* [MNIST 样例](MnistExamples.md) +* [为 CIFAR 10 分类找到最佳的 optimizer](Cifar10Examples.md) +* [如何在 NNI 调优 SciKit-learn 的参数](SklearnExamples.md) +* [在阅读理解上使用自动模型架构搜索。](SquadEvolutionExamples.md) +* [如何在 NNI 上调优 GBDT](GbdtExample.md) \ No newline at end of file diff --git a/docs/zh_CN/WebUI.md b/docs/zh_CN/WebUI.md index 8250be6411..c8f4dfba35 100644 --- a/docs/zh_CN/WebUI.md +++ b/docs/zh_CN/WebUI.md @@ -6,6 +6,8 @@ * 查看 Experiment 的配置和搜索空间内容。 * 支持下载 Experiment 结果。 +* 支持导出 nni-manager 和 dispatcher 的日志文件。 +* 如果有任何问题,可以点击 “Feedback” 告诉我们。 ![](../img/webui-img/over1.png) @@ -52,6 +54,14 @@ ![](../img/webui-img/detail-local.png) +* "Add column" 按钮可选择在表格中显示的列。 如果 Experiment 的最终结果是 dict,则可以在表格中查看其它键。 + +![](../img/webui-img/addColumn.png) + +* 可使用 "Copy as python" 按钮来拷贝 Trial 的参数。 + +![](../img/webui-img/copyParameter.png) + * 如果在 OpenPAI 或 Kubeflow 平台上运行,还可以看到 hdfsLog。 ![](../img/webui-img/detail-pai.png) diff --git a/docs/zh_CN/advanced.rst b/docs/zh_CN/advanced.rst index 7a227ab00e..15118d74c5 100644 --- a/docs/zh_CN/advanced.rst +++ b/docs/zh_CN/advanced.rst @@ -2,5 +2,5 @@ ===================== .. toctree:: - 多阶段 - 高级网络架构搜索(AdvancedNAS) \ No newline at end of file + 多阶段 + 高级网络架构搜索 \ No newline at end of file diff --git a/docs/zh_CN/assessors.rst b/docs/zh_CN/assessors.rst index 00e3f2d649..2c2829c8aa 100644 --- a/docs/zh_CN/assessors.rst +++ b/docs/zh_CN/assessors.rst @@ -15,5 +15,5 @@ Assessor 从 Trial 中接收中间结果,并通过指定的算法决定此 Tri .. toctree:: :maxdepth: 2 - 内置 Assessor - 自定义 Assessor + 内置 Assessor + 自定义 Assessor diff --git a/docs/zh_CN/builtinAssessor.rst b/docs/zh_CN/builtinAssessor.rst deleted file mode 100644 index 174fea8aa0..0000000000 --- a/docs/zh_CN/builtinAssessor.rst +++ /dev/null @@ -1,9 +0,0 @@ -内置 Assessor -================= - -.. toctree:: - :maxdepth: 1 - - 介绍 - Medianstop - Curvefitting \ No newline at end of file diff --git a/docs/zh_CN/builtinTuner.rst b/docs/zh_CN/builtinTuner.rst deleted file mode 100644 index 4088a7b130..0000000000 --- a/docs/zh_CN/builtinTuner.rst +++ /dev/null @@ -1,18 +0,0 @@ -内置 Tuner -================== - -.. toctree:: - :maxdepth: 1 - - 介绍 - TPE - Random Search - Anneal - Naive Evolution - SMAC - Batch Tuner - Grid Search - Hyperband - Network Morphism - Metis Tuner - BOHB \ No newline at end of file diff --git a/docs/zh_CN/builtin_assessor.rst b/docs/zh_CN/builtin_assessor.rst new file mode 100644 index 0000000000..7844bb7d2b --- /dev/null +++ b/docs/zh_CN/builtin_assessor.rst @@ -0,0 +1,9 @@ +内置 Assessor +================= + +.. toctree:: + :maxdepth: 1 + + 介绍 + Medianstop + Curvefitting \ No newline at end of file diff --git a/docs/zh_CN/builtin_tuner.rst b/docs/zh_CN/builtin_tuner.rst new file mode 100644 index 0000000000..b384f82c26 --- /dev/null +++ b/docs/zh_CN/builtin_tuner.rst @@ -0,0 +1,18 @@ +内置 Tuner +================== + +.. toctree:: + :maxdepth: 1 + + 介绍 + TPE + Random Search + Anneal + Naive Evolution + SMAC + Batch Tuner + Grid Search + Hyperband + Network Morphism + Metis Tuner + BOHB \ No newline at end of file diff --git a/docs/zh_CN/community_sharings.rst b/docs/zh_CN/community_sharings.rst new file mode 100644 index 0000000000..82f65c9b7a --- /dev/null +++ b/docs/zh_CN/community_sharings.rst @@ -0,0 +1,12 @@ +###################### +社区分享 +###################### + +除了官方的教程和示例之外,也支持社区贡献者分享自己的自动机器学习实践经验,特别是使用 NNI 的实践经验。 + +.. toctree:: + :maxdepth: 2 + + NNI 经验分享 + 神经网络结构搜索的对比 + 超参调优算法的对比 diff --git a/docs/zh_CN/Contribution.rst b/docs/zh_CN/contribution.rst similarity index 52% rename from docs/zh_CN/Contribution.rst rename to docs/zh_CN/contribution.rst index 74940f7c7a..227aa22c03 100644 --- a/docs/zh_CN/Contribution.rst +++ b/docs/zh_CN/contribution.rst @@ -3,5 +3,5 @@ ############################### .. toctree:: - 设置开发环境 - 贡献指南 \ No newline at end of file + 设置开发环境 + 贡献指南 \ No newline at end of file diff --git a/docs/zh_CN/examples.rst b/docs/zh_CN/examples.rst new file mode 100644 index 0000000000..95437431c1 --- /dev/null +++ b/docs/zh_CN/examples.rst @@ -0,0 +1,12 @@ +###################### +样例 +###################### + +.. toctree:: + :maxdepth: 2 + + MNIST + Cifar10 + Scikit-learn + EvolutionSQuAD + GBDT diff --git a/docs/zh_CN/index.rst b/docs/zh_CN/index.rst index 401712bfd9..a8bcca6c60 100644 --- a/docs/zh_CN/index.rst +++ b/docs/zh_CN/index.rst @@ -13,10 +13,10 @@ Neural Network Intelligence(NNI)文档 概述 入门 - 教程 - 样例 - 参考 + 教程 + 示例 + 参考 常见问答 - 贡献 - 版本日志 - 博客 + 贡献 + 更改日志 + 社区经验分享 diff --git a/docs/zh_CN/nni_practice_sharing.rst b/docs/zh_CN/nni_practice_sharing.rst new file mode 100644 index 0000000000..4ccec032cd --- /dev/null +++ b/docs/zh_CN/nni_practice_sharing.rst @@ -0,0 +1,10 @@ +################# +教程 +################# + +分享使用 NNI 来调优模型和系统的经验 + +.. toctree:: + :maxdepth: 2 + + 在 NNI 上调优 Recommenders 的 SVD \ No newline at end of file diff --git a/docs/zh_CN/Reference.rst b/docs/zh_CN/reference.rst similarity index 90% rename from docs/zh_CN/Reference.rst rename to docs/zh_CN/reference.rst index bcace87714..67f41eed70 100644 --- a/docs/zh_CN/Reference.rst +++ b/docs/zh_CN/reference.rst @@ -4,7 +4,7 @@ .. toctree:: :maxdepth: 3 - 命令行 + 命令行 Python API Annotation 配置 diff --git a/docs/zh_CN/training_services.rst b/docs/zh_CN/training_services.rst index 287232b644..fff6244c7a 100644 --- a/docs/zh_CN/training_services.rst +++ b/docs/zh_CN/training_services.rst @@ -4,6 +4,6 @@ NNI 支持的训练平台介绍 .. toctree:: 本机 远程 - OpenPAI + OpenPAI Kubeflow FrameworkController \ No newline at end of file diff --git a/docs/zh_CN/tuners.rst b/docs/zh_CN/tuners.rst index 9f5d62ad47..1c9d872bc6 100644 --- a/docs/zh_CN/tuners.rst +++ b/docs/zh_CN/tuners.rst @@ -13,6 +13,6 @@ Tuner 从 Trial 接收指标结果,来评估一组超参或网络结构的性 .. toctree:: :maxdepth: 2 - 内置 Tuner - 自定义 Tuner - 自定义 Advisor \ No newline at end of file + 内置 Tuner + 自定义 Tuner + 自定义 Advisor \ No newline at end of file diff --git a/docs/zh_CN/tutorials.rst b/docs/zh_CN/tutorials.rst new file mode 100644 index 0000000000..fd9f7bd153 --- /dev/null +++ b/docs/zh_CN/tutorials.rst @@ -0,0 +1,16 @@ +###################### +教程 +###################### + +.. toctree:: + :maxdepth: 2 + + 安装 + 实现 Trial + Tuner + Assessor + Web 界面 + 训练平台 + 如何使用 Docker + 高级功能 + 如何调试 \ No newline at end of file diff --git a/examples/trials/NAS/README.md b/examples/trials/NAS/README.md new file mode 100644 index 0000000000..375c7178a1 --- /dev/null +++ b/examples/trials/NAS/README.md @@ -0,0 +1,8 @@ + **Run Neural Network Architecture Search in NNI** + === + +Now we have an NAS example [NNI-NAS-Example](https://github.com/Crysple/NNI-NAS-Example) run in NNI using NAS interface from our contributors. + +Thanks our lovely contributors. + +And welcome more and more people to join us! diff --git a/examples/trials/auto-gbdt/main.py b/examples/trials/auto-gbdt/main.py index 12c7a71f84..f83b4635d8 100644 --- a/examples/trials/auto-gbdt/main.py +++ b/examples/trials/auto-gbdt/main.py @@ -74,6 +74,8 @@ def load_data(train_path='./data/regression.train', test_path='./data/regression def run(lgb_train, lgb_eval, params, X_test, y_test): print('Start training...') + params['num_leaves'] = int(params['num_leaves']) + # train gbm = lgb.train(params, lgb_train, diff --git a/examples/trials/auto-gbdt/search_space.json b/examples/trials/auto-gbdt/search_space.json index bdb8eedcd0..ea09eca9e7 100644 --- a/examples/trials/auto-gbdt/search_space.json +++ b/examples/trials/auto-gbdt/search_space.json @@ -1,5 +1,5 @@ { - "num_leaves":{"_type":"choice","_value":[31, 28, 24, 20]}, + "num_leaves":{"_type":"randint","_value":[20, 31]}, "learning_rate":{"_type":"choice","_value":[0.01, 0.05, 0.1, 0.2]}, "bagging_fraction":{"_type":"uniform","_value":[0.7, 1.0]}, "bagging_freq":{"_type":"choice","_value":[1, 2, 4, 8, 10]} diff --git a/examples/trials/cifar10_pytorch/config.yml b/examples/trials/cifar10_pytorch/config.yml index a35756a132..a44cfafa2b 100644 --- a/examples/trials/cifar10_pytorch/config.yml +++ b/examples/trials/cifar10_pytorch/config.yml @@ -1,6 +1,6 @@ authorName: default experimentName: example_pytorch_cifar10 -trialConcurrency: 1 +trialConcurrency: 4 maxExecDuration: 100h maxTrialNum: 10 #choice: local, remote, pai @@ -19,3 +19,5 @@ trial: command: python3 main.py codeDir: . gpuNum: 1 +localConfig: + maxTrialNumPerGpu: 2 diff --git a/examples/trials/mnist-cascading-search-space/config.yml b/examples/trials/mnist-nested-search-space/config.yml similarity index 86% rename from examples/trials/mnist-cascading-search-space/config.yml rename to examples/trials/mnist-nested-search-space/config.yml index 1c2a4643a4..7c1715b97c 100644 --- a/examples/trials/mnist-cascading-search-space/config.yml +++ b/examples/trials/mnist-nested-search-space/config.yml @@ -1,5 +1,5 @@ authorName: default -experimentName: mnist-cascading-search-space +experimentName: mnist-nested-search-space trialConcurrency: 2 maxExecDuration: 1h maxTrialNum: 100 diff --git a/examples/trials/mnist-cascading-search-space/mnist.py b/examples/trials/mnist-nested-search-space/mnist.py similarity index 99% rename from examples/trials/mnist-cascading-search-space/mnist.py rename to examples/trials/mnist-nested-search-space/mnist.py index a90c7640cb..ed892ff2b5 100644 --- a/examples/trials/mnist-cascading-search-space/mnist.py +++ b/examples/trials/mnist-nested-search-space/mnist.py @@ -14,7 +14,7 @@ import nni -logger = logging.getLogger('mnist_cascading_search_space') +logger = logging.getLogger('mnist_nested_search_space') FLAGS = None class MnistNetwork(object): diff --git a/examples/trials/mnist-cascading-search-space/requirments.txt b/examples/trials/mnist-nested-search-space/requirments.txt similarity index 100% rename from examples/trials/mnist-cascading-search-space/requirments.txt rename to examples/trials/mnist-nested-search-space/requirments.txt diff --git a/examples/trials/mnist-cascading-search-space/sample.json b/examples/trials/mnist-nested-search-space/sample.json similarity index 100% rename from examples/trials/mnist-cascading-search-space/sample.json rename to examples/trials/mnist-nested-search-space/sample.json diff --git a/examples/trials/mnist-cascading-search-space/search_space.json b/examples/trials/mnist-nested-search-space/search_space.json similarity index 100% rename from examples/trials/mnist-cascading-search-space/search_space.json rename to examples/trials/mnist-nested-search-space/search_space.json diff --git a/install.ps1 b/install.ps1 index 31d8ba2fe7..5de27a45a2 100644 --- a/install.ps1 +++ b/install.ps1 @@ -15,7 +15,7 @@ $yarnUrl = "https://yarnpkg.com/latest.tar.gz" $unzipNodeDir = "node-v*" $unzipYarnDir = "yarn-v*" -$NNI_DEPENDENCY_FOLDER = "C:\tmp\$env:USERNAME" +$NNI_DEPENDENCY_FOLDER = [System.IO.Path]::GetTempPath()+$env:USERNAME $WHICH_PYTHON = where.exe python if($WHICH_PYTHON -eq $null){ diff --git a/src/nni_manager/common/datastore.ts b/src/nni_manager/common/datastore.ts index 209292be73..769fe48017 100644 --- a/src/nni_manager/common/datastore.ts +++ b/src/nni_manager/common/datastore.ts @@ -70,6 +70,18 @@ interface TrialJobInfo { stderrPath?: string; } +interface HyperParameterFormat { + parameter_source: string; + parameters: Object; + parameter_id: number; +} + +interface ExportedDataFormat { + parameter: Object; + value: Object; + id: string; +} + abstract class DataStore { public abstract init(): Promise; public abstract close(): Promise; @@ -82,6 +94,8 @@ abstract class DataStore { public abstract getTrialJob(trialJobId: string): Promise; public abstract storeMetricData(trialJobId: string, data: string): Promise; public abstract getMetricData(trialJobId?: string, metricType?: MetricType): Promise; + public abstract exportTrialHpConfigs(): Promise; + public abstract getImportedData(): Promise; } abstract class Database { @@ -99,5 +113,5 @@ abstract class Database { export { DataStore, Database, TrialJobEvent, MetricType, MetricData, TrialJobInfo, - ExperimentProfileRecord, TrialJobEventRecord, MetricDataRecord + ExperimentProfileRecord, TrialJobEventRecord, MetricDataRecord, HyperParameterFormat, ExportedDataFormat }; diff --git a/src/nni_manager/common/manager.ts b/src/nni_manager/common/manager.ts index a20e4d010b..4933465b92 100644 --- a/src/nni_manager/common/manager.ts +++ b/src/nni_manager/common/manager.ts @@ -100,6 +100,7 @@ abstract class Manager { public abstract getExperimentProfile(): Promise; public abstract updateExperimentProfile(experimentProfile: ExperimentProfile, updateType: ProfileUpdateType): Promise; public abstract importData(data: string): Promise; + public abstract exportData(): Promise; public abstract addCustomizedTrialJob(hyperParams: string): Promise; public abstract cancelTrialJobByUser(trialJobId: string): Promise; diff --git a/src/nni_manager/common/utils.ts b/src/nni_manager/common/utils.ts index b741b4e9a9..f3a57cf9e0 100644 --- a/src/nni_manager/common/utils.ts +++ b/src/nni_manager/common/utils.ts @@ -43,11 +43,11 @@ function getExperimentRootDir(): string { .getLogDir(); } -function getLogDir(): string{ +function getLogDir(): string { return path.join(getExperimentRootDir(), 'log'); } -function getLogLevel(): string{ +function getLogLevel(): string { return getExperimentStartupInfo() .getLogLevel(); } @@ -149,7 +149,7 @@ function parseArg(names: string[]): string { return ''; } -function encodeCmdLineArgs(args:any):any{ +function encodeCmdLineArgs(args: any): any { if(process.platform === 'win32'){ return JSON.stringify(args); } @@ -158,7 +158,7 @@ function encodeCmdLineArgs(args:any):any{ } } -function getCmdPy():string{ +function getCmdPy(): string { let cmd = 'python3'; if(process.platform === 'win32'){ cmd = 'python'; @@ -390,7 +390,7 @@ async function getVersion(): Promise { /** * run command as ChildProcess */ -function getTunerProc(command: string, stdio: StdioOptions, newCwd: string, newEnv: any): ChildProcess{ +function getTunerProc(command: string, stdio: StdioOptions, newCwd: string, newEnv: any): ChildProcess { let cmd: string = command; let arg: string[] = []; let newShell: boolean = true; @@ -411,7 +411,7 @@ function getTunerProc(command: string, stdio: StdioOptions, newCwd: string, newE /** * judge whether the process is alive */ -async function isAlive(pid:any): Promise{ +async function isAlive(pid:any): Promise { let deferred : Deferred = new Deferred(); let alive: boolean = false; if(process.platform ==='win32'){ @@ -439,7 +439,7 @@ async function isAlive(pid:any): Promise{ /** * kill process */ -async function killPid(pid:any): Promise{ +async function killPid(pid:any): Promise { let deferred : Deferred = new Deferred(); try { if (process.platform === "win32") { @@ -455,7 +455,7 @@ async function killPid(pid:any): Promise{ return deferred.promise; } -function getNewLine(): string{ +function getNewLine(): string { if (process.platform === "win32") { return "\r\n"; } diff --git a/src/nni_manager/core/nniDataStore.ts b/src/nni_manager/core/nniDataStore.ts index 4f9f84d4b6..86defd0971 100644 --- a/src/nni_manager/core/nniDataStore.ts +++ b/src/nni_manager/core/nniDataStore.ts @@ -24,7 +24,8 @@ import { Deferred } from 'ts-deferred'; import * as component from '../common/component'; import { Database, DataStore, MetricData, MetricDataRecord, MetricType, - TrialJobEvent, TrialJobEventRecord, TrialJobInfo } from '../common/datastore'; + TrialJobEvent, TrialJobEventRecord, TrialJobInfo, HyperParameterFormat, + ExportedDataFormat } from '../common/datastore'; import { NNIError } from '../common/errors'; import { getExperimentId, isNewExperiment } from '../common/experimentStartupInfo'; import { getLogger, Logger } from '../common/log'; @@ -171,6 +172,61 @@ class NNIDataStore implements DataStore { return this.db.queryMetricData(trialJobId, metricType); } + public async exportTrialHpConfigs(): Promise { + const jobs: TrialJobInfo[] = await this.listTrialJobs(); + let exportedData: ExportedDataFormat[] = []; + for (const job of jobs) { + if (job.hyperParameters && job.finalMetricData) { + if (job.hyperParameters.length === 1 && job.finalMetricData.length === 1) { + // optimization for non-multi-phase case + const parameters: HyperParameterFormat = JSON.parse(job.hyperParameters[0]); + const oneEntry: ExportedDataFormat = { + parameter: parameters.parameters, + value: JSON.parse(job.finalMetricData[0].data), + id: job.id + }; + exportedData.push(oneEntry); + } else { + let paraMap: Map = new Map(); + let metricMap: Map = new Map(); + for (const eachPara of job.hyperParameters) { + const parameters: HyperParameterFormat = JSON.parse(eachPara); + paraMap.set(parameters.parameter_id, parameters.parameters); + } + for (const eachMetric of job.finalMetricData) { + const value: Object = JSON.parse(eachMetric.data); + metricMap.set(Number(eachMetric.parameterId), value); + } + paraMap.forEach((value: Object, key: number) => { + const metricValue: Object | undefined = metricMap.get(key); + if (metricValue) { + const oneEntry: ExportedDataFormat = { + parameter: value, + value: metricValue, + id: job.id + }; + exportedData.push(oneEntry); + } + }); + } + } + } + + return JSON.stringify(exportedData); + } + + public async getImportedData(): Promise { + let importedData: string[] = []; + const importDataEvents: TrialJobEventRecord[] = await this.db.queryTrialJobEvent(undefined, 'IMPORT_DATA'); + for (const event of importDataEvents) { + if (event.data) { + importedData.push(event.data); + } + } + + return importedData; + } + private async queryTrialJobs(status?: TrialJobStatus, trialJobId?: string): Promise { const result: TrialJobInfo[] = []; const trialJobEvents: TrialJobEventRecord[] = await this.db.queryTrialJobEvent(trialJobId); diff --git a/src/nni_manager/core/nnimanager.ts b/src/nni_manager/core/nnimanager.ts index 9eee97c91e..bc564289f8 100644 --- a/src/nni_manager/core/nnimanager.ts +++ b/src/nni_manager/core/nnimanager.ts @@ -58,7 +58,10 @@ class NNIManager implements Manager { private status: NNIManagerStatus; private waitingTrials: string[]; private trialJobs: Map; + private trialDataForTuner: string; + private trialJobMetricListener: (metric: TrialJobMetric) => void; + constructor() { this.currSubmittedTrialNum = 0; this.trialConcurrencyChange = 0; @@ -68,6 +71,7 @@ class NNIManager implements Manager { this.dispatcherPid = 0; this.waitingTrials = []; this.trialJobs = new Map(); + this.trialDataForTuner = ''; this.log = getLogger(); this.dataStore = component.get(DataStore); @@ -76,6 +80,11 @@ class NNIManager implements Manager { status: 'INITIALIZED', errors: [] }; + this.trialJobMetricListener = (metric: TrialJobMetric) => { + this.onTrialJobMetrics(metric).catch((err: Error) => { + this.criticalError(NNIError.FromError(err, 'Job metrics error: ')); + }); + }; } public updateExperimentProfile(experimentProfile: ExperimentProfile, updateType: ProfileUpdateType): Promise { @@ -110,6 +119,10 @@ class NNIManager implements Manager { return this.dataStore.storeTrialJobEvent('IMPORT_DATA', '', data); } + public async exportData(): Promise { + return this.dataStore.exportTrialHpConfigs(); + } + public addCustomizedTrialJob(hyperParams: string): Promise { if (this.currSubmittedTrialNum >= this.experimentProfile.params.maxTrialNum) { return Promise.reject( @@ -206,6 +219,16 @@ class NNIManager implements Manager { .filter((job: TrialJobInfo) => job.status === 'WAITING' || job.status === 'RUNNING') .map((job: TrialJobInfo) => this.dataStore.storeTrialJobEvent('FAILED', job.id))); + // Collect generated trials and imported trials + const finishedTrialData: string = await this.exportData(); + const importedData: string[] = await this.dataStore.getImportedData(); + let trialData: Object[] = JSON.parse(finishedTrialData); + for (const oneImportedData of importedData) { + // do not deduplicate + trialData = trialData.concat(JSON.parse(oneImportedData)); + } + this.trialDataForTuner = JSON.stringify(trialData); + if (this.experimentProfile.execDuration < this.experimentProfile.params.maxExecDuration && this.currSubmittedTrialNum < this.experimentProfile.params.maxTrialNum && this.experimentProfile.endTime) { @@ -342,6 +365,7 @@ class NNIManager implements Manager { if (this.dispatcher === undefined) { throw new Error('Error: tuner has not been setup'); } + this.trainingService.removeTrialJobMetricListener(this.trialJobMetricListener); this.dispatcher.sendCommand(TERMINATE); let tunerAlive: boolean = true; // gracefully terminate tuner and assessor here, wait at most 30 seconds. @@ -589,11 +613,7 @@ class NNIManager implements Manager { if (this.dispatcher === undefined) { throw new Error('Error: tuner or job maintainer have not been setup'); } - this.trainingService.addTrialJobMetricListener((metric: TrialJobMetric) => { - this.onTrialJobMetrics(metric).catch((err: Error) => { - this.criticalError(NNIError.FromError(err, 'Job metrics error: ')); - }); - }); + this.trainingService.addTrialJobMetricListener(this.trialJobMetricListener); this.dispatcher.onCommand((commandType: string, content: string) => { this.onTunerCommand(commandType, content).catch((err: Error) => { @@ -644,6 +664,12 @@ class NNIManager implements Manager { switch (commandType) { case INITIALIZED: // Tuner is intialized, search space is set, request tuner to generate hyper parameters + if (this.trialDataForTuner.length > 0) { + if (this.dispatcher === undefined) { + throw new Error('Dispatcher error: tuner has not been setup'); + } + this.dispatcher.sendCommand(IMPORT_DATA, this.trialDataForTuner); + } this.requestTrialJobs(this.experimentProfile.params.trialConcurrency); break; case NEW_TRIAL_JOB: diff --git a/src/nni_manager/core/test/mockedDatastore.ts b/src/nni_manager/core/test/mockedDatastore.ts index d08b5b801b..1e4c580a04 100644 --- a/src/nni_manager/core/test/mockedDatastore.ts +++ b/src/nni_manager/core/test/mockedDatastore.ts @@ -210,6 +210,16 @@ class MockedDataStore implements DataStore { return result; } + async exportTrialHpConfigs(): Promise { + const ret: string = ''; + return Promise.resolve(ret); + } + + async getImportedData(): Promise { + const ret: string[] = []; + return Promise.resolve(ret); + } + public getTrialJob(trialJobId: string): Promise { throw new Error("Method not implemented."); } diff --git a/src/nni_manager/rest_server/restHandler.ts b/src/nni_manager/rest_server/restHandler.ts index ded9bd6232..7513aec496 100644 --- a/src/nni_manager/rest_server/restHandler.ts +++ b/src/nni_manager/rest_server/restHandler.ts @@ -72,6 +72,7 @@ class NNIRestHandler { this.addTrialJob(router); this.cancelTrialJob(router); this.getMetricData(router); + this.exportData(router); // Express-joi-validator configuration router.use((err: any, req: Request, res: Response, next: any) => { @@ -261,6 +262,16 @@ class NNIRestHandler { }); } + private exportData(router: Router): void { + router.get('/export-data', (req: Request, res: Response) => { + this.nniManager.exportData().then((exportedData: string) => { + res.send(exportedData); + }).catch((err: Error) => { + this.handle_error(err, res); + }); + }); + } + private setErrorPathForFailedJob(jobInfo: TrialJobInfo): TrialJobInfo { if (jobInfo === undefined || jobInfo.status !== 'FAILED' || jobInfo.logPath === undefined) { return jobInfo; diff --git a/src/nni_manager/rest_server/restValidationSchemas.ts b/src/nni_manager/rest_server/restValidationSchemas.ts index a62a6b1ea6..f794df4d70 100644 --- a/src/nni_manager/rest_server/restValidationSchemas.ts +++ b/src/nni_manager/rest_server/restValidationSchemas.ts @@ -31,10 +31,14 @@ export namespace ValidationSchemas { passwd: joi.string(), sshKeyPath: joi.string(), passphrase: joi.string(), - gpuIndices: joi.string() + gpuIndices: joi.string(), + maxTrialNumPerGpu: joi.number(), + useActiveGpu: joi.boolean() })), local_config: joi.object({ - gpuIndices: joi.string() + gpuIndices: joi.string(), + maxTrialNumPerGpu: joi.number(), + useActiveGpu: joi.boolean() }), trial_config: joi.object({ image: joi.string().min(1), diff --git a/src/nni_manager/rest_server/test/mockedNNIManager.ts b/src/nni_manager/rest_server/test/mockedNNIManager.ts index d65ad1ed62..299c473aa6 100644 --- a/src/nni_manager/rest_server/test/mockedNNIManager.ts +++ b/src/nni_manager/rest_server/test/mockedNNIManager.ts @@ -49,6 +49,10 @@ export class MockedNNIManager extends Manager { public importData(data: string): Promise { return Promise.resolve(); } + public async exportData(): Promise { + const ret: string = ''; + return Promise.resolve(ret); + } public getTrialJobStatistics(): Promise { const deferred: Deferred = new Deferred(); deferred.resolve([{ diff --git a/src/nni_manager/training_service/common/util.ts b/src/nni_manager/training_service/common/util.ts index 031d277fab..556dc79806 100644 --- a/src/nni_manager/training_service/common/util.ts +++ b/src/nni_manager/training_service/common/util.ts @@ -24,7 +24,10 @@ import { getLogger } from "common/log"; import { countFilesRecursively } from '../../common/utils' import * as cpp from 'child-process-promise'; import * as cp from 'child_process'; -import { GPU_INFO_COLLECTOR_FORMAT_LINUX, GPU_INFO_COLLECTOR_FORMAT_WINDOWS } from './gpuData' +import * as os from 'os'; +import * as fs from 'fs'; +import { getNewLine } from '../../common/utils'; +import { GPU_INFO_COLLECTOR_FORMAT_LINUX, GPU_INFO_COLLECTOR_FORMAT_WINDOWS } from './gpuData'; import * as path from 'path'; import { String } from 'typescript-string-operations'; import { file } from "../../node_modules/@types/tmp"; @@ -66,6 +69,20 @@ export async function execMkdir(directory: string): Promise { return Promise.resolve(); } +/** + * copy files to the directory + * @param source + * @param destination + */ +export async function execCopydir(source: string, destination: string): Promise { + if (process.platform === 'win32') { + await cpp.exec(`powershell.exe Copy-Item ${source} -Destination ${destination} -Recurse`); + } else { + await cpp.exec(`cp -r ${source} ${destination}`); + } + return Promise.resolve(); +} + /** * crete a new file * @param filename @@ -91,8 +108,6 @@ export function execScript(filePath: string): cp.ChildProcess { } } - - /** * output the last line of a file * @param filePath @@ -111,9 +126,9 @@ export async function execTail(filePath: string): Promise{ +export async function execRemove(directory: string): Promise { if (process.platform === 'win32') { - await cpp.exec(`powershell.exe Remove-Item ${directory}`); + await cpp.exec(`powershell.exe Remove-Item ${directory} -Recurse -Force`); } else { await cpp.exec(`rm -rf ${directory}`); } @@ -124,7 +139,7 @@ export async function execRemove(directory: string): Promise{ * kill a process * @param directory */ -export async function execKill(pid: string): Promise{ +export async function execKill(pid: string): Promise { if (process.platform === 'win32') { await cpp.exec(`cmd /c taskkill /PID ${pid} /T /F`); } else { @@ -138,7 +153,7 @@ export async function execKill(pid: string): Promise{ * @param variable * @returns command string */ -export function setEnvironmentVariable(variable: { key: string; value: string }): string{ +export function setEnvironmentVariable(variable: { key: string; value: string }): string { if (process.platform === 'win32') { return `$env:${variable.key}="${variable.value}"`; } @@ -147,6 +162,32 @@ export function setEnvironmentVariable(variable: { key: string; value: string }) } } +/** + * Compress files in directory to tar file + * @param source_path + * @param tar_path + */ +export async function tarAdd(tar_path: string, source_path: string): Promise { + if (process.platform === 'win32') { + tar_path = tar_path.split('\\').join('\\\\'); + source_path = source_path.split('\\').join('\\\\'); + let script: string[] = []; + script.push( + `import os`, + `import tarfile`, + String.Format(`tar = tarfile.open("{0}","w:gz")\r\nfor root,dir,files in os.walk("{1}"):`, tar_path, source_path), + ` for file in files:`, + ` fullpath = os.path.join(root,file)`, + ` tar.add(fullpath, arcname=file)`, + `tar.close()`); + await fs.promises.writeFile(path.join(os.tmpdir(), 'tar.py'), script.join(getNewLine()), { encoding: 'utf8', mode: 0o777 }); + const tarScript: string = path.join(os.tmpdir(), 'tar.py'); + await cpp.exec(`python ${tarScript}`); + } else { + await cpp.exec(`tar -czf ${tar_path} -C ${source_path} .`); + } + return Promise.resolve(); +} /** * generate script file name diff --git a/src/nni_manager/training_service/local/gpuScheduler.ts b/src/nni_manager/training_service/local/gpuScheduler.ts index 04ea3d3390..0cf48e34ba 100644 --- a/src/nni_manager/training_service/local/gpuScheduler.ts +++ b/src/nni_manager/training_service/local/gpuScheduler.ts @@ -71,14 +71,15 @@ class GPUScheduler { execScript(gpuMetricsCollectorScriptPath) } - public getAvailableGPUIndices(): number[] { + public getAvailableGPUIndices(useActiveGpu: boolean, occupiedGpuIndexNumMap: Map): number[] { if (this.gpuSummary !== undefined) { - if(process.platform === 'win32') { + if(process.platform === 'win32' || useActiveGpu) { return this.gpuSummary.gpuInfos.map((info: GPUInfo) => info.index); } else{ - return this.gpuSummary.gpuInfos.filter((info: GPUInfo) => info.activeProcessNum === 0) - .map((info: GPUInfo) => info.index); + return this.gpuSummary.gpuInfos.filter((info: GPUInfo) => + occupiedGpuIndexNumMap.get(info.index) === undefined && info.activeProcessNum === 0 || + occupiedGpuIndexNumMap.get(info.index) !== undefined).map((info: GPUInfo) => info.index); } } diff --git a/src/nni_manager/training_service/local/localTrainingService.ts b/src/nni_manager/training_service/local/localTrainingService.ts index aed17b3fab..2ab316a8b3 100644 --- a/src/nni_manager/training_service/local/localTrainingService.ts +++ b/src/nni_manager/training_service/local/localTrainingService.ts @@ -97,11 +97,19 @@ class LocalTrialJobDetail implements TrialJobDetail { * Local training service config */ class LocalConfig { + public maxTrialNumPerGpu?: number; public gpuIndices?: string; - constructor(gpuIndices?: string) { + public useActiveGpu?: boolean; + constructor(gpuIndices?: string, maxTrialNumPerGpu?: number, useActiveGpu?: boolean) { if (gpuIndices !== undefined) { this.gpuIndices = gpuIndices; } + if (maxTrialNumPerGpu !== undefined) { + this.maxTrialNumPerGpu = maxTrialNumPerGpu; + } + if (useActiveGpu !== undefined) { + this.useActiveGpu = useActiveGpu; + } } } @@ -117,13 +125,15 @@ class LocalTrainingService implements TrainingService { private rootDir!: string; private trialSequenceId: number; private gpuScheduler!: GPUScheduler; - private occupiedGpuIndices: Set; + private occupiedGpuIndexNumMap: Map; private designatedGpuIndices!: Set; private log: Logger; private localTrailConfig?: TrialConfig; private localConfig?: LocalConfig; - private isMultiPhase: boolean = false; + private isMultiPhase: boolean; private jobStreamMap: Map; + private maxTrialNumPerGpu: number; + private useActiveGpu: boolean; constructor() { this.eventEmitter = new EventEmitter(); @@ -135,7 +145,10 @@ class LocalTrainingService implements TrainingService { this.trialSequenceId = -1; this.jobStreamMap = new Map(); this.log.info('Construct local machine training service.'); - this.occupiedGpuIndices = new Set(); + this.occupiedGpuIndexNumMap = new Map(); + this.maxTrialNumPerGpu = 1; + this.useActiveGpu = false; + this.isMultiPhase = false; } public async run(): Promise { @@ -304,6 +317,13 @@ class LocalTrainingService implements TrainingService { throw new Error('gpuIndices can not be empty if specified.'); } } + if (this.localConfig.maxTrialNumPerGpu !== undefined) { + this.maxTrialNumPerGpu = this.localConfig.maxTrialNumPerGpu; + } + + if (this.localConfig.useActiveGpu !== undefined) { + this.useActiveGpu = this.localConfig.useActiveGpu; + } break; case TrialConfigMetadataKey.MULTI_PHASE: this.isMultiPhase = (value === 'true' || value === 'True'); @@ -356,7 +376,14 @@ class LocalTrainingService implements TrainingService { if (trialJob.gpuIndices !== undefined && trialJob.gpuIndices.length > 0 && this.gpuScheduler !== undefined) { if (oldStatus === 'RUNNING' && trialJob.status !== 'RUNNING') { for (const index of trialJob.gpuIndices) { - this.occupiedGpuIndices.delete(index); + let num: number | undefined = this.occupiedGpuIndexNumMap.get(index); + if(num === undefined) { + throw new Error(`gpu resource schedule error`); + } else if(num === 1) { + this.occupiedGpuIndexNumMap.delete(index); + } else { + this.occupiedGpuIndexNumMap.set(index, num - 1) + } } } } @@ -396,8 +423,14 @@ class LocalTrainingService implements TrainingService { return [true, resource]; } - let selectedGPUIndices: number[] = this.gpuScheduler.getAvailableGPUIndices() - .filter((index: number) => !this.occupiedGpuIndices.has(index)); + let selectedGPUIndices: number[] = []; + let availableGpuIndices: number[] = this.gpuScheduler.getAvailableGPUIndices(this.useActiveGpu, this.occupiedGpuIndexNumMap); + for(let index of availableGpuIndices) { + let num: number | undefined = this.occupiedGpuIndexNumMap.get(index); + if(num === undefined || num < this.maxTrialNumPerGpu) { + selectedGPUIndices.push(index); + } + } if (this.designatedGpuIndices !== undefined) { this.checkSpecifiedGpuIndices(); @@ -428,7 +461,12 @@ class LocalTrainingService implements TrainingService { private occupyResource(resource: {gpuIndices: number[]}): void { if (this.gpuScheduler !== undefined) { for (const index of resource.gpuIndices) { - this.occupiedGpuIndices.add(index); + let num: number | undefined = this.occupiedGpuIndexNumMap.get(index); + if(num === undefined) { + this.occupiedGpuIndexNumMap.set(index, 1) + } else { + this.occupiedGpuIndexNumMap.set(index, num + 1) + } } } } diff --git a/src/nni_manager/training_service/remote_machine/gpuScheduler.ts b/src/nni_manager/training_service/remote_machine/gpuScheduler.ts index 77c99bda46..766a57d7aa 100644 --- a/src/nni_manager/training_service/remote_machine/gpuScheduler.ts +++ b/src/nni_manager/training_service/remote_machine/gpuScheduler.ts @@ -23,7 +23,8 @@ import * as assert from 'assert'; import { getLogger, Logger } from '../../common/log'; import { randomSelect } from '../../common/utils'; import { GPUInfo } from '../common/gpuData'; -import { parseGpuIndices, RemoteMachineMeta, RemoteMachineScheduleResult, ScheduleResultType, SSHClientManager } from './remoteMachineData'; +import { RemoteMachineTrialJobDetail, parseGpuIndices, RemoteMachineMeta, RemoteMachineScheduleResult, ScheduleResultType, SSHClientManager } from './remoteMachineData'; +import { TrialJobDetail } from 'common/trainingService'; /** * A simple GPU scheduler implementation @@ -45,7 +46,7 @@ export class GPUScheduler { * Schedule a machine according to the constraints (requiredGPUNum) * @param requiredGPUNum required GPU number */ - public scheduleMachine(requiredGPUNum: number, trialJobId : string) : RemoteMachineScheduleResult { + public scheduleMachine(requiredGPUNum: number, trialJobDetail : RemoteMachineTrialJobDetail) : RemoteMachineScheduleResult { assert(requiredGPUNum >= 0); const allRMs: RemoteMachineMeta[] = Array.from(this.machineSSHClientMap.keys()); assert(allRMs.length > 0); @@ -66,7 +67,7 @@ export class GPUScheduler { // Currenty the requireGPUNum parameter for all trial jobs are identical. if (requiredGPUNum > 0) { // Trial job requires GPU - const result: RemoteMachineScheduleResult | undefined = this.scheduleGPUHost(requiredGPUNum, trialJobId); + const result: RemoteMachineScheduleResult | undefined = this.scheduleGPUHost(requiredGPUNum, trialJobDetail); if (result !== undefined) { return result; } @@ -74,9 +75,9 @@ export class GPUScheduler { // Trail job does not need GPU const allocatedRm: RemoteMachineMeta = this.selectMachine(allRMs); - return this.allocateHost(requiredGPUNum, allocatedRm, [], trialJobId); + return this.allocateHost(requiredGPUNum, allocatedRm, [], trialJobDetail); } - this.log.warning(`Scheduler: trialJob id ${trialJobId}, no machine can be scheduled, return TMP_NO_AVAILABLE_GPU `); + this.log.warning(`Scheduler: trialJob id ${trialJobDetail.id}, no machine can be scheduled, return TMP_NO_AVAILABLE_GPU `); return { resultType : ScheduleResultType.TMP_NO_AVAILABLE_GPU, @@ -87,21 +88,35 @@ export class GPUScheduler { /** * remove the job's gpu reversion */ - public removeGpuReservation(trialJobId: string, rmMeta?: RemoteMachineMeta): void { - // If remote machine has no GPU, gpuReservcation is not initialized, so check if it's undefined - if (rmMeta !== undefined && rmMeta.gpuReservation !== undefined) { - rmMeta.gpuReservation.forEach((reserveTrialJobId : string, gpuIndex : number) => { - if (reserveTrialJobId === trialJobId) { - rmMeta.gpuReservation.delete(gpuIndex); + public removeGpuReservation(trialJobId: string, trialJobMap: Map): void { + let trialJobDetail: RemoteMachineTrialJobDetail | undefined = trialJobMap.get(trialJobId); + if(trialJobDetail === undefined) { + throw new Error(`could not get trialJobDetail by id ${trialJobId}`); + } + if (trialJobDetail.rmMeta !== undefined && + trialJobDetail.rmMeta.occupiedGpuIndexMap !== undefined && + trialJobDetail.gpuIndices !== undefined && + trialJobDetail.gpuIndices.length > 0) { + for (const gpuInfo of trialJobDetail.gpuIndices) { + let num: number | undefined = trialJobDetail.rmMeta.occupiedGpuIndexMap.get(gpuInfo.index); + if(num !== undefined) { + if(num === 1) { + trialJobDetail.rmMeta.occupiedGpuIndexMap.delete(gpuInfo.index); + } else { + trialJobDetail.rmMeta.occupiedGpuIndexMap.set(gpuInfo.index, num - 1) + } } - }); + } } + trialJobDetail.gpuIndices = []; + trialJobMap.set(trialJobId, trialJobDetail); } - private scheduleGPUHost(requiredGPUNum: number, trialJobId: string): RemoteMachineScheduleResult | undefined { + private scheduleGPUHost(requiredGPUNum: number, trialJobDetail: RemoteMachineTrialJobDetail): RemoteMachineScheduleResult | undefined { const totalResourceMap: Map = this.gpuResourceDetection(); const qualifiedRMs: RemoteMachineMeta[] = []; totalResourceMap.forEach((gpuInfos: GPUInfo[], rmMeta: RemoteMachineMeta) => { + if (gpuInfos !== undefined && gpuInfos.length >= requiredGPUNum) { qualifiedRMs.push(rmMeta); } @@ -110,7 +125,7 @@ export class GPUScheduler { const allocatedRm: RemoteMachineMeta = this.selectMachine(qualifiedRMs); const gpuInfos: GPUInfo[] | undefined = totalResourceMap.get(allocatedRm); if (gpuInfos !== undefined) { // should always true - return this.allocateHost(requiredGPUNum, allocatedRm, gpuInfos, trialJobId); + return this.allocateHost(requiredGPUNum, allocatedRm, gpuInfos, trialJobDetail); } else { assert(false, 'gpuInfos is undefined'); } @@ -130,9 +145,6 @@ export class GPUScheduler { // Assgin totoal GPU count as init available GPU number if (rmMeta.gpuSummary !== undefined) { const availableGPUs: GPUInfo[] = []; - if (rmMeta.gpuReservation === undefined) { - rmMeta.gpuReservation = new Map(); - } const designatedGpuIndices: Set | undefined = parseGpuIndices(rmMeta.gpuIndices); if (designatedGpuIndices !== undefined) { for (const gpuIndex of designatedGpuIndices) { @@ -145,10 +157,20 @@ export class GPUScheduler { rmMeta.gpuSummary.gpuInfos.forEach((gpuInfo: GPUInfo) => { // if the GPU has active process, OR be reserved by a job, // or index not in gpuIndices configuration in machineList, + // or trial number on a GPU reach max number, // We should NOT allocate this GPU - if (gpuInfo.activeProcessNum === 0 && !rmMeta.gpuReservation.has(gpuInfo.index) - && (designatedGpuIndices === undefined || designatedGpuIndices.has(gpuInfo.index))) { - availableGPUs.push(gpuInfo); + // if users set useActiveGpu, use the gpu whether there is another activeProcess + if (designatedGpuIndices === undefined || designatedGpuIndices.has(gpuInfo.index)) { + if(rmMeta.occupiedGpuIndexMap !== undefined) { + let num = rmMeta.occupiedGpuIndexMap.get(gpuInfo.index); + let maxTrialNumPerGpu: number = rmMeta.maxTrialNumPerGpu? rmMeta.maxTrialNumPerGpu: 1; + if((num === undefined && (!rmMeta.useActiveGpu && gpuInfo.activeProcessNum === 0 || rmMeta.useActiveGpu)) || + (num !== undefined && num < maxTrialNumPerGpu)) { + availableGPUs.push(gpuInfo); + } + } else { + throw new Error(`occupiedGpuIndexMap initialize error!`); + } } }); totalResourceMap.set(rmMeta, availableGPUs); @@ -170,14 +192,22 @@ export class GPUScheduler { } private allocateHost(requiredGPUNum: number, rmMeta: RemoteMachineMeta, - gpuInfos: GPUInfo[], trialJobId: string): RemoteMachineScheduleResult { + gpuInfos: GPUInfo[], trialJobDetail: RemoteMachineTrialJobDetail): RemoteMachineScheduleResult { assert(gpuInfos.length >= requiredGPUNum); const allocatedGPUs: GPUInfo[] = this.selectGPUsForTrial(gpuInfos, requiredGPUNum); - allocatedGPUs.forEach((gpuInfo: GPUInfo) => { - rmMeta.gpuReservation.set(gpuInfo.index, trialJobId); + if(rmMeta.occupiedGpuIndexMap !== undefined) { + let num = rmMeta.occupiedGpuIndexMap.get(gpuInfo.index); + if(num === undefined) { + num = 0; + } + rmMeta.occupiedGpuIndexMap.set(gpuInfo.index, num + 1); + }else { + throw new Error(`Machine ${rmMeta.ip} occupiedGpuIndexMap initialize error!`); + } }); - + trialJobDetail.gpuIndices = allocatedGPUs; + trialJobDetail.rmMeta = rmMeta; return { resultType: ScheduleResultType.SUCCEED, scheduleInfo: { diff --git a/src/nni_manager/training_service/remote_machine/remoteMachineData.ts b/src/nni_manager/training_service/remote_machine/remoteMachineData.ts index 0a825e1cfa..c9a2112090 100644 --- a/src/nni_manager/training_service/remote_machine/remoteMachineData.ts +++ b/src/nni_manager/training_service/remote_machine/remoteMachineData.ts @@ -23,7 +23,7 @@ import * as fs from 'fs'; import { Client, ConnectConfig } from 'ssh2'; import { Deferred } from 'ts-deferred'; import { JobApplicationForm, TrialJobDetail, TrialJobStatus } from '../../common/trainingService'; -import { GPUSummary } from '../common/gpuData'; +import { GPUSummary, GPUInfo } from '../common/gpuData'; /** * Metadata of remote machine for configuration and statuc query @@ -36,20 +36,23 @@ export class RemoteMachineMeta { public readonly sshKeyPath?: string; public readonly passphrase?: string; public gpuSummary : GPUSummary | undefined; - // GPU Reservation info, the key is GPU index, the value is the job id which reserves this GPU - public gpuReservation : Map; public readonly gpuIndices?: string; + public readonly maxTrialNumPerGpu?: number; + public occupiedGpuIndexMap: Map; + public readonly useActiveGpu?: boolean = false; constructor(ip : string, port : number, username : string, passwd : string, - sshKeyPath: string, passphrase : string, gpuIndices?: string) { + sshKeyPath: string, passphrase : string, gpuIndices?: string, maxTrialNumPerGpu?: number, useActiveGpu?: boolean) { this.ip = ip; this.port = port; this.username = username; this.passwd = passwd; this.sshKeyPath = sshKeyPath; this.passphrase = passphrase; - this.gpuReservation = new Map(); this.gpuIndices = gpuIndices; + this.maxTrialNumPerGpu = maxTrialNumPerGpu; + this.occupiedGpuIndexMap = new Map(); + this.useActiveGpu = useActiveGpu; } } @@ -97,6 +100,7 @@ export class RemoteMachineTrialJobDetail implements TrialJobDetail { public sequenceId: number; public rmMeta?: RemoteMachineMeta; public isEarlyStopped?: boolean; + public gpuIndices: GPUInfo[]; constructor(id: string, status: TrialJobStatus, submitTime: number, workingDirectory: string, form: JobApplicationForm, sequenceId: number) { @@ -107,6 +111,7 @@ export class RemoteMachineTrialJobDetail implements TrialJobDetail { this.form = form; this.sequenceId = sequenceId; this.tags = []; + this.gpuIndices = [] } } diff --git a/src/nni_manager/training_service/remote_machine/remoteMachineTrainingService.ts b/src/nni_manager/training_service/remote_machine/remoteMachineTrainingService.ts index d1a2a379a9..c471589948 100644 --- a/src/nni_manager/training_service/remote_machine/remoteMachineTrainingService.ts +++ b/src/nni_manager/training_service/remote_machine/remoteMachineTrainingService.ts @@ -36,7 +36,7 @@ import { ObservableTimer } from '../../common/observableTimer'; import { HostJobApplicationForm, HyperParameters, JobApplicationForm, TrainingService, TrialJobApplicationForm, TrialJobDetail, TrialJobMetric, NNIManagerIpConfig } from '../../common/trainingService'; -import { delay, generateParamFileName, getExperimentRootDir, uniqueString, getJobCancelStatus, getRemoteTmpDir,getIPV4Address } from '../../common/utils'; +import { delay, generateParamFileName, getExperimentRootDir, uniqueString, getJobCancelStatus, getRemoteTmpDir,getIPV4Address, getVersion, unixPathJoin } from '../../common/utils'; import { GPUSummary } from '../common/gpuData'; import { TrialConfig } from '../common/trialConfig'; import { TrialConfigMetadataKey } from '../common/trialConfigMetadataKey'; @@ -48,10 +48,9 @@ import { } from './remoteMachineData'; import { GPU_INFO_COLLECTOR_FORMAT_LINUX } from '../common/gpuData'; import { SSHClientUtility } from './sshClientUtility'; -import { validateCodeDir } from '../common/util'; +import { validateCodeDir, execRemove, execMkdir, execCopydir } from '../common/util'; import { RemoteMachineJobRestServer } from './remoteMachineJobRestServer'; import { CONTAINER_INSTALL_NNI_SHELL_FORMAT } from '../common/containerJobData'; -import { mkDirP, getVersion } from '../../common/utils'; /** * Training Service implementation for Remote Machine (Linux) @@ -234,7 +233,7 @@ class RemoteMachineTrainingService implements TrainingService { } else if (form.jobType === 'TRIAL') { // Generate trial job id(random) const trialJobId: string = uniqueString(5); - const trialWorkingFolder: string = path.join(this.remoteExpRootDir, 'trials', trialJobId); + const trialWorkingFolder: string = unixPathJoin(this.remoteExpRootDir, 'trials', trialJobId); const trialJobDetail: RemoteMachineTrialJobDetail = new RemoteMachineTrialJobDetail( trialJobId, @@ -283,7 +282,7 @@ class RemoteMachineTrainingService implements TrainingService { private updateGpuReservation() { for (const [key, value] of this.trialJobsMap) { if(!['WAITING', 'RUNNING'].includes(value.status)) { - this.gpuScheduler.removeGpuReservation(value.id, value.rmMeta); + this.gpuScheduler.removeGpuReservation(key, this.trialJobsMap); } }; } @@ -354,7 +353,7 @@ class RemoteMachineTrainingService implements TrainingService { case TrialConfigMetadataKey.MACHINE_LIST: await this.setupConnections(value); //remove local temp files - await cpp.exec(`rm -rf ${this.getLocalGpuMetricCollectorDir()}`); + await execRemove(this.getLocalGpuMetricCollectorDir()); break; case TrialConfigMetadataKey.TRIAL_CONFIG: const remoteMachineTrailConfig: TrialConfig = JSON.parse(value); @@ -417,7 +416,7 @@ class RemoteMachineTrainingService implements TrainingService { private async cleanupConnections(): Promise { try{ for (const [rmMeta, sshClientManager] of this.machineSSHClientMap.entries()) { - let jobpidPath: string = path.join(this.getRemoteScriptsPath(rmMeta.username), 'pid'); + let jobpidPath: string = unixPathJoin(this.getRemoteScriptsPath(rmMeta.username), 'pid'); let client: Client | undefined = sshClientManager.getFirstSSHClient(); if(client) { await SSHClientUtility.remoteExeCommand(`pkill -P \`cat ${jobpidPath}\``, client); @@ -438,7 +437,7 @@ class RemoteMachineTrainingService implements TrainingService { */ private getLocalGpuMetricCollectorDir(): string { let userName: string = path.basename(os.homedir()); //get current user name of os - return `${os.tmpdir()}/${userName}/nni/scripts/`; + return path.join(os.tmpdir(), userName, 'nni', 'scripts'); } /** @@ -447,14 +446,14 @@ class RemoteMachineTrainingService implements TrainingService { */ private async generateGpuMetricsCollectorScript(userName: string): Promise { let gpuMetricCollectorScriptFolder : string = this.getLocalGpuMetricCollectorDir(); - await cpp.exec(`mkdir -p ${path.join(gpuMetricCollectorScriptFolder, userName)}`); + await execMkdir(path.join(gpuMetricCollectorScriptFolder, userName)); //generate gpu_metrics_collector.sh let gpuMetricsCollectorScriptPath: string = path.join(gpuMetricCollectorScriptFolder, userName, 'gpu_metrics_collector.sh'); const remoteGPUScriptsDir: string = this.getRemoteScriptsPath(userName); // This directory is used to store gpu_metrics and pid created by script const gpuMetricsCollectorScriptContent: string = String.Format( GPU_INFO_COLLECTOR_FORMAT_LINUX, remoteGPUScriptsDir, - path.join(remoteGPUScriptsDir, 'pid'), + unixPathJoin(remoteGPUScriptsDir, 'pid'), ); await fs.promises.writeFile(gpuMetricsCollectorScriptPath, gpuMetricsCollectorScriptContent, { encoding: 'utf8' }); } @@ -481,7 +480,7 @@ class RemoteMachineTrainingService implements TrainingService { private async initRemoteMachineOnConnected(rmMeta: RemoteMachineMeta, conn: Client): Promise { // Create root working directory after ssh connection is ready await this.generateGpuMetricsCollectorScript(rmMeta.username); //generate gpu script in local machine first, will copy to remote machine later - const nniRootDir: string = `${os.tmpdir()}/nni`; + const nniRootDir: string = unixPathJoin(getRemoteTmpDir(this.remoteOS), 'nni'); await SSHClientUtility.remoteExeCommand(`mkdir -p ${this.remoteExpRootDir}`, conn); // Copy NNI scripts to remote expeirment working directory @@ -490,15 +489,15 @@ class RemoteMachineTrainingService implements TrainingService { await SSHClientUtility.remoteExeCommand(`mkdir -p ${remoteGpuScriptCollectorDir}`, conn); await SSHClientUtility.remoteExeCommand(`chmod 777 ${nniRootDir} ${nniRootDir}/* ${nniRootDir}/scripts/*`, conn); //copy gpu_metrics_collector.sh to remote - await SSHClientUtility.copyFileToRemote(path.join(localGpuScriptCollectorDir, rmMeta.username, 'gpu_metrics_collector.sh'), path.join(remoteGpuScriptCollectorDir, 'gpu_metrics_collector.sh'), conn); + await SSHClientUtility.copyFileToRemote(path.join(localGpuScriptCollectorDir, rmMeta.username, 'gpu_metrics_collector.sh'), unixPathJoin(remoteGpuScriptCollectorDir, 'gpu_metrics_collector.sh'), conn); //Begin to execute gpu_metrics_collection scripts - SSHClientUtility.remoteExeCommand(`bash ${path.join(remoteGpuScriptCollectorDir, 'gpu_metrics_collector.sh')}`, conn); + SSHClientUtility.remoteExeCommand(`bash ${unixPathJoin(remoteGpuScriptCollectorDir, 'gpu_metrics_collector.sh')}`, conn); this.timer.subscribe( async (tick: number) => { const cmdresult: RemoteCommandResult = await SSHClientUtility.remoteExeCommand( - `tail -n 1 ${path.join(remoteGpuScriptCollectorDir, 'gpu_metrics')}`, conn); + `tail -n 1 ${unixPathJoin(remoteGpuScriptCollectorDir, 'gpu_metrics')}`, conn); if (cmdresult && cmdresult.stdout) { rmMeta.gpuSummary = JSON.parse(cmdresult.stdout); } @@ -522,7 +521,7 @@ class RemoteMachineTrainingService implements TrainingService { return deferred.promise; } // get an ssh client from scheduler - const rmScheduleResult: RemoteMachineScheduleResult = this.gpuScheduler.scheduleMachine(this.trialConfig.gpuNum, trialJobId); + const rmScheduleResult: RemoteMachineScheduleResult = this.gpuScheduler.scheduleMachine(this.trialConfig.gpuNum, trialJobDetail); if (rmScheduleResult.resultType === ScheduleResultType.REQUIRE_EXCEED_TOTAL) { const errorMessage : string = `Required GPU number ${this.trialConfig.gpuNum} is too large, no machine can meet`; this.log.error(errorMessage); @@ -531,7 +530,7 @@ class RemoteMachineTrainingService implements TrainingService { } else if (rmScheduleResult.resultType === ScheduleResultType.SUCCEED && rmScheduleResult.scheduleInfo !== undefined) { const rmScheduleInfo : RemoteMachineScheduleInfo = rmScheduleResult.scheduleInfo; - const trialWorkingFolder: string = path.join(this.remoteExpRootDir, 'trials', trialJobId); + const trialWorkingFolder: string = unixPathJoin(this.remoteExpRootDir, 'trials', trialJobId); trialJobDetail.rmMeta = rmScheduleInfo.rmMeta; @@ -543,6 +542,7 @@ class RemoteMachineTrainingService implements TrainingService { trialJobDetail.url = `file://${rmScheduleInfo.rmMeta.ip}:${trialWorkingFolder}`; trialJobDetail.startTime = Date.now(); + this.trialJobsMap.set(trialJobId, trialJobDetail); deferred.resolve(true); } else if (rmScheduleResult.resultType === ScheduleResultType.TMP_NO_AVAILABLE_GPU) { this.log.info(`Right now no available GPU can be allocated for trial ${trialJobId}, will try to schedule later`); @@ -575,7 +575,7 @@ class RemoteMachineTrainingService implements TrainingService { const trialLocalTempFolder: string = path.join(this.expRootDir, 'trials-local', trialJobId); await SSHClientUtility.remoteExeCommand(`mkdir -p ${trialWorkingFolder}`, sshClient); - await SSHClientUtility.remoteExeCommand(`mkdir -p ${path.join(trialWorkingFolder, '.nni')}`, sshClient); + await SSHClientUtility.remoteExeCommand(`mkdir -p ${unixPathJoin(trialWorkingFolder, '.nni')}`, sshClient); // RemoteMachineRunShellFormat is the run shell format string, // See definition in remoteMachineData.ts @@ -603,20 +603,20 @@ class RemoteMachineTrainingService implements TrainingService { getExperimentId(), trialJobDetail.sequenceId.toString(), this.isMultiPhase, - path.join(trialWorkingFolder, '.nni', 'jobpid'), + unixPathJoin(trialWorkingFolder, '.nni', 'jobpid'), command, nniManagerIp, this.remoteRestServerPort, version, this.logCollection, - path.join(trialWorkingFolder, '.nni', 'code') + unixPathJoin(trialWorkingFolder, '.nni', 'code') ) //create tmp trial working folder locally. - await cpp.exec(`mkdir -p ${path.join(trialLocalTempFolder, '.nni')}`); + await execMkdir(path.join(trialLocalTempFolder, '.nni')); //create tmp trial working folder locally. - await cpp.exec(`cp -r ${this.trialConfig.codeDir}/* ${trialLocalTempFolder}`); + await execCopydir(path.join(this.trialConfig.codeDir, '*'), trialLocalTempFolder); const installScriptContent : string = CONTAINER_INSTALL_NNI_SHELL_FORMAT; // Write NNI installation file to local tmp files await fs.promises.writeFile(path.join(trialLocalTempFolder, 'install_nni.sh'), installScriptContent, { encoding: 'utf8' }); @@ -626,7 +626,7 @@ class RemoteMachineTrainingService implements TrainingService { // Copy files in codeDir to remote working directory await SSHClientUtility.copyDirectoryToRemote(trialLocalTempFolder, trialWorkingFolder, sshClient, this.remoteOS); // Execute command in remote machine - SSHClientUtility.remoteExeCommand(`bash ${path.join(trialWorkingFolder, 'run.sh')}`, sshClient); + SSHClientUtility.remoteExeCommand(`bash ${unixPathJoin(trialWorkingFolder, 'run.sh')}`, sshClient); } private async runHostJob(form: HostJobApplicationForm): Promise { @@ -646,8 +646,8 @@ class RemoteMachineTrainingService implements TrainingService { ); await fs.promises.writeFile(path.join(localDir, 'run.sh'), runScriptContent, { encoding: 'utf8' }); await SSHClientUtility.copyFileToRemote( - path.join(localDir, 'run.sh'), path.join(remoteDir, 'run.sh'), sshClient); - SSHClientUtility.remoteExeCommand(`bash ${path.join(remoteDir, 'run.sh')}`, sshClient); + path.join(localDir, 'run.sh'), unixPathJoin(remoteDir, 'run.sh'), sshClient); + SSHClientUtility.remoteExeCommand(`bash ${unixPathJoin(remoteDir, 'run.sh')}`, sshClient); const jobDetail: RemoteMachineTrialJobDetail = new RemoteMachineTrialJobDetail( jobId, 'RUNNING', Date.now(), remoteDir, form, this.generateSequenceId() @@ -672,7 +672,7 @@ class RemoteMachineTrainingService implements TrainingService { private async updateTrialJobStatus(trialJob: RemoteMachineTrialJobDetail, sshClient: Client): Promise { const deferred: Deferred = new Deferred(); const jobpidPath: string = this.getJobPidPath(trialJob.id); - const trialReturnCodeFilePath: string = path.join(this.remoteExpRootDir, 'trials', trialJob.id, '.nni', 'code'); + const trialReturnCodeFilePath: string = unixPathJoin(this.remoteExpRootDir, 'trials', trialJob.id, '.nni', 'code'); try { const killResult: number = (await SSHClientUtility.remoteExeCommand(`kill -0 \`cat ${jobpidPath}\``, sshClient)).exitCode; // if the process of jobpid is not alive any more @@ -712,15 +712,15 @@ class RemoteMachineTrainingService implements TrainingService { } private getRemoteScriptsPath(userName: string): string { - return path.join(getRemoteTmpDir(this.remoteOS), userName, 'nni', 'scripts'); + return unixPathJoin(getRemoteTmpDir(this.remoteOS), userName, 'nni', 'scripts'); } private getHostJobRemoteDir(jobId: string): string { - return path.join(this.remoteExpRootDir, 'hostjobs', jobId); + return unixPathJoin(this.remoteExpRootDir, 'hostjobs', jobId); } private getRemoteExperimentRootDir(): string{ - return path.join(getRemoteTmpDir(this.remoteOS), 'nni', 'experiments', getExperimentId()); + return unixPathJoin(getRemoteTmpDir(this.remoteOS), 'nni', 'experiments', getExperimentId()); } public get MetricsEmitter() : EventEmitter { @@ -735,9 +735,9 @@ class RemoteMachineTrainingService implements TrainingService { let jobpidPath: string; if (trialJobDetail.form.jobType === 'TRIAL') { - jobpidPath = path.join(trialJobDetail.workingDirectory, '.nni', 'jobpid'); + jobpidPath = unixPathJoin(trialJobDetail.workingDirectory, '.nni', 'jobpid'); } else if (trialJobDetail.form.jobType === 'HOST') { - jobpidPath = path.join(this.getHostJobRemoteDir(jobId), 'jobpid'); + jobpidPath = unixPathJoin(this.getHostJobRemoteDir(jobId), 'jobpid'); } else { throw new Error(`Job type not supported: ${trialJobDetail.form.jobType}`); } @@ -751,14 +751,14 @@ class RemoteMachineTrainingService implements TrainingService { throw new Error('sshClient is undefined.'); } - const trialWorkingFolder: string = path.join(this.remoteExpRootDir, 'trials', trialJobId); + const trialWorkingFolder: string = unixPathJoin(this.remoteExpRootDir, 'trials', trialJobId); const trialLocalTempFolder: string = path.join(this.expRootDir, 'trials-local', trialJobId); const fileName: string = generateParamFileName(hyperParameters); const localFilepath: string = path.join(trialLocalTempFolder, fileName); await fs.promises.writeFile(localFilepath, hyperParameters.value, { encoding: 'utf8' }); - await SSHClientUtility.copyFileToRemote(localFilepath, path.join(trialWorkingFolder, fileName), sshClient); + await SSHClientUtility.copyFileToRemote(localFilepath, unixPathJoin(trialWorkingFolder, fileName), sshClient); } private generateSequenceId(): number { diff --git a/src/nni_manager/training_service/remote_machine/sshClientUtility.ts b/src/nni_manager/training_service/remote_machine/sshClientUtility.ts index f1f227ecc5..bd3aa0cf42 100644 --- a/src/nni_manager/training_service/remote_machine/sshClientUtility.ts +++ b/src/nni_manager/training_service/remote_machine/sshClientUtility.ts @@ -28,8 +28,9 @@ import * as stream from 'stream'; import { Deferred } from 'ts-deferred'; import { NNIError, NNIErrorNames } from '../../common/errors'; import { getLogger, Logger } from '../../common/log'; -import { uniqueString, getRemoteTmpDir } from '../../common/utils'; +import { uniqueString, getRemoteTmpDir, unixPathJoin } from '../../common/utils'; import { RemoteCommandResult } from './remoteMachineData'; +import { execRemove, tarAdd } from '../common/util'; /** * @@ -47,13 +48,13 @@ export namespace SSHClientUtility { const deferred: Deferred = new Deferred(); const tmpTarName: string = `${uniqueString(10)}.tar.gz`; const localTarPath: string = path.join(os.tmpdir(), tmpTarName); - const remoteTarPath: string = path.join(getRemoteTmpDir(remoteOS), tmpTarName); + const remoteTarPath: string = unixPathJoin(getRemoteTmpDir(remoteOS), tmpTarName); // Compress files in local directory to experiment root directory - await cpp.exec(`tar -czf ${localTarPath} -C ${localDirectory} .`); + await tarAdd(localTarPath, localDirectory); // Copy the compressed file to remoteDirectory and delete it await copyFileToRemote(localTarPath, remoteTarPath, sshClient); - await cpp.exec(`rm ${localTarPath}`); + await execRemove(localTarPath); // Decompress the remote compressed file in and delete it await remoteExeCommand(`tar -oxzf ${remoteTarPath} -C ${remoteDirectory}`, sshClient); await remoteExeCommand(`rm ${remoteTarPath}`, sshClient); diff --git a/src/sdk/pynni/nni/batch_tuner/batch_tuner.py b/src/sdk/pynni/nni/batch_tuner/batch_tuner.py index 8e4e1a3e52..8e08fb3f82 100644 --- a/src/sdk/pynni/nni/batch_tuner/batch_tuner.py +++ b/src/sdk/pynni/nni/batch_tuner/batch_tuner.py @@ -22,11 +22,7 @@ class BatchTuner """ -import copy -from enum import Enum, unique -import random - -import numpy as np +import logging import nni from nni.tuner import Tuner @@ -35,6 +31,7 @@ class BatchTuner CHOICE = 'choice' VALUE = '_value' +logger = logging.getLogger('batch_tuner_AutoML') class BatchTuner(Tuner): """ @@ -46,7 +43,7 @@ class BatchTuner(Tuner): } } """ - + def __init__(self): self.count = -1 self.values = [] @@ -54,14 +51,14 @@ def __init__(self): def is_valid(self, search_space): """ Check the search space is valid: only contains 'choice' type - + Parameters ---------- search_space : dict """ if not len(search_space) == 1: raise RuntimeError('BatchTuner only supprt one combined-paramreters key.') - + for param in search_space: param_type = search_space[param][TYPE] if not param_type == CHOICE: @@ -73,8 +70,8 @@ def is_valid(self, search_space): return None def update_search_space(self, search_space): - """Update the search space - + """Update the search space + Parameters ---------- search_space : dict @@ -88,8 +85,8 @@ def generate_parameters(self, parameter_id): ---------- parameter_id : int """ - self.count +=1 - if self.count>len(self.values)-1: + self.count += 1 + if self.count > len(self.values) - 1: raise nni.NoMoreTrialError('no more parameters now.') return self.values[self.count] @@ -97,4 +94,31 @@ def receive_trial_result(self, parameter_id, parameters, value): pass def import_data(self, data): - pass + """Import additional data for tuning + Parameters + ---------- + data: + a list of dictionarys, each of which has at least two keys, 'parameter' and 'value' + """ + if len(self.values) == 0: + logger.info("Search space has not been initialized, skip this data import") + return + + self.values = self.values[(self.count+1):] + self.count = -1 + + _completed_num = 0 + for trial_info in data: + logger.info("Importing data, current processing progress %s / %s", _completed_num, len(data)) + # simply validate data format + assert "parameter" in trial_info + _params = trial_info["parameter"] + assert "value" in trial_info + _value = trial_info['value'] + if not _value: + logger.info("Useless trial data, value is %s, skip this trial data.", _value) + continue + _completed_num += 1 + if _params in self.values: + self.values.remove(_params) + logger.info("Successfully import data to batch tuner, total data: %d, imported data: %d.", len(data), _completed_num) diff --git a/src/sdk/pynni/nni/bohb_advisor/bohb_advisor.py b/src/sdk/pynni/nni/bohb_advisor/bohb_advisor.py index 042848038f..7617677842 100644 --- a/src/sdk/pynni/nni/bohb_advisor/bohb_advisor.py +++ b/src/sdk/pynni/nni/bohb_advisor/bohb_advisor.py @@ -31,7 +31,7 @@ from nni.protocol import CommandType, send from nni.msg_dispatcher_base import MsgDispatcherBase -from nni.utils import OptimizeMode, extract_scalar_reward +from nni.utils import OptimizeMode, extract_scalar_reward, randint_to_quniform from .config_generator import CG_BOHB @@ -443,6 +443,7 @@ def handle_update_search_space(self, data): search space of this experiment """ search_space = data + randint_to_quniform(search_space) cs = CS.ConfigurationSpace() for var in search_space: _type = str(search_space[var]["_type"]) diff --git a/src/sdk/pynni/nni/evolution_tuner/evolution_tuner.py b/src/sdk/pynni/nni/evolution_tuner/evolution_tuner.py index b46d560bed..8caf9c4a59 100644 --- a/src/sdk/pynni/nni/evolution_tuner/evolution_tuner.py +++ b/src/sdk/pynni/nni/evolution_tuner/evolution_tuner.py @@ -26,7 +26,7 @@ import numpy as np from nni.tuner import Tuner -from nni.utils import NodeType, OptimizeMode, extract_scalar_reward, split_index +from nni.utils import NodeType, OptimizeMode, extract_scalar_reward, split_index, randint_to_quniform import nni.parameter_expressions as parameter_expressions @@ -175,6 +175,7 @@ def update_search_space(self, search_space): search_space : dict """ self.searchspace_json = search_space + randint_to_quniform(self.searchspace_json) self.space = json2space(self.searchspace_json) self.random_state = np.random.RandomState() diff --git a/src/sdk/pynni/nni/hyperband_advisor/hyperband_advisor.py b/src/sdk/pynni/nni/hyperband_advisor/hyperband_advisor.py index 7590672de7..f346236a9c 100644 --- a/src/sdk/pynni/nni/hyperband_advisor/hyperband_advisor.py +++ b/src/sdk/pynni/nni/hyperband_advisor/hyperband_advisor.py @@ -31,7 +31,7 @@ from nni.protocol import CommandType, send from nni.msg_dispatcher_base import MsgDispatcherBase from nni.common import init_logger -from nni.utils import NodeType, OptimizeMode, extract_scalar_reward +from nni.utils import NodeType, OptimizeMode, extract_scalar_reward, randint_to_quniform import nni.parameter_expressions as parameter_expressions _logger = logging.getLogger(__name__) @@ -357,6 +357,7 @@ def handle_update_search_space(self, data): number of trial jobs """ self.searchspace_json = data + randint_to_quniform(self.searchspace_json) self.random_state = np.random.RandomState() def handle_trial_end(self, data): diff --git a/src/sdk/pynni/nni/hyperopt_tuner/hyperopt_tuner.py b/src/sdk/pynni/nni/hyperopt_tuner/hyperopt_tuner.py index 650d4c2ffc..0b203a8e73 100644 --- a/src/sdk/pynni/nni/hyperopt_tuner/hyperopt_tuner.py +++ b/src/sdk/pynni/nni/hyperopt_tuner/hyperopt_tuner.py @@ -27,7 +27,7 @@ import hyperopt as hp import numpy as np from nni.tuner import Tuner -from nni.utils import NodeType, OptimizeMode, extract_scalar_reward, split_index +from nni.utils import NodeType, OptimizeMode, extract_scalar_reward, split_index, randint_to_quniform logger = logging.getLogger('hyperopt_AutoML') @@ -153,14 +153,14 @@ def _add_index(in_x, parameter): Will change to format in hyperopt, like: {'dropout_rate': 0.8, 'conv_size': {'_index': 1, '_value': 3}, 'hidden_size': {'_index': 1, '_value': 512}} """ - if TYPE not in in_x: # if at the top level + if NodeType.TYPE not in in_x: # if at the top level out_y = dict() for key, value in parameter.items(): out_y[key] = _add_index(in_x[key], value) return out_y elif isinstance(in_x, dict): - value_type = in_x[TYPE] - value_format = in_x[VALUE] + value_type = in_x[NodeType.TYPE] + value_format = in_x[NodeType.VALUE] if value_type == "choice": choice_name = parameter[0] if isinstance(parameter, list) else parameter @@ -173,15 +173,14 @@ def _add_index(in_x, parameter): choice_value_format = item[1] if choice_key == choice_name: return { - INDEX: - pos, - VALUE: [ + NodeType.INDEX: pos, + NodeType.VALUE: [ choice_name, _add_index(choice_value_format, parameter[1]) ] } elif choice_name == item: - return {INDEX: pos, VALUE: item} + return {NodeType.INDEX: pos, NodeType.VALUE: item} else: return parameter @@ -232,6 +231,8 @@ def update_search_space(self, search_space): search_space : dict """ self.json = search_space + randint_to_quniform(self.json) + search_space_instance = json2space(self.json) rstate = np.random.RandomState() trials = hp.Trials() diff --git a/src/sdk/pynni/nni/metis_tuner/metis_tuner.py b/src/sdk/pynni/nni/metis_tuner/metis_tuner.py index 5701232ede..dd1e273280 100644 --- a/src/sdk/pynni/nni/metis_tuner/metis_tuner.py +++ b/src/sdk/pynni/nni/metis_tuner/metis_tuner.py @@ -133,7 +133,7 @@ def update_search_space(self, search_space): self.x_bounds[idx] = bounds self.x_types[idx] = 'discrete_int' elif key_type == 'randint': - self.x_bounds[idx] = [0, key_range[0]] + self.x_bounds[idx] = [key_range[0], key_range[1]] self.x_types[idx] = 'range_int' elif key_type == 'uniform': self.x_bounds[idx] = [key_range[0], key_range[1]] diff --git a/src/sdk/pynni/nni/smac_tuner/smac_tuner.py b/src/sdk/pynni/nni/smac_tuner/smac_tuner.py index d6217367ff..5a334973fc 100644 --- a/src/sdk/pynni/nni/smac_tuner/smac_tuner.py +++ b/src/sdk/pynni/nni/smac_tuner/smac_tuner.py @@ -21,21 +21,24 @@ smac_tuner.py """ -from nni.tuner import Tuner -from nni.utils import OptimizeMode, extract_scalar_reward - import sys import logging import numpy as np -import json_tricks -from enum import Enum, unique -from .convert_ss_to_scenario import generate_scenario + +from nni.tuner import Tuner +from nni.utils import OptimizeMode, extract_scalar_reward from smac.utils.io.cmd_reader import CMDReader from smac.scenario.scenario import Scenario from smac.facade.smac_facade import SMAC from smac.facade.roar_facade import ROAR from smac.facade.epils_facade import EPILS +from ConfigSpaceNNI import Configuration + +from .convert_ss_to_scenario import generate_scenario + +from nni.tuner import Tuner +from nni.utils import OptimizeMode, extract_scalar_reward, randint_to_quniform class SMACTuner(Tuner): @@ -57,6 +60,7 @@ def __init__(self, optimize_mode): self.update_ss_done = False self.loguniform_key = set() self.categorical_dict = {} + self.cs = None def _main_cli(self): """Main function of SMAC for CLI interface @@ -66,7 +70,7 @@ def _main_cli(self): instance optimizer """ - self.logger.info("SMAC call: %s" % (" ".join(sys.argv))) + self.logger.info("SMAC call: %s", " ".join(sys.argv)) cmd_reader = CMDReader() args, _ = cmd_reader.read_cmd() @@ -95,6 +99,7 @@ def _main_cli(self): # Create scenario-object scen = Scenario(args.scenario_file, []) + self.cs = scen.cs if args.mode == "SMAC": optimizer = SMAC( @@ -134,6 +139,7 @@ def update_search_space(self, search_space): search_space: search space """ + randint_to_quniform(search_space) if not self.update_ss_done: self.categorical_dict = generate_scenario(search_space) if self.categorical_dict is None: @@ -258,4 +264,45 @@ def generate_multiple_parameters(self, parameter_id_list): return params def import_data(self, data): - pass + """Import additional data for tuning + Parameters + ---------- + data: + a list of dictionarys, each of which has at least two keys, 'parameter' and 'value' + """ + _completed_num = 0 + for trial_info in data: + self.logger.info("Importing data, current processing progress %s / %s", _completed_num, len(data)) + # simply validate data format + assert "parameter" in trial_info + _params = trial_info["parameter"] + assert "value" in trial_info + _value = trial_info['value'] + if not _value: + self.logger.info("Useless trial data, value is %s, skip this trial data.", _value) + continue + # convert the keys in loguniform and categorical types + valid_entry = True + for key, value in _params.items(): + if key in self.loguniform_key: + _params[key] = np.log(value) + elif key in self.categorical_dict: + if value in self.categorical_dict[key]: + _params[key] = self.categorical_dict[key].index(value) + else: + self.logger.info("The value %s of key %s is not in search space.", str(value), key) + valid_entry = False + break + if not valid_entry: + continue + # start import this data entry + _completed_num += 1 + config = Configuration(self.cs, values=_params) + if self.optimize_mode is OptimizeMode.Maximize: + _value = -_value + if self.first_one: + self.smbo_solver.nni_smac_receive_first_run(config, _value) + self.first_one = False + else: + self.smbo_solver.nni_smac_receive_runs(config, _value) + self.logger.info("Successfully import data to smac tuner, total data: %d, imported data: %d.", len(data), _completed_num) diff --git a/src/sdk/pynni/nni/smartparam.py b/src/sdk/pynni/nni/smartparam.py index eb2a7c8b5a..99a9d19084 100644 --- a/src/sdk/pynni/nni/smartparam.py +++ b/src/sdk/pynni/nni/smartparam.py @@ -36,7 +36,8 @@ 'qnormal', 'lognormal', 'qlognormal', - 'function_choice' + 'function_choice', + 'mutable_layer' ] @@ -78,6 +79,9 @@ def qlognormal(mu, sigma, q, name=None): def function_choice(*funcs, name=None): return random.choice(funcs)() + def mutable_layer(): + raise RuntimeError('Cannot call nni.mutable_layer in this mode') + else: def choice(options, name=None, key=None): @@ -113,6 +117,42 @@ def qlognormal(mu, sigma, q, name=None, key=None): def function_choice(funcs, name=None, key=None): return funcs[_get_param(key)]() + def mutable_layer( + mutable_id, + mutable_layer_id, + funcs, + funcs_args, + fixed_inputs, + optional_inputs, + optional_input_size=0): + '''execute the chosen function and inputs. + Below is an example of chosen function and inputs: + { + "mutable_id": { + "mutable_layer_id": { + "chosen_layer": "pool", + "chosen_inputs": ["out1", "out3"] + } + } + } + Parameters: + --------------- + mutable_id: the name of this mutable_layer block (which could have multiple mutable layers) + mutable_layer_id: the name of a mutable layer in this block + funcs: dict of function calls + funcs_args: + fixed_inputs: + optional_inputs: dict of optional inputs + optional_input_size: number of candidate inputs to be chosen + ''' + mutable_block = _get_param(mutable_id) + chosen_layer = mutable_block[mutable_layer_id]["chosen_layer"] + chosen_inputs = mutable_block[mutable_layer_id]["chosen_inputs"] + real_chosen_inputs = [optional_inputs[input_name] for input_name in chosen_inputs] + layer_out = funcs[chosen_layer]([fixed_inputs, real_chosen_inputs], *funcs_args[chosen_layer]) + + return layer_out + def _get_param(key): if trial._params is None: trial.get_next_parameter() diff --git a/src/sdk/pynni/nni/utils.py b/src/sdk/pynni/nni/utils.py index 4df75a58f1..164590dd0d 100644 --- a/src/sdk/pynni/nni/utils.py +++ b/src/sdk/pynni/nni/utils.py @@ -40,6 +40,7 @@ class OptimizeMode(Enum): Minimize = 'minimize' Maximize = 'maximize' + class NodeType: """Node Type class """ @@ -83,6 +84,7 @@ def extract_scalar_reward(value, scalar_key='default'): raise RuntimeError('Incorrect final result: the final result should be float/int, or a dict which has a key named "default" whose value is float/int.') return reward + def convert_dict2tuple(value): """ convert dict type to tuple to solve unhashable problem. @@ -94,9 +96,30 @@ def convert_dict2tuple(value): else: return value + def init_dispatcher_logger(): """ Initialize dispatcher logging configuration""" logger_file_path = 'dispatcher.log' if dispatcher_env_vars.NNI_LOG_DIRECTORY is not None: logger_file_path = os.path.join(dispatcher_env_vars.NNI_LOG_DIRECTORY, logger_file_path) init_logger(logger_file_path, dispatcher_env_vars.NNI_LOG_LEVEL) + + +def randint_to_quniform(in_x): + if isinstance(in_x, dict): + if NodeType.TYPE in in_x.keys(): + if in_x[NodeType.TYPE] == 'randint': + value = in_x[NodeType.VALUE] + value.append(1) + + in_x[NodeType.TYPE] = 'quniform' + in_x[NodeType.VALUE] = value + + elif in_x[NodeType.TYPE] == 'choice': + randint_to_quniform(in_x[NodeType.VALUE]) + else: + for key in in_x.keys(): + randint_to_quniform(in_x[key]) + elif isinstance(in_x, list): + for temp in in_x: + randint_to_quniform(temp) diff --git a/src/webui/src/components/Overview.tsx b/src/webui/src/components/Overview.tsx index e68c683c2b..d610db5d36 100644 --- a/src/webui/src/components/Overview.tsx +++ b/src/webui/src/components/Overview.tsx @@ -192,17 +192,21 @@ class Overview extends React.Component<{}, OverviewState> { method: 'GET' }) .then(res => { - if (res.status === 200 && this._isMounted) { + if (res.status === 200) { const errors = res.data.errors; if (errors.length !== 0) { - this.setState({ - status: res.data.status, - errorStr: res.data.errors[0] - }); + if (this._isMounted) { + this.setState({ + status: res.data.status, + errorStr: res.data.errors[0] + }); + } } else { - this.setState({ - status: res.data.status, - }); + if (this._isMounted) { + this.setState({ + status: res.data.status, + }); + } } } }); @@ -254,7 +258,8 @@ class Overview extends React.Component<{}, OverviewState> { case 'SUCCEEDED': profile.succTrial += 1; const desJobDetail: Parameters = { - parameters: {} + parameters: {}, + intermediate: [] }; const duration = (tableData[item].endTime - tableData[item].startTime) / 1000; const acc = getFinal(tableData[item].finalMetricData); diff --git a/src/webui/src/components/TrialsDetail.tsx b/src/webui/src/components/TrialsDetail.tsx index 15ad531ba8..b6c29066b1 100644 --- a/src/webui/src/components/TrialsDetail.tsx +++ b/src/webui/src/components/TrialsDetail.tsx @@ -27,6 +27,11 @@ interface TrialDetailState { entriesInSelect: string; searchSpace: string; isMultiPhase: boolean; + isTableLoading: boolean; + whichGraph: string; + hyperCounts: number; // user click the hyper-parameter counts + durationCounts: number; + intermediateCounts: number; } class TrialsDetail extends React.Component<{}, TrialDetailState> { @@ -70,9 +75,14 @@ class TrialsDetail extends React.Component<{}, TrialDetailState> { experimentLogCollection: false, entriesTable: 20, entriesInSelect: '20', - isHasSearch: false, searchSpace: '', - isMultiPhase: false + whichGraph: '1', + isHasSearch: false, + isMultiPhase: false, + isTableLoading: false, + hyperCounts: 0, + durationCounts: 0, + intermediateCounts: 0 }; } @@ -85,6 +95,9 @@ class TrialsDetail extends React.Component<{}, TrialDetailState> { ]) .then(axios.spread((res, res1) => { if (res.status === 200 && res1.status === 200) { + if (this._isMounted === true) { + this.setState(() => ({ isTableLoading: true })); + } const trialJobs = res.data; const metricSource = res1.data; const trialTable: Array = []; @@ -175,6 +188,7 @@ class TrialsDetail extends React.Component<{}, TrialDetailState> { } if (this._isMounted) { this.setState(() => ({ + isTableLoading: false, tableListSource: trialTable })); } @@ -239,26 +253,26 @@ class TrialsDetail extends React.Component<{}, TrialDetailState> { } handleEntriesSelect = (value: string) => { - switch (value) { - case '20': - this.setState(() => ({ entriesTable: 20 })); - break; - case '50': - this.setState(() => ({ entriesTable: 50 })); - break; - case '100': - this.setState(() => ({ entriesTable: 100 })); - break; - case 'all': - const { tableListSource } = this.state; - if (this._isMounted) { - this.setState(() => ({ - entriesInSelect: 'all', - entriesTable: tableListSource.length - })); - } - break; - default: + // user select isn't 'all' + if (value !== 'all') { + if (this._isMounted) { + this.setState(() => ({ entriesTable: parseInt(value, 10) })); + } + } else { + const { tableListSource } = this.state; + if (this._isMounted) { + this.setState(() => ({ + entriesInSelect: 'all', + entriesTable: tableListSource.length + })); + } + } + } + + handleWhichTabs = (activeKey: string) => { + // const which = JSON.parse(activeKey); + if (this._isMounted) { + this.setState(() => ({ whichGraph: activeKey })); } } @@ -315,18 +329,21 @@ class TrialsDetail extends React.Component<{}, TrialDetailState> { const { tableListSource, searchResultSource, isHasSearch, isMultiPhase, - entriesTable, experimentPlatform, searchSpace, experimentLogCollection + entriesTable, experimentPlatform, searchSpace, experimentLogCollection, + whichGraph, isTableLoading } = this.state; const source = isHasSearch ? searchResultSource : tableListSource; return (
- + + {/* */} @@ -335,14 +352,16 @@ class TrialsDetail extends React.Component<{}, TrialDetailState> { - + + {/* */} - +
@@ -388,6 +407,7 @@ class TrialsDetail extends React.Component<{}, TrialDetailState> { { if (wei > 6) { result = `${lastVal.toFixed(6)}`; } - if (status === 'SUCCEEDED') { - result = `${lastVal.toFixed(6)} (FINAL)`; - } else { - result = `${lastVal.toFixed(6)} (LATEST)`; - } - + } + if (status === 'SUCCEEDED') { + result = `${result} (FINAL)`; + } else { + result = `${result} (LATEST)`; } } else { result = '--'; diff --git a/src/webui/src/components/public-child/OpenRow.tsx b/src/webui/src/components/public-child/OpenRow.tsx index 1fde2fee32..6f66eb0c44 100644 --- a/src/webui/src/components/public-child/OpenRow.tsx +++ b/src/webui/src/components/public-child/OpenRow.tsx @@ -3,9 +3,10 @@ import * as copy from 'copy-to-clipboard'; import PaiTrialLog from '../public-child/PaiTrialLog'; import TrialLog from '../public-child/TrialLog'; import { TableObj } from '../../static/interface'; -import { Row, Tabs, Button, message } from 'antd'; +import { Row, Tabs, Button, message, Modal } from 'antd'; import { MANAGER_IP } from '../../static/const'; import '../../static/style/overview.scss'; +import '../../static/style/copyParameter.scss'; import JSONTree from 'react-json-tree'; const TabPane = Tabs.TabPane; @@ -17,43 +18,62 @@ interface OpenRowProps { } interface OpenRowState { - idList: Array; + isShowFormatModal: boolean; + formatStr: string; } class OpenRow extends React.Component { + public _isMounted: boolean; constructor(props: OpenRowProps) { super(props); this.state = { - idList: [''] + isShowFormatModal: false, + formatStr: '' }; + } + + showFormatModal = (record: TableObj) => { + // get copy parameters + const params = JSON.stringify(record.description.parameters, null, 4); + // open modal with format string + if (this._isMounted === true) { + this.setState(() => ({ isShowFormatModal: true, formatStr: params })); + } + } + hideFormatModal = () => { + // close modal, destroy state format string data + if (this._isMounted === true) { + this.setState(() => ({ isShowFormatModal: false, formatStr: '' })); + } } - copyParams = (record: TableObj) => { + copyParams = () => { // json format - const params = JSON.stringify(record.description.parameters, null, 4); - if (copy(params)) { + const { formatStr } = this.state; + if (copy(formatStr)) { message.destroy(); message.success('Success copy parameters to clipboard in form of python dict !', 3); - const { idList } = this.state; - const copyIdList: Array = idList; - copyIdList[copyIdList.length - 1] = record.id; - this.setState(() => ({ - idList: copyIdList - })); } else { message.destroy(); message.error('Failed !', 2); } + this.hideFormatModal(); } + componentDidMount() { + this._isMounted = true; + } + + componentWillUnmount() { + this._isMounted = false; + } render() { const { trainingPlatform, record, logCollection, multiphase } = this.props; - const { idList } = this.state; + const { isShowFormatModal, formatStr } = this.state; let isClick = false; let isHasParameters = true; - if (idList.indexOf(record.id) !== -1) { isClick = true; } if (record.description.parameters.error) { isHasParameters = false; } @@ -101,7 +121,7 @@ class OpenRow extends React.Component { @@ -128,6 +148,21 @@ class OpenRow extends React.Component { } + + {/* write string in pre to show format string */} +
{formatStr}
+
); } diff --git a/src/webui/src/components/trial-detail/DefaultMetricPoint.tsx b/src/webui/src/components/trial-detail/DefaultMetricPoint.tsx index a4d023ff4d..bd6b5f0c4d 100644 --- a/src/webui/src/components/trial-detail/DefaultMetricPoint.tsx +++ b/src/webui/src/components/trial-detail/DefaultMetricPoint.tsx @@ -1,5 +1,6 @@ import * as React from 'react'; import ReactEcharts from 'echarts-for-react'; +import { filterByStatus } from '../../static/function'; import { TableObj, DetailAccurPoint, TooltipForAccuracy } from '../../static/interface'; require('echarts/lib/chart/scatter'); require('echarts/lib/component/tooltip'); @@ -8,11 +9,13 @@ require('echarts/lib/component/title'); interface DefaultPointProps { showSource: Array; height: number; + whichGraph: string; } interface DefaultPointState { defaultSource: object; accNodata: string; + succeedTrials: number; } class DefaultPoint extends React.Component { @@ -22,91 +25,130 @@ class DefaultPoint extends React.Component super(props); this.state = { defaultSource: {}, - accNodata: 'No data' + accNodata: '', + succeedTrials: 10000000 }; } - defaultMetric = (showSource: Array) => { + defaultMetric = (succeedSource: Array) => { const accSource: Array = []; - Object.keys(showSource).map(item => { - const temp = showSource[item]; - if (temp.status === 'SUCCEEDED' && temp.acc !== undefined) { - if (temp.acc.default !== undefined) { - const searchSpace = temp.description.parameters; - accSource.push({ - acc: temp.acc.default, - index: temp.sequenceId, - searchSpace: JSON.stringify(searchSpace) - }); + const showSource: Array = succeedSource.filter(filterByStatus); + const lengthOfSource = showSource.length; + const tooltipDefault = lengthOfSource === 0 ? 'No data' : ''; + if (this._isMounted === true) { + this.setState(() => ({ + succeedTrials: lengthOfSource, + accNodata: tooltipDefault + })); + } + if (lengthOfSource === 0) { + const nullGraph = { + grid: { + left: '8%' + }, + xAxis: { + name: 'Trial', + type: 'category', + }, + yAxis: { + name: 'Default metric', + type: 'value', } + }; + if (this._isMounted === true) { + this.setState(() => ({ + defaultSource: nullGraph + })); } - }); - const resultList: Array[] = []; - Object.keys(accSource).map(item => { - const items = accSource[item]; - let temp: Array; - temp = [items.index, items.acc, JSON.parse(items.searchSpace)]; - resultList.push(temp); - }); - - const allAcuracy = { - grid: { - left: '8%' - }, - tooltip: { - trigger: 'item', - enterable: true, - position: function (point: Array, data: TooltipForAccuracy) { - if (data.data[0] < resultList.length / 2) { - return [point[0], 80]; - } else { - return [point[0] - 300, 80]; + } else { + const resultList: Array[] = []; + Object.keys(showSource).map(item => { + const temp = showSource[item]; + if (temp.acc !== undefined) { + if (temp.acc.default !== undefined) { + const searchSpace = temp.description.parameters; + accSource.push({ + acc: temp.acc.default, + index: temp.sequenceId, + searchSpace: JSON.stringify(searchSpace) + }); } - }, - formatter: function (data: TooltipForAccuracy) { - const result = '
' + - '
Trial No.: ' + data.data[0] + '
' + - '
Default metric: ' + data.data[1] + '
' + - '
Parameters: ' + - '
' + JSON.stringify(data.data[2], null, 4) + '
' + - '
' + - '
'; - return result; - } - }, - xAxis: { - name: 'Trial', - type: 'category', - }, - yAxis: { - name: 'Default metric', - type: 'value', - }, - series: [{ - symbolSize: 6, - type: 'scatter', - data: resultList - }] - }; - if (this._isMounted === true) { - this.setState({ defaultSource: allAcuracy }, () => { - if (resultList.length === 0) { - this.setState({ - accNodata: 'No data' - }); - } else { - this.setState({ - accNodata: '' - }); } }); + Object.keys(accSource).map(item => { + const items = accSource[item]; + let temp: Array; + temp = [items.index, items.acc, JSON.parse(items.searchSpace)]; + resultList.push(temp); + }); + + const allAcuracy = { + grid: { + left: '8%' + }, + tooltip: { + trigger: 'item', + enterable: true, + position: function (point: Array, data: TooltipForAccuracy) { + if (data.data[0] < resultList.length / 2) { + return [point[0], 80]; + } else { + return [point[0] - 300, 80]; + } + }, + formatter: function (data: TooltipForAccuracy) { + const result = '
' + + '
Trial No.: ' + data.data[0] + '
' + + '
Default metric: ' + data.data[1] + '
' + + '
Parameters: ' + + '
' + JSON.stringify(data.data[2], null, 4) + '
' + + '
' + + '
'; + return result; + } + }, + xAxis: { + name: 'Trial', + type: 'category', + }, + yAxis: { + name: 'Default metric', + type: 'value', + }, + series: [{ + symbolSize: 6, + type: 'scatter', + data: resultList + }] + }; + if (this._isMounted === true) { + this.setState(() => ({ + defaultSource: allAcuracy + })); + } } } // update parent component state componentWillReceiveProps(nextProps: DefaultPointProps) { - const showSource = nextProps.showSource; - this.defaultMetric(showSource); + + const { whichGraph, showSource } = nextProps; + if (whichGraph === '1') { + this.defaultMetric(showSource); + } + } + + shouldComponentUpdate(nextProps: DefaultPointProps, nextState: DefaultPointState) { + const { whichGraph } = nextProps; + const succTrial = this.state.succeedTrials; + const { succeedTrials } = nextState; + if (whichGraph === '1') { + if (succeedTrials !== succTrial) { + return true; + } + } + // only whichGraph !== '1', default metric can't update + return false; } componentDidMount() { @@ -116,7 +158,7 @@ class DefaultPoint extends React.Component componentWillUnmount() { this._isMounted = false; } - + render() { const { height } = this.props; const { defaultSource, accNodata } = this.state; @@ -131,6 +173,7 @@ class DefaultPoint extends React.Component }} theme="my_theme" notMerge={true} // update now + // lazyUpdate={true} />
{accNodata}
diff --git a/src/webui/src/components/trial-detail/Duration.tsx b/src/webui/src/components/trial-detail/Duration.tsx index d1405a04fe..5cc96467a8 100644 --- a/src/webui/src/components/trial-detail/Duration.tsx +++ b/src/webui/src/components/trial-detail/Duration.tsx @@ -1,6 +1,7 @@ import * as React from 'react'; import ReactEcharts from 'echarts-for-react'; import { TableObj } from 'src/static/interface'; +import { filterDuration } from 'src/static/function'; require('echarts/lib/chart/bar'); require('echarts/lib/component/tooltip'); require('echarts/lib/component/title'); @@ -12,6 +13,7 @@ interface Runtrial { interface DurationProps { source: Array; + whichGraph: string; } interface DurationState { @@ -26,13 +28,64 @@ class Duration extends React.Component { super(props); this.state = { - durationSource: {} + durationSource: this.initDuration(this.props.source), }; } + initDuration = (source: Array) => { + const trialId: Array = []; + const trialTime: Array = []; + const trialJobs = source.filter(filterDuration); + Object.keys(trialJobs).map(item => { + const temp = trialJobs[item]; + trialId.push(temp.sequenceId); + trialTime.push(temp.duration); + }); + return { + tooltip: { + trigger: 'axis', + axisPointer: { + type: 'shadow' + } + }, + grid: { + bottom: '3%', + containLabel: true, + left: '1%', + right: '4%' + }, + + dataZoom: [{ + type: 'slider', + name: 'trial', + filterMode: 'filter', + yAxisIndex: 0, + orient: 'vertical' + }, { + type: 'slider', + name: 'trial', + filterMode: 'filter', + xAxisIndex: 0 + }], + xAxis: { + name: 'Time', + type: 'value', + }, + yAxis: { + name: 'Trial', + type: 'category', + data: trialId + }, + series: [{ + type: 'bar', + data: trialTime + }] + }; + } + getOption = (dataObj: Runtrial) => { - return { + return { tooltip: { trigger: 'axis', axisPointer: { @@ -45,7 +98,7 @@ class Duration extends React.Component { left: '1%', right: '4%' }, - + dataZoom: [{ type: 'slider', name: 'trial', @@ -74,17 +127,16 @@ class Duration extends React.Component { }; } - drawDurationGraph = (trialJobs: Array) => { - + drawDurationGraph = (source: Array) => { + // why this function run two times when props changed? const trialId: Array = []; const trialTime: Array = []; const trialRun: Array = []; + const trialJobs = source.filter(filterDuration); Object.keys(trialJobs).map(item => { const temp = trialJobs[item]; - if (temp.status !== 'WAITING') { - trialId.push(temp.sequenceId); - trialTime.push(temp.duration); - } + trialId.push(temp.sequenceId); + trialTime.push(temp.duration); }); trialRun.push({ trialId: trialId, @@ -97,18 +149,43 @@ class Duration extends React.Component { } } - componentWillReceiveProps(nextProps: DurationProps) { - const trialJobs = nextProps.source; - this.drawDurationGraph(trialJobs); - } - componentDidMount() { this._isMounted = true; - // init: user don't search - const {source} = this.props; + const { source } = this.props; this.drawDurationGraph(source); } + componentWillReceiveProps(nextProps: DurationProps) { + const { whichGraph, source } = nextProps; + if (whichGraph === '3') { + this.drawDurationGraph(source); + } + } + + shouldComponentUpdate(nextProps: DurationProps, nextState: DurationState) { + + const { whichGraph, source } = nextProps; + if (whichGraph === '3') { + const beforeSource = this.props.source; + if (whichGraph !== this.props.whichGraph) { + return true; + } + + if (source.length !== beforeSource.length) { + return true; + } + + if (source[source.length - 1].duration !== beforeSource[beforeSource.length - 1].duration) { + return true; + } + + if (source[source.length - 1].status !== beforeSource[beforeSource.length - 1].status) { + return true; + } + } + return false; + } + componentWillUnmount() { this._isMounted = false; } @@ -121,6 +198,7 @@ class Duration extends React.Component { option={durationSource} style={{ width: '95%', height: 412, margin: '0 auto' }} theme="my_theme" + notMerge={true} // update now /> ); diff --git a/src/webui/src/components/trial-detail/Intermeidate.tsx b/src/webui/src/components/trial-detail/Intermeidate.tsx index e30ad2f9cf..5812f525cf 100644 --- a/src/webui/src/components/trial-detail/Intermeidate.tsx +++ b/src/webui/src/components/trial-detail/Intermeidate.tsx @@ -11,16 +11,21 @@ interface Intermedia { data: Array; // intermediate data hyperPara: object; // each trial hyperpara value } + interface IntermediateState { + detailSource: Array; interSource: object; filterSource: Array; eachIntermediateNum: number; // trial's intermediate number count isLoadconfirmBtn: boolean; isFilter: boolean; + length: number; + clickCounts: number; // user filter intermediate click confirm btn's counts } interface IntermediateProps { source: Array; + whichGraph: string; } class Intermediate extends React.Component { @@ -34,39 +39,25 @@ class Intermediate extends React.Component constructor(props: IntermediateProps) { super(props); this.state = { + detailSource: [], interSource: {}, filterSource: [], eachIntermediateNum: 1, isLoadconfirmBtn: false, - isFilter: false - }; - } - - initMediate = () => { - const option = { - grid: { - left: '5%', - top: 40, - containLabel: true - }, - xAxis: { - type: 'category', - boundaryGap: false, - }, - yAxis: { - type: 'value', - name: 'Scape' - } + isFilter: false, + length: 100000, + clickCounts: 0 }; - if (this._isMounted) { - this.setState(() => ({ - interSource: option - })); - } } drawIntermediate = (source: Array) => { if (source.length > 0) { + if (this._isMounted) { + this.setState(() => ({ + length: source.length, + detailSource: source + })); + } const trialIntermediate: Array = []; Object.keys(source).map(item => { const temp = source[item]; @@ -140,7 +131,24 @@ class Intermediate extends React.Component })); } } else { - this.initMediate(); + const nullData = { + grid: { + left: '5%', + top: 40, + containLabel: true + }, + xAxis: { + type: 'category', + boundaryGap: false, + }, + yAxis: { + type: 'value', + name: 'Scape' + } + }; + if (this._isMounted) { + this.setState(() => ({ interSource: nullData })); + } } } @@ -183,8 +191,9 @@ class Intermediate extends React.Component this.setState({ filterSource: filterSource }); } this.drawIntermediate(filterSource); + const counts = this.state.clickCounts + 1; + this.setState({ isLoadconfirmBtn: false, clickCounts: counts }); } - this.setState({ isLoadconfirmBtn: false }); }); } } @@ -204,28 +213,73 @@ class Intermediate extends React.Component this.drawIntermediate(source); } - componentWillReceiveProps(nextProps: IntermediateProps) { - const { isFilter, filterSource } = this.state; - if (isFilter === true) { - const pointVal = this.pointInput !== null ? this.pointInput.value : ''; - const minVal = this.minValInput !== null ? this.minValInput.value : ''; - if (pointVal === '' && minVal === '') { - this.drawIntermediate(nextProps.source); + componentWillReceiveProps(nextProps: IntermediateProps, nextState: IntermediateState) { + const { isFilter, filterSource } = nextState; + const { whichGraph, source } = nextProps; + + if (whichGraph === '4') { + if (isFilter === true) { + const pointVal = this.pointInput !== null ? this.pointInput.value : ''; + const minVal = this.minValInput !== null ? this.minValInput.value : ''; + if (pointVal === '' && minVal === '') { + this.drawIntermediate(source); + } else { + this.drawIntermediate(filterSource); + } } else { - this.drawIntermediate(filterSource); + this.drawIntermediate(source); } - } else { - this.drawIntermediate(nextProps.source); } } + shouldComponentUpdate(nextProps: IntermediateProps, nextState: IntermediateState) { + const { whichGraph } = nextProps; + const beforeGraph = this.props.whichGraph; + if (whichGraph === '4') { + + const { source } = nextProps; + const { isFilter, length, clickCounts } = nextState; + const beforeLength = this.state.length; + const beforeSource = this.state.detailSource; + const beforeClickCounts = this.state.clickCounts; + + if (isFilter !== this.state.isFilter) { + return true; + } + + if (clickCounts !== beforeClickCounts) { + return true; + } + + if (isFilter === false) { + if (whichGraph !== beforeGraph) { + return true; + } + if (length !== beforeLength) { + return true; + } + if (source[source.length - 1].description.intermediate.length !== + beforeSource[beforeSource.length - 1].description.intermediate.length) { + return true; + } + if (source[source.length - 1].duration !== beforeSource[beforeSource.length - 1].duration) { + return true; + } + if (source[source.length - 1].status !== beforeSource[beforeSource.length - 1].status) { + return true; + } + } + } + + return false; + } + componentWillUnmount() { this._isMounted = false; } render() { const { interSource, isLoadconfirmBtn, isFilter } = this.state; - return (
{/* style in para.scss */} diff --git a/src/webui/src/components/trial-detail/Para.tsx b/src/webui/src/components/trial-detail/Para.tsx index 2a58e4714a..1a9190f702 100644 --- a/src/webui/src/components/trial-detail/Para.tsx +++ b/src/webui/src/components/trial-detail/Para.tsx @@ -1,7 +1,8 @@ import * as React from 'react'; import ReactEcharts from 'echarts-for-react'; +import { filterByStatus } from '../../static/function'; import { Row, Col, Select, Button, message } from 'antd'; -import { ParaObj, Dimobj, TableObj, SearchSpace } from '../../static/interface'; +import { ParaObj, Dimobj, TableObj } from '../../static/interface'; const Option = Select.Option; require('echarts/lib/chart/parallel'); require('echarts/lib/component/tooltip'); @@ -11,6 +12,7 @@ require('../../static/style/para.scss'); require('../../static/style/button.scss'); interface ParaState { + // paraSource: Array; option: object; paraBack: ParaObj; dimName: Array; @@ -19,11 +21,15 @@ interface ParaState { paraNodata: string; max: number; // graph color bar limit min: number; + sutrialCount: number; // succeed trial numbers for SUC + clickCounts: number; + isLoadConfirm: boolean; } interface ParaProps { dataSource: Array; expSearchSpace: string; + whichGraph: string; } message.config({ @@ -45,6 +51,8 @@ class Para extends React.Component { constructor(props: ParaProps) { super(props); this.state = { + // paraSource: [], + // option: this.hyperParaPic, option: {}, dimName: [], paraBack: { @@ -58,98 +66,20 @@ class Para extends React.Component { percent: 0, paraNodata: '', min: 0, - max: 1 + max: 1, + sutrialCount: 10000000, + clickCounts: 1, + isLoadConfirm: false }; } - componentDidMount() { - - this._isMounted = true; - this.reInit(); - } - getParallelAxis = ( - dimName: Array, searchRange: SearchSpace, - accPara: Array, - eachTrialParams: Array, paraYdata: number[][] + dimName: Array, parallelAxis: Array, + accPara: Array, eachTrialParams: Array ) => { - if (this._isMounted) { - this.setState(() => ({ - dimName: dimName - })); - } - const parallelAxis: Array = []; - // search space range and specific value [only number] - for (let i = 0; i < dimName.length; i++) { - const searchKey = searchRange[dimName[i]]; - switch (searchKey._type) { - case 'uniform': - case 'quniform': - parallelAxis.push({ - dim: i, - name: dimName[i], - max: searchKey._value[1], - min: searchKey._value[0] - }); - break; - - case 'randint': - parallelAxis.push({ - dim: i, - name: dimName[i], - max: searchKey._value[0] - 1, - min: 0 - }); - break; - - case 'choice': - const data: Array = []; - for (let j = 0; j < searchKey._value.length; j++) { - data.push(searchKey._value[j].toString()); - } - parallelAxis.push({ - dim: i, - name: dimName[i], - type: 'category', - data: data, - boundaryGap: true, - axisLine: { - lineStyle: { - type: 'dotted', // axis type,solid,dashed,dotted - width: 1 - } - }, - axisTick: { - show: true, - interval: 0, - alignWithLabel: true, - }, - axisLabel: { - show: true, - interval: 0, - // rotate: 30 - }, - }); - break; - // support log distribute - case 'loguniform': - parallelAxis.push({ - dim: i, - name: dimName[i], - type: 'log', - }); - break; - - default: - parallelAxis.push({ - dim: i, - name: dimName[i] - }); - - } - } // get data for every lines. if dim is choice type, number -> toString() + const paraYdata: number[][] = []; Object.keys(eachTrialParams).map(item => { let temp: Array = []; for (let i = 0; i < dimName.length; i++) { @@ -169,7 +99,7 @@ class Para extends React.Component { Object.keys(paraYdata).map(item => { paraYdata[item].push(accPara[item]); }); - // according acc to sort ydata + // according acc to sort ydata // sort to find top percent dataset if (paraYdata.length !== 0) { const len = paraYdata[0].length - 1; paraYdata.sort((a, b) => b[len] - a[len]); @@ -191,28 +121,153 @@ class Para extends React.Component { this.swapGraph(paraData, swapAxisArr); } this.getOption(paraData); + if (this._isMounted === true) { + this.setState(() => ({ paraBack: paraData })); + } } - hyperParaPic = (dataSource: Array, searchSpace: string) => { + hyperParaPic = (source: Array, searchSpace: string) => { + // filter succeed trials [{}, {}, {}] + const dataSource: Array = source.filter(filterByStatus); + const lenOfDataSource: number = dataSource.length; const accPara: Array = []; // specific value array const eachTrialParams: Array = []; - const paraYdata: number[][] = []; // experiment interface search space obj - const searchRange = JSON.parse(searchSpace); + const searchRange = searchSpace !== undefined ? JSON.parse(searchSpace) : ''; const dimName = Object.keys(searchRange); - // trial-jobs interface list - Object.keys(dataSource).map(item => { - const temp = dataSource[item]; - if (temp.status === 'SUCCEEDED') { - accPara.push(temp.acc.default); - eachTrialParams.push(temp.description.parameters); + if (this._isMounted === true) { + this.setState(() => ({ dimName: dimName })); + } + + const parallelAxis: Array = []; + // search space range and specific value [only number] + for (let i = 0; i < dimName.length; i++) { + const searchKey = searchRange[dimName[i]]; + switch (searchKey._type) { + case 'uniform': + case 'quniform': + parallelAxis.push({ + dim: i, + name: dimName[i], + max: searchKey._value[1], + min: searchKey._value[0] + }); + break; + + case 'randint': + parallelAxis.push({ + dim: i, + name: dimName[i], + max: searchKey._value[0] - 1, + min: 0 + }); + break; + + case 'choice': + const data: Array = []; + for (let j = 0; j < searchKey._value.length; j++) { + data.push(searchKey._value[j].toString()); + } + parallelAxis.push({ + dim: i, + name: dimName[i], + type: 'category', + data: data, + boundaryGap: true, + axisLine: { + lineStyle: { + type: 'dotted', // axis type,solid,dashed,dotted + width: 1 + } + }, + axisTick: { + show: true, + interval: 0, + alignWithLabel: true, + }, + axisLabel: { + show: true, + interval: 0, + // rotate: 30 + }, + }); + break; + // support log distribute + case 'loguniform': + parallelAxis.push({ + dim: i, + name: dimName[i], + type: 'log', + }); + break; + + default: + parallelAxis.push({ + dim: i, + name: dimName[i] + }); + } - }); - if (this._isMounted) { - this.setState({ max: Math.max(...accPara), min: Math.min(...accPara) }, () => { - this.getParallelAxis(dimName, searchRange, accPara, eachTrialParams, paraYdata); + } + if (lenOfDataSource === 0) { + const optionOfNull = { + parallelAxis, + tooltip: { + trigger: 'item' + }, + parallel: { + parallelAxisDefault: { + tooltip: { + show: true + }, + axisLabel: { + formatter: function (value: string) { + const length = value.length; + if (length > 16) { + const temp = value.split(''); + for (let i = 16; i < temp.length; i += 17) { + temp[i] += '\n'; + } + return temp.join(''); + } else { + return value; + } + } + }, + } + }, + visualMap: { + type: 'continuous', + min: 0, + max: 1, + color: ['#CA0000', '#FFC400', '#90EE90'] + } + }; + if (this._isMounted === true) { + this.setState({ + paraNodata: 'No data', + option: optionOfNull, + sutrialCount: 0 + }); + } + } else { + Object.keys(dataSource).map(item => { + const temp = dataSource[item]; + eachTrialParams.push(temp.description.parameters); + // may be a succeed trial hasn't final result + // all detail page may be break down if havn't if + if (temp.acc !== undefined) { + if (temp.acc.default !== undefined) { + accPara.push(temp.acc.default); + } + } }); + if (this._isMounted) { + this.setState({ max: Math.max(...accPara), min: Math.min(...accPara) }, () => { + this.getParallelAxis(dimName, parallelAxis, accPara, eachTrialParams); + }); + } } } @@ -229,9 +284,10 @@ class Para extends React.Component { // deal with response data into pic data getOption = (dataObj: ParaObj) => { + // dataObj [[y1], [y2]... [default metric]] const { max, min } = this.state; - let parallelAxis = dataObj.parallelAxis; - let paralleData = dataObj.data; + const parallelAxis = dataObj.parallelAxis; + const paralleData = dataObj.data; let visualMapObj = {}; if (max === min) { visualMapObj = { @@ -251,7 +307,7 @@ class Para extends React.Component { color: ['#CA0000', '#FFC400', '#90EE90'] }; } - let optionown = { + const optionown = { parallelAxis, tooltip: { trigger: 'item' @@ -288,21 +344,11 @@ class Para extends React.Component { } }; // please wait the data - if (this._isMounted) { - if (paralleData.length === 0) { - this.setState({ - paraNodata: 'No data' - }); - } else { - this.setState({ - paraNodata: '' - }); - } - } - // draw search space graph if (this._isMounted) { this.setState(() => ({ - option: optionown + option: optionown, + paraNodata: '', + sutrialCount: paralleData.length })); } } @@ -320,6 +366,68 @@ class Para extends React.Component { this.hyperParaPic(dataSource, expSearchSpace); } + swapReInit = () => { + const { clickCounts } = this.state; + const val = clickCounts + 1; + if (this._isMounted) { + this.setState({ isLoadConfirm: true, clickCounts: val, }); + } + const { paraBack, swapAxisArr } = this.state; + const paralDim = paraBack.parallelAxis; + const paraData = paraBack.data; + let temp: number; + let dim1: number; + let dim2: number; + let bool1: boolean = false; + let bool2: boolean = false; + let bool3: boolean = false; + Object.keys(paralDim).map(item => { + const paral = paralDim[item]; + switch (paral.name) { + case swapAxisArr[0]: + dim1 = paral.dim; + bool1 = true; + break; + + case swapAxisArr[1]: + dim2 = paral.dim; + bool2 = true; + break; + + default: + } + if (bool1 && bool2) { + bool3 = true; + } + }); + // swap dim's number + Object.keys(paralDim).map(item => { + if (bool3) { + if (paralDim[item].name === swapAxisArr[0]) { + paralDim[item].dim = dim2; + } + if (paralDim[item].name === swapAxisArr[1]) { + paralDim[item].dim = dim1; + } + } + }); + paralDim.sort(this.sortDimY); + // swap data array + Object.keys(paraData).map(paraItem => { + + temp = paraData[paraItem][dim1]; + paraData[paraItem][dim1] = paraData[paraItem][dim2]; + paraData[paraItem][dim2] = temp; + }); + this.getOption(paraBack); + // please wait the data + if (this._isMounted) { + this.setState(() => ({ + isLoadConfirm: false + })); + } + } + sortDimY = (a: Dimobj, b: Dimobj) => { return a.dim - b.dim; } @@ -374,11 +482,39 @@ class Para extends React.Component { }); } + componentDidMount() { + this._isMounted = true; + this.reInit(); + } + componentWillReceiveProps(nextProps: ParaProps) { - const dataSource = nextProps.dataSource; - const expSearchSpace = nextProps.expSearchSpace; - this.hyperParaPic(dataSource, expSearchSpace); + const { dataSource, expSearchSpace, whichGraph } = nextProps; + if (whichGraph === '2') { + this.hyperParaPic(dataSource, expSearchSpace); + } + } + shouldComponentUpdate(nextProps: ParaProps, nextState: ParaState) { + + const { whichGraph } = nextProps; + const beforeGraph = this.props.whichGraph; + if (whichGraph === '2') { + if (whichGraph !== beforeGraph) { + return true; + } + + const { sutrialCount, clickCounts } = nextState; + const beforeCount = this.state.sutrialCount; + const beforeClickCount = this.state.clickCounts; + if (sutrialCount !== beforeCount) { + return true; + } + + if (clickCounts !== beforeClickCount) { + return true; + } + } + return false; } componentWillUnmount() { @@ -386,7 +522,7 @@ class Para extends React.Component { } render() { - const { option, paraNodata, dimName } = this.state; + const { option, paraNodata, dimName, isLoadConfirm } = this.state; return ( @@ -423,7 +559,8 @@ class Para extends React.Component { @@ -434,7 +571,7 @@ class Para extends React.Component {
{paraNodata}
diff --git a/src/webui/src/components/trial-detail/TableList.tsx b/src/webui/src/components/trial-detail/TableList.tsx index c7dc803bf4..c52bb0068c 100644 --- a/src/webui/src/components/trial-detail/TableList.tsx +++ b/src/webui/src/components/trial-detail/TableList.tsx @@ -1,16 +1,13 @@ import * as React from 'react'; import axios from 'axios'; import ReactEcharts from 'echarts-for-react'; -import { - Row, Table, Button, Popconfirm, Modal, Checkbox -} from 'antd'; +import { Row, Table, Button, Popconfirm, Modal, Checkbox } from 'antd'; const CheckboxGroup = Checkbox.Group; import { MANAGER_IP, trialJobStatus, COLUMN, COLUMN_INDEX } from '../../static/const'; import { convertDuration, intermediateGraphOption, killJob } from '../../static/function'; import { TableObj, TrialJob } from '../../static/interface'; import OpenRow from '../public-child/OpenRow'; -// import DefaultMetric from '../public-child/DefaultMetrc'; -import IntermediateVal from '../public-child/IntermediateVal'; +import IntermediateVal from '../public-child/IntermediateVal'; // table default metric column import '../../static/style/search.scss'; require('../../static/style/tableStatus.css'); require('../../static/style/logPath.scss'); @@ -33,6 +30,7 @@ interface TableListProps { platform: string; logCollection: boolean; isMultiPhase: boolean; + isTableLoading: boolean; } interface TableListState { @@ -197,7 +195,7 @@ class TableList extends React.Component { render() { - const { entries, tableSource, updateList } = this.props; + const { entries, tableSource, updateList, isTableLoading } = this.props; const { intermediateOption, modalVisible, isShowColumn, columnSelected } = this.state; let showTitle = COLUMN; let bgColor = ''; @@ -420,6 +418,7 @@ class TableList extends React.Component { dataSource={tableSource} className="commonTableStyle" pagination={{ pageSize: entries }} + loading={isTableLoading} /> {/* Intermediate Result Modal */} { if (num % 3600 === 0) { @@ -131,7 +129,16 @@ const killJob = (key: number, id: string, status: string, updateList: Function) }); }; +const filterByStatus = (item: TableObj) => { + return item.status === 'SUCCEEDED'; +}; + +// a waittiong trial may havn't start time +const filterDuration = (item: TableObj) => { + return item.status !== 'WAITING'; +}; + export { - convertTime, convertDuration, getFinalResult, - getFinal, intermediateGraphOption, killJob + convertTime, convertDuration, getFinalResult, getFinal, + intermediateGraphOption, killJob, filterByStatus, filterDuration }; diff --git a/src/webui/src/static/interface.ts b/src/webui/src/static/interface.ts index bcf0bd6950..cd18cfe929 100644 --- a/src/webui/src/static/interface.ts +++ b/src/webui/src/static/interface.ts @@ -26,7 +26,7 @@ interface ErrorParameter { interface Parameters { parameters: ErrorParameter; logPath?: string; - intermediate?: Array; + intermediate: Array; } interface Experiment { diff --git a/src/webui/src/static/style/copyParameter.scss b/src/webui/src/static/style/copyParameter.scss new file mode 100644 index 0000000000..a906e56afb --- /dev/null +++ b/src/webui/src/static/style/copyParameter.scss @@ -0,0 +1,22 @@ +$color: #f2f2f2; +.formatStr{ + border: 1px solid #8f8f8f; + color: #333; + padding: 5px 10px; + background-color: #fff; +} + +.format { + .ant-modal-header{ + background-color: $color; + border-bottom: none; + } + .ant-modal-footer{ + background-color: $color; + border-top: none; + } + .ant-modal-body{ + background-color: $color; + padding: 10px 24px !important; + } +} diff --git a/src/webui/src/static/style/overview.scss b/src/webui/src/static/style/overview.scss index 353696aa6b..2cfc2436a8 100644 --- a/src/webui/src/static/style/overview.scss +++ b/src/webui/src/static/style/overview.scss @@ -52,4 +52,3 @@ .link{ margin-bottom: 10px; } - diff --git a/test/config_test/examples/mnist-cascading-search-space.test.yml b/test/config_test/examples/mnist-nested-search-space.test.yml similarity index 72% rename from test/config_test/examples/mnist-cascading-search-space.test.yml rename to test/config_test/examples/mnist-nested-search-space.test.yml index 1904afac64..accbd8b88f 100644 --- a/test/config_test/examples/mnist-cascading-search-space.test.yml +++ b/test/config_test/examples/mnist-nested-search-space.test.yml @@ -3,7 +3,7 @@ experimentName: default_test maxExecDuration: 5m maxTrialNum: 4 trialConcurrency: 2 -searchSpacePath: ../../../examples/trials/mnist-cascading-search-space/search_space.json +searchSpacePath: ../../../examples/trials/mnist-nested-search-space/search_space.json tuner: #choice: TPE, Random, Anneal, Evolution @@ -13,7 +13,7 @@ assessor: classArgs: optimize_mode: maximize trial: - codeDir: ../../../examples/trials/mnist-cascading-search-space + codeDir: ../../../examples/trials/mnist-nested-search-space command: python3 mnist.py --batch_num 100 gpuNum: 0 diff --git a/test/pipelines-it-remote-windows.yml b/test/pipelines-it-remote-windows.yml new file mode 100644 index 0000000000..8eaff656c1 --- /dev/null +++ b/test/pipelines-it-remote-windows.yml @@ -0,0 +1,49 @@ +jobs: +- job: 'integration_test_remote_windows' + + steps: + - script: python -m pip install --upgrade pip setuptools + displayName: 'Install python tools' + - task: CopyFilesOverSSH@0 + inputs: + sshEndpoint: $(end_point) + targetFolder: /tmp/nnitest/$(Build.BuildId)/nni-remote + overwrite: true + displayName: 'Copy all files to remote machine' + - script: | + powershell.exe -file install.ps1 + displayName: 'Install nni toolkit via source code' + - script: | + python -m pip install scikit-learn==0.20.1 --user + displayName: 'Install dependencies for integration tests' + - task: SSH@0 + inputs: + sshEndpoint: $(end_point) + runOptions: inline + inline: cd /tmp/nnitest/$(Build.BuildId)/nni-remote/deployment/pypi;make build + continueOnError: true + displayName: 'build nni bdsit_wheel' + - task: SSH@0 + inputs: + sshEndpoint: $(end_point) + runOptions: commands + commands: python3 /tmp/nnitest/$(Build.BuildId)/nni-remote/test/remote_docker.py --mode start --name $(Build.BuildId) --image nni/nni --os windows + displayName: 'Start docker' + - powershell: | + Write-Host "Downloading Putty..." + (New-Object Net.WebClient).DownloadFile("https://the.earth.li/~sgtatham/putty/latest/w64/pscp.exe", "$(Agent.TempDirectory)\pscp.exe") + $(Agent.TempDirectory)\pscp.exe -hostkey $(hostkey) -pw $(pscp_pwd) $(remote_user)@$(remote_host):/tmp/nnitest/$(Build.BuildId)/port test\port + Get-Content test\port + displayName: 'Get docker port' + - powershell: | + cd test + python generate_ts_config.py --ts remote --remote_user $(docker_user) --remote_host $(remote_host) --remote_port $(Get-Content port) --remote_pwd $(docker_pwd) --nni_manager_ip $(nni_manager_ip) + Get-Content training_service.yml + python config_test.py --ts remote --exclude cifar10,smac,bohb + displayName: 'integration test' + - task: SSH@0 + inputs: + sshEndpoint: $(end_point) + runOptions: commands + commands: python3 /tmp/nnitest/$(Build.BuildId)/nni-remote/test/remote_docker.py --mode stop --name $(Build.BuildId) --os windows + displayName: 'Stop docker' diff --git a/test/remote_docker.py b/test/remote_docker.py index 98f37a1444..576f54ffce 100644 --- a/test/remote_docker.py +++ b/test/remote_docker.py @@ -30,18 +30,33 @@ def find_wheel_package(dir): return file_name return None -def start_container(image, name): +def start_container(image, name, nnimanager_os): '''Start docker container, generate a port in /tmp/nnitest/{name}/port file''' port = find_port() source_dir = '/tmp/nnitest/' + name run_cmds = ['docker', 'run', '-d', '-p', str(port) + ':22', '--name', name, '--mount', 'type=bind,source=' + source_dir + ',target=/tmp/nni', image] output = check_output(run_cmds) commit_id = output.decode('utf-8') - wheel_name = find_wheel_package(os.path.join(source_dir, 'dist')) + + if nnimanager_os == 'windows': + wheel_name = find_wheel_package(os.path.join(source_dir, 'nni-remote/deployment/pypi/dist')) + else: + wheel_name = find_wheel_package(os.path.join(source_dir, 'dist')) + if not wheel_name: print('Error: could not find wheel package in {0}'.format(source_dir)) exit(1) - sdk_cmds = ['docker', 'exec', name, 'python3', '-m', 'pip', 'install', '/tmp/nni/dist/{0}'.format(wheel_name)] + + def get_dist(wheel_name): + '''get the wheel package path''' + if nnimanager_os == 'windows': + return '/tmp/nni/nni-remote/deployment/pypi/dist/{0}'.format(wheel_name) + else: + return '/tmp/nni/dist/{0}'.format(wheel_name) + + pip_cmds = ['docker', 'exec', name, 'python3', '-m', 'pip', 'install', '--upgrade', 'pip'] + check_call(pip_cmds) + sdk_cmds = ['docker', 'exec', name, 'python3', '-m', 'pip', 'install', get_dist(wheel_name)] check_call(sdk_cmds) with open(source_dir + '/port', 'w') as file: file.write(str(port)) @@ -58,8 +73,9 @@ def stop_container(name): parser.add_argument('--mode', required=True, choices=['start', 'stop'], dest='mode', help='start or stop a container') parser.add_argument('--name', required=True, dest='name', help='the name of container to be used') parser.add_argument('--image', dest='image', help='the image to be used') + parser.add_argument('--os', dest='os', default='unix', choices=['unix', 'windows'], help='nniManager os version') args = parser.parse_args() if args.mode == 'start': - start_container(args.image, args.name) + start_container(args.image, args.name, args.os) else: stop_container(args.name) diff --git a/tools/README_zh_CN.md b/tools/README_zh_CN.md index 3fae759e44..3339f77129 100644 --- a/tools/README_zh_CN.md +++ b/tools/README_zh_CN.md @@ -54,4 +54,4 @@ NNI CTL 模块用来控制 Neural Network Intelligence,包括开始新 Experim ## 开始使用 NNI CTL -参考 [NNI CTL 文档](../docs/zh_CN/NNICTLDOC.md)。 \ No newline at end of file +参考 [NNI CTL 文档](../docs/zh_CN/Nnictl.md)。 \ No newline at end of file diff --git a/tools/nni_annotation/code_generator.py b/tools/nni_annotation/code_generator.py index ce27596a8f..dfe24ca081 100644 --- a/tools/nni_annotation/code_generator.py +++ b/tools/nni_annotation/code_generator.py @@ -25,6 +25,94 @@ # pylint: disable=unidiomatic-typecheck +def parse_annotation_mutable_layers(code, lineno): + """Parse the string of mutable layers in annotation. + Return a list of AST Expr nodes + code: annotation string (excluding '@') + """ + module = ast.parse(code) + assert type(module) is ast.Module, 'internal error #1' + assert len(module.body) == 1, 'Annotation mutable_layers contains more than one expression' + assert type(module.body[0]) is ast.Expr, 'Annotation is not expression' + call = module.body[0].value + nodes = [] + mutable_id = 'mutable_block_' + str(lineno) + mutable_layer_cnt = 0 + for arg in call.args: + fields = {'layer_choice': False, + 'fixed_inputs': False, + 'optional_inputs': False, + 'optional_input_size': False, + 'layer_output': False} + for k, value in zip(arg.keys, arg.values): + if k.id == 'layer_choice': + assert not fields['layer_choice'], 'Duplicated field: layer_choice' + assert type(value) is ast.List, 'Value of layer_choice should be a list' + call_funcs_keys = [] + call_funcs_values = [] + call_kwargs_values = [] + for call in value.elts: + assert type(call) is ast.Call, 'Element in layer_choice should be function call' + call_name = astor.to_source(call).strip() + call_funcs_keys.append(ast.Str(s=call_name)) + call_funcs_values.append(call.func) + assert not call.args, 'Number of args without keyword should be zero' + kw_args = [] + kw_values = [] + for kw in call.keywords: + kw_args.append(kw.arg) + kw_values.append(kw.value) + call_kwargs_values.append(ast.Dict(keys=kw_args, values=kw_values)) + call_funcs = ast.Dict(keys=call_funcs_keys, values=call_funcs_values) + call_kwargs = ast.Dict(keys=call_funcs_keys, values=call_kwargs_values) + fields['layer_choice'] = True + elif k.id == 'fixed_inputs': + assert not fields['fixed_inputs'], 'Duplicated field: fixed_inputs' + assert type(value) is ast.List, 'Value of fixed_inputs should be a list' + fixed_inputs = value + fields['fixed_inputs'] = True + elif k.id == 'optional_inputs': + assert not fields['optional_inputs'], 'Duplicated field: optional_inputs' + assert type(value) is ast.List, 'Value of optional_inputs should be a list' + var_names = [ast.Str(s=astor.to_source(var).strip()) for var in value.elts] + optional_inputs = ast.Dict(keys=var_names, values=value.elts) + fields['optional_inputs'] = True + elif k.id == 'optional_input_size': + assert not fields['optional_input_size'], 'Duplicated field: optional_input_size' + assert type(value) is ast.Num, 'Value of optional_input_size should be a number' + optional_input_size = value + fields['optional_input_size'] = True + elif k.id == 'layer_output': + assert not fields['layer_output'], 'Duplicated field: layer_output' + assert type(value) is ast.Name, 'Value of layer_output should be ast.Name type' + layer_output = value + fields['layer_output'] = True + else: + raise AssertionError('Unexpected field in mutable layer') + # make call for this mutable layer + assert fields['layer_choice'], 'layer_choice must exist' + assert fields['layer_output'], 'layer_output must exist' + mutable_layer_id = 'mutable_layer_' + str(mutable_layer_cnt) + mutable_layer_cnt += 1 + target_call_attr = ast.Attribute(value=ast.Name(id='nni', ctx=ast.Load()), attr='mutable_layer', ctx=ast.Load()) + target_call_args = [ast.Str(s=mutable_id), + ast.Str(s=mutable_layer_id), + call_funcs, + call_kwargs] + if fields['fixed_inputs']: + target_call_args.append(fixed_inputs) + else: + target_call_args.append(ast.NameConstant(value=None)) + if fields['optional_inputs']: + target_call_args.append(optional_inputs) + assert fields['optional_input_size'], 'optional_input_size must exist when optional_inputs exists' + target_call_args.append(optional_input_size) + else: + target_call_args.append(ast.NameConstant(value=None)) + target_call = ast.Call(func=target_call_attr, args=target_call_args, keywords=[]) + node = ast.Assign(targets=[layer_output], value=target_call) + nodes.append(node) + return nodes def parse_annotation(code): """Parse an annotation string. @@ -235,6 +323,9 @@ def _visit_string(self, node): or string.startswith('@nni.get_next_parameter('): return parse_annotation(string[1:]) # expand annotation string to code + if string.startswith('@nni.mutable_layers('): + return parse_annotation_mutable_layers(string[1:], node.lineno) + if string.startswith('@nni.variable(') \ or string.startswith('@nni.function_choice('): self.stack[-1] = string[1:] # mark that the next expression is annotated diff --git a/tools/nni_annotation/search_space_generator.py b/tools/nni_annotation/search_space_generator.py index ed200ce934..a7ded034a2 100644 --- a/tools/nni_annotation/search_space_generator.py +++ b/tools/nni_annotation/search_space_generator.py @@ -38,7 +38,8 @@ 'qnormal', 'lognormal', 'qlognormal', - 'function_choice' + 'function_choice', + 'mutable_layer' ] @@ -50,6 +51,18 @@ def __init__(self, module_name): self.search_space = {} self.last_line = 0 # last parsed line, useful for error reporting + def generate_mutable_layer_search_space(self, args): + mutable_block = args[0].s + mutable_layer = args[1].s + if mutable_block not in self.search_space: + self.search_space[mutable_block] = dict() + self.search_space[mutable_block][mutable_layer] = { + 'layer_choice': [key.s for key in args[2].keys], + 'optional_inputs': [key.s for key in args[5].keys], + 'optional_input_size': args[6].n + } + + def visit_Call(self, node): # pylint: disable=invalid-name self.generic_visit(node) @@ -68,6 +81,10 @@ def visit_Call(self, node): # pylint: disable=invalid-name self.last_line = node.lineno + if func == 'mutable_layer': + self.generate_mutable_layer_search_space(node.args) + return node + if node.keywords: # there is a `name` argument assert len(node.keywords) == 1, 'Smart parameter has keyword argument other than "name"' diff --git a/tools/nni_annotation/testcase/mutable_layer_usercode/simple.py b/tools/nni_annotation/testcase/mutable_layer_usercode/simple.py new file mode 100644 index 0000000000..5838e84f03 --- /dev/null +++ b/tools/nni_annotation/testcase/mutable_layer_usercode/simple.py @@ -0,0 +1,53 @@ +import time + +def add_one(inputs): + return inputs + 1 + +def add_two(inputs): + return inputs + 2 + +def add_three(inputs): + return inputs + 3 + +def add_four(inputs): + return inputs + 4 + + +def main(): + + images = 5 + + """@nni.mutable_layers( + { + layer_choice: [add_one(), add_two(), add_three(), add_four()], + optional_inputs: [images], + optional_input_size: 1, + layer_output: layer_1_out + }, + { + layer_choice: [add_one(), add_two(), add_three(), add_four()], + optional_inputs: [layer_1_out], + optional_input_size: 1, + layer_output: layer_2_out + }, + { + layer_choice: [add_one(), add_two(), add_three(), add_four()], + optional_inputs: [layer_1_out, layer_2_out], + optional_input_size: 1, + layer_output: layer_3_out + } + )""" + + """@nni.report_intermediate_result(layer_1_out)""" + time.sleep(2) + """@nni.report_intermediate_result(layer_2_out)""" + time.sleep(2) + """@nni.report_intermediate_result(layer_3_out)""" + time.sleep(2) + + layer_3_out = layer_3_out + 10 + + """@nni.report_final_result(layer_3_out)""" + +if __name__ == '__main__': + main() diff --git a/tools/nni_cmd/config_schema.py b/tools/nni_cmd/config_schema.py index b41a1e6f09..ef5173d8b9 100644 --- a/tools/nni_cmd/config_schema.py +++ b/tools/nni_cmd/config_schema.py @@ -63,7 +63,9 @@ def setPathCheck(key): Optional('advisor'): dict, Optional('assessor'): dict, Optional('localConfig'): { - Optional('gpuIndices'): Or(int, And(str, lambda x: len([int(i) for i in x.split(',')]) > 0), error='gpuIndex format error!') + Optional('gpuIndices'): Or(int, And(str, lambda x: len([int(i) for i in x.split(',')]) > 0), error='gpuIndex format error!'), + Optional('maxTrialNumPerGpu'): setType('maxTrialNumPerGpu', int), + Optional('useActiveGpu'): setType('useActiveGpu', bool) } } tuner_schema_dict = { @@ -310,26 +312,30 @@ def setPathCheck(key): }) } -machine_list_schima = { +machine_list_schema = { Optional('machineList'):[Or({ 'ip': setType('ip', str), Optional('port'): setNumberRange('port', int, 1, 65535), 'username': setType('username', str), 'passwd': setType('passwd', str), - Optional('gpuIndices'): Or(int, And(str, lambda x: len([int(i) for i in x.split(',')]) > 0), error='gpuIndex format error!') + Optional('gpuIndices'): Or(int, And(str, lambda x: len([int(i) for i in x.split(',')]) > 0), error='gpuIndex format error!'), + Optional('maxTrialNumPerGpu'): setType('maxTrialNumPerGpu', int), + Optional('useActiveGpu'): setType('useActiveGpu', bool) },{ 'ip': setType('ip', str), Optional('port'): setNumberRange('port', int, 1, 65535), 'username': setType('username', str), 'sshKeyPath': setPathCheck('sshKeyPath'), Optional('passphrase'): setType('passphrase', str), - Optional('gpuIndices'): Or(int, And(str, lambda x: len([int(i) for i in x.split(',')]) > 0), error='gpuIndex format error!') + Optional('gpuIndices'): Or(int, And(str, lambda x: len([int(i) for i in x.split(',')]) > 0), error='gpuIndex format error!'), + Optional('maxTrialNumPerGpu'): setType('maxTrialNumPerGpu', int), + Optional('useActiveGpu'): setType('useActiveGpu', bool) })] } LOCAL_CONFIG_SCHEMA = Schema({**common_schema, **common_trial_schema}) -REMOTE_CONFIG_SCHEMA = Schema({**common_schema, **common_trial_schema, **machine_list_schima}) +REMOTE_CONFIG_SCHEMA = Schema({**common_schema, **common_trial_schema, **machine_list_schema}) PAI_CONFIG_SCHEMA = Schema({**common_schema, **pai_trial_schema, **pai_config_schema}) diff --git a/tools/nni_cmd/config_utils.py b/tools/nni_cmd/config_utils.py index 646aaa7316..3883330041 100644 --- a/tools/nni_cmd/config_utils.py +++ b/tools/nni_cmd/config_utils.py @@ -119,4 +119,4 @@ def read_file(self): return json.load(file) except ValueError: return {} - return {} \ No newline at end of file + return {} diff --git a/tools/nni_cmd/constants.py b/tools/nni_cmd/constants.py index 1287539293..77bef794b4 100644 --- a/tools/nni_cmd/constants.py +++ b/tools/nni_cmd/constants.py @@ -86,12 +86,13 @@ 'Anneal', 'GridSearch', 'MetisTuner', - 'BOHB' + 'BOHB', + 'SMAC', + 'BatchTuner' } TUNERS_NO_NEED_TO_IMPORT_DATA = { 'Random', - 'Batch_tuner', 'Hyperband' } diff --git a/tools/nni_cmd/launcher.py b/tools/nni_cmd/launcher.py index 75fe952d32..8e0eb57fc6 100644 --- a/tools/nni_cmd/launcher.py +++ b/tools/nni_cmd/launcher.py @@ -125,18 +125,17 @@ def start_rest_server(port, platform, mode, config_file_name, experiment_id=None if mode == 'resume': cmds += ['--experiment_id', experiment_id] stdout_full_path, stderr_full_path = get_log_path(config_file_name) - stdout_file = open(stdout_full_path, 'a+') - stderr_file = open(stderr_full_path, 'a+') - time_now = time.strftime('%Y-%m-%d %H:%M:%S',time.localtime(time.time())) - #add time information in the header of log files - log_header = LOG_HEADER % str(time_now) - stdout_file.write(log_header) - stderr_file.write(log_header) - if sys.platform == 'win32': - from subprocess import CREATE_NEW_PROCESS_GROUP - process = Popen(cmds, cwd=entry_dir, stdout=stdout_file, stderr=stderr_file, creationflags=CREATE_NEW_PROCESS_GROUP) - else: - process = Popen(cmds, cwd=entry_dir, stdout=stdout_file, stderr=stderr_file) + with open(stdout_full_path, 'a+') as stdout_file, open(stderr_full_path, 'a+') as stderr_file: + time_now = time.strftime('%Y-%m-%d %H:%M:%S',time.localtime(time.time())) + #add time information in the header of log files + log_header = LOG_HEADER % str(time_now) + stdout_file.write(log_header) + stderr_file.write(log_header) + if sys.platform == 'win32': + from subprocess import CREATE_NEW_PROCESS_GROUP + process = Popen(cmds, cwd=entry_dir, stdout=stdout_file, stderr=stderr_file, creationflags=CREATE_NEW_PROCESS_GROUP) + else: + process = Popen(cmds, cwd=entry_dir, stdout=stdout_file, stderr=stderr_file) return process, str(time_now) def set_trial_config(experiment_config, port, config_file_name): @@ -160,9 +159,13 @@ def set_local_config(experiment_config, port, config_file_name): request_data = dict() if experiment_config.get('localConfig'): request_data['local_config'] = experiment_config['localConfig'] - if request_data['local_config'] and request_data['local_config'].get('gpuIndices') \ - and isinstance(request_data['local_config'].get('gpuIndices'), int): - request_data['local_config']['gpuIndices'] = str(request_data['local_config'].get('gpuIndices')) + if request_data['local_config']: + if request_data['local_config'].get('gpuIndices') and isinstance(request_data['local_config'].get('gpuIndices'), int): + request_data['local_config']['gpuIndices'] = str(request_data['local_config'].get('gpuIndices')) + if request_data['local_config'].get('maxTrialNumOnEachGpu'): + request_data['local_config']['maxTrialNumOnEachGpu'] = request_data['local_config'].get('maxTrialNumOnEachGpu') + if request_data['local_config'].get('useActiveGpu'): + request_data['local_config']['useActiveGpu'] = request_data['local_config'].get('useActiveGpu') response = rest_put(cluster_metadata_url(port), json.dumps(request_data), REST_TIME_OUT) err_message = '' if not response or not check_response(response): @@ -343,6 +346,13 @@ def set_experiment(experiment_config, mode, port, config_file_name): def launch_experiment(args, experiment_config, mode, config_file_name, experiment_id=None): '''follow steps to start rest server and start experiment''' nni_config = Config(config_file_name) + # check execution policy in powershell + if sys.platform == 'win32': + execution_policy = check_output(['powershell.exe','Get-ExecutionPolicy']).decode('ascii').strip() + if execution_policy == 'Restricted': + print_error('PowerShell execution policy error, please run PowerShell as administrator with this command first:\r\n'\ + + '\'Set-ExecutionPolicy -ExecutionPolicy Unrestricted\'') + exit(1) # check packages for tuner package_name, module_name = None, None if experiment_config.get('tuner') and experiment_config['tuner'].get('builtinTunerName'): diff --git a/tools/nni_cmd/nnictl_utils.py b/tools/nni_cmd/nnictl_utils.py index 6aa25a9aba..0fe3785379 100644 --- a/tools/nni_cmd/nnictl_utils.py +++ b/tools/nni_cmd/nnictl_utils.py @@ -26,8 +26,8 @@ import time from subprocess import call, check_output from .rest_utils import rest_get, rest_delete, check_rest_server_quick, check_response +from .url_utils import trial_jobs_url, experiment_url, trial_job_id_url, export_data_url from .config_utils import Config, Experiments -from .url_utils import trial_jobs_url, experiment_url, trial_job_id_url from .constants import NNICTL_HOME_DIR, EXPERIMENT_INFORMATION_FORMAT, EXPERIMENT_DETAIL_FORMAT, \ EXPERIMENT_MONITOR_INFO, TRIAL_MONITOR_HEAD, TRIAL_MONITOR_CONTENT, TRIAL_MONITOR_TAIL, REST_TIME_OUT from .common_utils import print_normal, print_error, print_warning, detect_process @@ -450,30 +450,9 @@ def monitor_experiment(args): print_error(exception) exit(1) - -def parse_trial_data(content): - """output: List[Dict]""" - trial_records = [] - for trial_data in content: - for phase_i in range(len(trial_data['hyperParameters'])): - hparam = json.loads(trial_data['hyperParameters'][phase_i])['parameters'] - hparam['id'] = trial_data['id'] - if 'finalMetricData' in trial_data.keys() and phase_i < len(trial_data['finalMetricData']): - reward = json.loads(trial_data['finalMetricData'][phase_i]['data']) - if isinstance(reward, (float, int)): - dict_tmp = {**hparam, **{'reward': reward}} - elif isinstance(reward, dict): - dict_tmp = {**hparam, **reward} - else: - raise ValueError("Invalid finalMetricsData format: {}/{}".format(type(reward), reward)) - else: - dict_tmp = hparam - trial_records.append(dict_tmp) - return trial_records - def export_trials_data(args): - """export experiment metadata to csv - """ + '''export experiment metadata to csv + ''' nni_config = Config(get_config_filename(args)) rest_port = nni_config.get_config('restServerPort') rest_pid = nni_config.get_config('restServerPid') @@ -482,26 +461,28 @@ def export_trials_data(args): return running, response = check_rest_server_quick(rest_port) if running: - response = rest_get(trial_jobs_url(rest_port), 20) + response = rest_get(export_data_url(rest_port), 20) if response is not None and check_response(response): - content = json.loads(response.text) - # dframe = pd.DataFrame.from_records([parse_trial_data(t_data) for t_data in content]) - # dframe.to_csv(args.csv_path, sep='\t') - records = parse_trial_data(content) if args.type == 'json': - json_records = [] - for trial in records: - value = trial.pop('reward', None) - trial_id = trial.pop('id', None) - json_records.append({'parameter': trial, 'value': value, 'id': trial_id}) - with open(args.path, 'w') as file: - if args.type == 'csv': - writer = csv.DictWriter(file, set.union(*[set(r.keys()) for r in records])) + with open(args.path, 'w') as file: + file.write(response.text) + elif args.type == 'csv': + content = json.loads(response.text) + trial_records = [] + for record in content: + if not isinstance(record['value'], (float, int)): + formated_record = {**record['parameter'], **record['value'], **{'id': record['id']}} + else: + formated_record = {**record['parameter'], **{'reward': record['value'], 'id': record['id']}} + trial_records.append(formated_record) + with open(args.path, 'w') as file: + writer = csv.DictWriter(file, set.union(*[set(r.keys()) for r in trial_records])) writer.writeheader() - writer.writerows(records) - else: - json.dump(json_records, file) + writer.writerows(trial_records) + else: + print_error('Unknown type: %s' % args.type) + exit(1) else: print_error('Export failed...') else: - print_error('Restful server is not Running') + print_error('Restful server is not Running') \ No newline at end of file diff --git a/tools/nni_cmd/tensorboard_utils.py b/tools/nni_cmd/tensorboard_utils.py index 67bbfab76f..b4578c34b0 100644 --- a/tools/nni_cmd/tensorboard_utils.py +++ b/tools/nni_cmd/tensorboard_utils.py @@ -94,11 +94,9 @@ def start_tensorboard_process(args, nni_config, path_list, temp_nni_path): if detect_port(args.port): print_error('Port %s is used by another process, please reset port!' % str(args.port)) exit(1) - - stdout_file = open(os.path.join(temp_nni_path, 'tensorboard_stdout'), 'a+') - stderr_file = open(os.path.join(temp_nni_path, 'tensorboard_stderr'), 'a+') - cmds = ['tensorboard', '--logdir', format_tensorboard_log_path(path_list), '--port', str(args.port)] - tensorboard_process = Popen(cmds, stdout=stdout_file, stderr=stderr_file) + with open(os.path.join(temp_nni_path, 'tensorboard_stdout'), 'a+') as stdout_file, open(os.path.join(temp_nni_path, 'tensorboard_stderr'), 'a+') as stderr_file: + cmds = ['tensorboard', '--logdir', format_tensorboard_log_path(path_list), '--port', str(args.port)] + tensorboard_process = Popen(cmds, stdout=stdout_file, stderr=stderr_file) url_list = get_local_urls(args.port) print_normal(COLOR_GREEN_FORMAT % 'Start tensorboard success!\n' + 'Tensorboard urls: ' + ' '.join(url_list)) tensorboard_process_pid_list = nni_config.get_config('tensorboardPidList') diff --git a/tools/nni_cmd/url_utils.py b/tools/nni_cmd/url_utils.py index 0e77a77b99..0dce341a59 100644 --- a/tools/nni_cmd/url_utils.py +++ b/tools/nni_cmd/url_utils.py @@ -35,6 +35,8 @@ TRIAL_JOBS_API = '/trial-jobs' +EXPORT_DATA_API = '/export-data' + TENSORBOARD_API = '/tensorboard' @@ -68,6 +70,11 @@ def trial_job_id_url(port, job_id): return '{0}:{1}{2}{3}/:{4}'.format(BASE_URL, port, API_ROOT_URL, TRIAL_JOBS_API, job_id) +def export_data_url(port): + '''get export_data url''' + return '{0}:{1}{2}{3}'.format(BASE_URL, port, API_ROOT_URL, EXPORT_DATA_API) + + def tensorboard_url(port): '''get tensorboard url''' return '{0}:{1}{2}{3}'.format(BASE_URL, port, API_ROOT_URL, TENSORBOARD_API) diff --git a/uninstall.ps1 b/uninstall.ps1 index 29446f3836..578a4f24b7 100644 --- a/uninstall.ps1 +++ b/uninstall.ps1 @@ -1,5 +1,4 @@ - -$NNI_DEPENDENCY_FOLDER = "C:\tmp\$env:USERNAME" +$NNI_DEPENDENCY_FOLDER = [System.IO.Path]::GetTempPath()+$env:USERNAME $env:PYTHONIOENCODING = "UTF-8" if($env:VIRTUAL_ENV){ @@ -27,4 +26,4 @@ Remove-Item "src/nni_manager/node_modules" -Recurse -Force Remove-Item "src/webui/build" -Recurse -Force Remove-Item "src/webui/node_modules" -Recurse -Force Remove-Item $NNI_YARN_FOLDER -Recurse -Force -Remove-Item $NNI_NODE_FOLDER -Recurse -Force \ No newline at end of file +Remove-Item $NNI_NODE_FOLDER -Recurse -Force