diff --git a/.gitignore b/.gitignore
index a0f8ba7486..bc67f86f1d 100644
--- a/.gitignore
+++ b/.gitignore
@@ -81,6 +81,7 @@ typings/
__pycache__
build
*.egg-info
+.eggs/
setup.pye
**/__init__.pye
**/.ipynb_checkpoints
diff --git a/README.md b/README.md
index e03b339b80..42e6aa3552 100644
--- a/README.md
+++ b/README.md
@@ -16,7 +16,7 @@
**NNI (Neural Network Intelligence)** is a lightweight but powerful toolkit to help users **automate** Feature Engineering, Neural Architecture Search, Hyperparameter Tuning and Model Compression.
-The tool manages automated machine learning (AutoML) experiments, **dispatches and runs** experiments' trial jobs generated by tuning algorithms to search the best neural architecture and/or hyper-parameters in **different training environments** like Local Machine, Remote Servers, OpenPAI, Kubeflow, FrameworkController on K8S (AKS etc.), DLWorkspace (aka. DLTS), AML (Azure Machine Learning) and other cloud options.
+The tool manages automated machine learning (AutoML) experiments, **dispatches and runs** experiments' trial jobs generated by tuning algorithms to search the best neural architecture and/or hyper-parameters in **different training environments** like Local Machine, Remote Servers, OpenPAI, Kubeflow, FrameworkController on K8S (AKS etc.), DLWorkspace (aka. DLTS), AML (Azure Machine Learning), AdaptDL (aka. ADL) and other cloud options.
## **Who should consider using NNI**
@@ -173,11 +173,13 @@ Within the following table, we summarized the current NNI capabilities, we are g
Remote Servers
AML(Azure Machine Learning)
Kubernetes based services
-
-
diff --git a/docs/en_US/Assessor/BuiltinAssessor.md b/docs/archive_en_US/Assessor/BuiltinAssessor.md
similarity index 100%
rename from docs/en_US/Assessor/BuiltinAssessor.md
rename to docs/archive_en_US/Assessor/BuiltinAssessor.md
diff --git a/docs/en_US/Assessor/CurvefittingAssessor.md b/docs/archive_en_US/Assessor/CurvefittingAssessor.md
similarity index 100%
rename from docs/en_US/Assessor/CurvefittingAssessor.md
rename to docs/archive_en_US/Assessor/CurvefittingAssessor.md
diff --git a/docs/en_US/Assessor/CustomizeAssessor.md b/docs/archive_en_US/Assessor/CustomizeAssessor.md
similarity index 100%
rename from docs/en_US/Assessor/CustomizeAssessor.md
rename to docs/archive_en_US/Assessor/CustomizeAssessor.md
diff --git a/docs/en_US/Assessor/MedianstopAssessor.md b/docs/archive_en_US/Assessor/MedianstopAssessor.md
similarity index 100%
rename from docs/en_US/Assessor/MedianstopAssessor.md
rename to docs/archive_en_US/Assessor/MedianstopAssessor.md
diff --git a/docs/en_US/CommunitySharings/AutoCompletion.md b/docs/archive_en_US/CommunitySharings/AutoCompletion.md
similarity index 100%
rename from docs/en_US/CommunitySharings/AutoCompletion.md
rename to docs/archive_en_US/CommunitySharings/AutoCompletion.md
diff --git a/docs/en_US/CommunitySharings/HpoComparison.md b/docs/archive_en_US/CommunitySharings/HpoComparison.md
similarity index 100%
rename from docs/en_US/CommunitySharings/HpoComparison.md
rename to docs/archive_en_US/CommunitySharings/HpoComparison.md
diff --git a/docs/en_US/CommunitySharings/ModelCompressionComparison.md b/docs/archive_en_US/CommunitySharings/ModelCompressionComparison.md
similarity index 100%
rename from docs/en_US/CommunitySharings/ModelCompressionComparison.md
rename to docs/archive_en_US/CommunitySharings/ModelCompressionComparison.md
diff --git a/docs/en_US/CommunitySharings/NNI_AutoFeatureEng.md b/docs/archive_en_US/CommunitySharings/NNI_AutoFeatureEng.md
similarity index 100%
rename from docs/en_US/CommunitySharings/NNI_AutoFeatureEng.md
rename to docs/archive_en_US/CommunitySharings/NNI_AutoFeatureEng.md
diff --git a/docs/en_US/CommunitySharings/NNI_colab_support.md b/docs/archive_en_US/CommunitySharings/NNI_colab_support.md
similarity index 100%
rename from docs/en_US/CommunitySharings/NNI_colab_support.md
rename to docs/archive_en_US/CommunitySharings/NNI_colab_support.md
diff --git a/docs/en_US/CommunitySharings/NasComparison.md b/docs/archive_en_US/CommunitySharings/NasComparison.md
similarity index 100%
rename from docs/en_US/CommunitySharings/NasComparison.md
rename to docs/archive_en_US/CommunitySharings/NasComparison.md
diff --git a/docs/en_US/CommunitySharings/ParallelizingTpeSearch.md b/docs/archive_en_US/CommunitySharings/ParallelizingTpeSearch.md
similarity index 100%
rename from docs/en_US/CommunitySharings/ParallelizingTpeSearch.md
rename to docs/archive_en_US/CommunitySharings/ParallelizingTpeSearch.md
diff --git a/docs/en_US/CommunitySharings/RecommendersSvd.md b/docs/archive_en_US/CommunitySharings/RecommendersSvd.md
similarity index 100%
rename from docs/en_US/CommunitySharings/RecommendersSvd.md
rename to docs/archive_en_US/CommunitySharings/RecommendersSvd.md
diff --git a/docs/en_US/CommunitySharings/SptagAutoTune.md b/docs/archive_en_US/CommunitySharings/SptagAutoTune.md
similarity index 100%
rename from docs/en_US/CommunitySharings/SptagAutoTune.md
rename to docs/archive_en_US/CommunitySharings/SptagAutoTune.md
diff --git a/docs/en_US/Compression/AutoPruningUsingTuners.md b/docs/archive_en_US/Compression/AutoPruningUsingTuners.md
similarity index 100%
rename from docs/en_US/Compression/AutoPruningUsingTuners.md
rename to docs/archive_en_US/Compression/AutoPruningUsingTuners.md
diff --git a/docs/en_US/Compression/CompressionReference.md b/docs/archive_en_US/Compression/CompressionReference.md
similarity index 100%
rename from docs/en_US/Compression/CompressionReference.md
rename to docs/archive_en_US/Compression/CompressionReference.md
diff --git a/docs/en_US/Compression/CompressionUtils.md b/docs/archive_en_US/Compression/CompressionUtils.md
similarity index 100%
rename from docs/en_US/Compression/CompressionUtils.md
rename to docs/archive_en_US/Compression/CompressionUtils.md
diff --git a/docs/en_US/Compression/CustomizeCompressor.md b/docs/archive_en_US/Compression/CustomizeCompressor.md
similarity index 100%
rename from docs/en_US/Compression/CustomizeCompressor.md
rename to docs/archive_en_US/Compression/CustomizeCompressor.md
diff --git a/docs/en_US/Compression/DependencyAware.md b/docs/archive_en_US/Compression/DependencyAware.md
similarity index 100%
rename from docs/en_US/Compression/DependencyAware.md
rename to docs/archive_en_US/Compression/DependencyAware.md
diff --git a/docs/en_US/Compression/Framework.md b/docs/archive_en_US/Compression/Framework.md
similarity index 100%
rename from docs/en_US/Compression/Framework.md
rename to docs/archive_en_US/Compression/Framework.md
diff --git a/docs/en_US/Compression/ModelSpeedup.md b/docs/archive_en_US/Compression/ModelSpeedup.md
similarity index 100%
rename from docs/en_US/Compression/ModelSpeedup.md
rename to docs/archive_en_US/Compression/ModelSpeedup.md
diff --git a/docs/en_US/Compression/Overview.md b/docs/archive_en_US/Compression/Overview.md
similarity index 100%
rename from docs/en_US/Compression/Overview.md
rename to docs/archive_en_US/Compression/Overview.md
diff --git a/docs/en_US/Compression/Pruner.md b/docs/archive_en_US/Compression/Pruner.md
similarity index 100%
rename from docs/en_US/Compression/Pruner.md
rename to docs/archive_en_US/Compression/Pruner.md
diff --git a/docs/en_US/Compression/Quantizer.md b/docs/archive_en_US/Compression/Quantizer.md
similarity index 100%
rename from docs/en_US/Compression/Quantizer.md
rename to docs/archive_en_US/Compression/Quantizer.md
diff --git a/docs/en_US/Compression/QuickStart.md b/docs/archive_en_US/Compression/QuickStart.md
similarity index 100%
rename from docs/en_US/Compression/QuickStart.md
rename to docs/archive_en_US/Compression/QuickStart.md
diff --git a/docs/en_US/FeatureEngineering/GBDTSelector.md b/docs/archive_en_US/FeatureEngineering/GBDTSelector.md
similarity index 100%
rename from docs/en_US/FeatureEngineering/GBDTSelector.md
rename to docs/archive_en_US/FeatureEngineering/GBDTSelector.md
diff --git a/docs/en_US/FeatureEngineering/GradientFeatureSelector.md b/docs/archive_en_US/FeatureEngineering/GradientFeatureSelector.md
similarity index 100%
rename from docs/en_US/FeatureEngineering/GradientFeatureSelector.md
rename to docs/archive_en_US/FeatureEngineering/GradientFeatureSelector.md
diff --git a/docs/en_US/FeatureEngineering/Overview.md b/docs/archive_en_US/FeatureEngineering/Overview.md
similarity index 100%
rename from docs/en_US/FeatureEngineering/Overview.md
rename to docs/archive_en_US/FeatureEngineering/Overview.md
diff --git a/docs/en_US/NAS/Advanced.md b/docs/archive_en_US/NAS/Advanced.md
similarity index 100%
rename from docs/en_US/NAS/Advanced.md
rename to docs/archive_en_US/NAS/Advanced.md
diff --git a/docs/en_US/NAS/Benchmarks.md b/docs/archive_en_US/NAS/Benchmarks.md
similarity index 100%
rename from docs/en_US/NAS/Benchmarks.md
rename to docs/archive_en_US/NAS/Benchmarks.md
diff --git a/docs/en_US/NAS/CDARTS.md b/docs/archive_en_US/NAS/CDARTS.md
similarity index 100%
rename from docs/en_US/NAS/CDARTS.md
rename to docs/archive_en_US/NAS/CDARTS.md
diff --git a/docs/en_US/NAS/ClassicNas.md b/docs/archive_en_US/NAS/ClassicNas.md
similarity index 100%
rename from docs/en_US/NAS/ClassicNas.md
rename to docs/archive_en_US/NAS/ClassicNas.md
diff --git a/docs/en_US/NAS/Cream.md b/docs/archive_en_US/NAS/Cream.md
similarity index 100%
rename from docs/en_US/NAS/Cream.md
rename to docs/archive_en_US/NAS/Cream.md
diff --git a/docs/en_US/NAS/DARTS.md b/docs/archive_en_US/NAS/DARTS.md
similarity index 100%
rename from docs/en_US/NAS/DARTS.md
rename to docs/archive_en_US/NAS/DARTS.md
diff --git a/docs/en_US/NAS/ENAS.md b/docs/archive_en_US/NAS/ENAS.md
similarity index 100%
rename from docs/en_US/NAS/ENAS.md
rename to docs/archive_en_US/NAS/ENAS.md
diff --git a/docs/en_US/NAS/NasGuide.md b/docs/archive_en_US/NAS/NasGuide.md
similarity index 100%
rename from docs/en_US/NAS/NasGuide.md
rename to docs/archive_en_US/NAS/NasGuide.md
diff --git a/docs/en_US/NAS/NasReference.md b/docs/archive_en_US/NAS/NasReference.md
similarity index 100%
rename from docs/en_US/NAS/NasReference.md
rename to docs/archive_en_US/NAS/NasReference.md
diff --git a/docs/en_US/NAS/Overview.md b/docs/archive_en_US/NAS/Overview.md
similarity index 100%
rename from docs/en_US/NAS/Overview.md
rename to docs/archive_en_US/NAS/Overview.md
diff --git a/docs/en_US/NAS/PDARTS.md b/docs/archive_en_US/NAS/PDARTS.md
similarity index 100%
rename from docs/en_US/NAS/PDARTS.md
rename to docs/archive_en_US/NAS/PDARTS.md
diff --git a/docs/en_US/NAS/Proxylessnas.md b/docs/archive_en_US/NAS/Proxylessnas.md
similarity index 100%
rename from docs/en_US/NAS/Proxylessnas.md
rename to docs/archive_en_US/NAS/Proxylessnas.md
diff --git a/docs/en_US/NAS/SPOS.md b/docs/archive_en_US/NAS/SPOS.md
similarity index 100%
rename from docs/en_US/NAS/SPOS.md
rename to docs/archive_en_US/NAS/SPOS.md
diff --git a/docs/en_US/NAS/SearchSpaceZoo.md b/docs/archive_en_US/NAS/SearchSpaceZoo.md
similarity index 100%
rename from docs/en_US/NAS/SearchSpaceZoo.md
rename to docs/archive_en_US/NAS/SearchSpaceZoo.md
diff --git a/docs/en_US/NAS/TextNAS.md b/docs/archive_en_US/NAS/TextNAS.md
similarity index 100%
rename from docs/en_US/NAS/TextNAS.md
rename to docs/archive_en_US/NAS/TextNAS.md
diff --git a/docs/en_US/NAS/Visualization.md b/docs/archive_en_US/NAS/Visualization.md
similarity index 100%
rename from docs/en_US/NAS/Visualization.md
rename to docs/archive_en_US/NAS/Visualization.md
diff --git a/docs/en_US/NAS/WriteSearchSpace.md b/docs/archive_en_US/NAS/WriteSearchSpace.md
similarity index 100%
rename from docs/en_US/NAS/WriteSearchSpace.md
rename to docs/archive_en_US/NAS/WriteSearchSpace.md
diff --git a/docs/en_US/Overview.md b/docs/archive_en_US/Overview.md
similarity index 100%
rename from docs/en_US/Overview.md
rename to docs/archive_en_US/Overview.md
diff --git a/docs/en_US/Release.md b/docs/archive_en_US/Release.md
similarity index 100%
rename from docs/en_US/Release.md
rename to docs/archive_en_US/Release.md
diff --git a/docs/en_US/ResearchPublications.md b/docs/archive_en_US/ResearchPublications.md
similarity index 100%
rename from docs/en_US/ResearchPublications.md
rename to docs/archive_en_US/ResearchPublications.md
diff --git a/docs/en_US/SupportedFramework_Library.md b/docs/archive_en_US/SupportedFramework_Library.md
similarity index 100%
rename from docs/en_US/SupportedFramework_Library.md
rename to docs/archive_en_US/SupportedFramework_Library.md
diff --git a/docs/en_US/TrainingService/AMLMode.md b/docs/archive_en_US/TrainingService/AMLMode.md
similarity index 100%
rename from docs/en_US/TrainingService/AMLMode.md
rename to docs/archive_en_US/TrainingService/AMLMode.md
diff --git a/docs/en_US/TrainingService/AdaptDLMode.md b/docs/archive_en_US/TrainingService/AdaptDLMode.md
similarity index 100%
rename from docs/en_US/TrainingService/AdaptDLMode.md
rename to docs/archive_en_US/TrainingService/AdaptDLMode.md
diff --git a/docs/en_US/TrainingService/DLTSMode.md b/docs/archive_en_US/TrainingService/DLTSMode.md
similarity index 100%
rename from docs/en_US/TrainingService/DLTSMode.md
rename to docs/archive_en_US/TrainingService/DLTSMode.md
diff --git a/docs/en_US/TrainingService/FrameworkControllerMode.md b/docs/archive_en_US/TrainingService/FrameworkControllerMode.md
similarity index 100%
rename from docs/en_US/TrainingService/FrameworkControllerMode.md
rename to docs/archive_en_US/TrainingService/FrameworkControllerMode.md
diff --git a/docs/en_US/TrainingService/HowToImplementTrainingService.md b/docs/archive_en_US/TrainingService/HowToImplementTrainingService.md
similarity index 100%
rename from docs/en_US/TrainingService/HowToImplementTrainingService.md
rename to docs/archive_en_US/TrainingService/HowToImplementTrainingService.md
diff --git a/docs/en_US/TrainingService/KubeflowMode.md b/docs/archive_en_US/TrainingService/KubeflowMode.md
similarity index 100%
rename from docs/en_US/TrainingService/KubeflowMode.md
rename to docs/archive_en_US/TrainingService/KubeflowMode.md
diff --git a/docs/en_US/TrainingService/LocalMode.md b/docs/archive_en_US/TrainingService/LocalMode.md
similarity index 100%
rename from docs/en_US/TrainingService/LocalMode.md
rename to docs/archive_en_US/TrainingService/LocalMode.md
diff --git a/docs/en_US/TrainingService/Overview.md b/docs/archive_en_US/TrainingService/Overview.md
similarity index 100%
rename from docs/en_US/TrainingService/Overview.md
rename to docs/archive_en_US/TrainingService/Overview.md
diff --git a/docs/en_US/TrainingService/PaiMode.md b/docs/archive_en_US/TrainingService/PaiMode.md
similarity index 100%
rename from docs/en_US/TrainingService/PaiMode.md
rename to docs/archive_en_US/TrainingService/PaiMode.md
diff --git a/docs/en_US/TrainingService/PaiYarnMode.md b/docs/archive_en_US/TrainingService/PaiYarnMode.md
similarity index 100%
rename from docs/en_US/TrainingService/PaiYarnMode.md
rename to docs/archive_en_US/TrainingService/PaiYarnMode.md
diff --git a/docs/en_US/TrainingService/RemoteMachineMode.md b/docs/archive_en_US/TrainingService/RemoteMachineMode.md
similarity index 100%
rename from docs/en_US/TrainingService/RemoteMachineMode.md
rename to docs/archive_en_US/TrainingService/RemoteMachineMode.md
diff --git a/docs/en_US/TrialExample/Cifar10Examples.md b/docs/archive_en_US/TrialExample/Cifar10Examples.md
similarity index 100%
rename from docs/en_US/TrialExample/Cifar10Examples.md
rename to docs/archive_en_US/TrialExample/Cifar10Examples.md
diff --git a/docs/en_US/TrialExample/EfficientNet.md b/docs/archive_en_US/TrialExample/EfficientNet.md
similarity index 67%
rename from docs/en_US/TrialExample/EfficientNet.md
rename to docs/archive_en_US/TrialExample/EfficientNet.md
index e22da7e42e..f71a0f7f08 100644
--- a/docs/en_US/TrialExample/EfficientNet.md
+++ b/docs/archive_en_US/TrialExample/EfficientNet.md
@@ -9,7 +9,7 @@ Use Grid search to find the best combination of alpha, beta and gamma for Effici
[Example code](https://github.com/microsoft/nni/tree/v1.9/examples/trials/efficientnet)
1. Set your working directory here in the example code directory.
-2. Run `git clone https://github.com/ultmaster/EfficientNet-PyTorch` to clone this modified version of [EfficientNet-PyTorch](https://github.com/lukemelas/EfficientNet-PyTorch). The modifications were done to adhere to the original [Tensorflow version](https://github.com/tensorflow/tpu/tree/master/models/official/efficientnet) as close as possible (including EMA, label smoothing and etc.); also added are the part which gets parameters from tuner and reports intermediate/final results. Clone it into `EfficientNet-PyTorch`; the files like `main.py`, `train_imagenet.sh` will appear inside, as specified in the configuration files.
+2. Run `git clone https://github.com/ultmaster/EfficientNet-PyTorch` to clone the [ultmaster modified version](https://github.com/ultmaster/EfficientNet-PyTorch) of the original [EfficientNet-PyTorch](https://github.com/lukemelas/EfficientNet-PyTorch). The modifications were done to adhere to the original [Tensorflow version](https://github.com/tensorflow/tpu/tree/master/models/official/efficientnet) as close as possible (including EMA, label smoothing and etc.); also added are the part which gets parameters from tuner and reports intermediate/final results. Clone it into `EfficientNet-PyTorch`; the files like `main.py`, `train_imagenet.sh` will appear inside, as specified in the configuration files.
3. Run `nnictl create --config config_local.yml` (use `config_pai.yml` for OpenPAI) to find the best EfficientNet-B1. Adjust the training service (PAI/local/remote), batch size in the config files according to the environment.
For training on ImageNet, read `EfficientNet-PyTorch/train_imagenet.sh`. Download ImageNet beforehand and extract it adhering to [PyTorch format](https://pytorch.org/docs/stable/torchvision/datasets.html#imagenet) and then replace `/mnt/data/imagenet` in with the location of the ImageNet storage. This file should also be a good example to follow for mounting ImageNet into the container on OpenPAI.
diff --git a/docs/en_US/TrialExample/GbdtExample.md b/docs/archive_en_US/TrialExample/GbdtExample.md
similarity index 100%
rename from docs/en_US/TrialExample/GbdtExample.md
rename to docs/archive_en_US/TrialExample/GbdtExample.md
diff --git a/docs/en_US/TrialExample/KDExample.md b/docs/archive_en_US/TrialExample/KDExample.md
similarity index 100%
rename from docs/en_US/TrialExample/KDExample.md
rename to docs/archive_en_US/TrialExample/KDExample.md
diff --git a/docs/en_US/TrialExample/MnistExamples.md b/docs/archive_en_US/TrialExample/MnistExamples.md
similarity index 100%
rename from docs/en_US/TrialExample/MnistExamples.md
rename to docs/archive_en_US/TrialExample/MnistExamples.md
diff --git a/docs/en_US/TrialExample/OpEvoExamples.md b/docs/archive_en_US/TrialExample/OpEvoExamples.md
similarity index 100%
rename from docs/en_US/TrialExample/OpEvoExamples.md
rename to docs/archive_en_US/TrialExample/OpEvoExamples.md
diff --git a/docs/en_US/TrialExample/RocksdbExamples.md b/docs/archive_en_US/TrialExample/RocksdbExamples.md
similarity index 100%
rename from docs/en_US/TrialExample/RocksdbExamples.md
rename to docs/archive_en_US/TrialExample/RocksdbExamples.md
diff --git a/docs/en_US/TrialExample/SklearnExamples.md b/docs/archive_en_US/TrialExample/SklearnExamples.md
similarity index 100%
rename from docs/en_US/TrialExample/SklearnExamples.md
rename to docs/archive_en_US/TrialExample/SklearnExamples.md
diff --git a/docs/en_US/TrialExample/SquadEvolutionExamples.md b/docs/archive_en_US/TrialExample/SquadEvolutionExamples.md
similarity index 100%
rename from docs/en_US/TrialExample/SquadEvolutionExamples.md
rename to docs/archive_en_US/TrialExample/SquadEvolutionExamples.md
diff --git a/docs/en_US/TrialExample/Trials.md b/docs/archive_en_US/TrialExample/Trials.md
similarity index 100%
rename from docs/en_US/TrialExample/Trials.md
rename to docs/archive_en_US/TrialExample/Trials.md
diff --git a/docs/en_US/Tuner/BatchTuner.md b/docs/archive_en_US/Tuner/BatchTuner.md
similarity index 100%
rename from docs/en_US/Tuner/BatchTuner.md
rename to docs/archive_en_US/Tuner/BatchTuner.md
diff --git a/docs/en_US/Tuner/BohbAdvisor.md b/docs/archive_en_US/Tuner/BohbAdvisor.md
similarity index 100%
rename from docs/en_US/Tuner/BohbAdvisor.md
rename to docs/archive_en_US/Tuner/BohbAdvisor.md
diff --git a/docs/en_US/Tuner/BuiltinTuner.md b/docs/archive_en_US/Tuner/BuiltinTuner.md
similarity index 100%
rename from docs/en_US/Tuner/BuiltinTuner.md
rename to docs/archive_en_US/Tuner/BuiltinTuner.md
diff --git a/docs/en_US/Tuner/CustomizeAdvisor.md b/docs/archive_en_US/Tuner/CustomizeAdvisor.md
similarity index 100%
rename from docs/en_US/Tuner/CustomizeAdvisor.md
rename to docs/archive_en_US/Tuner/CustomizeAdvisor.md
diff --git a/docs/en_US/Tuner/CustomizeTuner.md b/docs/archive_en_US/Tuner/CustomizeTuner.md
similarity index 100%
rename from docs/en_US/Tuner/CustomizeTuner.md
rename to docs/archive_en_US/Tuner/CustomizeTuner.md
diff --git a/docs/en_US/Tuner/EvolutionTuner.md b/docs/archive_en_US/Tuner/EvolutionTuner.md
similarity index 100%
rename from docs/en_US/Tuner/EvolutionTuner.md
rename to docs/archive_en_US/Tuner/EvolutionTuner.md
diff --git a/docs/en_US/Tuner/GPTuner.md b/docs/archive_en_US/Tuner/GPTuner.md
similarity index 100%
rename from docs/en_US/Tuner/GPTuner.md
rename to docs/archive_en_US/Tuner/GPTuner.md
diff --git a/docs/en_US/Tuner/GridsearchTuner.md b/docs/archive_en_US/Tuner/GridsearchTuner.md
similarity index 100%
rename from docs/en_US/Tuner/GridsearchTuner.md
rename to docs/archive_en_US/Tuner/GridsearchTuner.md
diff --git a/docs/en_US/Tuner/HyperbandAdvisor.md b/docs/archive_en_US/Tuner/HyperbandAdvisor.md
similarity index 100%
rename from docs/en_US/Tuner/HyperbandAdvisor.md
rename to docs/archive_en_US/Tuner/HyperbandAdvisor.md
diff --git a/docs/en_US/Tuner/HyperoptTuner.md b/docs/archive_en_US/Tuner/HyperoptTuner.md
similarity index 100%
rename from docs/en_US/Tuner/HyperoptTuner.md
rename to docs/archive_en_US/Tuner/HyperoptTuner.md
diff --git a/docs/en_US/Tuner/InstallCustomizedTuner.md b/docs/archive_en_US/Tuner/InstallCustomizedTuner.md
similarity index 100%
rename from docs/en_US/Tuner/InstallCustomizedTuner.md
rename to docs/archive_en_US/Tuner/InstallCustomizedTuner.md
diff --git a/docs/en_US/Tuner/MetisTuner.md b/docs/archive_en_US/Tuner/MetisTuner.md
similarity index 100%
rename from docs/en_US/Tuner/MetisTuner.md
rename to docs/archive_en_US/Tuner/MetisTuner.md
diff --git a/docs/en_US/Tuner/NetworkmorphismTuner.md b/docs/archive_en_US/Tuner/NetworkmorphismTuner.md
similarity index 100%
rename from docs/en_US/Tuner/NetworkmorphismTuner.md
rename to docs/archive_en_US/Tuner/NetworkmorphismTuner.md
diff --git a/docs/en_US/Tuner/PBTTuner.md b/docs/archive_en_US/Tuner/PBTTuner.md
similarity index 100%
rename from docs/en_US/Tuner/PBTTuner.md
rename to docs/archive_en_US/Tuner/PBTTuner.md
diff --git a/docs/en_US/Tuner/PPOTuner.md b/docs/archive_en_US/Tuner/PPOTuner.md
similarity index 100%
rename from docs/en_US/Tuner/PPOTuner.md
rename to docs/archive_en_US/Tuner/PPOTuner.md
diff --git a/docs/en_US/Tuner/SmacTuner.md b/docs/archive_en_US/Tuner/SmacTuner.md
similarity index 100%
rename from docs/en_US/Tuner/SmacTuner.md
rename to docs/archive_en_US/Tuner/SmacTuner.md
diff --git a/docs/en_US/Tutorial/AnnotationSpec.md b/docs/archive_en_US/Tutorial/AnnotationSpec.md
similarity index 100%
rename from docs/en_US/Tutorial/AnnotationSpec.md
rename to docs/archive_en_US/Tutorial/AnnotationSpec.md
diff --git a/docs/en_US/Tutorial/Contributing.md b/docs/archive_en_US/Tutorial/Contributing.md
similarity index 100%
rename from docs/en_US/Tutorial/Contributing.md
rename to docs/archive_en_US/Tutorial/Contributing.md
diff --git a/docs/en_US/Tutorial/ExperimentConfig.md b/docs/archive_en_US/Tutorial/ExperimentConfig.md
similarity index 100%
rename from docs/en_US/Tutorial/ExperimentConfig.md
rename to docs/archive_en_US/Tutorial/ExperimentConfig.md
diff --git a/docs/en_US/Tutorial/FAQ.md b/docs/archive_en_US/Tutorial/FAQ.md
similarity index 100%
rename from docs/en_US/Tutorial/FAQ.md
rename to docs/archive_en_US/Tutorial/FAQ.md
diff --git a/docs/en_US/Tutorial/HowToDebug.md b/docs/archive_en_US/Tutorial/HowToDebug.md
similarity index 100%
rename from docs/en_US/Tutorial/HowToDebug.md
rename to docs/archive_en_US/Tutorial/HowToDebug.md
diff --git a/docs/en_US/Tutorial/HowToUseDocker.md b/docs/archive_en_US/Tutorial/HowToUseDocker.md
similarity index 100%
rename from docs/en_US/Tutorial/HowToUseDocker.md
rename to docs/archive_en_US/Tutorial/HowToUseDocker.md
diff --git a/docs/en_US/Tutorial/InstallCustomizedAlgos.md b/docs/archive_en_US/Tutorial/InstallCustomizedAlgos.md
similarity index 100%
rename from docs/en_US/Tutorial/InstallCustomizedAlgos.md
rename to docs/archive_en_US/Tutorial/InstallCustomizedAlgos.md
diff --git a/docs/en_US/Tutorial/InstallationLinux.md b/docs/archive_en_US/Tutorial/InstallationLinux.md
similarity index 100%
rename from docs/en_US/Tutorial/InstallationLinux.md
rename to docs/archive_en_US/Tutorial/InstallationLinux.md
diff --git a/docs/en_US/Tutorial/InstallationWin.md b/docs/archive_en_US/Tutorial/InstallationWin.md
similarity index 100%
rename from docs/en_US/Tutorial/InstallationWin.md
rename to docs/archive_en_US/Tutorial/InstallationWin.md
diff --git a/docs/en_US/Tutorial/Nnictl.md b/docs/archive_en_US/Tutorial/Nnictl.md
similarity index 100%
rename from docs/en_US/Tutorial/Nnictl.md
rename to docs/archive_en_US/Tutorial/Nnictl.md
diff --git a/docs/en_US/Tutorial/QuickStart.md b/docs/archive_en_US/Tutorial/QuickStart.md
similarity index 100%
rename from docs/en_US/Tutorial/QuickStart.md
rename to docs/archive_en_US/Tutorial/QuickStart.md
diff --git a/docs/en_US/Tutorial/SearchSpaceSpec.md b/docs/archive_en_US/Tutorial/SearchSpaceSpec.md
similarity index 100%
rename from docs/en_US/Tutorial/SearchSpaceSpec.md
rename to docs/archive_en_US/Tutorial/SearchSpaceSpec.md
diff --git a/docs/en_US/Tutorial/SetupNniDeveloperEnvironment.md b/docs/archive_en_US/Tutorial/SetupNniDeveloperEnvironment.md
similarity index 100%
rename from docs/en_US/Tutorial/SetupNniDeveloperEnvironment.md
rename to docs/archive_en_US/Tutorial/SetupNniDeveloperEnvironment.md
diff --git a/docs/en_US/Tutorial/WebUI.md b/docs/archive_en_US/Tutorial/WebUI.md
similarity index 100%
rename from docs/en_US/Tutorial/WebUI.md
rename to docs/archive_en_US/Tutorial/WebUI.md
diff --git a/docs/en_US/autotune_ref.md b/docs/archive_en_US/autotune_ref.md
similarity index 100%
rename from docs/en_US/autotune_ref.md
rename to docs/archive_en_US/autotune_ref.md
diff --git a/docs/en_US/nnicli_ref.md b/docs/archive_en_US/nnicli_ref.md
similarity index 100%
rename from docs/en_US/nnicli_ref.md
rename to docs/archive_en_US/nnicli_ref.md
diff --git a/docs/en_US/Assessor/BuiltinAssessor.rst b/docs/en_US/Assessor/BuiltinAssessor.rst
new file mode 100644
index 0000000000..6b85253a73
--- /dev/null
+++ b/docs/en_US/Assessor/BuiltinAssessor.rst
@@ -0,0 +1,101 @@
+.. role:: raw-html(raw)
+ :format: html
+
+
+Built-in Assessors
+==================
+
+NNI provides state-of-the-art tuning algorithms within our builtin-assessors and makes them easy to use. Below is a brief overview of NNI's current builtin Assessors.
+
+Note: Click the **Assessor's name** to get each Assessor's installation requirements, suggested usage scenario, and a config example. A link to a detailed description of each algorithm is provided at the end of the suggested scenario for each Assessor.
+
+Currently, we support the following Assessors:
+
+.. list-table::
+ :header-rows: 1
+ :widths: auto
+
+ * - Assessor
+ - Brief Introduction of Algorithm
+ * - `Medianstop <#MedianStop>`__
+ - Medianstop is a simple early stopping rule. It stops a pending trial X at step S if the trial’s best objective value by step S is strictly worse than the median value of the running averages of all completed trials’ objectives reported up to step S. `Reference Paper `__
+ * - `Curvefitting <#Curvefitting>`__
+ - Curve Fitting Assessor is an LPA (learning, predicting, assessing) algorithm. It stops a pending trial X at step S if the prediction of the final epoch's performance worse than the best final performance in the trial history. In this algorithm, we use 12 curves to fit the accuracy curve. `Reference Paper `__
+
+
+Usage of Builtin Assessors
+--------------------------
+
+Usage of builtin assessors provided by the NNI SDK requires one to declare the **builtinAssessorName** and **classArgs** in the ``config.yml`` file. In this part, we will introduce the details of usage and the suggested scenarios, classArg requirements, and an example for each assessor.
+
+Note: Please follow the provided format when writing your ``config.yml`` file.
+
+:raw-html:``
+
+Median Stop Assessor
+^^^^^^^^^^^^^^^^^^^^
+
+..
+
+ Builtin Assessor Name: **Medianstop**
+
+
+**Suggested scenario**
+
+It's applicable in a wide range of performance curves, thus, it can be used in various scenarios to speed up the tuning progress. `Detailed Description <./MedianstopAssessor.rst>`__
+
+**classArgs requirements:**
+
+
+* **optimize_mode** (*maximize or minimize, optional, default = maximize*\ ) - If 'maximize', assessor will **stop** the trial with smaller expectation. If 'minimize', assessor will **stop** the trial with larger expectation.
+* **start_step** (*int, optional, default = 0*\ ) - A trial is determined to be stopped or not only after receiving start_step number of reported intermediate results.
+
+**Usage example:**
+
+.. code-block:: yaml
+
+ # config.yml
+ assessor:
+ builtinAssessorName: Medianstop
+ classArgs:
+ optimize_mode: maximize
+ start_step: 5
+
+:raw-html:`
`
+
+:raw-html:``
+
+Curve Fitting Assessor
+^^^^^^^^^^^^^^^^^^^^^^
+
+..
+
+ Builtin Assessor Name: **Curvefitting**
+
+
+**Suggested scenario**
+
+It's applicable in a wide range of performance curves, thus, it can be used in various scenarios to speed up the tuning progress. Even better, it's able to handle and assess curves with similar performance. `Detailed Description <./CurvefittingAssessor.rst>`__
+
+**Note**\ , according to the original paper, only incremental functions are supported. Therefore this assessor can only be used to maximize optimization metrics. For example, it can be used for accuracy, but not for loss.
+
+**classArgs requirements:**
+
+
+* **epoch_num** (*int,** required***\ ) - The total number of epochs. We need to know the number of epochs to determine which points we need to predict.
+* **start_step** (*int, optional, default = 6*\ ) - A trial is determined to be stopped or not only after receiving start_step number of reported intermediate results.
+* **threshold** (*float, optional, default = 0.95*\ ) - The threshold that we use to decide to early stop the worst performance curve. For example: if threshold = 0.95, and the best performance in the history is 0.9, then we will stop the trial who's predicted value is lower than 0.95 * 0.9 = 0.855.
+* **gap** (*int, optional, default = 1*\ ) - The gap interval between Assessor judgements. For example: if gap = 2, start_step = 6, then we will assess the result when we get 6, 8, 10, 12...intermediate results.
+
+**Usage example:**
+
+.. code-block:: yaml
+
+ # config.yml
+ assessor:
+ builtinAssessorName: Curvefitting
+ classArgs:
+ epoch_num: 20
+ start_step: 6
+ threshold: 0.95
+ gap: 1
diff --git a/docs/en_US/Assessor/CurvefittingAssessor.rst b/docs/en_US/Assessor/CurvefittingAssessor.rst
new file mode 100644
index 0000000000..41c6d2c147
--- /dev/null
+++ b/docs/en_US/Assessor/CurvefittingAssessor.rst
@@ -0,0 +1,101 @@
+Curve Fitting Assessor on NNI
+=============================
+
+Introduction
+------------
+
+The Curve Fitting Assessor is an LPA (learning, predicting, assessing) algorithm. It stops a pending trial X at step S if the prediction of the final epoch's performance is worse than the best final performance in the trial history.
+
+In this algorithm, we use 12 curves to fit the learning curve. The set of parametric curve models are chosen from this `reference paper `__. The learning curves' shape coincides with our prior knowledge about the form of learning curves: They are typically increasing, saturating functions.
+
+
+.. image:: ../../img/curvefitting_learning_curve.PNG
+ :target: ../../img/curvefitting_learning_curve.PNG
+ :alt: learning_curve
+
+
+We combine all learning curve models into a single, more powerful model. This combined model is given by a weighted linear combination:
+
+
+.. image:: ../../img/curvefitting_f_comb.gif
+ :target: ../../img/curvefitting_f_comb.gif
+ :alt: f_comb
+
+
+with the new combined parameter vector
+
+
+.. image:: ../../img/curvefitting_expression_xi.gif
+ :target: ../../img/curvefitting_expression_xi.gif
+ :alt: expression_xi
+
+
+Assuming additive Gaussian noise and the noise parameter being initialized to its maximum likelihood estimate.
+
+We determine the maximum probability value of the new combined parameter vector by learning the historical data. We use such a value to predict future trial performance and stop the inadequate experiments to save computing resources.
+
+Concretely, this algorithm goes through three stages of learning, predicting, and assessing.
+
+
+*
+ Step1: Learning. We will learn about the trial history of the current trial and determine the \xi at the Bayesian angle. First of all, We fit each curve using the least-squares method, implemented by ``fit_theta``. After we obtained the parameters, we filter the curve and remove the outliers, implemented by ``filter_curve``. Finally, we use the MCMC sampling method. implemented by ``mcmc_sampling``\ , to adjust the weight of each curve. Up to now, we have determined all the parameters in \xi.
+
+*
+ Step2: Predicting. It calculates the expected final result accuracy, implemented by ``f_comb``\ , at the target position (i.e., the total number of epochs) by \xi and the formula of the combined model.
+
+*
+ Step3: If the fitting result doesn't converge, the predicted value will be ``None``. In this case, we return ``AssessResult.Good`` to ask for future accuracy information and predict again. Furthermore, we will get a positive value from the ``predict()`` function. If this value is strictly greater than the best final performance in history * ``THRESHOLD``\ (default value = 0.95), return ``AssessResult.Good``\ , otherwise, return ``AssessResult.Bad``
+
+The figure below is the result of our algorithm on MNIST trial history data, where the green point represents the data obtained by Assessor, the blue point represents the future but unknown data, and the red line is the Curve predicted by the Curve fitting assessor.
+
+
+.. image:: ../../img/curvefitting_example.PNG
+ :target: ../../img/curvefitting_example.PNG
+ :alt: examples
+
+
+Usage
+-----
+
+To use Curve Fitting Assessor, you should add the following spec in your experiment's YAML config file:
+
+.. code-block:: yaml
+
+ assessor:
+ builtinAssessorName: Curvefitting
+ classArgs:
+ # (required)The total number of epoch.
+ # We need to know the number of epoch to determine which point we need to predict.
+ epoch_num: 20
+ # (optional) In order to save our computing resource, we start to predict when we have more than only after receiving start_step number of reported intermediate results.
+ # The default value of start_step is 6.
+ start_step: 6
+ # (optional) The threshold that we decide to early stop the worse performance curve.
+ # For example: if threshold = 0.95, best performance in the history is 0.9, then we will stop the trial which predict value is lower than 0.95 * 0.9 = 0.855.
+ # The default value of threshold is 0.95.
+ threshold: 0.95
+ # (optional) The gap interval between Assesor judgements.
+ # For example: if gap = 2, start_step = 6, then we will assess the result when we get 6, 8, 10, 12...intermedian result.
+ # The default value of gap is 1.
+ gap: 1
+
+Limitation
+----------
+
+According to the original paper, only incremental functions are supported. Therefore this assessor can only be used to maximize optimization metrics. For example, it can be used for accuracy, but not for loss.
+
+File Structure
+--------------
+
+The assessor has a lot of different files, functions, and classes. Here we briefly describe a few of them.
+
+
+* ``curvefunctions.py`` includes all the function expressions and default parameters.
+* ``modelfactory.py`` includes learning and predicting; the corresponding calculation part is also implemented here.
+* ``curvefitting_assessor.py`` is the assessor which receives the trial history and assess whether to early stop the trial.
+
+TODO
+----
+
+
+* Further improve the accuracy of the prediction and test it on more models.
diff --git a/docs/en_US/Assessor/CustomizeAssessor.rst b/docs/en_US/Assessor/CustomizeAssessor.rst
new file mode 100644
index 0000000000..3926d7a306
--- /dev/null
+++ b/docs/en_US/Assessor/CustomizeAssessor.rst
@@ -0,0 +1,67 @@
+Customize Assessor
+==================
+
+NNI supports to build an assessor by yourself for tuning demand.
+
+If you want to implement a customized Assessor, there are three things to do:
+
+
+#. Inherit the base Assessor class
+#. Implement assess_trial function
+#. Configure your customized Assessor in experiment YAML config file
+
+**1. Inherit the base Assessor class**
+
+.. code-block:: python
+
+ from nni.assessor import Assessor
+
+ class CustomizedAssessor(Assessor):
+ def __init__(self, ...):
+ ...
+
+**2. Implement assess trial function**
+
+.. code-block:: python
+
+ from nni.assessor import Assessor, AssessResult
+
+ class CustomizedAssessor(Assessor):
+ def __init__(self, ...):
+ ...
+
+ def assess_trial(self, trial_history):
+ """
+ Determines whether a trial should be killed. Must override.
+ trial_history: a list of intermediate result objects.
+ Returns AssessResult.Good or AssessResult.Bad.
+ """
+ # you code implement here.
+ ...
+
+**3. Configure your customized Assessor in experiment YAML config file**
+
+NNI needs to locate your customized Assessor class and instantiate the class, so you need to specify the location of the customized Assessor class and pass literal values as parameters to the __init__ constructor.
+
+.. code-block:: yaml
+
+ assessor:
+ codeDir: /home/abc/myassessor
+ classFileName: my_customized_assessor.py
+ className: CustomizedAssessor
+ # Any parameter need to pass to your Assessor class __init__ constructor
+ # can be specified in this optional classArgs field, for example
+ classArgs:
+ arg1: value1
+
+Please noted in **2**. The object ``trial_history`` are exact the object that Trial send to Assessor by using SDK ``report_intermediate_result`` function.
+
+The working directory of your assessor is ``/nni-experiments//log``\ , which can be retrieved with environment variable ``NNI_LOG_DIRECTORY``\ ,
+
+More detail example you could see:
+
+..
+
+ * :githublink:`medianstop-assessor `
+ * :githublink:`curvefitting-assessor `
+
diff --git a/docs/en_US/Assessor/MedianstopAssessor.rst b/docs/en_US/Assessor/MedianstopAssessor.rst
new file mode 100644
index 0000000000..5a307bf0d3
--- /dev/null
+++ b/docs/en_US/Assessor/MedianstopAssessor.rst
@@ -0,0 +1,7 @@
+Medianstop Assessor on NNI
+==========================
+
+Median Stop
+-----------
+
+Medianstop is a simple early stopping rule mentioned in this `paper `__. It stops a pending trial X after step S if the trial’s best objective value by step S is strictly worse than the median value of the running averages of all completed trials’ objectives reported up to step S.
diff --git a/docs/en_US/CommunitySharings/AutoCompletion.rst b/docs/en_US/CommunitySharings/AutoCompletion.rst
new file mode 100644
index 0000000000..cb0c76c12f
--- /dev/null
+++ b/docs/en_US/CommunitySharings/AutoCompletion.rst
@@ -0,0 +1,55 @@
+Auto Completion for nnictl Commands
+===================================
+
+NNI's command line tool **nnictl** support auto-completion, i.e., you can complete a nnictl command by pressing the ``tab`` key.
+
+For example, if the current command is
+
+.. code-block:: bash
+
+ nnictl cre
+
+By pressing the ``tab`` key, it will be completed to
+
+.. code-block:: bash
+
+ nnictl create
+
+For now, auto-completion will not be enabled by default if you install NNI through ``pip``\ , and it only works on Linux with bash shell. If you want to enable this feature on your computer, please refer to the following steps:
+
+Step 1. Download ``bash-completion``
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+.. code-block:: bash
+
+ cd ~
+ wget https://raw.githubusercontent.com/microsoft/nni/{nni-version}/tools/bash-completion
+
+Here, {nni-version} should by replaced by the version of NNI, e.g., ``master``\ , ``v1.9``. You can also check the latest ``bash-completion`` script :githublink:`here `.
+
+Step 2. Install the script
+^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+If you are running a root account and want to install this script for all the users
+
+.. code-block:: bash
+
+ install -m644 ~/bash-completion /usr/share/bash-completion/completions/nnictl
+
+If you just want to install this script for your self
+
+.. code-block:: bash
+
+ mkdir -p ~/.bash_completion.d
+ install -m644 ~/bash-completion ~/.bash_completion.d/nnictl
+ echo '[[ -f ~/.bash_completion.d/nnictl ]] && source ~/.bash_completion.d/nnictl' >> ~/.bash_completion
+
+Step 3. Reopen your terminal
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Reopen your terminal and you should be able to use the auto-completion feature. Enjoy!
+
+Step 4. Uninstall
+^^^^^^^^^^^^^^^^^
+
+If you want to uninstall this feature, just revert the changes in the steps above.
diff --git a/docs/en_US/CommunitySharings/HpoComparison.rst b/docs/en_US/CommunitySharings/HpoComparison.rst
new file mode 100644
index 0000000000..75925ab2e9
--- /dev/null
+++ b/docs/en_US/CommunitySharings/HpoComparison.rst
@@ -0,0 +1,385 @@
+Hyper Parameter Optimization Comparison
+=======================================
+
+*Posted by Anonymous Author*
+
+Comparison of Hyperparameter Optimization (HPO) algorithms on several problems.
+
+Hyperparameter Optimization algorithms are list below:
+
+
+* `Random Search <../Tuner/BuiltinTuner.rst>`__
+* `Grid Search <../Tuner/BuiltinTuner.rst>`__
+* `Evolution <../Tuner/BuiltinTuner.rst>`__
+* `Anneal <../Tuner/BuiltinTuner.rst>`__
+* `Metis <../Tuner/BuiltinTuner.rst>`__
+* `TPE <../Tuner/BuiltinTuner.rst>`__
+* `SMAC <../Tuner/BuiltinTuner.rst>`__
+* `HyperBand <../Tuner/BuiltinTuner.rst>`__
+* `BOHB <../Tuner/BuiltinTuner.rst>`__
+
+All algorithms run in NNI local environment.
+
+Machine Environment:
+
+.. code-block:: bash
+
+ OS: Linux Ubuntu 16.04 LTS
+ CPU: Intel(R) Xeon(R) CPU E5-2690 v3 @ 2.60GHz 2600 MHz
+ Memory: 112 GB
+ NNI Version: v0.7
+ NNI Mode(local|pai|remote): local
+ Python version: 3.6
+ Is conda or virtualenv used?: Conda
+ is running in docker?: no
+
+AutoGBDT Example
+----------------
+
+Problem Description
+^^^^^^^^^^^^^^^^^^^
+
+Nonconvex problem on the hyper-parameter search of `AutoGBDT <../TrialExample/GbdtExample.rst>`__ example.
+
+Search Space
+^^^^^^^^^^^^
+
+.. code-block:: json
+
+ {
+ "num_leaves": {
+ "_type": "choice",
+ "_value": [10, 12, 14, 16, 18, 20, 22, 24, 28, 32, 48, 64, 96, 128]
+ },
+ "learning_rate": {
+ "_type": "choice",
+ "_value": [0.00001, 0.0001, 0.001, 0.01, 0.05, 0.1, 0.2, 0.5]
+ },
+ "max_depth": {
+ "_type": "choice",
+ "_value": [-1, 2, 3, 4, 5, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 28, 32, 48, 64, 96, 128]
+ },
+ "feature_fraction": {
+ "_type": "choice",
+ "_value": [0.9, 0.8, 0.7, 0.6, 0.5, 0.4, 0.3, 0.2]
+ },
+ "bagging_fraction": {
+ "_type": "choice",
+ "_value": [0.9, 0.8, 0.7, 0.6, 0.5, 0.4, 0.3, 0.2]
+ },
+ "bagging_freq": {
+ "_type": "choice",
+ "_value": [1, 2, 4, 8, 10, 12, 14, 16]
+ }
+ }
+
+The total search space is 1,204,224, we set the number of maximum trial to 1000. The time limitation is 48 hours.
+
+Results
+^^^^^^^
+
+.. list-table::
+ :header-rows: 1
+ :widths: auto
+
+ * - Algorithm
+ - Best loss
+ - Average of Best 5 Losses
+ - Average of Best 10 Losses
+ * - Random Search
+ - 0.418854
+ - 0.420352
+ - 0.421553
+ * - Random Search
+ - 0.417364
+ - 0.420024
+ - 0.420997
+ * - Random Search
+ - 0.417861
+ - 0.419744
+ - 0.420642
+ * - Grid Search
+ - 0.498166
+ - 0.498166
+ - 0.498166
+ * - Evolution
+ - 0.409887
+ - 0.409887
+ - 0.409887
+ * - Evolution
+ - 0.413620
+ - 0.413875
+ - 0.414067
+ * - Evolution
+ - 0.409887
+ - 0.409887
+ - 0.409887
+ * - Anneal
+ - 0.414877
+ - 0.417289
+ - 0.418281
+ * - Anneal
+ - 0.409887
+ - 0.409887
+ - 0.410118
+ * - Anneal
+ - 0.413683
+ - 0.416949
+ - 0.417537
+ * - Metis
+ - 0.416273
+ - 0.420411
+ - 0.422380
+ * - Metis
+ - 0.420262
+ - 0.423175
+ - 0.424816
+ * - Metis
+ - 0.421027
+ - 0.424172
+ - 0.425714
+ * - TPE
+ - 0.414478
+ - 0.414478
+ - 0.414478
+ * - TPE
+ - 0.415077
+ - 0.417986
+ - 0.418797
+ * - TPE
+ - 0.415077
+ - 0.417009
+ - 0.418053
+ * - SMAC
+ - **0.408386**
+ - **0.408386**
+ - **0.408386**
+ * - SMAC
+ - 0.414012
+ - 0.414012
+ - 0.414012
+ * - SMAC
+ - **0.408386**
+ - **0.408386**
+ - **0.408386**
+ * - BOHB
+ - 0.410464
+ - 0.415319
+ - 0.417755
+ * - BOHB
+ - 0.418995
+ - 0.420268
+ - 0.422604
+ * - BOHB
+ - 0.415149
+ - 0.418072
+ - 0.418932
+ * - HyperBand
+ - 0.414065
+ - 0.415222
+ - 0.417628
+ * - HyperBand
+ - 0.416807
+ - 0.417549
+ - 0.418828
+ * - HyperBand
+ - 0.415550
+ - 0.415977
+ - 0.417186
+ * - GP
+ - 0.414353
+ - 0.418563
+ - 0.420263
+ * - GP
+ - 0.414395
+ - 0.418006
+ - 0.420431
+ * - GP
+ - 0.412943
+ - 0.416566
+ - 0.418443
+
+
+In this example, all the algorithms are used with default parameters. For Metis, there are about 300 trials because it runs slowly due to its high time complexity O(n^3) in Gaussian Process.
+
+RocksDB Benchmark 'fillrandom' and 'readrandom'
+-----------------------------------------------
+
+Problem Description
+^^^^^^^^^^^^^^^^^^^
+
+`DB_Bench `__ is the main tool that is used to benchmark `RocksDB `__\ 's performance. It has so many hapermeter to tune.
+
+The performance of ``DB_Bench`` is associated with the machine configuration and installation method. We run the ``DB_Bench``\ in the Linux machine and install the Rock in shared library.
+
+Machine configuration
+^^^^^^^^^^^^^^^^^^^^^
+
+.. code-block:: bash
+
+ RocksDB: version 6.1
+ CPU: 6 * Intel(R) Xeon(R) CPU E5-2690 v4 @ 2.60GHz
+ CPUCache: 35840 KB
+ Keys: 16 bytes each
+ Values: 100 bytes each (50 bytes after compression)
+ Entries: 1000000
+
+Storage performance
+^^^^^^^^^^^^^^^^^^^
+
+**Latency**\ : each IO request will take some time to complete, this is called the average latency. There are several factors that would affect this time including network connection quality and hard disk IO performance.
+
+**IOPS**\ :** IO operations per second**\ , which means the amount of *read or write operations* that could be done in one seconds time.
+
+**IO size**\ :** the size of each IO request**. Depending on the operating system and the application/service that needs disk access it will issue a request to read or write a certain amount of data at the same time.
+
+**Throughput (in MB/s) = Average IO size x IOPS**
+
+IOPS is related to online processing ability and we use the IOPS as the metric in my experiment.
+
+Search Space
+^^^^^^^^^^^^
+
+.. code-block:: json
+
+ {
+ "max_background_compactions": {
+ "_type": "quniform",
+ "_value": [1, 256, 1]
+ },
+ "block_size": {
+ "_type": "quniform",
+ "_value": [1, 500000, 1]
+ },
+ "write_buffer_size": {
+ "_type": "quniform",
+ "_value": [1, 130000000, 1]
+ },
+ "max_write_buffer_number": {
+ "_type": "quniform",
+ "_value": [1, 128, 1]
+ },
+ "min_write_buffer_number_to_merge": {
+ "_type": "quniform",
+ "_value": [1, 32, 1]
+ },
+ "level0_file_num_compaction_trigger": {
+ "_type": "quniform",
+ "_value": [1, 256, 1]
+ },
+ "level0_slowdown_writes_trigger": {
+ "_type": "quniform",
+ "_value": [1, 1024, 1]
+ },
+ "level0_stop_writes_trigger": {
+ "_type": "quniform",
+ "_value": [1, 1024, 1]
+ },
+ "cache_size": {
+ "_type": "quniform",
+ "_value": [1, 30000000, 1]
+ },
+ "compaction_readahead_size": {
+ "_type": "quniform",
+ "_value": [1, 30000000, 1]
+ },
+ "new_table_reader_for_compaction_inputs": {
+ "_type": "randint",
+ "_value": [1]
+ }
+ }
+
+The search space is enormous (about 10^40) and we set the maximum number of trial to 100 to limit the computation resource.
+
+Results
+^^^^^^^
+
+fillrandom' Benchmark
+^^^^^^^^^^^^^^^^^^^^^
+
+.. list-table::
+ :header-rows: 1
+ :widths: auto
+
+ * - Model
+ - Best IOPS (Repeat 1)
+ - Best IOPS (Repeat 2)
+ - Best IOPS (Repeat 3)
+ * - Random
+ - 449901
+ - 427620
+ - 477174
+ * - Anneal
+ - 461896
+ - 467150
+ - 437528
+ * - Evolution
+ - 436755
+ - 389956
+ - 389790
+ * - TPE
+ - 378346
+ - 482316
+ - 468989
+ * - SMAC
+ - 491067
+ - 490472
+ - **491136**
+ * - Metis
+ - 444920
+ - 457060
+ - 454438
+
+
+Figure:
+
+
+.. image:: ../../img/hpo_rocksdb_fillrandom.png
+ :target: ../../img/hpo_rocksdb_fillrandom.png
+ :alt:
+
+
+'readrandom' Benchmark
+^^^^^^^^^^^^^^^^^^^^^^
+
+.. list-table::
+ :header-rows: 1
+ :widths: auto
+
+ * - Model
+ - Best IOPS (Repeat 1)
+ - Best IOPS (Repeat 2)
+ - Best IOPS (Repeat 3)
+ * - Random
+ - 2276157
+ - 2285301
+ - 2275142
+ * - Anneal
+ - 2286330
+ - 2282229
+ - 2284012
+ * - Evolution
+ - 2286524
+ - 2283673
+ - 2283558
+ * - TPE
+ - 2287366
+ - 2282865
+ - 2281891
+ * - SMAC
+ - 2270874
+ - 2284904
+ - 2282266
+ * - Metis
+ - **2287696**
+ - 2283496
+ - 2277701
+
+
+Figure:
+
+
+.. image:: ../../img/hpo_rocksdb_readrandom.png
+ :target: ../../img/hpo_rocksdb_readrandom.png
+ :alt:
+
diff --git a/docs/en_US/CommunitySharings/ModelCompressionComparison.rst b/docs/en_US/CommunitySharings/ModelCompressionComparison.rst
new file mode 100644
index 0000000000..12cc009e25
--- /dev/null
+++ b/docs/en_US/CommunitySharings/ModelCompressionComparison.rst
@@ -0,0 +1,133 @@
+Comparison of Filter Pruning Algorithms
+=======================================
+
+To provide an initial insight into the performance of various filter pruning algorithms,
+we conduct extensive experiments with various pruning algorithms on some benchmark models and datasets.
+We present the experiment result in this document.
+In addition, we provide friendly instructions on the re-implementation of these experiments to facilitate further contributions to this effort.
+
+Experiment Setting
+------------------
+
+The experiments are performed with the following pruners/datasets/models:
+
+
+*
+ Models: :githublink:`VGG16, ResNet18, ResNet50 `
+
+*
+ Datasets: CIFAR-10
+
+*
+ Pruners:
+
+
+ * These pruners are included:
+
+ * Pruners with scheduling : ``SimulatedAnnealing Pruner``\ , ``NetAdapt Pruner``\ , ``AutoCompress Pruner``.
+ Given the overal sparsity requirement, these pruners can automatically generate a sparsity distribution among different layers.
+ * One-shot pruners: ``L1Filter Pruner``\ , ``L2Filter Pruner``\ , ``FPGM Pruner``.
+ The sparsity of each layer is set the same as the overall sparsity in this experiment.
+
+ *
+ Only **filter pruning** performances are compared here.
+
+ For the pruners with scheduling, ``L1Filter Pruner`` is used as the base algorithm. That is to say, after the sparsities distribution is decided by the scheduling algorithm, ``L1Filter Pruner`` is used to performn real pruning.
+
+ *
+ All the pruners listed above are implemented in :githublink:`nni `.
+
+Experiment Result
+-----------------
+
+For each dataset/model/pruner combination, we prune the model to different levels by setting a series of target sparsities for the pruner.
+
+Here we plot both **Number of Weights - Performances** curve and** FLOPs - Performance** curve.
+As a reference, we also plot the result declared in the paper `AutoCompress: An Automatic DNN Structured Pruning Framework for Ultra-High Compression Rates `__ for models VGG16 and ResNet18 on CIFAR-10.
+
+The experiment result are shown in the following figures:
+
+CIFAR-10, VGG16:
+
+
+.. image:: ../../../examples/model_compress/comparison_of_pruners/img/performance_comparison_vgg16.png
+ :target: ../../../examples/model_compress/comparison_of_pruners/img/performance_comparison_vgg16.png
+ :alt:
+
+
+CIFAR-10, ResNet18:
+
+
+.. image:: ../../../examples/model_compress/comparison_of_pruners/img/performance_comparison_resnet18.png
+ :target: ../../../examples/model_compress/comparison_of_pruners/img/performance_comparison_resnet18.png
+ :alt:
+
+
+CIFAR-10, ResNet50:
+
+
+.. image:: ../../../examples/model_compress/comparison_of_pruners/img/performance_comparison_resnet50.png
+ :target: ../../../examples/model_compress/comparison_of_pruners/img/performance_comparison_resnet50.png
+ :alt:
+
+
+Analysis
+--------
+
+From the experiment result, we get the following conclusions:
+
+
+* Given the constraint on the number of parameters, the pruners with scheduling ( ``AutoCompress Pruner`` , ``SimualatedAnnealing Pruner`` ) performs better than the others when the constraint is strict. However, they have no such advantage in FLOPs/Performances comparison since only number of parameters constraint is considered in the optimization process;
+* The basic algorithms ``L1Filter Pruner`` , ``L2Filter Pruner`` , ``FPGM Pruner`` performs very similarly in these experiments;
+* ``NetAdapt Pruner`` can not achieve very high compression rate. This is caused by its mechanism that it prunes only one layer each pruning iteration. This leads to un-acceptable complexity if the sparsity per iteration is much lower than the overall sparisity constraint.
+
+Experiments Reproduction
+------------------------
+
+Implementation Details
+^^^^^^^^^^^^^^^^^^^^^^
+
+
+*
+ The experiment results are all collected with the default configuration of the pruners in nni, which means that when we call a pruner class in nni, we don't change any default class arguments.
+
+*
+ Both FLOPs and the number of parameters are counted with :githublink:`Model FLOPs/Parameters Counter ` after :githublink:`model speed up `.
+ This avoids potential issues of counting them of masked models.
+
+*
+ The experiment code can be found :githublink:`here `.
+
+Experiment Result Rendering
+^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+
+*
+ If you follow the practice in the :githublink:`example `\ , for every single pruning experiment, the experiment result will be saved in JSON format as follows:
+
+ .. code-block:: json
+
+ {
+ "performance": {"original": 0.9298, "pruned": 0.1, "speedup": 0.1, "finetuned": 0.7746},
+ "params": {"original": 14987722.0, "speedup": 167089.0},
+ "flops": {"original": 314018314.0, "speedup": 38589922.0}
+ }
+
+*
+ The experiment results are saved :githublink:`here `.
+ You can refer to :githublink:`analyze ` to plot new performance comparison figures.
+
+Contribution
+------------
+
+TODO Items
+^^^^^^^^^^
+
+
+* Pruners constrained by FLOPS/latency
+* More pruning algorithms/datasets/models
+
+Issues
+^^^^^^
+
+For algorithm implementation & experiment issues, please `create an issue `__.
diff --git a/docs/en_US/CommunitySharings/NNI_AutoFeatureEng.rst b/docs/en_US/CommunitySharings/NNI_AutoFeatureEng.rst
new file mode 100644
index 0000000000..d01a824517
--- /dev/null
+++ b/docs/en_US/CommunitySharings/NNI_AutoFeatureEng.rst
@@ -0,0 +1,141 @@
+.. role:: raw-html(raw)
+ :format: html
+
+
+NNI review article from Zhihu: :raw-html:`` - By Garvin Li
+========================================================================================================================
+
+The article is by a NNI user on Zhihu forum. In the article, Garvin had shared his experience on using NNI for Automatic Feature Engineering. We think this article is very useful for users who are interested in using NNI for feature engineering. With author's permission, we translated the original article into English.
+
+**原文(source)**\ : `如何看待微软最新发布的AutoML平台NNI?By Garvin Li `__
+
+01 Overview of AutoML
+---------------------
+
+In author's opinion, AutoML is not only about hyperparameter optimization, but
+also a process that can target various stages of the machine learning process,
+including feature engineering, NAS, HPO, etc.
+
+02 Overview of NNI
+------------------
+
+NNI (Neural Network Intelligence) is an open source AutoML toolkit from
+Microsoft, to help users design and tune machine learning models, neural network
+architectures, or a complex system’s parameters in an efficient and automatic
+way.
+
+Link:\ ` https://github.com/Microsoft/nni `__
+
+In general, most of Microsoft tools have one prominent characteristic: the
+design is highly reasonable (regardless of the technology innovation degree).
+NNI's AutoFeatureENG basically meets all user requirements of AutoFeatureENG
+with a very reasonable underlying framework design.
+
+03 Details of NNI-AutoFeatureENG
+--------------------------------
+
+..
+
+ The article is following the github project: `https://github.com/SpongebBob/tabular_automl_NNI `__.
+
+
+Each new user could do AutoFeatureENG with NNI easily and efficiently. To exploring the AutoFeatureENG capability, downloads following required files, and then run NNI install through pip.
+
+
+.. image:: https://pic3.zhimg.com/v2-8886eea730cad25f5ac06ef1897cd7e4_r.jpg
+ :target: https://pic3.zhimg.com/v2-8886eea730cad25f5ac06ef1897cd7e4_r.jpg
+ :alt:
+
+NNI treats AutoFeatureENG as a two-steps-task, feature generation exploration and feature selection. Feature generation exploration is mainly about feature derivation and high-order feature combination.
+
+04 Feature Exploration
+----------------------
+
+For feature derivation, NNI offers many operations which could automatically generate new features, which list \ `as following `__\ :
+
+**count**\ : Count encoding is based on replacing categories with their counts computed on the train set, also named frequency encoding.
+
+**target**\ : Target encoding is based on encoding categorical variable values with the mean of target variable per value.
+
+**embedding**\ : Regard features as sentences, generate vectors using *Word2Vec.*
+
+**crosscout**\ : Count encoding on more than one-dimension, alike CTR (Click Through Rate).
+
+**aggregete**\ : Decide the aggregation functions of the features, including min/max/mean/var.
+
+**nunique**\ : Statistics of the number of unique features.
+
+**histsta**\ : Statistics of feature buckets, like histogram statistics.
+
+Search space could be defined in a **JSON file**\ : to define how specific features intersect, which two columns intersect and how features generate from corresponding columns.
+
+
+.. image:: https://pic1.zhimg.com/v2-3c3eeec6eea9821e067412725e5d2317_r.jpg
+ :target: https://pic1.zhimg.com/v2-3c3eeec6eea9821e067412725e5d2317_r.jpg
+ :alt:
+
+
+The picture shows us the procedure of defining search space. NNI provides count encoding for 1-order-op, as well as cross count encoding, aggerate statistics (min max var mean median nunique) for 2-order-op.
+
+For example, we want to search the features which are a frequency encoding (valuecount) features on columns name {“C1”, ...,” C26”}, in the following way:
+
+
+.. image:: https://github.com/JSong-Jia/Pic/blob/master/images/pic%203.jpg
+ :target: https://github.com/JSong-Jia/Pic/blob/master/images/pic%203.jpg
+ :alt:
+
+
+we can define a cross frequency encoding (value count on cross dims) method on columns {"C1",...,"C26"} x {"C1",...,"C26"} in the following way:
+
+
+.. image:: https://github.com/JSong-Jia/Pic/blob/master/images/pic%204.jpg
+ :target: https://github.com/JSong-Jia/Pic/blob/master/images/pic%204.jpg
+ :alt:
+
+
+The purpose of Exploration is to generate new features. You can use **get_next_parameter** function to get received feature candidates of one trial.
+
+..
+
+ RECEIVED_PARAMS = nni.get_next_parameter()
+
+
+05 Feature selection
+--------------------
+
+To avoid feature explosion and overfitting, feature selection is necessary. In the feature selection of NNI-AutoFeatureENG, LightGBM (Light Gradient Boosting Machine), a gradient boosting framework developed by Microsoft, is mainly promoted.
+
+
+.. image:: https://pic2.zhimg.com/v2-7bf9c6ae1303692101a911def478a172_r.jpg
+ :target: https://pic2.zhimg.com/v2-7bf9c6ae1303692101a911def478a172_r.jpg
+ :alt:
+
+
+If you have used **XGBoost** or** GBDT**\ , you would know the algorithm based on tree structure can easily calculate the importance of each feature on results. LightGBM is able to make feature selection naturally.
+
+The issue is that selected features might be applicable to *GBDT* (Gradient Boosting Decision Tree), but not to the linear algorithm like *LR* (Logistic Regression).
+
+
+.. image:: https://pic4.zhimg.com/v2-d2f919497b0ed937acad0577f7a8df83_r.jpg
+ :target: https://pic4.zhimg.com/v2-d2f919497b0ed937acad0577f7a8df83_r.jpg
+ :alt:
+
+
+06 Summary
+----------
+
+NNI's AutoFeatureEng sets a well-established standard, showing us the operation procedure, available modules, which is highly convenient to use. However, a simple model is probably not enough for good results.
+
+Suggestions to NNI
+------------------
+
+About Exploration: If consider using DNN (like xDeepFM) to extract high-order feature would be better.
+
+About Selection: There could be more intelligent options, such as automatic selection system based on downstream models.
+
+Conclusion: NNI could offer users some inspirations of design and it is a good open source project. I suggest researchers leverage it to accelerate the AI research.
+
+Tips: Because the scripts of open source projects are compiled based on gcc7, Mac system may encounter problems of gcc (GNU Compiler Collection). The solution is as follows:
+
+brew install libomp
+===================
diff --git a/docs/en_US/CommunitySharings/NNI_colab_support.rst b/docs/en_US/CommunitySharings/NNI_colab_support.rst
new file mode 100644
index 0000000000..438f66bb26
--- /dev/null
+++ b/docs/en_US/CommunitySharings/NNI_colab_support.rst
@@ -0,0 +1,47 @@
+Use NNI on Google Colab
+=======================
+
+NNI can easily run on Google Colab platform. However, Colab doesn't expose its public IP and ports, so by default you can not access NNI's Web UI on Colab. To solve this, you need a reverse proxy software like ``ngrok`` or ``frp``. This tutorial will show you how to use ngrok to access NNI's Web UI on Colab.
+
+How to Open NNI's Web UI on Google Colab
+----------------------------------------
+
+
+#. Install required packages and softwares.
+
+.. code-block:: bash
+
+ ! pip install nni # install nni
+ ! wget https://bin.equinox.io/c/4VmDzA7iaHb/ngrok-stable-linux-amd64.zip # download ngrok and unzip it
+ ! unzip ngrok-stable-linux-amd64.zip
+ ! mkdir -p nni_repo
+ ! git clone https://github.com/microsoft/nni.git nni_repo/nni # clone NNI's offical repo to get examples
+
+
+#. Register a ngrok account `here `__\ , then connect to your account using your authtoken.
+
+.. code-block:: bash
+
+ ! ./ngrok authtoken
+
+
+#. Start an NNI example on a port bigger than 1024, then start ngrok with the same port. If you want to use gpu, make sure gpuNum >= 1 in config.yml. Use ``get_ipython()`` to start ngrok since it will be stuck if you use ``! ngrok http 5000 &``.
+
+.. code-block:: bash
+
+ ! nnictl create --config nni_repo/nni/examples/trials/mnist-pytorch/config.yml --port 5000 &
+ get_ipython().system_raw('./ngrok http 5000 &')
+
+
+#. Check the public url.
+
+.. code-block:: bash
+
+ ! curl -s http://localhost:4040/api/tunnels # don't change the port number 4040
+
+You will see an url like http://xxxx.ngrok.io after step 4, open this url and you will find NNI's Web UI. Have fun :)
+
+Access Web UI with frp
+----------------------
+
+frp is another reverse proxy software with similar functions. However, frp doesn't provide free public urls, so you may need an server with public IP as a frp server. See `here `__ to know more about how to deploy frp.
diff --git a/docs/en_US/CommunitySharings/NasComparison.rst b/docs/en_US/CommunitySharings/NasComparison.rst
new file mode 100644
index 0000000000..d2a9ac1131
--- /dev/null
+++ b/docs/en_US/CommunitySharings/NasComparison.rst
@@ -0,0 +1,165 @@
+Neural Architecture Search Comparison
+=====================================
+
+*Posted by Anonymous Author*
+
+Train and Compare NAS (Neural Architecture Search) models including Autokeras, DARTS, ENAS and NAO.
+
+Their source code link is as below:
+
+
+*
+ Autokeras: `https://github.com/jhfjhfj1/autokeras `__
+
+*
+ DARTS: `https://github.com/quark0/darts `__
+
+*
+ ENAS: `https://github.com/melodyguan/enas `__
+
+*
+ NAO: `https://github.com/renqianluo/NAO `__
+
+Experiment Description
+----------------------
+
+To avoid over-fitting in **CIFAR-10**\ , we also compare the models in the other five datasets including Fashion-MNIST, CIFAR-100, OUI-Adience-Age, ImageNet-10-1 (subset of ImageNet), ImageNet-10-2 (another subset of ImageNet). We just sample a subset with 10 different labels from ImageNet to make ImageNet-10-1 or ImageNet-10-2.
+
+.. list-table::
+ :header-rows: 1
+ :widths: auto
+
+ * - Dataset
+ - Training Size
+ - Numer of Classes
+ - Descriptions
+ * - `Fashion-MNIST `__
+ - 60,000
+ - 10
+ - T-shirt/top, trouser, pullover, dress, coat, sandal, shirt, sneaker, bag and ankle boot.
+ * - `CIFAR-10 `__
+ - 50,000
+ - 10
+ - Airplanes, cars, birds, cats, deer, dogs, frogs, horses, ships and trucks.
+ * - `CIFAR-100 `__
+ - 50,000
+ - 100
+ - Similar to CIFAR-10 but with 100 classes and 600 images each.
+ * - `OUI-Adience-Age `__
+ - 26,580
+ - 8
+ - 8 age groups/labels (0-2, 4-6, 8-13, 15-20, 25-32, 38-43, 48-53, 60-).
+ * - `ImageNet-10-1 `__
+ - 9,750
+ - 10
+ - Coffee mug, computer keyboard, dining table, wardrobe, lawn mower, microphone, swing, sewing machine, odometer and gas pump.
+ * - `ImageNet-10-2 `__
+ - 9,750
+ - 10
+ - Drum, banj, whistle, grand piano, violin, organ, acoustic guitar, trombone, flute and sax.
+
+
+We do not change the default fine-tuning technique in their source code. In order to match each task, the codes of input image shape and output numbers are changed.
+
+Search phase time for all NAS methods is **two days** as well as the retrain time. Average results are reported based on** three repeat times**. Our evaluation machines have one Nvidia Tesla P100 GPU, 112GB of RAM and one 2.60GHz CPU (Intel E5-2690).
+
+For NAO, it requires too much computing resources, so we only use NAO-WS which provides the pipeline script.
+
+For AutoKeras, we used 0.2.18 version because it was the latest version when we started the experiment.
+
+NAS Performance
+---------------
+
+.. list-table::
+ :header-rows: 1
+ :widths: auto
+
+ * - NAS
+ - AutoKeras (%)
+ - ENAS (macro) (%)
+ - ENAS (micro) (%)
+ - DARTS (%)
+ - NAO-WS (%)
+ * - Fashion-MNIST
+ - 91.84
+ - 95.44
+ - 95.53
+ - **95.74**
+ - 95.20
+ * - CIFAR-10
+ - 75.78
+ - 95.68
+ - **96.16**
+ - 94.23
+ - 95.64
+ * - CIFAR-100
+ - 43.61
+ - 78.13
+ - 78.84
+ - **79.74**
+ - 75.75
+ * - OUI-Adience-Age
+ - 63.20
+ - **80.34**
+ - 78.55
+ - 76.83
+ - 72.96
+ * - ImageNet-10-1
+ - 61.80
+ - 77.07
+ - 79.80
+ - **80.48**
+ - 77.20
+ * - ImageNet-10-2
+ - 37.20
+ - 58.13
+ - 56.47
+ - 60.53
+ - **61.20**
+
+
+Unfortunately, we cannot reproduce all the results in the paper.
+
+The best or average results reported in the paper:
+
+.. list-table::
+ :header-rows: 1
+ :widths: auto
+
+ * - NAS
+ - AutoKeras(%)
+ - ENAS (macro) (%)
+ - ENAS (micro) (%)
+ - DARTS (%)
+ - NAO-WS (%)
+ * - CIFAR- 10
+ - 88.56(best)
+ - 96.13(best)
+ - 97.11(best)
+ - 97.17(average)
+ - 96.47(best)
+
+
+For AutoKeras, it has relatively worse performance across all datasets due to its random factor on network morphism.
+
+For ENAS, ENAS (macro) shows good results in OUI-Adience-Age and ENAS (micro) shows good results in CIFAR-10.
+
+For DARTS, it has a good performance on some datasets but we found its high variance in other datasets. The difference among three runs of benchmarks can be up to 5.37% in OUI-Adience-Age and 4.36% in ImageNet-10-1.
+
+For NAO-WS, it shows good results in ImageNet-10-2 but it can perform very poorly in OUI-Adience-Age.
+
+Reference
+---------
+
+
+#.
+ Jin, Haifeng, Qingquan Song, and Xia Hu. "Efficient neural architecture search with network morphism." *arXiv preprint arXiv:1806.10282* (2018).
+
+#.
+ Liu, Hanxiao, Karen Simonyan, and Yiming Yang. "Darts: Differentiable architecture search." arXiv preprint arXiv:1806.09055 (2018).
+
+#.
+ Pham, Hieu, et al. "Efficient Neural Architecture Search via Parameters Sharing." international conference on machine learning (2018): 4092-4101.
+
+#.
+ Luo, Renqian, et al. "Neural Architecture Optimization." neural information processing systems (2018): 7827-7838.
diff --git a/docs/en_US/CommunitySharings/ParallelizingTpeSearch.rst b/docs/en_US/CommunitySharings/ParallelizingTpeSearch.rst
new file mode 100644
index 0000000000..3d75962f6c
--- /dev/null
+++ b/docs/en_US/CommunitySharings/ParallelizingTpeSearch.rst
@@ -0,0 +1,183 @@
+.. role:: raw-html(raw)
+ :format: html
+
+
+Parallelizing a Sequential Algorithm TPE
+========================================
+
+TPE approaches were actually run asynchronously in order to make use of multiple compute nodes and to avoid wasting time waiting for trial evaluations to complete. For the TPE approach, the so-called constant liar approach was used: each time a candidate point x∗ was proposed, a fake fitness evaluation of the y was assigned temporarily, until the evaluation completed and reported the actual loss f(x∗).
+
+Introduction and Problems
+-------------------------
+
+Sequential Model-based Global Optimization
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Sequential Model-Based Global Optimization (SMBO) algorithms have been used in many applications where evaluation of the fitness function is expensive. In an application where the true fitness function f: X → R is costly to evaluate, model-based algorithms approximate f with a surrogate that is cheaper to evaluate. Typically the inner loop in an SMBO algorithm is the numerical optimization of this surrogate, or some transformation of the surrogate. The point x∗ that maximizes the surrogate (or its transformation) becomes the proposal for where the true function f should be evaluated. This active-learning-like algorithm template is summarized in the figure below. SMBO algorithms differ in what criterion they optimize to obtain x∗ given a model (or surrogate) of f, and in they model f via observation history H.
+
+
+.. image:: ../../img/parallel_tpe_search4.PNG
+ :target: ../../img/parallel_tpe_search4.PNG
+ :alt:
+
+
+The algorithms in this work optimize the criterion of Expected Improvement (EI). Other criteria have been suggested, such as Probability of Improvement and Expected Improvement, minimizing the Conditional Entropy of the Minimizer, and the bandit-based criterion. We chose to use the EI criterion in TPE because it is intuitive, and has been shown to work well in a variety of settings. Expected improvement is the expectation under some model M of f : X → RN that f(x) will exceed (negatively) some threshold y∗:
+
+
+.. image:: ../../img/parallel_tpe_search_ei.PNG
+ :target: ../../img/parallel_tpe_search_ei.PNG
+ :alt:
+
+
+Since calculation of p(y|x) is expensive, TPE approach modeled p(y|x) by p(x|y) and p(y).The TPE defines p(x|y) using two such densities:
+
+
+.. image:: ../../img/parallel_tpe_search_tpe.PNG
+ :target: ../../img/parallel_tpe_search_tpe.PNG
+ :alt:
+
+
+where l(x) is the density formed by using the observations {x(i)} such that corresponding loss
+f(x(i)) was less than y∗ and g(x) is the density formed by using the remaining observations. TPE algorithm depends on a y∗ that is larger than the best observed f(x) so that some points can be used to form l(x). The TPE algorithm chooses y∗ to be some quantile γ of the observed y values, so that p(y<\ ``y∗``\ ) = γ, but no specific model for p(y) is necessary. The tree-structured form of l and g makes it easy to draw many candidates according to l and evaluate them according to g(x)/l(x). On each iteration, the algorithm returns the candidate x∗ with the greatest EI.
+
+Here is a simulation of the TPE algorithm in a two-dimensional search space. The difference of background color represents different values. It can be seen that TPE combines exploration and exploitation very well. (Black indicates the points of this round samples, and yellow indicates the points has been taken in the history.)
+
+
+.. image:: ../../img/parallel_tpe_search1.gif
+ :target: ../../img/parallel_tpe_search1.gif
+ :alt:
+
+
+**Since EI is a continuous function, the highest x of EI is determined at a certain status.** As shown in the figure below, the blue triangle is the point that is most likely to be sampled in this state.
+
+
+.. image:: ../../img/parallel_tpe_search_ei2.PNG
+ :target: ../../img/parallel_tpe_search_ei2.PNG
+ :alt:
+
+
+TPE performs well when we use it in sequential, but if we provide a larger concurrency, then **there will be a large number of points produced in the same EI state**\ , too concentrated points will reduce the exploration ability of the tuner, resulting in resources waste.
+
+Here is the simulation figure when we set ``concurrency=60``\ , It can be seen that this phenomenon is obvious.
+
+
+.. image:: ../../img/parallel_tpe_search2.gif
+ :target: ../../img/parallel_tpe_search2.gif
+ :alt:
+
+
+Research solution
+-----------------
+
+Approximated q-EI Maximization
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+The multi-points criterion that we have presented below can potentially be used to deliver an additional design of experiments in one step through the resolution of the optimization problem.
+
+
+.. image:: ../../img/parallel_tpe_search_qEI.PNG
+ :target: ../../img/parallel_tpe_search_qEI.PNG
+ :alt:
+
+
+However, the computation of q-EI becomes intensive as q increases. After our research, there are four popular greedy strategies that approach the result of problem while avoiding its numerical cost.
+
+Solution 1: Believing the OK Predictor: The KB(Kriging Believer) Heuristic Strategy
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+The Kriging Believer strategy replaces the conditional knowledge about the responses at the sites chosen within the last iterations by deterministic values equal to the expectation of the Kriging predictor. Keeping the same notations as previously, the strategy can be summed up as follows:
+
+
+.. image:: ../../img/parallel_tpe_search_kb.PNG
+ :target: ../../img/parallel_tpe_search_kb.PNG
+ :alt:
+
+
+This sequential strategy delivers a q-points design and is computationally affordable since it relies on the analytically known EI, optimized in d dimensions. However, there is a risk of failure, since believing an OK predictor that overshoots the observed data may lead to a sequence that gets trapped in a non-optimal region for many iterations. We now propose a second strategy that reduces this risk.
+
+Solution 2: The CL(Constant Liar) Heuristic Strategy
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Let us now consider a sequential strategy in which the metamodel is updated (still without hyperparameter re-estimation) at each iteration with a value L exogenously fixed by the user, here called a ”lie”. The strategy referred to as the Constant Liar consists in lying with the same value L at every iteration: maximize EI (i.e. find xn+1), actualize the model as if y(xn+1) = L, and so on always with the same L ∈ R:
+
+
+.. image:: ../../img/parallel_tpe_search_cl.PNG
+ :target: ../../img/parallel_tpe_search_cl.PNG
+ :alt:
+
+
+L should logically be determined on the basis of the values taken by y at X. Three values, min{Y}, mean{Y}, and max{Y} are considered here. **The larger L is, the more explorative the algorithm will be, and vice versa.**
+
+We have simulated the method above. The following figure shows the result of using mean value liars to maximize q-EI. We find that the points we have taken have begun to be scattered.
+
+
+.. image:: ../../img/parallel_tpe_search3.gif
+ :target: ../../img/parallel_tpe_search3.gif
+ :alt:
+
+
+Experiment
+----------
+
+Branin-Hoo
+^^^^^^^^^^
+
+The four optimization strategies presented in the last section are now compared on the Branin-Hoo function which is a classical test-case in global optimization.
+
+
+.. image:: ../../img/parallel_tpe_search_branin.PNG
+ :target: ../../img/parallel_tpe_search_branin.PNG
+ :alt:
+
+
+The recommended values of a, b, c, r, s and t are: a = 1, b = 5.1 ⁄ (4π2), c = 5 ⁄ π, r = 6, s = 10 and t = 1 ⁄ (8π). This function has three global minimizers(-3.14, 12.27), (3.14, 2.27), (9.42, 2.47).
+
+Next is the comparison of the q-EI associated with the q first points (q ∈ [1,10]) given by the constant liar strategies (min and max), 2000 q-points designs uniformly drawn for every q, and 2000 q-points LHS designs taken at random for every q.
+
+
+.. image:: ../../img/parallel_tpe_search_result.PNG
+ :target: ../../img/parallel_tpe_search_result.PNG
+ :alt:
+
+
+As we can seen on figure, CL[max] and CL[min] offer very good q-EI results compared to random designs, especially for small values of q.
+
+Gaussian Mixed Model function
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+We also compared the case of using parallel optimization and not using parallel optimization. A two-dimensional multimodal Gaussian Mixed distribution is used to simulate, the following is our result:
+
+.. list-table::
+ :header-rows: 1
+ :widths: auto
+
+ * -
+ - concurrency=80
+ - concurrency=60
+ - concurrency=40
+ - concurrency=20
+ - concurrency=10
+ * - Without parallel optimization
+ - avg = 0.4841 :raw-html:`
` var = 0.1953
+ - avg = 0.5155 :raw-html:`
` var = 0.2219
+ - avg = 0.5773 :raw-html:`
` var = 0.2570
+ - avg = 0.4680 :raw-html:`
` var = 0.1994
+ - avg = 0.2774 :raw-html:`
` var = 0.1217
+ * - With parallel optimization
+ - avg = 0.2132 :raw-html:`
` var = 0.0700
+ - avg = 0.2177\ :raw-html:`
`\ var = 0.0796
+ - avg = 0.1835 :raw-html:`
` var = 0.0533
+ - avg = 0.1671 :raw-html:`
` var = 0.0413
+ - avg = 0.1918 :raw-html:`
` var = 0.0697
+
+
+Note: The total number of samples per test is 240 (ensure that the budget is equal). The trials in each form were repeated 1000 times, the value is the average and variance of the best results in 1000 trials.
+
+References
+----------
+
+[1] James Bergstra, Remi Bardenet, Yoshua Bengio, Balazs Kegl. "Algorithms for Hyper-Parameter Optimization". `Link `__
+
+[2] Meng-Hiot Lim, Yew-Soon Ong. "Computational Intelligence in Expensive Optimization Problems". `Link `__
+
+[3] M. Jordan, J. Kleinberg, B. Scho¨lkopf. "Pattern Recognition and Machine Learning". `Link `__
diff --git a/docs/en_US/CommunitySharings/RecommendersSvd.rst b/docs/en_US/CommunitySharings/RecommendersSvd.rst
new file mode 100644
index 0000000000..5c90b2b507
--- /dev/null
+++ b/docs/en_US/CommunitySharings/RecommendersSvd.rst
@@ -0,0 +1,15 @@
+Automatically tuning SVD (NNI in Recommenders)
+==============================================
+
+In this tutorial, we first introduce a github repo `Recommenders `__. It is a repository that provides examples and best practices for building recommendation systems, provided as Jupyter notebooks. It has various models that are popular and widely deployed in recommendation systems. To provide a complete end-to-end experience, they present each example in five key tasks, as shown below:
+
+
+* `Prepare Data `__\ : Preparing and loading data for each recommender algorithm.
+* `Model `__\ : Building models using various classical and deep learning recommender algorithms such as Alternating Least Squares (\ `ALS `__\ ) or eXtreme Deep Factorization Machines (\ `xDeepFM `__\ ).
+* `Evaluate `__\ : Evaluating algorithms with offline metrics.
+* `Model Select and Optimize `__\ : Tuning and optimizing hyperparameters for recommender models.
+* `Operationalize `__\ : Operationalizing models in a production environment on Azure.
+
+The fourth task is tuning and optimizing the model's hyperparameters, this is where NNI could help. To give a concrete example that NNI tunes the models in Recommenders, let's demonstrate with the model `SVD `__\ , and data Movielens100k. There are more than 10 hyperparameters to be tuned in this model.
+
+`This Jupyter notebook `__ provided by Recommenders is a very detailed step-by-step tutorial for this example. It uses different built-in tuning algorithms in NNI, including ``Annealing``\ , ``SMAC``\ , ``Random Search``\ , ``TPE``\ , ``Hyperband``\ , ``Metis`` and ``Evolution``. Finally, the results of different tuning algorithms are compared. Please go through this notebook to learn how to use NNI to tune SVD model, then you could further use NNI to tune other models in Recommenders.
diff --git a/docs/en_US/CommunitySharings/SptagAutoTune.rst b/docs/en_US/CommunitySharings/SptagAutoTune.rst
new file mode 100644
index 0000000000..6f6e8df601
--- /dev/null
+++ b/docs/en_US/CommunitySharings/SptagAutoTune.rst
@@ -0,0 +1,9 @@
+Automatically tuning SPTAG with NNI
+===================================
+
+`SPTAG `__ (Space Partition Tree And Graph) is a library for large scale vector approximate nearest neighbor search scenario released by `Microsoft Research (MSR) `__ and `Microsoft Bing `__.
+
+This library assumes that the samples are represented as vectors and that the vectors can be compared by L2 distances or cosine distances. Vectors returned for a query vector are the vectors that have smallest L2 distance or cosine distances with the query vector.
+SPTAG provides two methods: kd-tree and relative neighborhood graph (SPTAG-KDT) and balanced k-means tree and relative neighborhood graph (SPTAG-BKT). SPTAG-KDT is advantageous in index building cost, and SPTAG-BKT is advantageous in search accuracy in very high-dimensional data.
+
+In SPTAG, there are tens of parameters that can be tuned for specified scenarios or datasets. NNI is a great tool for automatically tuning those parameters. The authors of SPTAG tried NNI for the auto tuning and found good-performing parameters easily, thus, they shared the practice of tuning SPTAG on NNI in their document `here `__. Please refer to it for detailed tutorial.
diff --git a/docs/en_US/Compression/AutoPruningUsingTuners.rst b/docs/en_US/Compression/AutoPruningUsingTuners.rst
new file mode 100644
index 0000000000..abda796614
--- /dev/null
+++ b/docs/en_US/Compression/AutoPruningUsingTuners.rst
@@ -0,0 +1,121 @@
+Automatic Model Pruning using NNI Tuners
+========================================
+
+It's convenient to implement auto model pruning with NNI compression and NNI tuners
+
+First, model compression with NNI
+---------------------------------
+
+You can easily compress a model with NNI compression. Take pruning for example, you can prune a pretrained model with LevelPruner like this
+
+.. code-block:: python
+
+ from nni.algorithms.compression.pytorch.pruning import LevelPruner
+ config_list = [{ 'sparsity': 0.8, 'op_types': ['default'] }]
+ pruner = LevelPruner(model, config_list)
+ pruner.compress()
+
+The 'default' op_type stands for the module types defined in :githublink:`default_layers.py ` for pytorch.
+
+Therefore ``{ 'sparsity': 0.8, 'op_types': ['default'] }``\ means that **all layers with specified op_types will be compressed with the same 0.8 sparsity**. When ``pruner.compress()`` called, the model is compressed with masks and after that you can normally fine tune this model and **pruned weights won't be updated** which have been masked.
+
+Then, make this automatic
+-------------------------
+
+The previous example manually choosed LevelPruner and pruned all layers with the same sparsity, this is obviously sub-optimal because different layers may have different redundancy. Layer sparsity should be carefully tuned to achieve least model performance degradation and this can be done with NNI tuners.
+
+The first thing we need to do is to design a search space, here we use a nested search space which contains choosing pruning algorithm and optimizing layer sparsity.
+
+.. code-block:: json
+
+ {
+ "prune_method": {
+ "_type": "choice",
+ "_value": [
+ {
+ "_name": "agp",
+ "conv0_sparsity": {
+ "_type": "uniform",
+ "_value": [
+ 0.1,
+ 0.9
+ ]
+ },
+ "conv1_sparsity": {
+ "_type": "uniform",
+ "_value": [
+ 0.1,
+ 0.9
+ ]
+ },
+ },
+ {
+ "_name": "level",
+ "conv0_sparsity": {
+ "_type": "uniform",
+ "_value": [
+ 0.1,
+ 0.9
+ ]
+ },
+ "conv1_sparsity": {
+ "_type": "uniform",
+ "_value": [
+ 0.01,
+ 0.9
+ ]
+ },
+ }
+ ]
+ }
+ }
+
+Then we need to modify our codes for few lines
+
+.. code-block:: python
+
+ import nni
+ from nni.algorithms.compression.pytorch.pruning import *
+ params = nni.get_parameters()
+ conv0_sparsity = params['prune_method']['conv0_sparsity']
+ conv1_sparsity = params['prune_method']['conv1_sparsity']
+ # these raw sparsity should be scaled if you need total sparsity constrained
+ config_list_level = [{ 'sparsity': conv0_sparsity, 'op_name': 'conv0' },
+ { 'sparsity': conv1_sparsity, 'op_name': 'conv1' }]
+ config_list_agp = [{'initial_sparsity': 0, 'final_sparsity': conv0_sparsity,
+ 'start_epoch': 0, 'end_epoch': 3,
+ 'frequency': 1,'op_name': 'conv0' },
+ {'initial_sparsity': 0, 'final_sparsity': conv1_sparsity,
+ 'start_epoch': 0, 'end_epoch': 3,
+ 'frequency': 1,'op_name': 'conv1' },]
+ PRUNERS = {'level':LevelPruner(model, config_list_level), 'agp':AGPPruner(model, config_list_agp)}
+ pruner = PRUNERS(params['prune_method']['_name'])
+ pruner.compress()
+ ... # fine tuning
+ acc = evaluate(model) # evaluation
+ nni.report_final_results(acc)
+
+Last, define our task and automatically tuning pruning methods with layers sparsity
+
+.. code-block:: yaml
+
+ authorName: default
+ experimentName: Auto_Compression
+ trialConcurrency: 2
+ maxExecDuration: 100h
+ maxTrialNum: 500
+ #choice: local, remote, pai
+ trainingServicePlatform: local
+ #choice: true, false
+ useAnnotation: False
+ searchSpacePath: search_space.json
+ tuner:
+ #choice: TPE, Random, Anneal...
+ builtinTunerName: TPE
+ classArgs:
+ #choice: maximize, minimize
+ optimize_mode: maximize
+ trial:
+ command: bash run_prune.sh
+ codeDir: .
+ gpuNum: 1
diff --git a/docs/en_US/Compression/CompressionReference.rst b/docs/en_US/Compression/CompressionReference.rst
new file mode 100644
index 0000000000..0ead9cec81
--- /dev/null
+++ b/docs/en_US/Compression/CompressionReference.rst
@@ -0,0 +1,33 @@
+Python API Reference of Compression Utilities
+=============================================
+
+.. contents::
+
+Sensitivity Utilities
+---------------------
+
+.. autoclass:: nni.compression.pytorch.utils.sensitivity_analysis.SensitivityAnalysis
+ :members:
+
+Topology Utilities
+------------------
+
+.. autoclass:: nni.compression.pytorch.utils.shape_dependency.ChannelDependency
+ :members:
+
+.. autoclass:: nni.compression.pytorch.utils.shape_dependency.GroupDependency
+ :members:
+
+.. autoclass:: nni.compression.pytorch.utils.mask_conflict.CatMaskPadding
+ :members:
+
+.. autoclass:: nni.compression.pytorch.utils.mask_conflict.GroupMaskConflict
+ :members:
+
+.. autoclass:: nni.compression.pytorch.utils.mask_conflict.ChannelMaskConflict
+ :members:
+
+Model FLOPs/Parameters Counter
+------------------------------
+
+.. autofunction:: nni.compression.pytorch.utils.counter.count_flops_params
diff --git a/docs/en_US/Compression/CompressionUtils.rst b/docs/en_US/Compression/CompressionUtils.rst
new file mode 100644
index 0000000000..c56d80d085
--- /dev/null
+++ b/docs/en_US/Compression/CompressionUtils.rst
@@ -0,0 +1,175 @@
+Analysis Utils for Model Compression
+====================================
+
+.. contents::
+
+We provide several easy-to-use tools for users to analyze their model during model compression.
+
+Sensitivity Analysis
+--------------------
+
+First, we provide a sensitivity analysis tool (\ **SensitivityAnalysis**\ ) for users to analyze the sensitivity of each convolutional layer in their model. Specifically, the SensitiviyAnalysis gradually prune each layer of the model, and test the accuracy of the model at the same time. Note that, SensitivityAnalysis only prunes a layer once a time, and the other layers are set to their original weights. According to the accuracies of different convolutional layers under different sparsities, we can easily find out which layers the model accuracy is more sensitive to.
+
+Usage
+^^^^^
+
+The following codes show the basic usage of the SensitivityAnalysis.
+
+.. code-block:: python
+
+ from nni.compression.pytorch.utils.sensitivity_analysis import SensitivityAnalysis
+
+ def val(model):
+ model.eval()
+ total = 0
+ correct = 0
+ with torch.no_grad():
+ for batchid, (data, label) in enumerate(val_loader):
+ data, label = data.cuda(), label.cuda()
+ out = model(data)
+ _, predicted = out.max(1)
+ total += data.size(0)
+ correct += predicted.eq(label).sum().item()
+ return correct / total
+
+ s_analyzer = SensitivityAnalysis(model=net, val_func=val)
+ sensitivity = s_analyzer.analysis(val_args=[net])
+ os.makedir(outdir)
+ s_analyzer.export(os.path.join(outdir, filename))
+
+Two key parameters of SensitivityAnalysis are ``model``\ , and ``val_func``. ``model`` is the neural network that to be analyzed and the ``val_func`` is the validation function that returns the model accuracy/loss/ or other metrics on the validation dataset. Due to different scenarios may have different ways to calculate the loss/accuracy, so users should prepare a function that returns the model accuracy/loss on the dataset and pass it to SensitivityAnalysis.
+SensitivityAnalysis can export the sensitivity results as a csv file usage is shown in the example above.
+
+Futhermore, users can specify the sparsities values used to prune for each layer by optional parameter ``sparsities``.
+
+.. code-block:: python
+
+ s_analyzer = SensitivityAnalysis(model=net, val_func=val, sparsities=[0.25, 0.5, 0.75])
+
+the SensitivityAnalysis will prune 25% 50% 75% weights gradually for each layer, and record the model's accuracy at the same time (SensitivityAnalysis only prune a layer once a time, the other layers are set to their original weights). If the sparsities is not set, SensitivityAnalysis will use the numpy.arange(0.1, 1.0, 0.1) as the default sparsity values.
+
+Users can also speed up the progress of sensitivity analysis by the early_stop_mode and early_stop_value option. By default, the SensitivityAnalysis will test the accuracy under all sparsities for each layer. In contrast, when the early_stop_mode and early_stop_value are set, the sensitivity analysis for a layer will stop, when the accuracy/loss has already met the threshold set by early_stop_value. We support four early stop modes: minimize, maximize, dropped, raised.
+
+minimize: The analysis stops when the validation metric return by the val_func lower than ``early_stop_value``.
+
+maximize: The analysis stops when the validation metric return by the val_func larger than ``early_stop_value``.
+
+dropped: The analysis stops when the validation metric has dropped by ``early_stop_value``.
+
+raised: The analysis stops when the validation metric has raised by ``early_stop_value``.
+
+.. code-block:: python
+
+ s_analyzer = SensitivityAnalysis(model=net, val_func=val, sparsities=[0.25, 0.5, 0.75], early_stop_mode='dropped', early_stop_value=0.1)
+
+If users only want to analyze several specified convolutional layers, users can specify the target conv layers by the ``specified_layers`` in analysis function. ``specified_layers`` is a list that consists of the Pytorch module names of the conv layers. For example
+
+.. code-block:: python
+
+ sensitivity = s_analyzer.analysis(val_args=[net], specified_layers=['Conv1'])
+
+In this example, only the ``Conv1`` layer is analyzed. In addtion, users can quickly and easily achieve the analysis parallelization by launching multiple processes and assigning different conv layers of the same model to each process.
+
+Output example
+^^^^^^^^^^^^^^
+
+The following lines are the example csv file exported from SensitivityAnalysis. The first line is constructed by 'layername' and sparsity list. Here the sparsity value means how much weight SensitivityAnalysis prune for each layer. Each line below records the model accuracy when this layer is under different sparsities. Note that, due to the early_stop option, some layers may
+not have model accuracies/losses under all sparsities, for example, its accuracy drop has already exceeded the threshold set by the user.
+
+.. code-block:: bash
+
+ layername,0.05,0.1,0.2,0.3,0.4,0.5,0.7,0.85,0.95
+ features.0,0.54566,0.46308,0.06978,0.0374,0.03024,0.01512,0.00866,0.00492,0.00184
+ features.3,0.54878,0.51184,0.37978,0.19814,0.07178,0.02114,0.00438,0.00442,0.00142
+ features.6,0.55128,0.53566,0.4887,0.4167,0.31178,0.19152,0.08612,0.01258,0.00236
+ features.8,0.55696,0.54194,0.48892,0.42986,0.33048,0.2266,0.09566,0.02348,0.0056
+ features.10,0.55468,0.5394,0.49576,0.4291,0.3591,0.28138,0.14256,0.05446,0.01578
+
+Topology Analysis
+-----------------
+
+We also provide several tools for the topology analysis during the model compression. These tools are to help users compress their model better. Because of the complex topology of the network, when compressing the model, users often need to spend a lot of effort to check whether the compression configuration is reasonable. So we provide these tools for topology analysis to reduce the burden on users.
+
+ChannelDependency
+^^^^^^^^^^^^^^^^^
+
+Complicated models may have residual connection/concat operations in their models. When the user prunes these models, they need to be careful about the channel-count dependencies between the convolution layers in the model. Taking the following residual block in the resnet18 as an example. The output features of the ``layer2.0.conv2`` and ``layer2.0.downsample.0`` are added together, so the number of the output channels of ``layer2.0.conv2`` and ``layer2.0.downsample.0`` should be the same, or there may be a tensor shape conflict.
+
+
+.. image:: ../../img/channel_dependency_example.jpg
+ :target: ../../img/channel_dependency_example.jpg
+ :alt:
+
+
+If the layers have channel dependency are assigned with different sparsities (here we only discuss the structured pruning by L1FilterPruner/L2FilterPruner), then there will be a shape conflict during these layers. Even the pruned model with mask works fine, the pruned model cannot be speedup to the final model directly that runs on the devices, because there will be a shape conflict when the model tries to add/concat the outputs of these layers. This tool is to find the layers that have channel count dependencies to help users better prune their model.
+
+Usage
+^^^^^
+
+.. code-block:: python
+
+ from nni.compression.pytorch.utils.shape_dependency import ChannelDependency
+ data = torch.ones(1, 3, 224, 224).cuda()
+ channel_depen = ChannelDependency(net, data)
+ channel_depen.export('dependency.csv')
+
+Output Example
+^^^^^^^^^^^^^^
+
+The following lines are the output example of torchvision.models.resnet18 exported by ChannelDependency. The layers at the same line have output channel dependencies with each other. For example, layer1.1.conv2, conv1, and layer1.0.conv2 have output channel dependencies with each other, which means the output channel(filters) numbers of these three layers should be same with each other, otherwise, the model may have shape conflict.
+
+.. code-block:: bash
+
+ Dependency Set,Convolutional Layers
+ Set 1,layer1.1.conv2,layer1.0.conv2,conv1
+ Set 2,layer1.0.conv1
+ Set 3,layer1.1.conv1
+ Set 4,layer2.0.conv1
+ Set 5,layer2.1.conv2,layer2.0.conv2,layer2.0.downsample.0
+ Set 6,layer2.1.conv1
+ Set 7,layer3.0.conv1
+ Set 8,layer3.0.downsample.0,layer3.1.conv2,layer3.0.conv2
+ Set 9,layer3.1.conv1
+ Set 10,layer4.0.conv1
+ Set 11,layer4.0.downsample.0,layer4.1.conv2,layer4.0.conv2
+ Set 12,layer4.1.conv1
+
+MaskConflict
+^^^^^^^^^^^^
+
+When the masks of different layers in a model have conflict (for example, assigning different sparsities for the layers that have channel dependency), we can fix the mask conflict by MaskConflict. Specifically, the MaskConflict loads the masks exported by the pruners(L1FilterPruner, etc), and check if there is mask conflict, if so, MaskConflict sets the conflicting masks to the same value.
+
+.. code-block:: bash
+
+ from nni.compression.pytorch.utils.mask_conflict import fix_mask_conflict
+ fixed_mask = fix_mask_conflict('./resnet18_mask', net, data)
+
+Model FLOPs/Parameters Counter
+------------------------------
+
+We provide a model counter for calculating the model FLOPs and parameters. This counter supports calculating FLOPs/parameters of a normal model without masks, it can also calculates FLOPs/parameters of a model with mask wrappers, which helps users easily check model complexity during model compression on NNI. Note that, for sturctured pruning, we only identify the remained filters according to its mask, which not taking the pruned input channels into consideration, so the calculated FLOPs will be larger than real number (i.e., the number calculated after Model Speedup).
+
+We support two modes to collect information of modules. The first mode is ``default``\ , which only collect the information of convolution and linear. The second mode is ``full``\ , which also collect the information of other operations. Users can easily use our collected ``results`` for futher analysis.
+
+Usage
+^^^^^
+
+.. code-block:: python
+
+ from nni.compression.pytorch.utils.counter import count_flops_params
+
+ # Given input size (1, 1, 28, 28)
+ flops, params, results = count_flops_params(model, (1, 1, 28, 28))
+
+ # Given input tensor with size (1, 1, 28, 28) and switch to full mode
+ x = torch.randn(1, 1, 28, 28)
+
+ flops, params, results = count_flops_params(model, (x,) mode='full') # tuple of tensor as input
+
+ # Format output size to M (i.e., 10^6)
+ print(f'FLOPs: {flops/1e6:.3f}M, Params: {params/1e6:.3f}M)
+ print(results)
+ {
+ 'conv': {'flops': [60], 'params': [20], 'weight_size': [(5, 3, 1, 1)], 'input_size': [(1, 3, 2, 2)], 'output_size': [(1, 5, 2, 2)], 'module_type': ['Conv2d']},
+ 'conv2': {'flops': [100], 'params': [30], 'weight_size': [(5, 5, 1, 1)], 'input_size': [(1, 5, 2, 2)], 'output_size': [(1, 5, 2, 2)], 'module_type': ['Conv2d']}
+ }
diff --git a/docs/en_US/Compression/CustomizeCompressor.rst b/docs/en_US/Compression/CustomizeCompressor.rst
new file mode 100644
index 0000000000..7457439c9c
--- /dev/null
+++ b/docs/en_US/Compression/CustomizeCompressor.rst
@@ -0,0 +1,179 @@
+Customize New Compression Algorithm
+===================================
+
+.. contents::
+
+In order to simplify the process of writing new compression algorithms, we have designed simple and flexible programming interface, which covers pruning and quantization. Below, we first demonstrate how to customize a new pruning algorithm and then demonstrate how to customize a new quantization algorithm.
+
+**Important Note** To better understand how to customize new pruning/quantization algorithms, users should first understand the framework that supports various pruning algorithms in NNI. Reference `Framework overview of model compression `__
+
+Customize a new pruning algorithm
+---------------------------------
+
+Implementing a new pruning algorithm requires implementing a ``weight masker`` class which shoud be a subclass of ``WeightMasker``\ , and a ``pruner`` class, which should be a subclass ``Pruner``.
+
+An implementation of ``weight masker`` may look like this:
+
+.. code-block:: python
+
+ class MyMasker(WeightMasker):
+ def __init__(self, model, pruner):
+ super().__init__(model, pruner)
+ # You can do some initialization here, such as collecting some statistics data
+ # if it is necessary for your algorithms to calculate the masks.
+
+ def calc_mask(self, sparsity, wrapper, wrapper_idx=None):
+ # calculate the masks based on the wrapper.weight, and sparsity,
+ # and anything else
+ # mask = ...
+ return {'weight_mask': mask}
+
+You can reference nni provided :githublink:`weight masker ` implementations to implement your own weight masker.
+
+A basic ``pruner`` looks likes this:
+
+.. code-block:: python
+
+ class MyPruner(Pruner):
+ def __init__(self, model, config_list, optimizer):
+ super().__init__(model, config_list, optimizer)
+ self.set_wrappers_attribute("if_calculated", False)
+ # construct a weight masker instance
+ self.masker = MyMasker(model, self)
+
+ def calc_mask(self, wrapper, wrapper_idx=None):
+ sparsity = wrapper.config['sparsity']
+ if wrapper.if_calculated:
+ # Already pruned, do not prune again as a one-shot pruner
+ return None
+ else:
+ # call your masker to actually calcuate the mask for this layer
+ masks = self.masker.calc_mask(sparsity=sparsity, wrapper=wrapper, wrapper_idx=wrapper_idx)
+ wrapper.if_calculated = True
+ return masks
+
+Reference nni provided :githublink:`pruner ` implementations to implement your own pruner class.
+
+----
+
+Customize a new quantization algorithm
+--------------------------------------
+
+To write a new quantization algorithm, you can write a class that inherits ``nni.compression.pytorch.Quantizer``. Then, override the member functions with the logic of your algorithm. The member function to override is ``quantize_weight``. ``quantize_weight`` directly returns the quantized weights rather than mask, because for quantization the quantized weights cannot be obtained by applying mask.
+
+.. code-block:: python
+
+ from nni.compression.pytorch import Quantizer
+
+ class YourQuantizer(Quantizer):
+ def __init__(self, model, config_list):
+ """
+ Suggest you to use the NNI defined spec for config
+ """
+ super().__init__(model, config_list)
+
+ def quantize_weight(self, weight, config, **kwargs):
+ """
+ quantize should overload this method to quantize weight tensors.
+ This method is effectively hooked to :meth:`forward` of the model.
+
+ Parameters
+ ----------
+ weight : Tensor
+ weight that needs to be quantized
+ config : dict
+ the configuration for weight quantization
+ """
+
+ # Put your code to generate `new_weight` here
+
+ return new_weight
+
+ def quantize_output(self, output, config, **kwargs):
+ """
+ quantize should overload this method to quantize output.
+ This method is effectively hooked to `:meth:`forward` of the model.
+
+ Parameters
+ ----------
+ output : Tensor
+ output that needs to be quantized
+ config : dict
+ the configuration for output quantization
+ """
+
+ # Put your code to generate `new_output` here
+
+ return new_output
+
+ def quantize_input(self, *inputs, config, **kwargs):
+ """
+ quantize should overload this method to quantize input.
+ This method is effectively hooked to :meth:`forward` of the model.
+
+ Parameters
+ ----------
+ inputs : Tensor
+ inputs that needs to be quantized
+ config : dict
+ the configuration for inputs quantization
+ """
+
+ # Put your code to generate `new_input` here
+
+ return new_input
+
+ def update_epoch(self, epoch_num):
+ pass
+
+ def step(self):
+ """
+ Can do some processing based on the model or weights binded
+ in the func bind_model
+ """
+ pass
+
+Customize backward function
+^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Sometimes it's necessary for a quantization operation to have a customized backward function, such as `Straight-Through Estimator `__\ , user can customize a backward function as follow:
+
+.. code-block:: python
+
+ from nni.compression.pytorch.compressor import Quantizer, QuantGrad, QuantType
+
+ class ClipGrad(QuantGrad):
+ @staticmethod
+ def quant_backward(tensor, grad_output, quant_type):
+ """
+ This method should be overrided by subclass to provide customized backward function,
+ default implementation is Straight-Through Estimator
+ Parameters
+ ----------
+ tensor : Tensor
+ input of quantization operation
+ grad_output : Tensor
+ gradient of the output of quantization operation
+ quant_type : QuantType
+ the type of quantization, it can be `QuantType.QUANT_INPUT`, `QuantType.QUANT_WEIGHT`, `QuantType.QUANT_OUTPUT`,
+ you can define different behavior for different types.
+ Returns
+ -------
+ tensor
+ gradient of the input of quantization operation
+ """
+
+ # for quant_output function, set grad to zero if the absolute value of tensor is larger than 1
+ if quant_type == QuantType.QUANT_OUTPUT:
+ grad_output[torch.abs(tensor) > 1] = 0
+ return grad_output
+
+
+ class YourQuantizer(Quantizer):
+ def __init__(self, model, config_list):
+ super().__init__(model, config_list)
+ # set your customized backward function to overwrite default backward function
+ self.quant_grad = ClipGrad
+
+If you do not customize ``QuantGrad``\ , the default backward is Straight-Through Estimator.
+*Coming Soon* ...
diff --git a/docs/en_US/Compression/DependencyAware.rst b/docs/en_US/Compression/DependencyAware.rst
new file mode 100644
index 0000000000..5001ca7430
--- /dev/null
+++ b/docs/en_US/Compression/DependencyAware.rst
@@ -0,0 +1,77 @@
+Dependency-aware Mode for Filter Pruning
+========================================
+
+Currently, we have several filter pruning algorithm for the convolutional layers: FPGM Pruner, L1Filter Pruner, L2Filter Pruner, Activation APoZ Rank Filter Pruner, Activation Mean Rank Filter Pruner, Taylor FO On Weight Pruner. In these filter pruning algorithms, the pruner will prune each convolutional layer separately. While pruning a convolution layer, the algorithm will quantify the importance of each filter based on some specific rules(such as l1-norm), and prune the less important filters.
+
+As `dependency analysis utils <./CompressionUtils.md>`__ shows, if the output channels of two convolutional layers(conv1, conv2) are added together, then these two conv layers have channel dependency with each other(more details please see `Compression Utils <./CompressionUtils.rst>`__\ ). Take the following figure as an example.
+
+
+.. image:: ../../img/mask_conflict.jpg
+ :target: ../../img/mask_conflict.jpg
+ :alt:
+
+
+If we prune the first 50% of output channels(filters) for conv1, and prune the last 50% of output channels for conv2. Although both layers have pruned 50% of the filters, the speedup module still needs to add zeros to align the output channels. In this case, we cannot harvest the speed benefit from the model pruning.
+
+ To better gain the speed benefit of the model pruning, we add a dependency-aware mode for the Filter Pruner. In the dependency-aware mode, the pruner prunes the model not only based on the l1 norm of each filter, but also the topology of the whole network architecture.
+
+In the dependency-aware mode(\ ``dependency_aware`` is set ``True``\ ), the pruner will try to prune the same output channels for the layers that have the channel dependencies with each other, as shown in the following figure.
+
+
+.. image:: ../../img/dependency-aware.jpg
+ :target: ../../img/dependency-aware.jpg
+ :alt:
+
+
+Take the dependency-aware mode of L1Filter Pruner as an example. Specifically, the pruner will calculate the L1 norm (for example) sum of all the layers in the dependency set for each channel. Obviously, the number of channels that can actually be pruned of this dependency set in the end is determined by the minimum sparsity of layers in this dependency set(denoted by ``min_sparsity``\ ). According to the L1 norm sum of each channel, the pruner will prune the same ``min_sparsity`` channels for all the layers. Next, the pruner will additionally prune ``sparsity`` - ``min_sparsity`` channels for each convolutional layer based on its own L1 norm of each channel. For example, suppose the output channels of ``conv1`` , ``conv2`` are added together and the configured sparsities of ``conv1`` and ``conv2`` are 0.3, 0.2 respectively. In this case, the ``dependency-aware pruner`` will
+
+.. code-block:: bash
+
+ - First, prune the same 20% of channels for `conv1` and `conv2` according to L1 norm sum of `conv1` and `conv2`.
+ - Second, the pruner will additionally prune 10% channels for `conv1` according to the L1 norm of each channel of `conv1`.
+
+
+In addition, for the convolutional layers that have more than one filter group, ``dependency-aware pruner`` will also try to prune the same number of the channels for each filter group. Overall, this pruner will prune the model according to the L1 norm of each filter and try to meet the topological constrains(channel dependency, etc) to improve the final speed gain after the speedup process.
+
+In the dependency-aware mode, the pruner will provide a better speed gain from the model pruning.
+
+Usage
+-----
+
+In this section, we will show how to enable the dependency-aware mode for the filter pruner. Currently, only the one-shot pruners such as FPGM Pruner, L1Filter Pruner, L2Filter Pruner, Activation APoZ Rank Filter Pruner, Activation Mean Rank Filter Pruner, Taylor FO On Weight Pruner, support the dependency-aware mode.
+
+To enable the dependency-aware mode for ``L1FilterPruner``\ :
+
+.. code-block:: python
+
+ from nni.algorithms.compression.pytorch.pruning import L1FilterPruner
+ config_list = [{ 'sparsity': 0.8, 'op_types': ['Conv2d'] }]
+ # dummy_input is necessary for the dependency_aware mode
+ dummy_input = torch.ones(1, 3, 224, 224).cuda()
+ pruner = L1FilterPruner(model, config_list, dependency_aware=True, dummy_input=dummy_input)
+ # for L2FilterPruner
+ # pruner = L2FilterPruner(model, config_list, dependency_aware=True, dummy_input=dummy_input)
+ # for FPGMPruner
+ # pruner = FPGMPruner(model, config_list, dependency_aware=True, dummy_input=dummy_input)
+ # for ActivationAPoZRankFilterPruner
+ # pruner = ActivationAPoZRankFilterPruner(model, config_list, statistics_batch_num=1, , dependency_aware=True, dummy_input=dummy_input)
+ # for ActivationMeanRankFilterPruner
+ # pruner = ActivationMeanRankFilterPruner(model, config_list, statistics_batch_num=1, dependency_aware=True, dummy_input=dummy_input)
+ # for TaylorFOWeightFilterPruner
+ # pruner = TaylorFOWeightFilterPruner(model, config_list, statistics_batch_num=1, dependency_aware=True, dummy_input=dummy_input)
+
+ pruner.compress()
+
+Evaluation
+----------
+
+In order to compare the performance of the pruner with or without the dependency-aware mode, we use L1FilterPruner to prune the Mobilenet_v2 separately when the dependency-aware mode is turned on and off. To simplify the experiment, we use the uniform pruning which means we allocate the same sparsity for all convolutional layers in the model.
+We trained a Mobilenet_v2 model on the cifar10 dataset and prune the model based on this pretrained checkpoint. The following figure shows the accuracy and FLOPs of the model pruned by different pruners.
+
+
+.. image:: ../../img/mobilev2_l1_cifar.jpg
+ :target: ../../img/mobilev2_l1_cifar.jpg
+ :alt:
+
+
+In the figure, the ``Dependency-aware`` represents the L1FilterPruner with dependency-aware mode enabled. ``L1 Filter`` is the normal ``L1FilterPruner`` without the dependency-aware mode, and the ``No-Dependency`` means pruner only prunes the layers that has no channel dependency with other layers. As we can see in the figure, when the dependency-aware mode enabled, the pruner can bring higher accuracy under the same Flops.
diff --git a/docs/en_US/Compression/Framework.rst b/docs/en_US/Compression/Framework.rst
new file mode 100644
index 0000000000..fa46b60230
--- /dev/null
+++ b/docs/en_US/Compression/Framework.rst
@@ -0,0 +1,209 @@
+Framework overview of model compression
+=======================================
+
+.. contents::
+
+Below picture shows the components overview of model compression framework.
+
+
+.. image:: ../../img/compressor_framework.jpg
+ :target: ../../img/compressor_framework.jpg
+ :alt:
+
+
+There are 3 major components/classes in NNI model compression framework: ``Compressor``\ , ``Pruner`` and ``Quantizer``. Let's look at them in detail one by one:
+
+Compressor
+----------
+
+Compressor is the base class for pruner and quntizer, it provides a unified interface for pruner and quantizer for end users, so that pruner and quantizer can be used in the same way. For example, to use a pruner:
+
+.. code-block:: python
+
+ from nni.algorithms.compression.pytorch.pruning import LevelPruner
+
+ # load a pretrained model or train a model before using a pruner
+
+ configure_list = [{
+ 'sparsity': 0.7,
+ 'op_types': ['Conv2d', 'Linear'],
+ }]
+
+ optimizer = torch.optim.SGD(model.parameters(), lr=0.001, momentum=0.9, weight_decay=1e-4)
+ pruner = LevelPruner(model, configure_list, optimizer)
+ model = pruner.compress()
+
+ # model is ready for pruning, now start finetune the model,
+ # the model will be pruned during training automatically
+
+To use a quantizer:
+
+.. code-block:: python
+
+ from nni.algorithms.compression.pytorch.pruning import DoReFaQuantizer
+
+ configure_list = [{
+ 'quant_types': ['weight'],
+ 'quant_bits': {
+ 'weight': 8,
+ },
+ 'op_types':['Conv2d', 'Linear']
+ }]
+ optimizer = torch.optim.SGD(model.parameters(), lr=0.001, momentum=0.9, weight_decay=1e-4)
+ quantizer = DoReFaQuantizer(model, configure_list, optimizer)
+ quantizer.compress()
+
+View :githublink:`example code ` for more information.
+
+``Compressor`` class provides some utility methods for subclass and users:
+
+Set wrapper attribute
+^^^^^^^^^^^^^^^^^^^^^
+
+Sometimes ``calc_mask`` must save some state data, therefore users can use ``set_wrappers_attribute`` API to register attribute just like how buffers are registered in PyTorch modules. These buffers will be registered to ``module wrapper``. Users can access these buffers through ``module wrapper``.
+In above example, we use ``set_wrappers_attribute`` to set a buffer ``if_calculated`` which is used as flag indicating if the mask of a layer is already calculated.
+
+Collect data during forward
+^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Sometimes users want to collect some data during the modules' forward method, for example, the mean value of the activation. This can be done by adding a customized collector to module.
+
+.. code-block:: python
+
+ class MyMasker(WeightMasker):
+ def __init__(self, model, pruner):
+ super().__init__(model, pruner)
+ # Set attribute `collected_activation` for all wrappers to store
+ # activations for each layer
+ self.pruner.set_wrappers_attribute("collected_activation", [])
+ self.activation = torch.nn.functional.relu
+
+ def collector(wrapper, input_, output):
+ # The collected activation can be accessed via each wrapper's collected_activation
+ # attribute
+ wrapper.collected_activation.append(self.activation(output.detach().cpu()))
+
+ self.pruner.hook_id = self.pruner.add_activation_collector(collector)
+
+The collector function will be called each time the forward method runs.
+
+Users can also remove this collector like this:
+
+.. code-block:: python
+
+ # Save the collector identifier
+ collector_id = self.pruner.add_activation_collector(collector)
+
+ # When the collector is not used any more, it can be remove using
+ # the saved collector identifier
+ self.pruner.remove_activation_collector(collector_id)
+
+----
+
+Pruner
+------
+
+A pruner receives ``model``\ , ``config_list`` and ``optimizer`` as arguments. It prunes the model per the ``config_list`` during training loop by adding a hook on ``optimizer.step()``.
+
+Pruner class is a subclass of Compressor, so it contains everything in the Compressor class and some additional components only for pruning, it contains:
+
+Weight masker
+^^^^^^^^^^^^^
+
+A ``weight masker`` is the implementation of pruning algorithms, it can prune a specified layer wrapped by ``module wrapper`` with specified sparsity.
+
+Pruning module wrapper
+^^^^^^^^^^^^^^^^^^^^^^
+
+A ``pruning module wrapper`` is a module containing:
+
+
+#. the origin module
+#. some buffers used by ``calc_mask``
+#. a new forward method that applies masks before running the original forward method.
+
+the reasons to use ``module wrapper``\ :
+
+
+#. some buffers are needed by ``calc_mask`` to calculate masks and these buffers should be registered in ``module wrapper`` so that the original modules are not contaminated.
+#. a new ``forward`` method is needed to apply masks to weight before calling the real ``forward`` method.
+
+Pruning hook
+^^^^^^^^^^^^
+
+A pruning hook is installed on a pruner when the pruner is constructed, it is used to call pruner's calc_mask method at ``optimizer.step()`` is invoked.
+
+----
+
+Quantizer
+---------
+
+Quantizer class is also a subclass of ``Compressor``\ , it is used to compress models by reducing the number of bits required to represent weights or activations, which can reduce the computations and the inference time. It contains:
+
+Quantization module wrapper
+^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Each module/layer of the model to be quantized is wrapped by a quantization module wrapper, it provides a new ``forward`` method to quantize the original module's weight, input and output.
+
+Quantization hook
+^^^^^^^^^^^^^^^^^
+
+A quantization hook is installed on a quntizer when it is constructed, it is call at ``optimizer.step()``.
+
+Quantization methods
+^^^^^^^^^^^^^^^^^^^^
+
+``Quantizer`` class provides following methods for subclass to implement quantization algorithms:
+
+.. code-block:: python
+
+ class Quantizer(Compressor):
+ """
+ Base quantizer for pytorch quantizer
+ """
+ def quantize_weight(self, weight, wrapper, **kwargs):
+ """
+ quantize should overload this method to quantize weight.
+ This method is effectively hooked to :meth:`forward` of the model.
+ Parameters
+ ----------
+ weight : Tensor
+ weight that needs to be quantized
+ wrapper : QuantizerModuleWrapper
+ the wrapper for origin module
+ """
+ raise NotImplementedError('Quantizer must overload quantize_weight()')
+
+ def quantize_output(self, output, wrapper, **kwargs):
+ """
+ quantize should overload this method to quantize output.
+ This method is effectively hooked to :meth:`forward` of the model.
+ Parameters
+ ----------
+ output : Tensor
+ output that needs to be quantized
+ wrapper : QuantizerModuleWrapper
+ the wrapper for origin module
+ """
+ raise NotImplementedError('Quantizer must overload quantize_output()')
+
+ def quantize_input(self, *inputs, wrapper, **kwargs):
+ """
+ quantize should overload this method to quantize input.
+ This method is effectively hooked to :meth:`forward` of the model.
+ Parameters
+ ----------
+ inputs : Tensor
+ inputs that needs to be quantized
+ wrapper : QuantizerModuleWrapper
+ the wrapper for origin module
+ """
+ raise NotImplementedError('Quantizer must overload quantize_input()')
+
+----
+
+Multi-GPU support
+-----------------
+
+On multi-GPU training, buffers and parameters are copied to multiple GPU every time the ``forward`` method runs on multiple GPU. If buffers and parameters are updated in the ``forward`` method, an ``in-place`` update is needed to ensure the update is effective.
+Since ``calc_mask`` is called in the ``optimizer.step`` method, which happens after the ``forward`` method and happens only on one GPU, it supports multi-GPU naturally.
diff --git a/docs/en_US/Compression/ModelSpeedup.rst b/docs/en_US/Compression/ModelSpeedup.rst
new file mode 100644
index 0000000000..ed7ec2a78b
--- /dev/null
+++ b/docs/en_US/Compression/ModelSpeedup.rst
@@ -0,0 +1,190 @@
+Speed up Masked Model
+=====================
+
+*This feature is in Beta version.*
+
+Introduction
+------------
+
+Pruning algorithms usually use weight masks to simulate the real pruning. Masks can be used
+to check model performance of a specific pruning (or sparsity), but there is no real speedup.
+Since model speedup is the ultimate goal of model pruning, we try to provide a tool to users
+to convert a model to a smaller one based on user provided masks (the masks come from the
+pruning algorithms).
+
+There are two types of pruning. One is fine-grained pruning, it does not change the shape of weights, and input/output tensors. Sparse kernel is required to speed up a fine-grained pruned layer. The other is coarse-grained pruning (e.g., channels), shape of weights and input/output tensors usually change due to such pruning. To speed up this kind of pruning, there is no need to use sparse kernel, just replace the pruned layer with smaller one. Since the support of sparse kernels in community is limited, we only support the speedup of coarse-grained pruning and leave the support of fine-grained pruning in future.
+
+Design and Implementation
+-------------------------
+
+To speed up a model, the pruned layers should be replaced, either replaced with smaller layer for coarse-grained mask, or replaced with sparse kernel for fine-grained mask. Coarse-grained mask usually changes the shape of weights or input/output tensors, thus, we should do shape inference to check are there other unpruned layers should be replaced as well due to shape change. Therefore, in our design, there are two main steps: first, do shape inference to find out all the modules that should be replaced; second, replace the modules. The first step requires topology (i.e., connections) of the model, we use ``jit.trace`` to obtain the model graph for PyTorch.
+
+For each module, we should prepare four functions, three for shape inference and one for module replacement. The three shape inference functions are: given weight shape infer input/output shape, given input shape infer weight/output shape, given output shape infer weight/input shape. The module replacement function returns a newly created module which is smaller.
+
+Usage
+-----
+
+.. code-block:: python
+
+ from nni.compression.pytorch import ModelSpeedup
+ # model: the model you want to speed up
+ # dummy_input: dummy input of the model, given to `jit.trace`
+ # masks_file: the mask file created by pruning algorithms
+ m_speedup = ModelSpeedup(model, dummy_input.to(device), masks_file)
+ m_speedup.speedup_model()
+ dummy_input = dummy_input.to(device)
+ start = time.time()
+ out = model(dummy_input)
+ print('elapsed time: ', time.time() - start)
+
+For complete examples please refer to :githublink:`the code `
+
+NOTE: The current implementation supports PyTorch 1.3.1 or newer.
+
+Limitations
+-----------
+
+Since every module requires four functions for shape inference and module replacement, this is a large amount of work, we only implemented the ones that are required by the examples. If you want to speed up your own model which cannot supported by the current implementation, you are welcome to contribute.
+
+For PyTorch we can only replace modules, if functions in ``forward`` should be replaced, our current implementation does not work. One workaround is make the function a PyTorch module.
+
+Speedup Results of Examples
+---------------------------
+
+The code of these experiments can be found :githublink:`here `.
+
+slim pruner example
+^^^^^^^^^^^^^^^^^^^
+
+on one V100 GPU,
+input tensor: ``torch.randn(64, 3, 32, 32)``
+
+.. list-table::
+ :header-rows: 1
+ :widths: auto
+
+ * - Times
+ - Mask Latency
+ - Speedup Latency
+ * - 1
+ - 0.01197
+ - 0.005107
+ * - 2
+ - 0.02019
+ - 0.008769
+ * - 4
+ - 0.02733
+ - 0.014809
+ * - 8
+ - 0.04310
+ - 0.027441
+ * - 16
+ - 0.07731
+ - 0.05008
+ * - 32
+ - 0.14464
+ - 0.10027
+
+
+fpgm pruner example
+^^^^^^^^^^^^^^^^^^^
+
+on cpu,
+input tensor: ``torch.randn(64, 1, 28, 28)``\ ,
+too large variance
+
+.. list-table::
+ :header-rows: 1
+ :widths: auto
+
+ * - Times
+ - Mask Latency
+ - Speedup Latency
+ * - 1
+ - 0.01383
+ - 0.01839
+ * - 2
+ - 0.01167
+ - 0.003558
+ * - 4
+ - 0.01636
+ - 0.01088
+ * - 40
+ - 0.14412
+ - 0.08268
+ * - 40
+ - 1.29385
+ - 0.14408
+ * - 40
+ - 0.41035
+ - 0.46162
+ * - 400
+ - 6.29020
+ - 5.82143
+
+
+l1filter pruner example
+^^^^^^^^^^^^^^^^^^^^^^^
+
+on one V100 GPU,
+input tensor: ``torch.randn(64, 3, 32, 32)``
+
+.. list-table::
+ :header-rows: 1
+ :widths: auto
+
+ * - Times
+ - Mask Latency
+ - Speedup Latency
+ * - 1
+ - 0.01026
+ - 0.003677
+ * - 2
+ - 0.01657
+ - 0.008161
+ * - 4
+ - 0.02458
+ - 0.020018
+ * - 8
+ - 0.03498
+ - 0.025504
+ * - 16
+ - 0.06757
+ - 0.047523
+ * - 32
+ - 0.10487
+ - 0.086442
+
+
+APoZ pruner example
+^^^^^^^^^^^^^^^^^^^
+
+on one V100 GPU,
+input tensor: ``torch.randn(64, 3, 32, 32)``
+
+.. list-table::
+ :header-rows: 1
+ :widths: auto
+
+ * - Times
+ - Mask Latency
+ - Speedup Latency
+ * - 1
+ - 0.01389
+ - 0.004208
+ * - 2
+ - 0.01628
+ - 0.008310
+ * - 4
+ - 0.02521
+ - 0.014008
+ * - 8
+ - 0.03386
+ - 0.023923
+ * - 16
+ - 0.06042
+ - 0.046183
+ * - 32
+ - 0.12421
+ - 0.087113
+
diff --git a/docs/en_US/Compression/Overview.rst b/docs/en_US/Compression/Overview.rst
new file mode 100644
index 0000000000..676d2d586f
--- /dev/null
+++ b/docs/en_US/Compression/Overview.rst
@@ -0,0 +1,118 @@
+Model Compression with NNI
+==========================
+
+.. contents::
+
+As larger neural networks with more layers and nodes are considered, reducing their storage and computational cost becomes critical, especially for some real-time applications. Model compression can be used to address this problem.
+
+NNI provides a model compression toolkit to help user compress and speed up their model with state-of-the-art compression algorithms and strategies. There are several core features supported by NNI model compression:
+
+
+* Support many popular pruning and quantization algorithms.
+* Automate model pruning and quantization process with state-of-the-art strategies and NNI's auto tuning power.
+* Speed up a compressed model to make it have lower inference latency and also make it become smaller.
+* Provide friendly and easy-to-use compression utilities for users to dive into the compression process and results.
+* Concise interface for users to customize their own compression algorithms.
+
+*Note that the interface and APIs are unified for both PyTorch and TensorFlow, currently only PyTorch version has been supported, TensorFlow version will be supported in future.*
+
+Supported Algorithms
+--------------------
+
+The algorithms include pruning algorithms and quantization algorithms.
+
+Pruning Algorithms
+^^^^^^^^^^^^^^^^^^
+
+Pruning algorithms compress the original network by removing redundant weights or channels of layers, which can reduce model complexity and address the over-fitting issue.
+
+.. list-table::
+ :header-rows: 1
+ :widths: auto
+
+ * - Name
+ - Brief Introduction of Algorithm
+ * - `Level Pruner `__
+ - Pruning the specified ratio on each weight based on absolute values of weights
+ * - `AGP Pruner `__
+ - Automated gradual pruning (To prune, or not to prune: exploring the efficacy of pruning for model compression) `Reference Paper `__
+ * - `Lottery Ticket Pruner `__
+ - The pruning process used by "The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks". It prunes a model iteratively. `Reference Paper `__
+ * - `FPGM Pruner `__
+ - Filter Pruning via Geometric Median for Deep Convolutional Neural Networks Acceleration `Reference Paper `__
+ * - `L1Filter Pruner `__
+ - Pruning filters with the smallest L1 norm of weights in convolution layers (Pruning Filters for Efficient Convnets) `Reference Paper `__
+ * - `L2Filter Pruner `__
+ - Pruning filters with the smallest L2 norm of weights in convolution layers
+ * - `ActivationAPoZRankFilterPruner `__
+ - Pruning filters based on the metric APoZ (average percentage of zeros) which measures the percentage of zeros in activations of (convolutional) layers. `Reference Paper `__
+ * - `ActivationMeanRankFilterPruner `__
+ - Pruning filters based on the metric that calculates the smallest mean value of output activations
+ * - `Slim Pruner `__
+ - Pruning channels in convolution layers by pruning scaling factors in BN layers(Learning Efficient Convolutional Networks through Network Slimming) `Reference Paper `__
+ * - `TaylorFO Pruner `__
+ - Pruning filters based on the first order taylor expansion on weights(Importance Estimation for Neural Network Pruning) `Reference Paper `__
+ * - `ADMM Pruner `__
+ - Pruning based on ADMM optimization technique `Reference Paper `__
+ * - `NetAdapt Pruner `__
+ - Automatically simplify a pretrained network to meet the resource budget by iterative pruning `Reference Paper `__
+ * - `SimulatedAnnealing Pruner `__
+ - Automatic pruning with a guided heuristic search method, Simulated Annealing algorithm `Reference Paper `__
+ * - `AutoCompress Pruner `__
+ - Automatic pruning by iteratively call SimulatedAnnealing Pruner and ADMM Pruner `Reference Paper `__
+ * - `AMC Pruner `__
+ - AMC: AutoML for Model Compression and Acceleration on Mobile Devices `Reference Paper `__
+
+
+You can refer to this :githublink:`benchmark ` for the performance of these pruners on some benchmark problems.
+
+Quantization Algorithms
+^^^^^^^^^^^^^^^^^^^^^^^
+
+Quantization algorithms compress the original network by reducing the number of bits required to represent weights or activations, which can reduce the computations and the inference time.
+
+.. list-table::
+ :header-rows: 1
+ :widths: auto
+
+ * - Name
+ - Brief Introduction of Algorithm
+ * - `Naive Quantizer `__
+ - Quantize weights to default 8 bits
+ * - `QAT Quantizer `__
+ - Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference. `Reference Paper `__
+ * - `DoReFa Quantizer `__
+ - DoReFa-Net: Training Low Bitwidth Convolutional Neural Networks with Low Bitwidth Gradients. `Reference Paper `__
+ * - `BNN Quantizer `__
+ - Binarized Neural Networks: Training Deep Neural Networks with Weights and Activations Constrained to +1 or -1. `Reference Paper `__
+
+
+Automatic Model Compression
+---------------------------
+
+Given targeted compression ratio, it is pretty hard to obtain the best compressed ratio in a one shot manner. An automatic model compression algorithm usually need to explore the compression space by compressing different layers with different sparsities. NNI provides such algorithms to free users from specifying sparsity of each layer in a model. Moreover, users could leverage NNI's auto tuning power to automatically compress a model. Detailed document can be found `here <./AutoPruningUsingTuners.rst>`__.
+
+Model Speedup
+-------------
+
+The final goal of model compression is to reduce inference latency and model size. However, existing model compression algorithms mainly use simulation to check the performance (e.g., accuracy) of compressed model, for example, using masks for pruning algorithms, and storing quantized values still in float32 for quantization algorithms. Given the output masks and quantization bits produced by those algorithms, NNI can really speed up the model. The detailed tutorial of Model Speedup can be found `here <./ModelSpeedup.rst>`__.
+
+Compression Utilities
+---------------------
+
+Compression utilities include some useful tools for users to understand and analyze the model they want to compress. For example, users could check sensitivity of each layer to pruning. Users could easily calculate the FLOPs and parameter size of a model. Please refer to `here <./CompressionUtils.rst>`__ for a complete list of compression utilities.
+
+Customize Your Own Compression Algorithms
+-----------------------------------------
+
+NNI model compression leaves simple interface for users to customize a new compression algorithm. The design philosophy of the interface is making users focus on the compression logic while hiding framework specific implementation details from users. The detailed tutorial for customizing a new compression algorithm (pruning algorithm or quantization algorithm) can be found `here <./Framework.rst>`__.
+
+Reference and Feedback
+----------------------
+
+
+* To `report a bug `__ for this feature in GitHub;
+* To `file a feature or improvement request `__ for this feature in GitHub;
+* To know more about `Feature Engineering with NNI <../FeatureEngineering/Overview.rst>`__\ ;
+* To know more about `NAS with NNI <../NAS/Overview.rst>`__\ ;
+* To know more about `Hyperparameter Tuning with NNI <../Tuner/BuiltinTuner.rst>`__\ ;
diff --git a/docs/en_US/Compression/Pruner.rst b/docs/en_US/Compression/Pruner.rst
new file mode 100644
index 0000000000..e677f69b46
--- /dev/null
+++ b/docs/en_US/Compression/Pruner.rst
@@ -0,0 +1,801 @@
+Supported Pruning Algorithms on NNI
+===================================
+
+We provide several pruning algorithms that support fine-grained weight pruning and structural filter pruning. **Fine-grained Pruning** generally results in unstructured models, which need specialized haredware or software to speed up the sparse network.** Filter Pruning** achieves acceleratation by removing the entire filter. We also provide an algorithm to control the** pruning schedule**.
+
+**Fine-grained Pruning**
+
+
+* `Level Pruner <#level-pruner>`__
+
+**Filter Pruning**
+
+
+* `Slim Pruner <#slim-pruner>`__
+* `FPGM Pruner <#fpgm-pruner>`__
+* `L1Filter Pruner <#l1filter-pruner>`__
+* `L2Filter Pruner <#l2filter-pruner>`__
+* `Activation APoZ Rank Filter Pruner <#activationAPoZRankFilter-pruner>`__
+* `Activation Mean Rank Filter Pruner <#activationmeanrankfilter-pruner>`__
+* `Taylor FO On Weight Pruner <#taylorfoweightfilter-pruner>`__
+
+**Pruning Schedule**
+
+
+* `AGP Pruner <#agp-pruner>`__
+* `NetAdapt Pruner <#netadapt-pruner>`__
+* `SimulatedAnnealing Pruner <#simulatedannealing-pruner>`__
+* `AutoCompress Pruner <#autocompress-pruner>`__
+* `AMC Pruner <#amc-pruner>`__
+* `Sensitivity Pruner <#sensitivity-pruner>`__
+
+**Others**
+
+
+* `ADMM Pruner <#admm-pruner>`__
+* `Lottery Ticket Hypothesis <#lottery-ticket-hypothesis>`__
+
+Level Pruner
+------------
+
+This is one basic one-shot pruner: you can set a target sparsity level (expressed as a fraction, 0.6 means we will prune 60% of the weight parameters).
+
+We first sort the weights in the specified layer by their absolute values. And then mask to zero the smallest magnitude weights until the desired sparsity level is reached.
+
+Usage
+^^^^^
+
+Tensorflow code
+
+.. code-block:: python
+
+ from nni.algorithms.compression.tensorflow.pruning import LevelPruner
+ config_list = [{ 'sparsity': 0.8, 'op_types': ['default'] }]
+ pruner = LevelPruner(model, config_list)
+ pruner.compress()
+
+PyTorch code
+
+.. code-block:: python
+
+ from nni.algorithms.compression.pytorch.pruning import LevelPruner
+ config_list = [{ 'sparsity': 0.8, 'op_types': ['default'] }]
+ pruner = LevelPruner(model, config_list)
+ pruner.compress()
+
+User configuration for Level Pruner
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+**PyTorch**
+
+.. autoclass:: nni.algorithms.compression.pytorch.pruning.LevelPruner
+
+Tensorflow
+""""""""""
+
+.. autoclass:: nni.algorithms.compression.tensorflow.pruning.LevelPruner
+
+Slim Pruner
+-----------
+
+This is an one-shot pruner, In `'Learning Efficient Convolutional Networks through Network Slimming' `__\ , authors Zhuang Liu, Jianguo Li, Zhiqiang Shen, Gao Huang, Shoumeng Yan and Changshui Zhang.
+
+
+.. image:: ../../img/slim_pruner.png
+ :target: ../../img/slim_pruner.png
+ :alt:
+
+
+..
+
+ Slim Pruner **prunes channels in the convolution layers by masking corresponding scaling factors in the later BN layers**\ , L1 regularization on the scaling factors should be applied in batch normalization (BN) layers while training, scaling factors of BN layers are** globally ranked** while pruning, so the sparse model can be automatically found given sparsity.
+
+
+Usage
+^^^^^
+
+PyTorch code
+
+.. code-block:: python
+
+ from nni.algorithms.compression.pytorch.pruning import SlimPruner
+ config_list = [{ 'sparsity': 0.8, 'op_types': ['BatchNorm2d'] }]
+ pruner = SlimPruner(model, config_list)
+ pruner.compress()
+
+User configuration for Slim Pruner
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+**PyTorch**
+
+.. autoclass:: nni.algorithms.compression.pytorch.pruning.SlimPruner
+
+Reproduced Experiment
+^^^^^^^^^^^^^^^^^^^^^
+
+We implemented one of the experiments in `'Learning Efficient Convolutional Networks through Network Slimming' `__\ , we pruned $70\%$ channels in the **VGGNet** for CIFAR-10 in the paper, in which $88.5\%$ parameters are pruned. Our experiments results are as follows:
+
+.. list-table::
+ :header-rows: 1
+ :widths: auto
+
+ * - Model
+ - Error(paper/ours)
+ - Parameters
+ - Pruned
+ * - VGGNet
+ - 6.34/6.40
+ - 20.04M
+ -
+ * - Pruned-VGGNet
+ - 6.20/6.26
+ - 2.03M
+ - 88.5%
+
+
+The experiments code can be found at :githublink:`examples/model_compress `
+
+----
+
+FPGM Pruner
+-----------
+
+This is an one-shot pruner, FPGM Pruner is an implementation of paper `Filter Pruning via Geometric Median for Deep Convolutional Neural Networks Acceleration `__
+
+FPGMPruner prune filters with the smallest geometric median.
+
+
+.. image:: ../../img/fpgm_fig1.png
+ :target: ../../img/fpgm_fig1.png
+ :alt:
+
+
+..
+
+ Previous works utilized “smaller-norm-less-important” criterion to prune filters with smaller norm values in a convolutional neural network. In this paper, we analyze this norm-based criterion and point out that its effectiveness depends on two requirements that are not always met: (1) the norm deviation of the filters should be large; (2) the minimum norm of the filters should be small. To solve this problem, we propose a novel filter pruning method, namely Filter Pruning via Geometric Median (FPGM), to compress the model regardless of those two requirements. Unlike previous methods, FPGM compresses CNN models by pruning filters with redundancy, rather than those with “relatively less” importance.
+
+
+We also provide a dependency-aware mode for this pruner to get better speedup from the pruning. Please reference `dependency-aware <./DependencyAware.rst>`__ for more details.
+
+Usage
+^^^^^
+
+PyTorch code
+
+.. code-block:: python
+
+ from nni.algorithms.compression.pytorch.pruning import FPGMPruner
+ config_list = [{
+ 'sparsity': 0.5,
+ 'op_types': ['Conv2d']
+ }]
+ pruner = FPGMPruner(model, config_list)
+ pruner.compress()
+
+User configuration for FPGM Pruner
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+**PyTorch**
+
+.. autoclass:: nni.algorithms.compression.pytorch.pruning.FPGMPruner
+
+L1Filter Pruner
+---------------
+
+This is an one-shot pruner, In `'PRUNING FILTERS FOR EFFICIENT CONVNETS' `__\ , authors Hao Li, Asim Kadav, Igor Durdanovic, Hanan Samet and Hans Peter Graf.
+
+
+.. image:: ../../img/l1filter_pruner.png
+ :target: ../../img/l1filter_pruner.png
+ :alt:
+
+
+..
+
+ L1Filter Pruner prunes filters in the **convolution layers**
+
+ The procedure of pruning m filters from the ith convolutional layer is as follows:
+
+
+ #. For each filter :math:`F_{i,j}`, calculate the sum of its absolute kernel weights :math:`s_j=\sum_{l=1}^{n_i}\sum|K_l|`.
+
+ #. Sort the filters by :math:`s_j`.
+
+ #. Prune :math:`m` filters with the smallest sum values and their corresponding feature maps. The
+ kernels in the next convolutional layer corresponding to the pruned feature maps are also removed.
+
+ #. A new kernel matrix is created for both the :math:`i`-th and :math:`i+1`-th layers, and the remaining kernel
+ weights are copied to the new model.
+
+
+In addition, we also provide a dependency-aware mode for the L1FilterPruner. For more details about the dependency-aware mode, please reference `dependency-aware mode <./DependencyAware.rst>`__.
+
+Usage
+^^^^^
+
+PyTorch code
+
+.. code-block:: python
+
+ from nni.algorithms.compression.pytorch.pruning import L1FilterPruner
+ config_list = [{ 'sparsity': 0.8, 'op_types': ['Conv2d'] }]
+ pruner = L1FilterPruner(model, config_list)
+ pruner.compress()
+
+User configuration for L1Filter Pruner
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+**PyTorch**
+
+.. autoclass:: nni.algorithms.compression.pytorch.pruning.L1FilterPruner
+
+Reproduced Experiment
+^^^^^^^^^^^^^^^^^^^^^
+
+We implemented one of the experiments in `'PRUNING FILTERS FOR EFFICIENT CONVNETS' `__ with **L1FilterPruner**\ , we pruned** VGG-16** for CIFAR-10 to** VGG-16-pruned-A** in the paper, in which $64\%$ parameters are pruned. Our experiments results are as follows:
+
+.. list-table::
+ :header-rows: 1
+ :widths: auto
+
+ * - Model
+ - Error(paper/ours)
+ - Parameters
+ - Pruned
+ * - VGG-16
+ - 6.75/6.49
+ - 1.5x10^7
+ -
+ * - VGG-16-pruned-A
+ - 6.60/6.47
+ - 5.4x10^6
+ - 64.0%
+
+
+The experiments code can be found at :githublink:`examples/model_compress `
+
+----
+
+L2Filter Pruner
+---------------
+
+This is a structured pruning algorithm that prunes the filters with the smallest L2 norm of the weights. It is implemented as a one-shot pruner.
+
+We also provide a dependency-aware mode for this pruner to get better speedup from the pruning. Please reference `dependency-aware <./DependencyAware.rst>`__ for more details.
+
+Usage
+^^^^^
+
+PyTorch code
+
+.. code-block:: python
+
+ from nni.algorithms.compression.pytorch.pruning import L2FilterPruner
+ config_list = [{ 'sparsity': 0.8, 'op_types': ['Conv2d'] }]
+ pruner = L2FilterPruner(model, config_list)
+ pruner.compress()
+
+User configuration for L2Filter Pruner
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+**PyTorch**
+
+.. autoclass:: nni.algorithms.compression.pytorch.pruning.L2FilterPruner
+
+----
+
+ActivationAPoZRankFilter Pruner
+-------------------------------
+
+ActivationAPoZRankFilter Pruner is a pruner which prunes the filters with the smallest importance criterion ``APoZ`` calculated from the output activations of convolution layers to achieve a preset level of network sparsity. The pruning criterion ``APoZ`` is explained in the paper `Network Trimming: A Data-Driven Neuron Pruning Approach towards Efficient Deep Architectures `__.
+
+The APoZ is defined as:
+
+
+.. image:: ../../img/apoz.png
+ :target: ../../img/apoz.png
+ :alt:
+
+
+We also provide a dependency-aware mode for this pruner to get better speedup from the pruning. Please reference `dependency-aware <./DependencyAware.rst>`__ for more details.
+
+Usage
+^^^^^
+
+PyTorch code
+
+.. code-block:: python
+
+ from nni.algorithms.compression.pytorch.pruning import ActivationAPoZRankFilterPruner
+ config_list = [{
+ 'sparsity': 0.5,
+ 'op_types': ['Conv2d']
+ }]
+ pruner = ActivationAPoZRankFilterPruner(model, config_list, statistics_batch_num=1)
+ pruner.compress()
+
+Note: ActivationAPoZRankFilterPruner is used to prune convolutional layers within deep neural networks, therefore the ``op_types`` field supports only convolutional layers.
+
+You can view :githublink:`example ` for more information.
+
+User configuration for ActivationAPoZRankFilter Pruner
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+**PyTorch**
+
+.. autoclass:: nni.algorithms.compression.pytorch.pruning.ActivationAPoZRankFilterPruner
+
+----
+
+ActivationMeanRankFilter Pruner
+-------------------------------
+
+ActivationMeanRankFilterPruner is a pruner which prunes the filters with the smallest importance criterion ``mean activation`` calculated from the output activations of convolution layers to achieve a preset level of network sparsity. The pruning criterion ``mean activation`` is explained in section 2.2 of the paper\ `Pruning Convolutional Neural Networks for Resource Efficient Inference `__. Other pruning criteria mentioned in this paper will be supported in future release.
+
+We also provide a dependency-aware mode for this pruner to get better speedup from the pruning. Please reference `dependency-aware <./DependencyAware.rst>`__ for more details.
+
+Usage
+^^^^^
+
+PyTorch code
+
+.. code-block:: python
+
+ from nni.algorithms.compression.pytorch.pruning import ActivationMeanRankFilterPruner
+ config_list = [{
+ 'sparsity': 0.5,
+ 'op_types': ['Conv2d']
+ }]
+ pruner = ActivationMeanRankFilterPruner(model, config_list, statistics_batch_num=1)
+ pruner.compress()
+
+Note: ActivationMeanRankFilterPruner is used to prune convolutional layers within deep neural networks, therefore the ``op_types`` field supports only convolutional layers.
+
+You can view :githublink:`example ` for more information.
+
+User configuration for ActivationMeanRankFilterPruner
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+**PyTorch**
+
+.. autoclass:: nni.algorithms.compression.pytorch.pruning.ActivationMeanRankFilterPruner
+
+----
+
+TaylorFOWeightFilter Pruner
+---------------------------
+
+TaylorFOWeightFilter Pruner is a pruner which prunes convolutional layers based on estimated importance calculated from the first order taylor expansion on weights to achieve a preset level of network sparsity. The estimated importance of filters is defined as the paper `Importance Estimation for Neural Network Pruning `__. Other pruning criteria mentioned in this paper will be supported in future release.
+
+..
+
+
+
+
+
+.. image:: ../../img/importance_estimation_sum.png
+ :target: ../../img/importance_estimation_sum.png
+ :alt:
+
+
+We also provide a dependency-aware mode for this pruner to get better speedup from the pruning. Please reference `dependency-aware <./DependencyAware.rst>`__ for more details.
+
+Usage
+^^^^^
+
+PyTorch code
+
+.. code-block:: python
+
+ from nni.algorithms.compression.pytorch.pruning import TaylorFOWeightFilterPruner
+ config_list = [{
+ 'sparsity': 0.5,
+ 'op_types': ['Conv2d']
+ }]
+ pruner = TaylorFOWeightFilterPruner(model, config_list, statistics_batch_num=1)
+ pruner.compress()
+
+User configuration for TaylorFOWeightFilter Pruner
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+**PyTorch**
+
+.. autoclass:: nni.algorithms.compression.pytorch.pruning.TaylorFOWeightFilterPruner
+
+----
+
+AGP Pruner
+----------
+
+This is an iterative pruner, In `To prune, or not to prune: exploring the efficacy of pruning for model compression `__\ , authors Michael Zhu and Suyog Gupta provide an algorithm to prune the weight gradually.
+
+..
+
+ We introduce a new automated gradual pruning algorithm in which the sparsity is increased from an initial sparsity value si (usually 0) to a final sparsity value sf over a span of n pruning steps, starting at training step t0 and with pruning frequency ∆t:
+
+ .. image:: ../../img/agp_pruner.png
+ :target: ../../img/agp_pruner.png
+ :alt:
+
+
+ The binary weight masks are updated every ∆t steps as the network is trained to gradually increase the sparsity of the network while allowing the network training steps to recover from any pruning-induced loss in accuracy. In our experience, varying the pruning frequency ∆t between 100 and 1000 training steps had a negligible impact on the final model quality. Once the model achieves the target sparsity sf , the weight masks are no longer updated. The intuition behind this sparsity function in equation (1).
+
+
+Usage
+^^^^^
+
+You can prune all weight from 0% to 80% sparsity in 10 epoch with the code below.
+
+PyTorch code
+
+.. code-block:: python
+
+ from nni.algorithms.compression.pytorch.pruning import AGPPruner
+ config_list = [{
+ 'initial_sparsity': 0,
+ 'final_sparsity': 0.8,
+ 'start_epoch': 0,
+ 'end_epoch': 10,
+ 'frequency': 1,
+ 'op_types': ['default']
+ }]
+
+ # load a pretrained model or train a model before using a pruner
+ # model = MyModel()
+ # model.load_state_dict(torch.load('mycheckpoint.pth'))
+
+ # AGP pruner prunes model while fine tuning the model by adding a hook on
+ # optimizer.step(), so an optimizer is required to prune the model.
+ optimizer = torch.optim.SGD(model.parameters(), lr=0.001, momentum=0.9, weight_decay=1e-4)
+
+ pruner = AGPPruner(model, config_list, optimizer, pruning_algorithm='level')
+ pruner.compress()
+
+AGP pruner uses ``LevelPruner`` algorithms to prune the weight by default, however you can set ``pruning_algorithm`` parameter to other values to use other pruning algorithms:
+
+
+* ``level``\ : LevelPruner
+* ``slim``\ : SlimPruner
+* ``l1``\ : L1FilterPruner
+* ``l2``\ : L2FilterPruner
+* ``fpgm``\ : FPGMPruner
+* ``taylorfo``\ : TaylorFOWeightFilterPruner
+* ``apoz``\ : ActivationAPoZRankFilterPruner
+* ``mean_activation``\ : ActivationMeanRankFilterPruner
+
+You should add code below to update epoch number when you finish one epoch in your training code.
+
+PyTorch code
+
+.. code-block:: python
+
+ pruner.update_epoch(epoch)
+
+You can view :githublink:`example ` for more information.
+
+User configuration for AGP Pruner
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+**PyTorch**
+
+.. autoclass:: nni.algorithms.compression.pytorch.pruning.AGPPruner
+
+----
+
+NetAdapt Pruner
+---------------
+
+NetAdapt allows a user to automatically simplify a pretrained network to meet the resource budget.
+Given the overall sparsity, NetAdapt will automatically generate the sparsities distribution among different layers by iterative pruning.
+
+For more details, please refer to `NetAdapt: Platform-Aware Neural Network Adaptation for Mobile Applications `__.
+
+
+.. image:: ../../img/algo_NetAdapt.png
+ :target: ../../img/algo_NetAdapt.png
+ :alt:
+
+
+Usage
+^^^^^
+
+PyTorch code
+
+.. code-block:: python
+
+ from nni.algorithms.compression.pytorch.pruning import NetAdaptPruner
+ config_list = [{
+ 'sparsity': 0.5,
+ 'op_types': ['Conv2d']
+ }]
+ pruner = NetAdaptPruner(model, config_list, short_term_fine_tuner=short_term_fine_tuner, evaluator=evaluator,base_algo='l1', experiment_data_dir='./')
+ pruner.compress()
+
+You can view :githublink:`example ` for more information.
+
+User configuration for NetAdapt Pruner
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+**PyTorch**
+
+.. autoclass:: nni.algorithms.compression.pytorch.pruning.NetAdaptPruner
+
+SimulatedAnnealing Pruner
+-------------------------
+
+We implement a guided heuristic search method, Simulated Annealing (SA) algorithm, with enhancement on guided search based on prior experience.
+The enhanced SA technique is based on the observation that a DNN layer with more number of weights often has a higher degree of model compression with less impact on overall accuracy.
+
+
+* Randomly initialize a pruning rate distribution (sparsities).
+* While current_temperature < stop_temperature:
+
+ #. generate a perturbation to current distribution
+ #. Perform fast evaluation on the perturbated distribution
+ #. accept the perturbation according to the performance and probability, if not accepted, return to step 1
+ #. cool down, current_temperature <- current_temperature * cool_down_rate
+
+For more details, please refer to `AutoCompress: An Automatic DNN Structured Pruning Framework for Ultra-High Compression Rates `__.
+
+Usage
+^^^^^
+
+PyTorch code
+
+.. code-block:: python
+
+ from nni.algorithms.compression.pytorch.pruning import SimulatedAnnealingPruner
+ config_list = [{
+ 'sparsity': 0.5,
+ 'op_types': ['Conv2d']
+ }]
+ pruner = SimulatedAnnealingPruner(model, config_list, evaluator=evaluator, base_algo='l1', cool_down_rate=0.9, experiment_data_dir='./')
+ pruner.compress()
+
+You can view :githublink:`example ` for more information.
+
+User configuration for SimulatedAnnealing Pruner
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+**PyTorch**
+
+.. autoclass:: nni.algorithms.compression.pytorch.pruning.SimulatedAnnealingPruner
+
+AutoCompress Pruner
+-------------------
+
+For each round, AutoCompressPruner prune the model for the same sparsity to achive the overall sparsity:
+
+.. code-block:: bash
+
+ 1. Generate sparsities distribution using SimulatedAnnealingPruner
+ 2. Perform ADMM-based structured pruning to generate pruning result for the next round.
+ Here we use `speedup` to perform real pruning.
+
+
+For more details, please refer to `AutoCompress: An Automatic DNN Structured Pruning Framework for Ultra-High Compression Rates `__.
+
+Usage
+^^^^^
+
+PyTorch code
+
+.. code-block:: python
+
+ from nni.algorithms.compression.pytorch.pruning import ADMMPruner
+ config_list = [{
+ 'sparsity': 0.5,
+ 'op_types': ['Conv2d']
+ }]
+ pruner = AutoCompressPruner(
+ model, config_list, trainer=trainer, evaluator=evaluator,
+ dummy_input=dummy_input, num_iterations=3, optimize_mode='maximize', base_algo='l1',
+ cool_down_rate=0.9, admm_num_iterations=30, admm_training_epochs=5, experiment_data_dir='./')
+ pruner.compress()
+
+You can view :githublink:`example ` for more information.
+
+User configuration for AutoCompress Pruner
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+**PyTorch**
+
+.. autoclass:: nni.algorithms.compression.pytorch.pruning.AutoCompressPruner
+
+AMC Pruner
+----------
+
+AMC pruner leverages reinforcement learning to provide the model compression policy.
+This learning-based compression policy outperforms conventional rule-based compression policy by having higher compression ratio,
+better preserving the accuracy and freeing human labor.
+
+
+.. image:: ../../img/amc_pruner.jpg
+ :target: ../../img/amc_pruner.jpg
+ :alt:
+
+
+For more details, please refer to `AMC: AutoML for Model Compression and Acceleration on Mobile Devices `__.
+
+Usage
+^^^^^
+
+PyTorch code
+
+.. code-block:: python
+
+ from nni.algorithms.compression.pytorch.pruning import AMCPruner
+ config_list = [{
+ 'op_types': ['Conv2d', 'Linear']
+ }]
+ pruner = AMCPruner(model, config_list, evaluator, val_loader, flops_ratio=0.5)
+ pruner.compress()
+
+You can view :githublink:`example ` for more information.
+
+User configuration for AutoCompress Pruner
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+**PyTorch**
+
+.. autoclass:: nni.algorithms.compression.pytorch.pruning.AMCPruner
+
+Reproduced Experiment
+^^^^^^^^^^^^^^^^^^^^^
+
+We implemented one of the experiments in `AMC: AutoML for Model Compression and Acceleration on Mobile Devices `__\ , we pruned **MobileNet** to 50% FLOPS for ImageNet in the paper. Our experiments results are as follows:
+
+.. list-table::
+ :header-rows: 1
+ :widths: auto
+
+ * - Model
+ - Top 1 acc.(paper/ours)
+ - Top 5 acc. (paper/ours)
+ - FLOPS
+ * - MobileNet
+ - 70.5% / 69.9%
+ - 89.3% / 89.1%
+ - 50%
+
+
+The experiments code can be found at :githublink:`examples/model_compress `
+
+ADMM Pruner
+-----------
+
+Alternating Direction Method of Multipliers (ADMM) is a mathematical optimization technique,
+by decomposing the original nonconvex problem into two subproblems that can be solved iteratively. In weight pruning problem, these two subproblems are solved via 1) gradient descent algorithm and 2) Euclidean projection respectively.
+
+During the process of solving these two subproblems, the weights of the original model will be changed. An one-shot pruner will then be applied to prune the model according to the config list given.
+
+This solution framework applies both to non-structured and different variations of structured pruning schemes.
+
+For more details, please refer to `A Systematic DNN Weight Pruning Framework using Alternating Direction Method of Multipliers `__.
+
+Usage
+^^^^^
+
+PyTorch code
+
+.. code-block:: python
+
+ from nni.algorithms.compression.pytorch.pruning import ADMMPruner
+ config_list = [{
+ 'sparsity': 0.8,
+ 'op_types': ['Conv2d'],
+ 'op_names': ['conv1']
+ }, {
+ 'sparsity': 0.92,
+ 'op_types': ['Conv2d'],
+ 'op_names': ['conv2']
+ }]
+ pruner = ADMMPruner(model, config_list, trainer=trainer, num_iterations=30, epochs=5)
+ pruner.compress()
+
+You can view :githublink:`example ` for more information.
+
+User configuration for ADMM Pruner
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+**PyTorch**
+
+.. autoclass:: nni.algorithms.compression.pytorch.pruning.ADMMPruner
+
+Lottery Ticket Hypothesis
+-------------------------
+
+`The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks `__\ , authors Jonathan Frankle and Michael Carbin,provides comprehensive measurement and analysis, and articulate the *lottery ticket hypothesis*\ : dense, randomly-initialized, feed-forward networks contain subnetworks (*winning tickets*\ ) that -- when trained in isolation -- reach test accuracy comparable to the original network in a similar number of iterations.
+
+In this paper, the authors use the following process to prune a model, called *iterative prunning*\ :
+
+..
+
+ #. Randomly initialize a neural network f(x;theta_0) (where theta\ *0 follows D*\ {theta}).
+ #. Train the network for j iterations, arriving at parameters theta_j.
+ #. Prune p% of the parameters in theta_j, creating a mask m.
+ #. Reset the remaining parameters to their values in theta_0, creating the winning ticket f(x;m*theta_0).
+ #. Repeat step 2, 3, and 4.
+
+
+If the configured final sparsity is P (e.g., 0.8) and there are n times iterative pruning, each iterative pruning prunes 1-(1-P)^(1/n) of the weights that survive the previous round.
+
+Usage
+^^^^^
+
+PyTorch code
+
+.. code-block:: python
+
+ from nni.algorithms.compression.pytorch.pruning import LotteryTicketPruner
+ config_list = [{
+ 'prune_iterations': 5,
+ 'sparsity': 0.8,
+ 'op_types': ['default']
+ }]
+ pruner = LotteryTicketPruner(model, config_list, optimizer)
+ pruner.compress()
+ for _ in pruner.get_prune_iterations():
+ pruner.prune_iteration_start()
+ for epoch in range(epoch_num):
+ ...
+
+The above configuration means that there are 5 times of iterative pruning. As the 5 times iterative pruning are executed in the same run, LotteryTicketPruner needs ``model`` and ``optimizer`` (\ **Note that should add ``lr_scheduler`` if used**\ ) to reset their states every time a new prune iteration starts. Please use ``get_prune_iterations`` to get the pruning iterations, and invoke ``prune_iteration_start`` at the beginning of each iteration. ``epoch_num`` is better to be large enough for model convergence, because the hypothesis is that the performance (accuracy) got in latter rounds with high sparsity could be comparable with that got in the first round.
+
+*Tensorflow version will be supported later.*
+
+User configuration for LotteryTicket Pruner
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+**PyTorch**
+
+.. autoclass:: nni.algorithms.compression.pytorch.pruning.LotteryTicketPruner
+
+Reproduced Experiment
+^^^^^^^^^^^^^^^^^^^^^
+
+We try to reproduce the experiment result of the fully connected network on MNIST using the same configuration as in the paper. The code can be referred :githublink:`here `. In this experiment, we prune 10 times, for each pruning we train the pruned model for 50 epochs.
+
+
+.. image:: ../../img/lottery_ticket_mnist_fc.png
+ :target: ../../img/lottery_ticket_mnist_fc.png
+ :alt:
+
+
+The above figure shows the result of the fully connected network. ``round0-sparsity-0.0`` is the performance without pruning. Consistent with the paper, pruning around 80% also obtain similar performance compared to non-pruning, and converges a little faster. If pruning too much, e.g., larger than 94%, the accuracy becomes lower and convergence becomes a little slower. A little different from the paper, the trend of the data in the paper is relatively more clear.
+
+Sensitivity Pruner
+------------------
+
+For each round, SensitivityPruner prunes the model based on the sensitivity to the accuracy of each layer until meeting the final configured sparsity of the whole model:
+
+.. code-block:: bash
+
+ 1. Analyze the sensitivity of each layer in the current state of the model.
+ 2. Prune each layer according to the sensitivity.
+
+
+For more details, please refer to `Learning both Weights and Connections for Efficient Neural Networks `__.
+
+Usage
+^^^^^
+
+PyTorch code
+
+.. code-block:: python
+
+ from nni.algorithms.compression.pytorch.pruning import SensitivityPruner
+ config_list = [{
+ 'sparsity': 0.5,
+ 'op_types': ['Conv2d']
+ }]
+ pruner = SensitivityPruner(model, config_list, finetuner=fine_tuner, evaluator=evaluator)
+ # eval_args and finetune_args are the parameters passed to the evaluator and finetuner respectively
+ pruner.compress(eval_args=[model], finetune_args=[model])
+
+User configuration for Sensitivity Pruner
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+**PyTorch**
+
+.. autoclass:: nni.algorithms.compression.pytorch.pruning.SensitivityPruner
diff --git a/docs/en_US/Compression/Quantizer.rst b/docs/en_US/Compression/Quantizer.rst
new file mode 100644
index 0000000000..61d0607b8c
--- /dev/null
+++ b/docs/en_US/Compression/Quantizer.rst
@@ -0,0 +1,184 @@
+Supported Quantization Algorithms on NNI
+========================================
+
+Index of supported quantization algorithms
+
+
+* `Naive Quantizer <#naive-quantizer>`__
+* `QAT Quantizer <#qat-quantizer>`__
+* `DoReFa Quantizer <#dorefa-quantizer>`__
+* `BNN Quantizer <#bnn-quantizer>`__
+
+Naive Quantizer
+---------------
+
+We provide Naive Quantizer to quantizer weight to default 8 bits, you can use it to test quantize algorithm without any configure.
+
+Usage
+^^^^^
+
+pytorch
+
+.. code-block:: python
+
+ model = nni.algorithms.compression.pytorch.quantization.NaiveQuantizer(model).compress()
+
+----
+
+QAT Quantizer
+-------------
+
+In `Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference `__\ , authors Benoit Jacob and Skirmantas Kligys provide an algorithm to quantize the model with training.
+
+..
+
+ We propose an approach that simulates quantization effects in the forward pass of training. Backpropagation still happens as usual, and all weights and biases are stored in floating point so that they can be easily nudged by small amounts. The forward propagation pass however simulates quantized inference as it will happen in the inference engine, by implementing in floating-point arithmetic the rounding behavior of the quantization scheme
+
+
+ * Weights are quantized before they are convolved with the input. If batch normalization (see [17]) is used for the layer, the batch normalization parameters are “folded into” the weights before quantization.
+ * Activations are quantized at points where they would be during inference, e.g. after the activation function is applied to a convolutional or fully connected layer’s output, or after a bypass connection adds or concatenates the outputs of several layers together such as in ResNets.
+
+
+Usage
+^^^^^
+
+You can quantize your model to 8 bits with the code below before your training code.
+
+PyTorch code
+
+.. code-block:: python
+
+ from nni.algorithms.compression.pytorch.quantization import QAT_Quantizer
+ model = Mnist()
+
+ config_list = [{
+ 'quant_types': ['weight'],
+ 'quant_bits': {
+ 'weight': 8,
+ }, # you can just use `int` here because all `quan_types` share same bits length, see config for `ReLu6` below.
+ 'op_types':['Conv2d', 'Linear']
+ }, {
+ 'quant_types': ['output'],
+ 'quant_bits': 8,
+ 'quant_start_step': 7000,
+ 'op_types':['ReLU6']
+ }]
+ quantizer = QAT_Quantizer(model, config_list)
+ quantizer.compress()
+
+You can view example for more information
+
+User configuration for QAT Quantizer
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+common configuration needed by compression algorithms can be found at `Specification of ``config_list`` <./QuickStart.rst>`__.
+
+configuration needed by this algorithm :
+
+
+* **quant_start_step:** int
+
+disable quantization until model are run by certain number of steps, this allows the network to enter a more stable
+state where activation quantization ranges do not exclude a significant fraction of values, default value is 0
+
+note
+^^^^
+
+batch normalization folding is currently not supported.
+
+----
+
+DoReFa Quantizer
+----------------
+
+In `DoReFa-Net: Training Low Bitwidth Convolutional Neural Networks with Low Bitwidth Gradients `__\ , authors Shuchang Zhou and Yuxin Wu provide an algorithm named DoReFa to quantize the weight, activation and gradients with training.
+
+Usage
+^^^^^
+
+To implement DoReFa Quantizer, you can add code below before your training code
+
+PyTorch code
+
+.. code-block:: python
+
+ from nni.algorithms.compression.pytorch.quantization import DoReFaQuantizer
+ config_list = [{
+ 'quant_types': ['weight'],
+ 'quant_bits': 8,
+ 'op_types': 'default'
+ }]
+ quantizer = DoReFaQuantizer(model, config_list)
+ quantizer.compress()
+
+You can view example for more information
+
+User configuration for DoReFa Quantizer
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+common configuration needed by compression algorithms can be found at `Specification of ``config_list`` <./QuickStart.rst>`__.
+
+configuration needed by this algorithm :
+
+----
+
+BNN Quantizer
+-------------
+
+In `Binarized Neural Networks: Training Deep Neural Networks with Weights and Activations Constrained to +1 or -1 `__\ ,
+
+..
+
+ We introduce a method to train Binarized Neural Networks (BNNs) - neural networks with binary weights and activations at run-time. At training-time the binary weights and activations are used for computing the parameters gradients. During the forward pass, BNNs drastically reduce memory size and accesses, and replace most arithmetic operations with bit-wise operations, which is expected to substantially improve power-efficiency.
+
+
+Usage
+^^^^^
+
+PyTorch code
+
+.. code-block:: python
+
+ from nni.algorithms.compression.pytorch.quantization import BNNQuantizer
+ model = VGG_Cifar10(num_classes=10)
+
+ configure_list = [{
+ 'quant_bits': 1,
+ 'quant_types': ['weight'],
+ 'op_types': ['Conv2d', 'Linear'],
+ 'op_names': ['features.0', 'features.3', 'features.7', 'features.10', 'features.14', 'features.17', 'classifier.0', 'classifier.3']
+ }, {
+ 'quant_bits': 1,
+ 'quant_types': ['output'],
+ 'op_types': ['Hardtanh'],
+ 'op_names': ['features.6', 'features.9', 'features.13', 'features.16', 'features.20', 'classifier.2', 'classifier.5']
+ }]
+
+ quantizer = BNNQuantizer(model, configure_list)
+ model = quantizer.compress()
+
+You can view example :githublink:`examples/model_compress/BNN_quantizer_cifar10.py ` for more information.
+
+User configuration for BNN Quantizer
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+common configuration needed by compression algorithms can be found at `Specification of ``config_list`` <./QuickStart.rst>`__.
+
+configuration needed by this algorithm :
+
+Experiment
+^^^^^^^^^^
+
+We implemented one of the experiments in `Binarized Neural Networks: Training Deep Neural Networks with Weights and Activations Constrained to +1 or -1 `__\ , we quantized the **VGGNet** for CIFAR-10 in the paper. Our experiments results are as follows:
+
+.. list-table::
+ :header-rows: 1
+ :widths: auto
+
+ * - Model
+ - Accuracy
+ * - VGGNet
+ - 86.93%
+
+
+The experiments code can be found at :githublink:`examples/model_compress/BNN_quantizer_cifar10.py `
diff --git a/docs/en_US/Compression/QuickStart.rst b/docs/en_US/Compression/QuickStart.rst
new file mode 100644
index 0000000000..85a1930bfe
--- /dev/null
+++ b/docs/en_US/Compression/QuickStart.rst
@@ -0,0 +1,212 @@
+Tutorial for Model Compression
+==============================
+
+.. contents::
+
+In this tutorial, we use the `first section <#quick-start-to-compress-a-model>`__ to quickly go through the usage of model compression on NNI. Then use the `second section <#detailed-usage-guide>`__ to explain more details of the usage.
+
+Quick Start to Compress a Model
+-------------------------------
+
+NNI provides very simple APIs for compressing a model. The compression includes pruning algorithms and quantization algorithms. The usage of them are the same, thus, here we use `slim pruner `__ as an example to show the usage.
+
+Write configuration
+^^^^^^^^^^^^^^^^^^^
+
+Write a configuration to specify the layers that you want to prune. The following configuration means pruning all the ``BatchNorm2d``\ s to sparsity 0.7 while keeping other layers unpruned.
+
+.. code-block:: python
+
+ configure_list = [{
+ 'sparsity': 0.7,
+ 'op_types': ['BatchNorm2d'],
+ }]
+
+The specification of configuration can be found `here <#specification-of-config-list>`__. Note that different pruners may have their own defined fields in configuration, for exmaple ``start_epoch`` in AGP pruner. Please refer to each pruner's `usage <./Pruner.rst>`__ for details, and adjust the configuration accordingly.
+
+Choose a compression algorithm
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Choose a pruner to prune your model. First instantiate the chosen pruner with your model and configuration as arguments, then invoke ``compress()`` to compress your model.
+
+.. code-block:: python
+
+ pruner = SlimPruner(model, configure_list)
+ model = pruner.compress()
+
+Then, you can train your model using traditional training approach (e.g., SGD), pruning is applied transparently during the training. Some pruners prune once at the beginning, the following training can be seen as fine-tune. Some pruners prune your model iteratively, the masks are adjusted epoch by epoch during training.
+
+Export compression result
+^^^^^^^^^^^^^^^^^^^^^^^^^
+
+After training, you get accuracy of the pruned model. You can export model weights to a file, and the generated masks to a file as well. Exporting onnx model is also supported.
+
+.. code-block:: python
+
+ pruner.export_model(model_path='pruned_vgg19_cifar10.pth', mask_path='mask_vgg19_cifar10.pth')
+
+The complete code of model compression examples can be found :githublink:`here `.
+
+Speed up the model
+^^^^^^^^^^^^^^^^^^
+
+Masks do not provide real speedup of your model. The model should be speeded up based on the exported masks, thus, we provide an API to speed up your model as shown below. After invoking ``apply_compression_results`` on your model, your model becomes a smaller one with shorter inference latency.
+
+.. code-block:: python
+
+ from nni.compression.pytorch import apply_compression_results
+ apply_compression_results(model, 'mask_vgg19_cifar10.pth')
+
+Please refer to `here `__ for detailed description.
+
+Detailed Usage Guide
+--------------------
+
+The example code for users to apply model compression on a user model can be found below:
+
+PyTorch code
+
+.. code-block:: python
+
+ from nni.algorithms.compression.pytorch.pruning import LevelPruner
+ config_list = [{ 'sparsity': 0.8, 'op_types': ['default'] }]
+ pruner = LevelPruner(model, config_list)
+ pruner.compress()
+
+Tensorflow code
+
+.. code-block:: python
+
+ from nni.algorithms.compression.tensorflow.pruning import LevelPruner
+ config_list = [{ 'sparsity': 0.8, 'op_types': ['default'] }]
+ pruner = LevelPruner(tf.get_default_graph(), config_list)
+ pruner.compress()
+
+You can use other compression algorithms in the package of ``nni.compression``. The algorithms are implemented in both PyTorch and TensorFlow (partial support on TensorFlow), under ``nni.compression.pytorch`` and ``nni.compression.tensorflow`` respectively. You can refer to `Pruner <./Pruner.md>`__ and `Quantizer <./Quantizer.md>`__ for detail description of supported algorithms. Also if you want to use knowledge distillation, you can refer to `KDExample <../TrialExample/KDExample.rst>`__
+
+A compression algorithm is first instantiated with a ``config_list`` passed in. The specification of this ``config_list`` will be described later.
+
+The function call ``pruner.compress()`` modifies user defined model (in Tensorflow the model can be obtained with ``tf.get_default_graph()``\ , while in PyTorch the model is the defined model class), and the model is modified with masks inserted. Then when you run the model, the masks take effect. The masks can be adjusted at runtime by the algorithms.
+
+*Note that, ``pruner.compress`` simply adds masks on model weights, it does not include fine tuning logic. If users want to fine tune the compressed model, they need to write the fine tune logic by themselves after ``pruner.compress``.*
+
+Specification of ``config_list``
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Users can specify the configuration (i.e., ``config_list``\ ) for a compression algorithm. For example,when compressing a model, users may want to specify the sparsity ratio, to specify different ratios for different types of operations, to exclude certain types of operations, or to compress only a certain types of operations. For users to express these kinds of requirements, we define a configuration specification. It can be seen as a python ``list`` object, where each element is a ``dict`` object.
+
+The ``dict``\ s in the ``list`` are applied one by one, that is, the configurations in latter ``dict`` will overwrite the configurations in former ones on the operations that are within the scope of both of them.
+
+There are different keys in a ``dict``. Some of them are common keys supported by all the compression algorithms:
+
+
+* **op_types**\ : This is to specify what types of operations to be compressed. 'default' means following the algorithm's default setting.
+* **op_names**\ : This is to specify by name what operations to be compressed. If this field is omitted, operations will not be filtered by it.
+* **exclude**\ : Default is False. If this field is True, it means the operations with specified types and names will be excluded from the compression.
+
+Some other keys are often specific to a certain algorithms, users can refer to `pruning algorithms <./Pruner.md>`__ and `quantization algorithms <./Quantizer.rst>`__ for the keys allowed by each algorithm.
+
+A simple example of configuration is shown below:
+
+.. code-block:: python
+
+ [
+ {
+ 'sparsity': 0.8,
+ 'op_types': ['default']
+ },
+ {
+ 'sparsity': 0.6,
+ 'op_names': ['op_name1', 'op_name2']
+ },
+ {
+ 'exclude': True,
+ 'op_names': ['op_name3']
+ }
+ ]
+
+It means following the algorithm's default setting for compressed operations with sparsity 0.8, but for ``op_name1`` and ``op_name2`` use sparsity 0.6, and do not compress ``op_name3``.
+
+Quantization specific keys
+^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Besides the keys explained above, if you use quantization algorithms you need to specify more keys in ``config_list``\ , which are explained below.
+
+
+* **quant_types** : list of string.
+
+Type of quantization you want to apply, currently support 'weight', 'input', 'output'. 'weight' means applying quantization operation
+to the weight parameter of modules. 'input' means applying quantization operation to the input of module forward method. 'output' means applying quantization operation to the output of module forward method, which is often called as 'activation' in some papers.
+
+
+* **quant_bits** : int or dict of {str : int}
+
+bits length of quantization, key is the quantization type, value is the quantization bits length, eg.
+
+.. code-block:: bash
+
+ {
+ quant_bits: {
+ 'weight': 8,
+ 'output': 4,
+ },
+ }
+
+when the value is int type, all quantization types share same bits length. eg.
+
+.. code-block:: bash
+
+ {
+ quant_bits: 8, # weight or output quantization are all 8 bits
+ }
+
+The following example shows a more complete ``config_list``\ , it uses ``op_names`` (or ``op_types``\ ) to specify the target layers along with the quantization bits for those layers.
+
+.. code-block:: bash
+
+ configure_list = [{
+ 'quant_types': ['weight'],
+ 'quant_bits': 8,
+ 'op_names': ['conv1']
+ }, {
+ 'quant_types': ['weight'],
+ 'quant_bits': 4,
+ 'quant_start_step': 0,
+ 'op_names': ['conv2']
+ }, {
+ 'quant_types': ['weight'],
+ 'quant_bits': 3,
+ 'op_names': ['fc1']
+ },
+ {
+ 'quant_types': ['weight'],
+ 'quant_bits': 2,
+ 'op_names': ['fc2']
+ }
+ ]
+
+In this example, 'op_names' is the name of layer and four layers will be quantized to different quant_bits.
+
+APIs for Updating Fine Tuning Status
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Some compression algorithms use epochs to control the progress of compression (e.g. `AGP `__\ ), and some algorithms need to do something after every minibatch. Therefore, we provide another two APIs for users to invoke: ``pruner.update_epoch(epoch)`` and ``pruner.step()``.
+
+``update_epoch`` should be invoked in every epoch, while ``step`` should be invoked after each minibatch. Note that most algorithms do not require calling the two APIs. Please refer to each algorithm's document for details. For the algorithms that do not need them, calling them is allowed but has no effect.
+
+Export Compressed Model
+^^^^^^^^^^^^^^^^^^^^^^^
+
+You can easily export the compressed model using the following API if you are pruning your model, ``state_dict`` of the sparse model weights will be stored in ``model.pth``\ , which can be loaded by ``torch.load('model.pth')``. In this exported ``model.pth``\ , the masked weights are zero.
+
+.. code-block:: bash
+
+ pruner.export_model(model_path='model.pth')
+
+``mask_dict`` and pruned model in ``onnx`` format(\ ``input_shape`` need to be specified) can also be exported like this:
+
+.. code-block:: python
+
+ pruner.export_model(model_path='model.pth', mask_path='mask.pth', onnx_path='model.onnx', input_shape=[1, 1, 28, 28])
+
+If you want to really speed up the compressed model, please refer to `NNI model speedup <./ModelSpeedup.rst>`__ for details.
diff --git a/docs/en_US/FeatureEngineering/GBDTSelector.rst b/docs/en_US/FeatureEngineering/GBDTSelector.rst
new file mode 100644
index 0000000000..f645b12785
--- /dev/null
+++ b/docs/en_US/FeatureEngineering/GBDTSelector.rst
@@ -0,0 +1,70 @@
+GBDTSelector
+------------
+
+GBDTSelector is based on `LightGBM `__\ , which is a gradient boosting framework that uses tree-based learning algorithms.
+
+When passing the data into the GBDT model, the model will construct the boosting tree. And the feature importance comes from the score in construction, which indicates how useful or valuable each feature was in the construction of the boosted decision trees within the model.
+
+We could use this method as a strong baseline in Feature Selector, especially when using the GBDT model as a classifier or regressor.
+
+For now, we support the ``importance_type`` is ``split`` and ``gain``. But we will support customized ``importance_type`` in the future, which means the user could define how to calculate the ``feature score`` by themselves.
+
+Usage
+^^^^^
+
+First you need to install dependency:
+
+.. code-block:: bash
+
+ pip install lightgbm
+
+Then
+
+.. code-block:: python
+
+ from nni.feature_engineering.gbdt_selector import GBDTSelector
+
+ # load data
+ ...
+ X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)
+
+ # initlize a selector
+ fgs = GBDTSelector()
+ # fit data
+ fgs.fit(X_train, y_train, ...)
+ # get improtant features
+ # will return the index with important feature here.
+ print(fgs.get_selected_features(10))
+
+ ...
+
+And you could reference the examples in ``/examples/feature_engineering/gbdt_selector/``\ , too.
+
+**Requirement of ``fit`` FuncArgs**
+
+
+*
+ **X** (array-like, require) - The training input samples which shape = [n_samples, n_features]
+
+*
+ **y** (array-like, require) - The target values (class labels in classification, real numbers in regression) which shape = [n_samples].
+
+*
+ **lgb_params** (dict, require) - The parameters for lightgbm model. The detail you could reference `here `__
+
+*
+ **eval_ratio** (float, require) - The ratio of data size. It's used for split the eval data and train data from self.X.
+
+*
+ **early_stopping_rounds** (int, require) - The early stopping setting in lightgbm. The detail you could reference `here `__.
+
+*
+ **importance_type** (str, require) - could be 'split' or 'gain'. The 'split' means ' result contains numbers of times the feature is used in a model' and the 'gain' means 'result contains total gains of splits which use the feature'. The detail you could reference in `here `__.
+
+*
+ **num_boost_round** (int, require) - number of boost round. The detail you could reference `here `__.
+
+**Requirement of ``get_selected_features`` FuncArgs**
+
+
+* **topk** (int, require) - the topK impotance features you want to selected.
diff --git a/docs/en_US/FeatureEngineering/GradientFeatureSelector.rst b/docs/en_US/FeatureEngineering/GradientFeatureSelector.rst
new file mode 100644
index 0000000000..1b4b212bdd
--- /dev/null
+++ b/docs/en_US/FeatureEngineering/GradientFeatureSelector.rst
@@ -0,0 +1,107 @@
+GradientFeatureSelector
+-----------------------
+
+The algorithm in GradientFeatureSelector comes from `"Feature Gradients: Scalable Feature Selection via Discrete Relaxation" `__.
+
+GradientFeatureSelector, a gradient-based search algorithm
+for feature selection.
+
+1) This approach extends a recent result on the estimation of
+learnability in the sublinear data regime by showing that the calculation can be performed iteratively (i.e., in mini-batches) and in **linear time and space** with respect to both the number of features D and the sample size N.
+
+2) This, along with a discrete-to-continuous relaxation of the search domain, allows for an **efficient, gradient-based** search algorithm among feature subsets for very **large datasets**.
+
+3) Crucially, this algorithm is capable of finding **higher-order correlations** between features and targets for both the N > D and N < D regimes, as opposed to approaches that do not consider such interactions and/or only consider one regime.
+
+Usage
+^^^^^
+
+.. code-block:: python
+
+ from nni.feature_engineering.gradient_selector import FeatureGradientSelector
+
+ # load data
+ ...
+ X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)
+
+ # initlize a selector
+ fgs = FeatureGradientSelector(n_features=10)
+ # fit data
+ fgs.fit(X_train, y_train)
+ # get improtant features
+ # will return the index with important feature here.
+ print(fgs.get_selected_features())
+
+ ...
+
+And you could reference the examples in ``/examples/feature_engineering/gradient_feature_selector/``\ , too.
+
+**Parameters of class FeatureGradientSelector constructor**
+
+
+*
+ **order** (int, optional, default = 4) - What order of interactions to include. Higher orders may be more accurate but increase the run time. 12 is the maximum allowed order.
+
+*
+ **penatly** (int, optional, default = 1) - Constant that multiplies the regularization term.
+
+*
+ **n_features** (int, optional, default = None) - If None, will automatically choose number of features based on search. Otherwise, the number of top features to select.
+
+*
+ **max_features** (int, optional, default = None) - If not None, will use the 'elbow method' to determine the number of features with max_features as the upper limit.
+
+*
+ **learning_rate** (float, optional, default = 1e-1) - learning rate
+
+*
+ **init** (*zero, on, off, onhigh, offhigh, or sklearn, optional, default = zero*\ ) - How to initialize the vector of scores. 'zero' is the default.
+
+*
+ **n_epochs** (int, optional, default = 1) - number of epochs to run
+
+*
+ **shuffle** (bool, optional, default = True) - Shuffle "rows" prior to an epoch.
+
+*
+ **batch_size** (int, optional, default = 1000) - Nnumber of "rows" to process at a time.
+
+*
+ **target_batch_size** (int, optional, default = 1000) - Number of "rows" to accumulate gradients over. Useful when many rows will not fit into memory but are needed for accurate estimation.
+
+*
+ **classification** (bool, optional, default = True) - If True, problem is classification, else regression.
+
+*
+ **ordinal** (bool, optional, default = True) - If True, problem is ordinal classification. Requires classification to be True.
+
+*
+ **balanced** (bool, optional, default = True) - If true, each class is weighted equally in optimization, otherwise weighted is done via support of each class. Requires classification to be True.
+
+*
+ **prerocess** (str, optional, default = 'zscore') - 'zscore' which refers to centering and normalizing data to unit variance or 'center' which only centers the data to 0 mean.
+
+*
+ **soft_grouping** (bool, optional, default = True) - If True, groups represent features that come from the same source. Used to encourage sparsity of groups and features within groups.
+
+*
+ **verbose** (int, optional, default = 0) - Controls the verbosity when fitting. Set to 0 for no printing 1 or higher for printing every verbose number of gradient steps.
+
+*
+ **device** (str, optional, default = 'cpu') - 'cpu' to run on CPU and 'cuda' to run on GPU. Runs much faster on GPU
+
+**Requirement of ``fit`` FuncArgs**
+
+
+*
+ **X** (array-like, require) - The training input samples which shape = [n_samples, n_features]
+
+*
+ **y** (array-like, require) - The target values (class labels in classification, real numbers in regression) which shape = [n_samples].
+
+*
+ **groups** (array-like, optional, default = None) - Groups of columns that must be selected as a unit. e.g. [0, 0, 1, 2] specifies the first two columns are part of a group. Which shape is [n_features].
+
+**Requirement of ``get_selected_features`` FuncArgs**
+
+ For now, the ``get_selected_features`` function has no parameters.
diff --git a/docs/en_US/FeatureEngineering/Overview.rst b/docs/en_US/FeatureEngineering/Overview.rst
new file mode 100644
index 0000000000..c6fedfeeaa
--- /dev/null
+++ b/docs/en_US/FeatureEngineering/Overview.rst
@@ -0,0 +1,320 @@
+Feature Engineering with NNI
+============================
+
+We are glad to announce the alpha release for Feature Engineering toolkit on top of NNI, it's still in the experiment phase which might evolve based on user feedback. We'd like to invite you to use, feedback and even contribute.
+
+For now, we support the following feature selector:
+
+
+* `GradientFeatureSelector <./GradientFeatureSelector.rst>`__
+* `GBDTSelector <./GBDTSelector.rst>`__
+
+These selectors are suitable for tabular data(which means it doesn't include image, speech and text data).
+
+In addition, those selector only for feature selection. If you want to:
+1) generate high-order combined features on nni while doing feature selection;
+2) leverage your distributed resources;
+you could try this :githublink:`example `.
+
+How to use?
+-----------
+
+.. code-block:: python
+
+ from nni.feature_engineering.gradient_selector import FeatureGradientSelector
+ # from nni.feature_engineering.gbdt_selector import GBDTSelector
+
+ # load data
+ ...
+ X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)
+
+ # initlize a selector
+ fgs = FeatureGradientSelector(...)
+ # fit data
+ fgs.fit(X_train, y_train)
+ # get improtant features
+ # will return the index with important feature here.
+ print(fgs.get_selected_features(...))
+
+ ...
+
+When using the built-in Selector, you first need to ``import`` a feature selector, and ``initialize`` it. You could call the function ``fit`` in the selector to pass the data to the selector. After that, you could use ``get_seleteced_features`` to get important features. The function parameters in different selectors might be different, so you need to check the docs before using it.
+
+How to customize?
+-----------------
+
+NNI provides *state-of-the-art* feature selector algorithm in the builtin-selector. NNI also supports to build a feature selector by yourself.
+
+If you want to implement a customized feature selector, you need to:
+
+
+#. Inherit the base FeatureSelector class
+#. Implement *fit* and _get_selected*features* function
+#. Integrate with sklearn (Optional)
+
+Here is an example:
+
+**1. Inherit the base Featureselector Class**
+
+.. code-block:: python
+
+ from nni.feature_engineering.feature_selector import FeatureSelector
+
+ class CustomizedSelector(FeatureSelector):
+ def __init__(self, ...):
+ ...
+
+**2. Implement *fit* and _get_selected*features* Function**
+
+.. code-block:: python
+
+ from nni.tuner import Tuner
+
+ from nni.feature_engineering.feature_selector import FeatureSelector
+
+ class CustomizedSelector(FeatureSelector):
+ def __init__(self, ...):
+ ...
+
+ def fit(self, X, y, **kwargs):
+ """
+ Fit the training data to FeatureSelector
+
+ Parameters
+ ------------
+ X : array-like numpy matrix
+ The training input samples, which shape is [n_samples, n_features].
+ y: array-like numpy matrix
+ The target values (class labels in classification, real numbers in regression). Which shape is [n_samples].
+ """
+ self.X = X
+ self.y = y
+ ...
+
+ def get_selected_features(self):
+ """
+ Get important feature
+
+ Returns
+ -------
+ list :
+ Return the index of the important feature.
+ """
+ ...
+ return self.selected_features_
+
+ ...
+
+**3. Integrate with Sklearn**
+
+``sklearn.pipeline.Pipeline`` can connect models in series, such as feature selector, normalization, and classification/regression to form a typical machine learning problem workflow.
+The following step could help us to better integrate with sklearn, which means we could treat the customized feature selector as a module of the pipeline.
+
+
+#. Inherit the calss *sklearn.base.BaseEstimator*
+#. Implement _get\ *params* and _set*params* function in *BaseEstimator*
+#. Inherit the class _sklearn.feature\ *selection.base.SelectorMixin*
+#. Implement _get\ *support*\ , *transform* and _inverse*transform* Function in *SelectorMixin*
+
+Here is an example:
+
+**1. Inherit the BaseEstimator Class and its Function**
+
+.. code-block:: python
+
+ from sklearn.base import BaseEstimator
+ from nni.feature_engineering.feature_selector import FeatureSelector
+
+ class CustomizedSelector(FeatureSelector, BaseEstimator):
+ def __init__(self, ...):
+ ...
+
+ def get_params(self, ...):
+ """
+ Get parameters for this estimator.
+ """
+ params = self.__dict__
+ params = {key: val for (key, val) in params.items()
+ if not key.endswith('_')}
+ return params
+
+ def set_params(self, **params):
+ """
+ Set the parameters of this estimator.
+ """
+ for param in params:
+ if hasattr(self, param):
+ setattr(self, param, params[param])
+ return self
+
+**2. Inherit the SelectorMixin Class and its Function**
+
+.. code-block:: python
+
+ from sklearn.base import BaseEstimator
+ from sklearn.feature_selection.base import SelectorMixin
+
+ from nni.feature_engineering.feature_selector import FeatureSelector
+
+ class CustomizedSelector(FeatureSelector, BaseEstimator, SelectorMixin):
+ def __init__(self, ...):
+ ...
+
+ def get_params(self, ...):
+ """
+ Get parameters for this estimator.
+ """
+ params = self.__dict__
+ params = {key: val for (key, val) in params.items()
+ if not key.endswith('_')}
+ return params
+
+ def set_params(self, **params):
+ """
+ Set the parameters of this estimator.
+ """
+ for param in params:
+ if hasattr(self, param):
+ setattr(self, param, params[param])
+ return self
+
+ def get_support(self, indices=False):
+ """
+ Get a mask, or integer index, of the features selected.
+
+ Parameters
+ ----------
+ indices : bool
+ Default False. If True, the return value will be an array of integers, rather than a boolean mask.
+
+ Returns
+ -------
+ list :
+ returns support: An index that selects the retained features from a feature vector.
+ If indices are False, this is a boolean array of shape [# input features], in which an element is True iff its corresponding feature is selected for retention.
+ If indices are True, this is an integer array of shape [# output features] whose values
+ are indices into the input feature vector.
+ """
+ ...
+ return mask
+
+
+ def transform(self, X):
+ """Reduce X to the selected features.
+
+ Parameters
+ ----------
+ X : array
+ which shape is [n_samples, n_features]
+
+ Returns
+ -------
+ X_r : array
+ which shape is [n_samples, n_selected_features]
+ The input samples with only the selected features.
+ """
+ ...
+ return X_r
+
+
+ def inverse_transform(self, X):
+ """
+ Reverse the transformation operation
+
+ Parameters
+ ----------
+ X : array
+ shape is [n_samples, n_selected_features]
+
+ Returns
+ -------
+ X_r : array
+ shape is [n_samples, n_original_features]
+ """
+ ...
+ return X_r
+
+After integrating with Sklearn, we could use the feature selector as follows:
+
+.. code-block:: python
+
+ from sklearn.linear_model import LogisticRegression
+
+ # load data
+ ...
+ X_train, y_train = ...
+
+ # build a ppipeline
+ pipeline = make_pipeline(XXXSelector(...), LogisticRegression())
+ pipeline = make_pipeline(SelectFromModel(ExtraTreesClassifier(n_estimators=50)), LogisticRegression())
+ pipeline.fit(X_train, y_train)
+
+ # score
+ print("Pipeline Score: ", pipeline.score(X_train, y_train))
+
+Benchmark
+---------
+
+``Baseline`` means without any feature selection, we directly pass the data to LogisticRegression. For this benchmark, we only use 10% data from the train as test data. For the GradientFeatureSelector, we only take the top20 features. The metric is the mean accuracy on the given test data and labels.
+
+.. list-table::
+ :header-rows: 1
+ :widths: auto
+
+ * - Dataset
+ - All Features + LR (acc, time, memory)
+ - GradientFeatureSelector + LR (acc, time, memory)
+ - TreeBasedClassifier + LR (acc, time, memory)
+ - #Train
+ - #Feature
+ * - colon-cancer
+ - 0.7547, 890ms, 348MiB
+ - 0.7368, 363ms, 286MiB
+ - 0.7223, 171ms, 1171 MiB
+ - 62
+ - 2,000
+ * - gisette
+ - 0.9725, 215ms, 584MiB
+ - 0.89416, 446ms, 397MiB
+ - 0.9792, 911ms, 234MiB
+ - 6,000
+ - 5,000
+ * - avazu
+ - 0.8834, N/A, N/A
+ - N/A, N/A, N/A
+ - N/A, N/A, N/A
+ - 40,428,967
+ - 1,000,000
+ * - rcv1
+ - 0.9644, 557ms, 241MiB
+ - 0.7333, 401ms, 281MiB
+ - 0.9615, 752ms, 284MiB
+ - 20,242
+ - 47,236
+ * - news20.binary
+ - 0.9208, 707ms, 361MiB
+ - 0.6870, 565ms, 371MiB
+ - 0.9070, 904ms, 364MiB
+ - 19,996
+ - 1,355,191
+ * - real-sim
+ - 0.9681, 433ms, 274MiB
+ - 0.7969, 251ms, 274MiB
+ - 0.9591, 643ms, 367MiB
+ - 72,309
+ - 20,958
+
+
+The dataset of benchmark could be download in `here `__
+
+The code could be refenrence ``/examples/feature_engineering/gradient_feature_selector/benchmark_test.py``.
+
+Reference and Feedback
+----------------------
+
+
+* To `report a bug `__ for this feature in GitHub;
+* To `file a feature or improvement request `__ for this feature in GitHub;
+* To know more about :githublink:`Neural Architecture Search with NNI `\ ;
+* To know more about :githublink:`Model Compression with NNI `\ ;
+* To know more about :githublink:`Hyperparameter Tuning with NNI `\ ;
diff --git a/docs/en_US/NAS/Advanced.rst b/docs/en_US/NAS/Advanced.rst
new file mode 100644
index 0000000000..7930245d23
--- /dev/null
+++ b/docs/en_US/NAS/Advanced.rst
@@ -0,0 +1,136 @@
+Customize a NAS Algorithm
+=========================
+
+Extend the Ability of One-Shot Trainers
+---------------------------------------
+
+Users might want to do multiple things if they are using the trainers on real tasks, for example, distributed training, half-precision training, logging periodically, writing tensorboard, dumping checkpoints and so on. As mentioned previously, some trainers do have support for some of the items listed above; others might not. Generally, there are two recommended ways to add anything you want to an existing trainer: inherit an existing trainer and override, or copy an existing trainer and modify.
+
+Either way, you are walking into the scope of implementing a new trainer. Basically, implementing a one-shot trainer is no different from any traditional deep learning trainer, except that a new concept called mutator will reveal itself. So that the implementation will be different in at least two places:
+
+
+* Initialization
+
+.. code-block:: python
+
+ model = Model()
+ mutator = MyMutator(model)
+
+
+* Training
+
+.. code-block:: python
+
+ for _ in range(epochs):
+ for x, y in data_loader:
+ mutator.reset() # reset all the choices in model
+ out = model(x) # like traditional model
+ loss = criterion(out, y)
+ loss.backward()
+ # no difference below
+
+To demonstrate what mutators are for, we need to know how one-shot NAS normally works. Usually, one-shot NAS "co-optimize model weights and architecture weights". It repeatedly: sample an architecture or combination of several architectures from the supernet, train the chosen architectures like traditional deep learning model, update the trained parameters to the supernet, and use the metrics or loss as some signal to guide the architecture sampler. The mutator, is the architecture sampler here, often defined to be another deep-learning model. Therefore, you can treat it as any model, by defining parameters in it and optimizing it with optimizers. One mutator is initialized with exactly one model. Once a mutator is binded to a model, it cannot be rebinded to another model.
+
+``mutator.reset()`` is the core step. That's where all the choices in the model are finalized. The reset result will be always effective, until the next reset flushes the data. After the reset, the model can be seen as a traditional model to do forward-pass and backward-pass.
+
+Finally, mutators provide a method called ``mutator.export()`` that export a dict with architectures to the model. Note that currently this dict this a mapping from keys of mutables to tensors of selection. So in order to dump to json, users need to convert the tensors explicitly into python list.
+
+Meanwhile, NNI provides some useful tools so that users can implement trainers more easily. See `Trainers <./NasReference.rst>`__ for details.
+
+Implement New Mutators
+----------------------
+
+To start with, here is the pseudo-code that demonstrates what happens on ``mutator.reset()`` and ``mutator.export()``.
+
+.. code-block:: python
+
+ def reset(self):
+ self.apply_on_model(self.sample_search())
+
+.. code-block:: python
+
+ def export(self):
+ return self.sample_final()
+
+On reset, a new architecture is sampled with ``sample_search()`` and applied on the model. Then the model is trained for one or more steps in search phase. On export, a new architecture is sampled with ``sample_final()`` and **do nothing to the model**. This is either for checkpoint or exporting the final architecture.
+
+The requirements of return values of ``sample_search()`` and ``sample_final()`` are the same: a mapping from mutable keys to tensors. The tensor can be either a BoolTensor (true for selected, false for negative), or a FloatTensor which applies weight on each candidate. The selected branches will then be computed (in ``LayerChoice``\ , modules will be called; in ``InputChoice``\ , it's just tensors themselves), and reduce with the reduction operation specified in the choices. For most algorithms only worrying about the former part, here is an example of your mutator implementation.
+
+.. code-block:: python
+
+ class RandomMutator(Mutator):
+ def __init__(self, model):
+ super().__init__(model) # don't forget to call super
+ # do something else
+
+ def sample_search(self):
+ result = dict()
+ for mutable in self.mutables: # this is all the mutable modules in user model
+ # mutables share the same key will be de-duplicated
+ if isinstance(mutable, LayerChoice):
+ # decided that this mutable should choose `gen_index`
+ gen_index = np.random.randint(mutable.length)
+ result[mutable.key] = torch.tensor([i == gen_index for i in range(mutable.length)],
+ dtype=torch.bool)
+ elif isinstance(mutable, InputChoice):
+ if mutable.n_chosen is None: # n_chosen is None, then choose any number
+ result[mutable.key] = torch.randint(high=2, size=(mutable.n_candidates,)).view(-1).bool()
+ # else do something else
+ return result
+
+ def sample_final(self):
+ return self.sample_search() # use the same logic here. you can do something different
+
+The complete example of random mutator can be found :githublink:`here `.
+
+For advanced usages, e.g., users want to manipulate the way modules in ``LayerChoice`` are executed, they can inherit ``BaseMutator``\ , and overwrite ``on_forward_layer_choice`` and ``on_forward_input_choice``\ , which are the callback implementation of ``LayerChoice`` and ``InputChoice`` respectively. Users can still use property ``mutables`` to get all ``LayerChoice`` and ``InputChoice`` in the model code. For details, please refer to :githublink:`reference ` here to learn more.
+
+.. tip::
+ A useful application of random mutator is for debugging. Use
+
+ .. code-block:: python
+
+ mutator = RandomMutator(model)
+ mutator.reset()
+
+ will immediately set one possible candidate in the search space as the active one.
+
+Implemented a Distributed NAS Tuner
+-----------------------------------
+
+Before learning how to write a distributed NAS tuner, users should first learn how to write a general tuner. read `Customize Tuner <../Tuner/CustomizeTuner.rst>`__ for tutorials.
+
+When users call "\ `nnictl ss_gen <../Tutorial/Nnictl.rst>`__\ " to generate search space file, a search space file like this will be generated:
+
+.. code-block:: json
+
+ {
+ "key_name": {
+ "_type": "layer_choice",
+ "_value": ["op1_repr", "op2_repr", "op3_repr"]
+ },
+ "key_name": {
+ "_type": "input_choice",
+ "_value": {
+ "candidates": ["in1_key", "in2_key", "in3_key"],
+ "n_chosen": 1
+ }
+ }
+ }
+
+This is the exact search space tuners will receive in ``update_search_space``. It's then tuners' responsibility to interpret the search space and generate new candidates in ``generate_parameters``. A valid "parameters" will be in the following format:
+
+.. code-block:: json
+
+ {
+ "key_name": {
+ "_value": "op1_repr",
+ "_idx": 0
+ },
+ "key_name": {
+ "_value": ["in2_key"],
+ "_idex": [1]
+ }
+ }
+
+Send it through ``generate_parameters``\ , and the tuner would look like any HPO tuner. Refer to `SPOS <./SPOS.rst>`__ example code for an example.
diff --git a/docs/en_US/NAS/Benchmarks.rst b/docs/en_US/NAS/Benchmarks.rst
new file mode 100644
index 0000000000..a81e1785b5
--- /dev/null
+++ b/docs/en_US/NAS/Benchmarks.rst
@@ -0,0 +1,168 @@
+NAS Benchmarks
+==============
+
+.. toctree::
+ :hidden:
+
+ Example Usages
+
+Introduction
+------------
+
+To imporve the reproducibility of NAS algorithms as well as reducing computing resource requirements, researchers proposed a series of NAS benchmarks such as `NAS-Bench-101 `__\ , `NAS-Bench-201 `__\ , `NDS `__\ , etc. NNI provides a query interface for users to acquire these benchmarks. Within just a few lines of code, researcher are able to evaluate their NAS algorithms easily and fairly by utilizing these benchmarks.
+
+Prerequisites
+-------------
+
+
+* Please prepare a folder to household all the benchmark databases. By default, it can be found at ``${HOME}/.nni/nasbenchmark``. You can place it anywhere you like, and specify it in ``NASBENCHMARK_DIR`` via ``export NASBENCHMARK_DIR=/path/to/your/nasbenchmark`` before importing NNI.
+* Please install ``peewee`` via ``pip3 install peewee``\ , which NNI uses to connect to database.
+
+Data Preparation
+----------------
+
+To avoid storage and legality issues, we do not provide any prepared databases. Please follow the following steps.
+
+
+#.
+ Clone NNI to your machine and enter ``examples/nas/benchmarks`` directory.
+
+ .. code-block:: bash
+
+ git clone -b ${NNI_VERSION} https://github.com/microsoft/nni
+ cd nni/examples/nas/benchmarks
+
+ Replace ``${NNI_VERSION}`` with a released version name or branch name, e.g., ``v1.9``.
+
+#.
+ Install dependencies via ``pip3 install -r xxx.requirements.txt``. ``xxx`` can be ``nasbench101``\ , ``nasbench201`` or ``nds``.
+
+#. Generate the database via ``./xxx.sh``. The directory that stores the benchmark file can be configured with ``NASBENCHMARK_DIR`` environment variable, which defaults to ``~/.nni/nasbenchmark``. Note that the NAS-Bench-201 dataset will be downloaded from a google drive.
+
+Please make sure there is at least 10GB free disk space and note that the conversion process can take up to hours to complete.
+
+Example Usages
+--------------
+
+Please refer to `examples usages of Benchmarks API <./BenchmarksExample>`__.
+
+NAS-Bench-101
+-------------
+
+`Paper link `__ `Open-source `__
+
+NAS-Bench-101 contains 423,624 unique neural networks, combined with 4 variations in number of epochs (4, 12, 36, 108), each of which is trained 3 times. It is a cell-wise search space, which constructs and stacks a cell by enumerating DAGs with at most 7 operators, and no more than 9 connections. All operators can be chosen from ``CONV3X3_BN_RELU``\ , ``CONV1X1_BN_RELU`` and ``MAXPOOL3X3``\ , except the first operator (always ``INPUT``\ ) and last operator (always ``OUTPUT``\ ).
+
+Notably, NAS-Bench-101 eliminates invalid cells (e.g., there is no path from input to output, or there is redundant computation). Furthermore, isomorphic cells are de-duplicated, i.e., all the remaining cells are computationally unique.
+
+API Documentation
+^^^^^^^^^^^^^^^^^
+
+.. autofunction:: nni.nas.benchmarks.nasbench101.query_nb101_trial_stats
+
+.. autoattribute:: nni.nas.benchmarks.nasbench101.INPUT
+
+.. autoattribute:: nni.nas.benchmarks.nasbench101.OUTPUT
+
+.. autoattribute:: nni.nas.benchmarks.nasbench101.CONV3X3_BN_RELU
+
+.. autoattribute:: nni.nas.benchmarks.nasbench101.CONV1X1_BN_RELU
+
+.. autoattribute:: nni.nas.benchmarks.nasbench101.MAXPOOL3X3
+
+.. autoclass:: nni.nas.benchmarks.nasbench101.Nb101TrialConfig
+
+.. autoclass:: nni.nas.benchmarks.nasbench101.Nb101TrialStats
+
+.. autoclass:: nni.nas.benchmarks.nasbench101.Nb101IntermediateStats
+
+.. autofunction:: nni.nas.benchmarks.nasbench101.graph_util.nasbench_format_to_architecture_repr
+
+.. autofunction:: nni.nas.benchmarks.nasbench101.graph_util.infer_num_vertices
+
+.. autofunction:: nni.nas.benchmarks.nasbench101.graph_util.hash_module
+
+NAS-Bench-201
+-------------
+
+`Paper link `__ `Open-source API `__ \ `Implementations `__
+
+NAS-Bench-201 is a cell-wise search space that views nodes as tensors and edges as operators. The search space contains all possible densely-connected DAGs with 4 nodes, resulting in 15,625 candidates in total. Each operator (i.e., edge) is selected from a pre-defined operator set (\ ``NONE``\ , ``SKIP_CONNECT``\ , ``CONV_1X1``\ , ``CONV_3X3`` and ``AVG_POOL_3X3``\ ). Training appraoches vary in the dataset used (CIFAR-10, CIFAR-100, ImageNet) and number of epochs scheduled (12 and 200). Each combination of architecture and training approach is repeated 1 - 3 times with different random seeds.
+
+API Documentation
+^^^^^^^^^^^^^^^^^
+
+.. autofunction:: nni.nas.benchmarks.nasbench201.query_nb201_trial_stats
+
+.. autoattribute:: nni.nas.benchmarks.nasbench201.NONE
+
+.. autoattribute:: nni.nas.benchmarks.nasbench201.SKIP_CONNECT
+
+.. autoattribute:: nni.nas.benchmarks.nasbench201.CONV_1X1
+
+.. autoattribute:: nni.nas.benchmarks.nasbench201.CONV_3X3
+
+.. autoattribute:: nni.nas.benchmarks.nasbench201.AVG_POOL_3X3
+
+.. autoclass:: nni.nas.benchmarks.nasbench201.Nb201TrialConfig
+
+.. autoclass:: nni.nas.benchmarks.nasbench201.Nb201TrialStats
+
+.. autoclass:: nni.nas.benchmarks.nasbench201.Nb201IntermediateStats
+
+NDS
+---
+
+`Paper link `__ `Open-source `__
+
+*On Network Design Spaces for Visual Recognition* released trial statistics of over 100,000 configurations (models + hyper-parameters) sampled from multiple model families, including vanilla (feedforward network loosely inspired by VGG), ResNet and ResNeXt (residual basic block and residual bottleneck block) and NAS cells (following popular design from NASNet, Ameoba, PNAS, ENAS and DARTS). Most configurations are trained only once with a fixed seed, except a few that are trained twice or three times.
+
+Instead of storing results obtained with different configurations in separate files, we dump them into one single database to enable comparison in multiple dimensions. Specifically, we use ``model_family`` to distinguish model types, ``model_spec`` for all hyper-parameters needed to build this model, ``cell_spec`` for detailed information on operators and connections if it is a NAS cell, ``generator`` to denote the sampling policy through which this configuration is generated. Refer to API documentation for details.
+
+Available Operators
+-------------------
+
+Here is a list of available operators used in NDS.
+
+.. autoattribute:: nni.nas.benchmarks.nds.constants.NONE
+
+.. autoattribute:: nni.nas.benchmarks.nds.constants.SKIP_CONNECT
+
+.. autoattribute:: nni.nas.benchmarks.nds.constants.AVG_POOL_3X3
+
+.. autoattribute:: nni.nas.benchmarks.nds.constants.MAX_POOL_3X3
+
+.. autoattribute:: nni.nas.benchmarks.nds.constants.MAX_POOL_5X5
+
+.. autoattribute:: nni.nas.benchmarks.nds.constants.MAX_POOL_7X7
+
+.. autoattribute:: nni.nas.benchmarks.nds.constants.CONV_1X1
+
+.. autoattribute:: nni.nas.benchmarks.nds.constants.CONV_3X3
+
+.. autoattribute:: nni.nas.benchmarks.nds.constants.CONV_3X1_1X3
+
+.. autoattribute:: nni.nas.benchmarks.nds.constants.CONV_7X1_1X7
+
+.. autoattribute:: nni.nas.benchmarks.nds.constants.DIL_CONV_3X3
+
+.. autoattribute:: nni.nas.benchmarks.nds.constants.DIL_CONV_5X5
+
+.. autoattribute:: nni.nas.benchmarks.nds.constants.SEP_CONV_3X3
+
+.. autoattribute:: nni.nas.benchmarks.nds.constants.SEP_CONV_5X5
+
+.. autoattribute:: nni.nas.benchmarks.nds.constants.SEP_CONV_7X7
+
+.. autoattribute:: nni.nas.benchmarks.nds.constants.DIL_SEP_CONV_3X3
+
+API Documentation
+^^^^^^^^^^^^^^^^^
+
+.. autofunction:: nni.nas.benchmarks.nds.query_nds_trial_stats
+
+.. autoclass:: nni.nas.benchmarks.nds.NdsTrialConfig
+
+.. autoclass:: nni.nas.benchmarks.nds.NdsTrialStats
+
+.. autoclass:: nni.nas.benchmarks.nds.NdsIntermediateStats
diff --git a/docs/en_US/NAS/CDARTS.rst b/docs/en_US/NAS/CDARTS.rst
new file mode 100644
index 0000000000..90d7804383
--- /dev/null
+++ b/docs/en_US/NAS/CDARTS.rst
@@ -0,0 +1,72 @@
+CDARTS
+======
+
+Introduction
+------------
+
+`CDARTS `__ builds a cyclic feedback mechanism between the search and evaluation networks. First, the search network generates an initial topology for evaluation, so that the weights of the evaluation network can be optimized. Second, the architecture topology in the search network is further optimized by the label supervision in classification, as well as the regularization from the evaluation network through feature distillation. Repeating the above cycle results in a joint optimization of the search and evaluation networks, and thus enables the evolution of the topology to fit the final evaluation network.
+
+In implementation of ``CdartsTrainer``\ , it first instantiates two models and two mutators (one for each). The first model is the so-called "search network", which is mutated with a ``RegularizedDartsMutator`` -- a mutator with subtle differences with ``DartsMutator``. The second model is the "evaluation network", which is mutated with a discrete mutator that leverages the previous search network mutator, to sample a single path each time. Trainers train models and mutators alternatively. Users can refer to `paper `__ if they are interested in more details on these trainers and mutators.
+
+Reproduction Results
+--------------------
+
+This is CDARTS based on the NNI platform, which currently supports CIFAR10 search and retrain. ImageNet search and retrain should also be supported, and we provide corresponding interfaces. Our reproduced results on NNI are slightly lower than the paper, but much higher than the original DARTS. Here we show the results of three independent experiments on CIFAR10.
+
+.. list-table::
+ :header-rows: 1
+ :widths: auto
+
+ * - Runs
+ - Paper
+ - NNI
+ * - 1
+ - 97.52
+ - 97.44
+ * - 2
+ - 97.53
+ - 97.48
+ * - 3
+ - 97.58
+ - 97.56
+
+
+Examples
+--------
+
+`Example code `__
+
+.. code-block:: bash
+
+ # In case NNI code is not cloned. If the code is cloned already, ignore this line and enter code folder.
+ git clone https://github.com/Microsoft/nni.git
+
+ # install apex for distributed training.
+ git clone https://github.com/NVIDIA/apex
+ cd apex
+ python setup.py install --cpp_ext --cuda_ext
+
+ # search the best architecture
+ cd examples/nas/cdarts
+ bash run_search_cifar.sh
+
+ # train the best architecture.
+ bash run_retrain_cifar.sh
+
+Reference
+---------
+
+PyTorch
+^^^^^^^
+
+.. autoclass:: nni.algorithms.nas.pytorch.cdarts.CdartsTrainer
+ :members:
+
+.. autoclass:: nni.algorithms.nas.pytorch.cdarts.RegularizedDartsMutator
+ :members:
+
+.. autoclass:: nni.algorithms.nas.pytorch.cdarts.DartsDiscreteMutator
+ :members:
+
+.. autoclass:: nni.algorithms.nas.pytorch.cdarts.RegularizedMutatorParallel
+ :members:
diff --git a/docs/en_US/NAS/ClassicNas.rst b/docs/en_US/NAS/ClassicNas.rst
new file mode 100644
index 0000000000..d8aa47a5c4
--- /dev/null
+++ b/docs/en_US/NAS/ClassicNas.rst
@@ -0,0 +1,59 @@
+.. role:: raw-html(raw)
+ :format: html
+
+
+Classic NAS Algorithms
+======================
+
+In classic NAS algorithms, each architecture is trained as a trial and the NAS algorithm acts as a tuner. Thus, this training mode naturally fits within the NNI hyper-parameter tuning framework, where Tuner generates new architecture for the next trial and trials run in the training service.
+
+Quick Start
+-----------
+
+The following example shows how to use classic NAS algorithms. You can see it is quite similar to NNI hyper-parameter tuning.
+
+.. code-block:: python
+
+ model = Net()
+
+ # get the chosen architecture from tuner and apply it on model
+ get_and_apply_next_architecture(model)
+ train(model) # your code for training the model
+ acc = test(model) # test the trained model
+ nni.report_final_result(acc) # report the performance of the chosen architecture
+
+First, instantiate the model. Search space has been defined in this model through ``LayerChoice`` and ``InputChoice``. After that, user should invoke ``get_and_apply_next_architecture(model)`` to settle down to a specific architecture. This function receives the architecture from tuner (i.e., the classic NAS algorithm) and applies the architecture to ``model``. At this point, ``model`` becomes a specific architecture rather than a search space. Then users are free to train this model just like training a normal PyTorch model. After get the accuracy of this model, users should invoke ``nni.report_final_result(acc)`` to report the result to the tuner.
+
+At this point, trial code is ready. Then, we can prepare an NNI experiment, i.e., search space file and experiment config file. Different from NNI hyper-parameter tuning, search space file is automatically generated from the trial code by running the command (the detailed usage of this command can be found `here <../Tutorial/Nnictl.rst>`__\ ):
+
+``nnictl ss_gen --trial_command="the command for running your trial code"``
+
+A file named ``nni_auto_gen_search_space.json`` is generated by this command. Then put the path of the generated search space in the field ``searchSpacePath`` of the experiment config file. The other fields of the config file can be filled by referring `this tutorial <../Tutorial/QuickStart.rst>`__.
+
+Currently, we only support :githublink:`PPO Tuner ` for classic NAS. More classic NAS algorithms will be supported soon.
+
+The complete examples can be found :githublink:`here ` for PyTorch and :githublink:`here ` for TensorFlow.
+
+Standalone mode for easy debugging
+----------------------------------
+
+We support a standalone mode for easy debugging, where you can directly run the trial command without launching an NNI experiment. This is for checking whether your trial code can correctly run. The first candidate(s) are chosen for ``LayerChoice`` and ``InputChoice`` in this standalone mode.
+
+:raw-html:``
+
+Regularized Evolution Tuner
+---------------------------
+
+This is a tuner geared for NNI’s Neural Architecture Search (NAS) interface. It uses the `evolution algorithm `__.
+
+The tuner first randomly initializes the number of ``population`` models and evaluates them. After that, every time to produce a new architecture, the tuner randomly chooses the number of ``sample`` architectures from ``population``\ , then mutates the best model in ``sample``\ , the parent model, to produce the child model. The mutation includes the hidden mutation and the op mutation. The hidden state mutation consists of replacing a hidden state with another hidden state from within the cell, subject to the constraint that no loops are formed. The op mutation behaves like the hidden state mutation as far as replacing one op with another op from the op set. Note that keeping the child model the same as its parent is not allowed. After evaluating the child model, it is added to the tail of the ``population``\ , then pops the front one.
+
+Note that **trial concurrency should be less than the population of the model**\ , otherwise NO_MORE_TRIAL exception will be raised.
+
+The whole procedure is summarized by the pseudocode below.
+
+
+.. image:: ../../img/EvoNasTuner.png
+ :target: ../../img/EvoNasTuner.png
+ :alt:
+
diff --git a/docs/en_US/NAS/Cream.rst b/docs/en_US/NAS/Cream.rst
new file mode 100644
index 0000000000..7ad06784b4
--- /dev/null
+++ b/docs/en_US/NAS/Cream.rst
@@ -0,0 +1,158 @@
+.. role:: raw-html(raw)
+ :format: html
+
+
+Cream of the Crop: Distilling Prioritized Paths For One-Shot Neural Architecture Search
+=======================================================================================
+
+ **`[Paper] `__ `[Models-Google Drive] `__\ `[Models-Baidu Disk (PWD: wqw6)] `__ `[BibTex] `__** :raw-html:`
`
+
+In this work, we present a simple yet effective architecture distillation method. The central idea is that subnetworks can learn collaboratively and teach each other throughout the training process, aiming to boost the convergence of individual models. We introduce the concept of prioritized path, which refers to the architecture candidates exhibiting superior performance during training. Distilling knowledge from the prioritized paths is able to boost the training of subnetworks. Since the prioritized paths are changed on the fly depending on their performance and complexity, the final obtained paths are the cream of the crop. The discovered architectures achieve superior performance compared to the recent `MobileNetV3 `__ and `EfficientNet `__ families under aligned settings.
+
+:raw-html:``
+Reproduced Results
+------------------
+
+Top-1 Accuracy on ImageNet. The top-1 accuracy of Cream search algorithm surpasses MobileNetV3 and EfficientNet-B0/B1 on ImageNet.
+The training with 16 Gpus is a little bit superior than 8 Gpus, as below.
+
+.. list-table::
+ :header-rows: 1
+ :widths: auto
+
+ * - Model (M Flops)
+ - 8Gpus
+ - 16Gpus
+ * - 14M
+ - 53.7
+ - 53.8
+ * - 43M
+ - 65.8
+ - 66.5
+ * - 114M
+ - 72.1
+ - 72.8
+ * - 287M
+ - 76.7
+ - 77.6
+ * - 481M
+ - 78.9
+ - 79.2
+ * - 604M
+ - 79.4
+ - 80.0
+
+
+
+.. raw:: html
+
+
+  |
+  |
+
+
+
+Examples
+--------
+
+`Example code `__
+
+Please run the following scripts in the example folder.
+
+Data Preparation
+----------------
+
+You need to first download the `ImageNet-2012 `__ to the folder ``./data/imagenet`` and move the validation set to the subfolder ``./data/imagenet/val``. To move the validation set, you cloud use the following script: https://raw.githubusercontent.com/soumith/imagenetloader.torch/master/valprep.sh
+
+Put the imagenet data in ``./data``. It should be like following:
+
+.. code-block:: bash
+
+ ./data/imagenet/train
+ ./data/imagenet/val
+ ...
+
+Quick Start
+-----------
+
+I. Search
+^^^^^^^^^
+
+First, build environments for searching.
+
+.. code-block:: bash
+
+ pip install -r ./requirements
+
+ git clone https://github.com/NVIDIA/apex.git
+ cd apex
+ python setup.py install --cpp_ext --cuda_ext
+
+To search for an architecture, you need to configure the parameters ``FLOPS_MINIMUM`` and ``FLOPS_MAXIMUM`` to specify the desired model flops, such as [0,600]MB flops. You can specify the flops interval by changing these two parameters in ``./configs/train.yaml``
+
+.. code-block:: bash
+
+ FLOPS_MINIMUM: 0 # Minimum Flops of Architecture
+ FLOPS_MAXIMUM: 600 # Maximum Flops of Architecture
+
+For example, if you expect to search an architecture with model flops <= 200M, please set the ``FLOPS_MINIMUM`` and ``FLOPS_MAXIMUM`` to be ``0`` and ``200``.
+
+After you specify the flops of the architectures you would like to search, you can search an architecture now by running:
+
+.. code-block:: bash
+
+ python -m torch.distributed.launch --nproc_per_node=8 ./train.py --cfg ./configs/train.yaml
+
+The searched architectures need to be retrained and obtain the final model. The final model is saved in ``.pth.tar`` format. Retraining code will be released soon.
+
+II. Retrain
+^^^^^^^^^^^
+
+To train searched architectures, you need to configure the parameter ``MODEL_SELECTION`` to specify the model Flops. To specify which model to train, you should add ``MODEL_SELECTION`` in ``./configs/retrain.yaml``. You can select one from [14,43,112,287,481,604], which stands for different Flops(MB).
+
+.. code-block:: bash
+
+ MODEL_SELECTION: 43 # Retrain 43m model
+ MODEL_SELECTION: 481 # Retrain 481m model
+ ......
+
+To train random architectures, you need specify ``MODEL_SELECTION`` to ``-1`` and configure the parameter ``INPUT_ARCH``\ :
+
+.. code-block:: bash
+
+ MODEL_SELECTION: -1 # Train random architectures
+ INPUT_ARCH: [[0], [3], [3, 3], [3, 1, 3], [3, 3, 3, 3], [3, 3, 3], [0]] # Random Architectures
+ ......
+
+After adding ``MODEL_SELECTION`` in ``./configs/retrain.yaml``\ , you need to use the following command to train the model.
+
+.. code-block:: bash
+
+ python -m torch.distributed.launch --nproc_per_node=8 ./retrain.py --cfg ./configs/retrain.yaml
+
+III. Test
+^^^^^^^^^
+
+To test our trained of models, you need to use ``MODEL_SELECTION`` in ``./configs/test.yaml`` to specify which model to test.
+
+.. code-block:: bash
+
+ MODEL_SELECTION: 43 # test 43m model
+ MODEL_SELECTION: 481 # test 470m model
+ ......
+
+After specifying the flops of the model, you need to write the path to the resume model in ``./test.sh``.
+
+.. code-block:: bash
+
+ RESUME_PATH: './43.pth.tar'
+ RESUME_PATH: './481.pth.tar'
+ ......
+
+We provide 14M/43M/114M/287M/481M/604M pretrained models in `google drive `__ or `[Models-Baidu Disk (password: wqw6)] `__ .
+
+After downloading the pretrained models and adding ``MODEL_SELECTION`` and ``RESUME_PATH`` in './configs/test.yaml', you need to use the following command to test the model.
+
+.. code-block:: bash
+
+ python -m torch.distributed.launch --nproc_per_node=8 ./test.py --cfg ./configs/test.yaml
diff --git a/docs/en_US/NAS/DARTS.rst b/docs/en_US/NAS/DARTS.rst
new file mode 100644
index 0000000000..021c554a4a
--- /dev/null
+++ b/docs/en_US/NAS/DARTS.rst
@@ -0,0 +1,69 @@
+DARTS
+=====
+
+Introduction
+------------
+
+The paper `DARTS: Differentiable Architecture Search `__ addresses the scalability challenge of architecture search by formulating the task in a differentiable manner. Their method is based on the continuous relaxation of the architecture representation, allowing efficient search of the architecture using gradient descent.
+
+Authors' code optimizes the network weights and architecture weights alternatively in mini-batches. They further explore the possibility that uses second order optimization (unroll) instead of first order, to improve the performance.
+
+Implementation on NNI is based on the `official implementation `__ and a `popular 3rd-party repo `__. DARTS on NNI is designed to be general for arbitrary search space. A CNN search space tailored for CIFAR10, same as the original paper, is implemented as a use case of DARTS.
+
+Reproduction Results
+--------------------
+
+The above-mentioned example is meant to reproduce the results in the paper, we do experiments with first and second order optimization. Due to the time limit, we retrain *only the best architecture* derived from the search phase and we repeat the experiment *only once*. Our results is currently on par with the results reported in paper. We will add more results later when ready.
+
+.. list-table::
+ :header-rows: 1
+ :widths: auto
+
+ * -
+ - In paper
+ - Reproduction
+ * - First order (CIFAR10)
+ - 3.00 +/- 0.14
+ - 2.78
+ * - Second order (CIFAR10)
+ - 2.76 +/- 0.09
+ - 2.80
+
+
+Examples
+--------
+
+CNN Search Space
+^^^^^^^^^^^^^^^^
+
+:githublink:`Example code `
+
+.. code-block:: bash
+
+ # In case NNI code is not cloned. If the code is cloned already, ignore this line and enter code folder.
+ git clone https://github.com/Microsoft/nni.git
+
+ # search the best architecture
+ cd examples/nas/darts
+ python3 search.py
+
+ # train the best architecture
+ python3 retrain.py --arc-checkpoint ./checkpoints/epoch_49.json
+
+Reference
+---------
+
+PyTorch
+^^^^^^^
+
+.. autoclass:: nni.algorithms.nas.pytorch.darts.DartsTrainer
+ :members:
+
+.. autoclass:: nni.algorithms.nas.pytorch.darts.DartsMutator
+ :members:
+
+Limitations
+-----------
+
+
+* DARTS doesn't support DataParallel and needs to be customized in order to support DistributedDataParallel.
diff --git a/docs/en_US/NAS/ENAS.rst b/docs/en_US/NAS/ENAS.rst
new file mode 100644
index 0000000000..4ee0d03573
--- /dev/null
+++ b/docs/en_US/NAS/ENAS.rst
@@ -0,0 +1,46 @@
+ENAS
+====
+
+Introduction
+------------
+
+The paper `Efficient Neural Architecture Search via Parameter Sharing `__ uses parameter sharing between child models to accelerate the NAS process. In ENAS, a controller learns to discover neural network architectures by searching for an optimal subgraph within a large computational graph. The controller is trained with policy gradient to select a subgraph that maximizes the expected reward on the validation set. Meanwhile the model corresponding to the selected subgraph is trained to minimize a canonical cross entropy loss.
+
+Implementation on NNI is based on the `official implementation in Tensorflow `__\ , including a general-purpose Reinforcement-learning controller and a trainer that trains target network and this controller alternatively. Following paper, we have also implemented macro and micro search space on CIFAR10 to demonstrate how to use these trainers. Since code to train from scratch on NNI is not ready yet, reproduction results are currently unavailable.
+
+Examples
+--------
+
+CIFAR10 Macro/Micro Search Space
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+:githublink:`Example code `
+
+.. code-block:: bash
+
+ # In case NNI code is not cloned. If the code is cloned already, ignore this line and enter code folder.
+ git clone https://github.com/Microsoft/nni.git
+
+ # search the best architecture
+ cd examples/nas/enas
+
+ # search in macro search space
+ python3 search.py --search-for macro
+
+ # search in micro search space
+ python3 search.py --search-for micro
+
+ # view more options for search
+ python3 search.py -h
+
+Reference
+---------
+
+PyTorch
+^^^^^^^
+
+.. autoclass:: nni.algorithms.nas.pytorch.enas.EnasTrainer
+ :members:
+
+.. autoclass:: nni.algorithms.nas.pytorch.enas.EnasMutator
+ :members:
diff --git a/docs/en_US/NAS/NasGuide.rst b/docs/en_US/NAS/NasGuide.rst
new file mode 100644
index 0000000000..45475c686a
--- /dev/null
+++ b/docs/en_US/NAS/NasGuide.rst
@@ -0,0 +1,88 @@
+One-shot NAS algorithms
+=======================
+
+Besides `classic NAS algorithms <./ClassicNas.rst>`__\ , users also apply more advanced one-shot NAS algorithms to find better models from a search space. There are lots of related works about one-shot NAS algorithms, such as `SMASH `__\ , `ENAS `__\ , `DARTS `__\ , `FBNet `__\ , `ProxylessNAS `__\ , `SPOS `__\ , `Single-Path NAS `__\ , `Understanding One-shot `__ and `GDAS `__. One-shot NAS algorithms usually build a supernet containing every candidate in the search space as its subnetwork, and in each step, a subnetwork or combination of several subnetworks is trained.
+
+Currently, several one-shot NAS methods are supported on NNI. For example, ``DartsTrainer``\ , which uses SGD to train architecture weights and model weights iteratively, and ``ENASTrainer``\ , which `uses a controller to train the model `__. New and more efficient NAS trainers keep emerging in research community and some will be implemented in future releases of NNI.
+
+Search with One-shot NAS Algorithms
+-----------------------------------
+
+Each one-shot NAS algorithm implements a trainer, for which users can find usage details in the description of each algorithm. Here is a simple example, demonstrating how users can use ``EnasTrainer``.
+
+.. code-block:: python
+
+ # this is exactly same as traditional model training
+ model = Net()
+ dataset_train = CIFAR10(root="./data", train=True, download=True, transform=train_transform)
+ dataset_valid = CIFAR10(root="./data", train=False, download=True, transform=valid_transform)
+ criterion = nn.CrossEntropyLoss()
+ optimizer = torch.optim.SGD(model.parameters(), 0.05, momentum=0.9, weight_decay=1.0E-4)
+
+ # use NAS here
+ def top1_accuracy(output, target):
+ # this is the function that computes the reward, as required by ENAS algorithm
+ batch_size = target.size(0)
+ _, predicted = torch.max(output.data, 1)
+ return (predicted == target).sum().item() / batch_size
+
+ def metrics_fn(output, target):
+ # metrics function receives output and target and computes a dict of metrics
+ return {"acc1": top1_accuracy(output, target)}
+
+ from nni.algorithms.nas.pytorch import enas
+ trainer = enas.EnasTrainer(model,
+ loss=criterion,
+ metrics=metrics_fn,
+ reward_function=top1_accuracy,
+ optimizer=optimizer,
+ batch_size=128
+ num_epochs=10, # 10 epochs
+ dataset_train=dataset_train,
+ dataset_valid=dataset_valid,
+ log_frequency=10) # print log every 10 steps
+ trainer.train() # training
+ trainer.export(file="model_dir/final_architecture.json") # export the final architecture to file
+
+``model`` is the one with `user defined search space <./WriteSearchSpace.rst>`__. Then users should prepare training data and model evaluation metrics. To search from the defined search space, a one-shot algorithm is instantiated, called trainer (e.g., EnasTrainer). The trainer exposes a few arguments that you can customize. For example, the loss function, the metrics function, the optimizer, and the datasets. These should satisfy most usage requirements and we do our best to make sure our built-in trainers work on as many models, tasks, and datasets as possible.
+
+**Note that** when using one-shot NAS algorithms, there is no need to start an NNI experiment. Users can directly run this Python script (i.e., ``train.py``\ ) through ``python3 train.py`` without ``nnictl``. After training, users can export the best one of the found models through ``trainer.export()``.
+
+Each trainer in NNI has its targeted scenario and usage. Some trainers have the assumption that the task is a classification task; some trainers might have a different definition of "epoch" (e.g., an ENAS epoch = some child steps + some controller steps). Most trainers do not have support for distributed training: they won't wrap your model with ``DataParallel`` or ``DistributedDataParallel`` to do that. So after a few tryouts, if you want to actually use the trainers on your very customized applications, you might need to `customize your trainer <./Advanced.rst#extend-the-ability-of-one-shot-trainers>`__.
+
+Furthermore, one-shot NAS can be visualized with our NAS UI. `See more details. <./Visualization.rst>`__
+
+Retrain with Exported Architecture
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+After the search phase, it's time to train the found architecture. Unlike many open-source NAS algorithms who write a whole new model specifically for retraining. We found that the search model and retraining model are usually very similar, and therefore you can construct your final model with the exact same model code. For example
+
+.. code-block:: python
+
+ model = Net()
+ apply_fixed_architecture(model, "model_dir/final_architecture.json")
+
+The JSON is simply a mapping from mutable keys to choices. Choices can be expressed in:
+
+
+* A string: select the candidate with corresponding name.
+* A number: select the candidate with corresponding index.
+* A list of string: select the candidates with corresponding names.
+* A list of number: select the candidates with corresponding indices.
+* A list of boolean values: a multi-hot array.
+
+For example,
+
+.. code-block:: json
+
+ {
+ "LayerChoice1": "conv5x5",
+ "LayerChoice2": 6,
+ "InputChoice3": ["layer1", "layer3"],
+ "InputChoice4": [1, 2],
+ "InputChoice5": [false, true, false, false, true]
+ }
+
+After applying, the model is then fixed and ready for final training. The model works as a single model, and unused parameters and modules are pruned.
+
+Also, refer to `DARTS <./DARTS.rst>`__ for code exemplifying retraining.
diff --git a/docs/en_US/NAS/NasReference.rst b/docs/en_US/NAS/NasReference.rst
new file mode 100644
index 0000000000..6df2be425b
--- /dev/null
+++ b/docs/en_US/NAS/NasReference.rst
@@ -0,0 +1,99 @@
+NAS Reference
+=============
+
+.. contents::
+
+Mutables
+--------
+
+.. autoclass:: nni.nas.pytorch.mutables.Mutable
+ :members:
+
+.. autoclass:: nni.nas.pytorch.mutables.LayerChoice
+ :members:
+
+.. autoclass:: nni.nas.pytorch.mutables.InputChoice
+ :members:
+
+.. autoclass:: nni.nas.pytorch.mutables.MutableScope
+ :members:
+
+Utilities
+^^^^^^^^^
+
+.. autofunction:: nni.nas.pytorch.utils.global_mutable_counting
+
+Mutators
+--------
+
+.. autoclass:: nni.nas.pytorch.base_mutator.BaseMutator
+ :members:
+
+.. autoclass:: nni.nas.pytorch.mutator.Mutator
+ :members:
+
+Random Mutator
+^^^^^^^^^^^^^^
+
+.. autoclass:: nni.algorithms.nas.pytorch.random.RandomMutator
+ :members:
+
+Utilities
+^^^^^^^^^
+
+.. autoclass:: nni.nas.pytorch.utils.StructuredMutableTreeNode
+ :members:
+
+Trainers
+--------
+
+Trainer
+^^^^^^^
+
+.. autoclass:: nni.nas.pytorch.base_trainer.BaseTrainer
+ :members:
+
+.. autoclass:: nni.nas.pytorch.trainer.Trainer
+ :members:
+
+Retrain
+^^^^^^^
+
+.. autofunction:: nni.nas.pytorch.fixed.apply_fixed_architecture
+
+.. autoclass:: nni.nas.pytorch.fixed.FixedArchitecture
+ :members:
+
+Distributed NAS
+^^^^^^^^^^^^^^^
+
+.. autofunction:: nni.algorithms.nas.pytorch.classic_nas.get_and_apply_next_architecture
+
+.. autoclass:: nni.algorithms.nas.pytorch.classic_nas.mutator.ClassicMutator
+ :members:
+
+Callbacks
+^^^^^^^^^
+
+.. autoclass:: nni.nas.pytorch.callbacks.Callback
+ :members:
+
+.. autoclass:: nni.nas.pytorch.callbacks.LRSchedulerCallback
+ :members:
+
+.. autoclass:: nni.nas.pytorch.callbacks.ArchitectureCheckpoint
+ :members:
+
+.. autoclass:: nni.nas.pytorch.callbacks.ModelCheckpoint
+ :members:
+
+Utilities
+^^^^^^^^^
+
+.. autoclass:: nni.nas.pytorch.utils.AverageMeterGroup
+ :members:
+
+.. autoclass:: nni.nas.pytorch.utils.AverageMeter
+ :members:
+
+.. autofunction:: nni.nas.pytorch.utils.to_device
diff --git a/docs/en_US/NAS/Overview.rst b/docs/en_US/NAS/Overview.rst
new file mode 100644
index 0000000000..3583816f5d
--- /dev/null
+++ b/docs/en_US/NAS/Overview.rst
@@ -0,0 +1,112 @@
+Neural Architecture Search (NAS) on NNI
+=======================================
+
+.. contents::
+
+Overview
+--------
+
+Automatic neural architecture search is taking an increasingly important role in finding better models. Recent research has proved the feasibility of automatic NAS and has lead to models that beat many manually designed and tuned models. Some representative works are `NASNet `__\ , `ENAS `__\ , `DARTS `__\ , `Network Morphism `__\ , and `Evolution `__. Further, new innovations keep emerging.
+
+However, it takes a great effort to implement NAS algorithms, and it's hard to reuse the code base of existing algorithms for new ones. To facilitate NAS innovations (e.g., the design and implementation of new NAS models, the comparison of different NAS models side-by-side, etc.), an easy-to-use and flexible programming interface is crucial.
+
+With this motivation, our ambition is to provide a unified architecture in NNI, accelerate innovations on NAS, and apply state-of-the-art algorithms to real-world problems faster.
+
+With the unified interface, there are two different modes for architecture search. `One <#supported-one-shot-nas-algorithms>`__ is the so-called one-shot NAS, where a super-net is built based on a search space and one-shot training is used to generate a good-performing child model. `The other <#supported-classic-nas-algorithms>`__ is the traditional search-based approach, where each child model within the search space runs as an independent trial. We call it classic NAS.
+
+NNI also provides dedicated `visualization tool <#nas-visualization>`__ for users to check the status of the neural architecture search process.
+
+Supported Classic NAS Algorithms
+--------------------------------
+
+The procedure of classic NAS algorithms is similar to hyper-parameter tuning, users use ``nnictl`` to start experiments and each model runs as a trial. The difference is that search space file is automatically generated from user model (with search space in it) by running ``nnictl ss_gen``. The following table listed supported tuning algorihtms for classic NAS mode. More algorihtms will be supported in future release.
+
+.. list-table::
+ :header-rows: 1
+ :widths: auto
+
+ * - Name
+ - Brief Introduction of Algorithm
+ * - :githublink:`Random Search `
+ - Randomly pick a model from search space
+ * - `PPO Tuner `__
+ - PPO Tuner is a Reinforcement Learning tuner based on PPO algorithm. `Reference Paper `__
+
+
+Please refer to `here `__ for the usage of classic NAS algorithms.
+
+Supported One-shot NAS Algorithms
+---------------------------------
+
+NNI currently supports the one-shot NAS algorithms listed below and is adding more. Users can reproduce an algorithm or use it on their own dataset. We also encourage users to implement other algorithms with `NNI API <#use-nni-api>`__\ , to benefit more people.
+
+.. list-table::
+ :header-rows: 1
+ :widths: auto
+
+ * - Name
+ - Brief Introduction of Algorithm
+ * - `ENAS `__
+ - `Efficient Neural Architecture Search via Parameter Sharing `__. In ENAS, a controller learns to discover neural network architectures by searching for an optimal subgraph within a large computational graph. It uses parameter sharing between child models to achieve fast speed and excellent performance.
+ * - `DARTS `__
+ - `DARTS: Differentiable Architecture Search `__ introduces a novel algorithm for differentiable network architecture search on bilevel optimization.
+ * - `P-DARTS `__
+ - `Progressive Differentiable Architecture Search: Bridging the Depth Gap between Search and Evaluation `__ is based on DARTS. It introduces an efficient algorithm which allows the depth of searched architectures to grow gradually during the training procedure.
+ * - `SPOS `__
+ - `Single Path One-Shot Neural Architecture Search with Uniform Sampling `__ constructs a simplified supernet trained with a uniform path sampling method and applies an evolutionary algorithm to efficiently search for the best-performing architectures.
+ * - `CDARTS `__
+ - `Cyclic Differentiable Architecture Search `__ builds a cyclic feedback mechanism between the search and evaluation networks. It introduces a cyclic differentiable architecture search framework which integrates the two networks into a unified architecture.
+ * - `ProxylessNAS `__
+ - `ProxylessNAS: Direct Neural Architecture Search on Target Task and Hardware `__. It removes proxy, directly learns the architectures for large-scale target tasks and target hardware platforms.
+ * - `TextNAS `__
+ - `TextNAS: A Neural Architecture Search Space tailored for Text Representation `__. It is a neural architecture search algorithm tailored for text representation.
+
+
+One-shot algorithms run **standalone without nnictl**. NNI supports both PyTorch and Tensorflow 2.X.
+
+Here are some common dependencies to run the examples. PyTorch needs to be above 1.2 to use ``BoolTensor``.
+
+
+* tensorboard
+* PyTorch 1.2+
+* git
+
+Please refer to `here `__ for the usage of one-shot NAS algorithms.
+
+One-shot NAS can be visualized with our visualization tool. Learn more details `here <./Visualization.rst>`__.
+
+Search Space Zoo
+----------------
+
+NNI provides some predefined search space which can be easily reused. By stacking the extracted cells, user can quickly reproduce those NAS models.
+
+Search Space Zoo contains the following NAS cells:
+
+
+* `DartsCell <./SearchSpaceZoo.rst#DartsCell>`__
+* `ENAS micro <./SearchSpaceZoo.rst#ENASMicroLayer>`__
+* `ENAS macro <./SearchSpaceZoo.rst#ENASMacroLayer>`__
+* `NAS Bench 201 <./SearchSpaceZoo.rst#nas-bench-201>`__
+
+Using NNI API to Write Your Search Space
+----------------------------------------
+
+The programming interface of designing and searching a model is often demanded in two scenarios.
+
+
+#. When designing a neural network, there may be multiple operation choices on a layer, sub-model, or connection, and it's undetermined which one or combination performs best. So, it needs an easy way to express the candidate layers or sub-models.
+#. When applying NAS on a neural network, it needs a unified way to express the search space of architectures, so that it doesn't need to update trial code for different search algorithms.
+
+For using NNI NAS, we suggest users to first go through `the tutorial of NAS API for building search space <./WriteSearchSpace.rst>`__.
+
+NAS Visualization
+-----------------
+
+To help users track the process and status of how the model is searched under specified search space, we developed a visualization tool. It visualizes search space as a super-net and shows importance of subnets and layers/operations, as well as how the importance changes along with the search process. Please refer to `the document of NAS visualization <./Visualization.rst>`__ for how to use it.
+
+Reference and Feedback
+----------------------
+
+
+* To `report a bug `__ for this feature in GitHub;
+* To `file a feature or improvement request `__ for this feature in GitHub.
diff --git a/docs/en_US/NAS/PDARTS.rst b/docs/en_US/NAS/PDARTS.rst
new file mode 100644
index 0000000000..ae4d5daa06
--- /dev/null
+++ b/docs/en_US/NAS/PDARTS.rst
@@ -0,0 +1,20 @@
+P-DARTS
+=======
+
+Examples
+--------
+
+:githublink:`Example code `
+
+.. code-block:: bash
+
+ # In case NNI code is not cloned. If the code is cloned already, ignore this line and enter code folder.
+ git clone https://github.com/Microsoft/nni.git
+
+ # search the best architecture
+ cd examples/nas/pdarts
+ python3 search.py
+
+ # train the best architecture, it's the same progress as darts.
+ cd ../darts
+ python3 retrain.py --arc-checkpoint ../pdarts/checkpoints/epoch_2.json
diff --git a/docs/en_US/NAS/Proxylessnas.rst b/docs/en_US/NAS/Proxylessnas.rst
new file mode 100644
index 0000000000..56857fb2ab
--- /dev/null
+++ b/docs/en_US/NAS/Proxylessnas.rst
@@ -0,0 +1,74 @@
+ProxylessNAS on NNI
+===================
+
+Introduction
+------------
+
+The paper `ProxylessNAS: Direct Neural Architecture Search on Target Task and Hardware `__ removes proxy, it directly learns the architectures for large-scale target tasks and target hardware platforms. They address high memory consumption issue of differentiable NAS and reduce the computational cost to the same level of regular training while still allowing a large candidate set. Please refer to the paper for the details.
+
+Usage
+-----
+
+To use ProxylessNAS training/searching approach, users need to specify search space in their model using `NNI NAS interface `__\ , e.g., ``LayerChoice``\ , ``InputChoice``. After defining and instantiating the model, the following work can be leaved to ProxylessNasTrainer by instantiating the trainer and passing the model to it.
+
+.. code-block:: python
+
+ trainer = ProxylessNasTrainer(model,
+ model_optim=optimizer,
+ train_loader=data_provider.train,
+ valid_loader=data_provider.valid,
+ device=device,
+ warmup=True,
+ ckpt_path=args.checkpoint_path,
+ arch_path=args.arch_path)
+ trainer.train()
+ trainer.export(args.arch_path)
+
+The complete example code can be found :githublink:`here `.
+
+**Input arguments of ProxylessNasTrainer**
+
+
+* **model** (*PyTorch model, required*\ ) - The model that users want to tune/search. It has mutables to specify search space.
+* **model_optim** (*PyTorch optimizer, required*\ ) - The optimizer users want to train the model.
+* **device** (*device, required*\ ) - The devices that users provide to do the train/search. The trainer applies data parallel on the model for users.
+* **train_loader** (*PyTorch data loader, required*\ ) - The data loader for training set.
+* **valid_loader** (*PyTorch data loader, required*\ ) - The data loader for validation set.
+* **label_smoothing** (*float, optional, default = 0.1*\ ) - The degree of label smoothing.
+* **n_epochs** (*int, optional, default = 120*\ ) - The number of epochs to train/search.
+* **init_lr** (*float, optional, default = 0.025*\ ) - The initial learning rate for training the model.
+* **binary_mode** (*'two', 'full', or 'full_v2', optional, default = 'full_v2'*\ ) - The forward/backward mode for the binary weights in mutator. 'full' means forward all the candidate ops, 'two' means only forward two sampled ops, 'full_v2' means recomputing the inactive ops during backward.
+* **arch_init_type** (*'normal' or 'uniform', optional, default = 'normal'*\ ) - The way to init architecture parameters.
+* **arch_init_ratio** (*float, optional, default = 1e-3*\ ) - The ratio to init architecture parameters.
+* **arch_optim_lr** (*float, optional, default = 1e-3*\ ) - The learning rate of the architecture parameters optimizer.
+* **arch_weight_decay** (*float, optional, default = 0*\ ) - Weight decay of the architecture parameters optimizer.
+* **grad_update_arch_param_every** (*int, optional, default = 5*\ ) - Update architecture weights every this number of minibatches.
+* **grad_update_steps** (*int, optional, default = 1*\ ) - During each update of architecture weights, the number of steps to train architecture weights.
+* **warmup** (*bool, optional, default = True*\ ) - Whether to do warmup.
+* **warmup_epochs** (*int, optional, default = 25*\ ) - The number of epochs to do during warmup.
+* **arch_valid_frequency** (*int, optional, default = 1*\ ) - The frequency of printing validation result.
+* **load_ckpt** (*bool, optional, default = False*\ ) - Whether to load checkpoint.
+* **ckpt_path** (*str, optional, default = None*\ ) - checkpoint path, if load_ckpt is True, ckpt_path cannot be None.
+* **arch_path** (*str, optional, default = None*\ ) - The path to store chosen architecture.
+
+Implementation
+--------------
+
+The implementation on NNI is based on the `offical implementation `__. The official implementation supports two training approaches: gradient descent and RL based, and support different targeted hardware, including 'mobile', 'cpu', 'gpu8', 'flops'. In our current implementation on NNI, gradient descent training approach is supported, but has not supported different hardwares. The complete support is ongoing.
+
+Below we will describe implementation details. Like other one-shot NAS algorithms on NNI, ProxylessNAS is composed of two parts: *search space* and *training approach*. For users to flexibly define their own search space and use built-in ProxylessNAS training approach, we put the specified search space in :githublink:`example code ` using :githublink:`NNI NAS interface `.
+
+
+.. image:: ../../img/proxylessnas.png
+ :target: ../../img/proxylessnas.png
+ :alt:
+
+
+ProxylessNAS training approach is composed of ProxylessNasMutator and ProxylessNasTrainer. ProxylessNasMutator instantiates MixedOp for each mutable (i.e., LayerChoice), and manage architecture weights in MixedOp. **For DataParallel**\ , architecture weights should be included in user model. Specifically, in ProxylessNAS implementation, we add MixedOp to the corresponding mutable (i.e., LayerChoice) as a member variable. The mutator also exposes two member functions, i.e., ``arch_requires_grad``\ , ``arch_disable_grad``\ , for the trainer to control the training of architecture weights.
+
+ProxylessNasMutator also implements the forward logic of the mutables (i.e., LayerChoice).
+
+Reproduce Results
+-----------------
+
+To reproduce the result, we first run the search, we found that though it runs many epochs the chosen architecture converges at the first several epochs. This is probably induced by hyper-parameters or the implementation, we are working on it. The test accuracy of the found architecture is top1: 72.31, top5: 90.26.
diff --git a/docs/en_US/NAS/SPOS.rst b/docs/en_US/NAS/SPOS.rst
new file mode 100644
index 0000000000..86bf901afd
--- /dev/null
+++ b/docs/en_US/NAS/SPOS.rst
@@ -0,0 +1,124 @@
+Single Path One-Shot (SPOS)
+===========================
+
+Introduction
+------------
+
+Proposed in `Single Path One-Shot Neural Architecture Search with Uniform Sampling `__ is a one-shot NAS method that addresses the difficulties in training One-Shot NAS models by constructing a simplified supernet trained with an uniform path sampling method, so that all underlying architectures (and their weights) get trained fully and equally. An evolutionary algorithm is then applied to efficiently search for the best-performing architectures without any fine tuning.
+
+Implementation on NNI is based on `official repo `__. We implement a trainer that trains the supernet and a evolution tuner that leverages the power of NNI framework that speeds up the evolutionary search phase. We have also shown
+
+Examples
+--------
+
+Here is a use case, which is the search space in paper, and the way to use flops limit to perform uniform sampling.
+
+:githublink:`Example code `
+
+Requirements
+^^^^^^^^^^^^
+
+NVIDIA DALI >= 0.16 is needed as we use DALI to accelerate the data loading of ImageNet. `Installation guide `__
+
+Download the flops lookup table from `here `__ (maintained by `Megvii `__\ ).
+Put ``op_flops_dict.pkl`` and ``checkpoint-150000.pth.tar`` (if you don't want to retrain the supernet) under ``data`` directory.
+
+Prepare ImageNet in the standard format (follow the script `here `__\ ). Linking it to ``data/imagenet`` will be more convenient.
+
+After preparation, it's expected to have the following code structure:
+
+.. code-block:: bash
+
+ spos
+ ├── architecture_final.json
+ ├── blocks.py
+ ├── config_search.yml
+ ├── data
+ │ ├── imagenet
+ │ │ ├── train
+ │ │ └── val
+ │ └── op_flops_dict.pkl
+ ├── dataloader.py
+ ├── network.py
+ ├── readme.md
+ ├── scratch.py
+ ├── supernet.py
+ ├── tester.py
+ ├── tuner.py
+ └── utils.py
+
+Step 1. Train Supernet
+^^^^^^^^^^^^^^^^^^^^^^
+
+.. code-block:: bash
+
+ python supernet.py
+
+Will export the checkpoint to ``checkpoints`` directory, for the next step.
+
+NOTE: The data loading used in the official repo is `slightly different from usual `__\ , as they use BGR tensor and keep the values between 0 and 255 intentionally to align with their own DL framework. The option ``--spos-preprocessing`` will simulate the behavior used originally and enable you to use the checkpoints pretrained.
+
+Step 2. Evolution Search
+^^^^^^^^^^^^^^^^^^^^^^^^
+
+Single Path One-Shot leverages evolution algorithm to search for the best architecture. The tester, which is responsible for testing the sampled architecture, recalculates all the batch norm for a subset of training images, and evaluates the architecture on the full validation set.
+
+In order to make the tuner aware of the flops limit and have the ability to calculate the flops, we created a new tuner called ``EvolutionWithFlops`` in ``tuner.py``\ , inheriting the tuner in SDK.
+
+To have a search space ready for NNI framework, first run
+
+.. code-block:: bash
+
+ nnictl ss_gen -t "python tester.py"
+
+This will generate a file called ``nni_auto_gen_search_space.json``\ , which is a serialized representation of your search space.
+
+By default, it will use ``checkpoint-150000.pth.tar`` downloaded previously. In case you want to use the checkpoint trained by yourself from the last step, specify ``--checkpoint`` in the command in ``config_search.yml``.
+
+Then search with evolution tuner.
+
+.. code-block:: bash
+
+ nnictl create --config config_search.yml
+
+The final architecture exported from every epoch of evolution can be found in ``checkpoints`` under the working directory of your tuner, which, by default, is ``$HOME/nni-experiments/your_experiment_id/log``.
+
+Step 3. Train from Scratch
+^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+.. code-block:: bash
+
+ python scratch.py
+
+By default, it will use ``architecture_final.json``. This architecture is provided by the official repo (converted into NNI format). You can use any architecture (e.g., the architecture found in step 2) with ``--fixed-arc`` option.
+
+Reference
+---------
+
+PyTorch
+^^^^^^^
+
+.. autoclass:: nni.algorithms.nas.pytorch.spos.SPOSEvolution
+ :members:
+
+.. autoclass:: nni.algorithms.nas.pytorch.spos.SPOSSupernetTrainer
+ :members:
+
+.. autoclass:: nni.algorithms.nas.pytorch.spos.SPOSSupernetTrainingMutator
+ :members:
+
+Known Limitations
+-----------------
+
+
+* Block search only. Channel search is not supported yet.
+* Only GPU version is provided here.
+
+Current Reproduction Results
+----------------------------
+
+Reproduction is still undergoing. Due to the gap between official release and original paper, we compare our current results with official repo (our run) and paper.
+
+
+* Evolution phase is almost aligned with official repo. Our evolution algorithm shows a converging trend and reaches ~65% accuracy at the end of search. Nevertheless, this result is not on par with paper. For details, please refer to `this issue `__.
+* Retrain phase is not aligned. Our retraining code, which uses the architecture released by the authors, reaches 72.14% accuracy, still having a gap towards 73.61% by official release and 74.3% reported in original paper.
diff --git a/docs/en_US/NAS/SearchSpaceZoo.rst b/docs/en_US/NAS/SearchSpaceZoo.rst
new file mode 100644
index 0000000000..5f2fc87a15
--- /dev/null
+++ b/docs/en_US/NAS/SearchSpaceZoo.rst
@@ -0,0 +1,281 @@
+.. role:: raw-html(raw)
+ :format: html
+
+
+Search Space Zoo
+================
+
+DartsCell
+---------
+
+DartsCell is extracted from :githublink:`CNN model `. A DartsCell is a directed acyclic graph containing an ordered sequence of N nodes and each node stands for a latent representation (e.g. feature map in a convolutional network). Directed edges from Node 1 to Node 2 are associated with some operations that transform Node 1 and the result is stored on Node 2. The `Candidate operators <#predefined-operations-darts>`__ between nodes is predefined and unchangeable. One edge represents an operation that chosen from the predefined ones to be applied to the starting node of the edge. One cell contains two input nodes, a single output node, and other ``n_node`` nodes. The input nodes are defined as the cell outputs in the previous two layers. The output of the cell is obtained by applying a reduction operation (e.g. concatenation) to all the intermediate nodes. To make the search space continuous, the categorical choice of a particular operation is relaxed to a softmax over all possible operations. By adjusting the weight of softmax on every node, the operation with the highest probability is chosen to be part of the final structure. A CNN model can be formed by stacking several cells together, which builds a search space. Note that, in DARTS paper all cells in the model share the same structure.
+
+One structure in the Darts search space is shown below. Note that, NNI merges the last one of the four intermediate nodes and the output node.
+
+
+.. image:: ../../img/NAS_Darts_cell.svg
+ :target: ../../img/NAS_Darts_cell.svg
+ :alt:
+
+
+The predefined operators are shown `here <#predefined-operations-darts>`__.
+
+.. autoclass:: nni.nas.pytorch.search_space_zoo.DartsCell
+ :members:
+
+Example code
+^^^^^^^^^^^^
+
+:githublink:`example code `
+
+.. code-block:: bash
+
+ git clone https://github.com/Microsoft/nni.git
+ cd nni/examples/nas/search_space_zoo
+ # search the best structure
+ python3 darts_example.py
+
+:raw-html:``
+
+Candidate operators
+^^^^^^^^^^^^^^^^^^^
+
+All supported operators for Darts are listed below.
+
+
+*
+ MaxPool / AvgPool
+
+
+ * MaxPool: Call ``torch.nn.MaxPool2d``. This operation applies a 2D max pooling over all input channels. Its parameters ``kernel_size=3`` and ``padding=1`` are fixed. The pooling result will pass through a BatchNorm2d then return as the result.
+ *
+ AvgPool: Call ``torch.nn.AvgPool2d``. This operation applies a 2D average pooling over all input channels. Its parameters ``kernel_size=3`` and ``padding=1`` are fixed. The pooling result will pass through a BatchNorm2d then return as the result.
+
+ MaxPool / AvgPool with ``kernel_size=3`` and ``padding=1`` followed by BatchNorm2d
+
+.. autoclass:: nni.nas.pytorch.search_space_zoo.darts_ops.PoolBN
+
+*
+ SkipConnect
+
+ There is no operation between two nodes. Call ``torch.nn.Identity`` to forward what it gets to the output.
+
+*
+ Zero operation
+
+ There is no connection between two nodes.
+
+*
+ DilConv3x3 / DilConv5x5
+
+ :raw-html:``\ DilConv3x3: (Dilated) depthwise separable Conv. It's a 3x3 depthwise convolution with ``C_in`` groups, followed by a 1x1 pointwise convolution. It reduces the amount of parameters. Input is first passed through relu, then DilConv and finally batchNorm2d. **Note that the operation is not Dilated Convolution, but we follow the convention in NAS papers to name it DilConv.** 3x3 DilConv has parameters ``kernel_size=3``\ , ``padding=1`` and 5x5 DilConv has parameters ``kernel_size=5``\ , ``padding=4``.
+
+ .. autoclass:: nni.nas.pytorch.search_space_zoo.darts_ops.DilConv
+
+*
+ SepConv3x3 / SepConv5x5
+
+ Composed of two DilConvs with fixed ``kernel_size=3``\ , ``padding=1`` or ``kernel_size=5``\ , ``padding=2`` sequentially.
+
+ .. autoclass:: nni.nas.pytorch.search_space_zoo.darts_ops.SepConv
+
+ENASMicroLayer
+--------------
+
+This layer is extracted from the model designed :githublink:`here `. A model contains several blocks that share the same architecture. A block is made up of some normal layers and reduction layers, ``ENASMicroLayer`` is a unified implementation of the two types of layers. The only difference between the two layers is that reduction layers apply all operations with ``stride=2``.
+
+ENAS Micro employs a DAG with N nodes in one cell, where the nodes represent local computations, and the edges represent the flow of information between the N nodes. One cell contains two input nodes and a single output node. The following nodes choose two previous nodes as input and apply two operations from `predefined ones <#predefined-operations-enas>`__ then add them as the output of this node. For example, Node 4 chooses Node 1 and Node 3 as inputs then applies ``MaxPool`` and ``AvgPool`` on the inputs respectively, then adds and sums them as the output of Node 4. Nodes that are not served as input for any other node are viewed as the output of the layer. If there are multiple output nodes, the model will calculate the average of these nodes as the layer output.
+
+The ENAS micro search space is shown below.
+
+
+.. image:: ../../img/NAS_ENAS_micro.svg
+ :target: ../../img/NAS_ENAS_micro.svg
+ :alt:
+
+
+The predefined operators can be seen `here <#predefined-operations-enas>`__.
+
+.. autoclass:: nni.nas.pytorch.search_space_zoo.ENASMicroLayer
+ :members:
+
+The Reduction Layer is made up of two Conv operations followed by BatchNorm, each of them will output ``C_out//2`` channels and concat them in channels as the output. The Convolution has ``kernel_size=1`` and ``stride=2``\ , and they perform alternate sampling on the input to reduce the resolution without loss of information. This layer is wrapped in ``ENASMicroLayer``.
+
+Example code
+^^^^^^^^^^^^
+
+:githublink:`example code `
+
+.. code-block:: bash
+
+ git clone https://github.com/Microsoft/nni.git
+ cd nni/examples/nas/search_space_zoo
+ # search the best cell structure
+ python3 enas_micro_example.py
+
+:raw-html:``
+
+Candidate operators
+^^^^^^^^^^^^^^^^^^^
+
+All supported operators for ENAS micro search are listed below.
+
+
+*
+ MaxPool / AvgPool
+
+
+ * MaxPool: Call ``torch.nn.MaxPool2d``. This operation applies a 2D max pooling over all input channels followed by BatchNorm2d. Its parameters are fixed to ``kernel_size=3``\ , ``stride=1`` and ``padding=1``.
+ * AvgPool: Call ``torch.nn.AvgPool2d``. This operation applies a 2D average pooling over all input channels followed by BatchNorm2d. Its parameters are fixed to ``kernel_size=3``\ , ``stride=1`` and ``padding=1``.
+
+.. autoclass:: nni.nas.pytorch.search_space_zoo.enas_ops.Pool
+
+*
+ SepConv
+
+
+ * SepConvBN3x3: ReLU followed by a `DilConv <#DilConv>`__ and BatchNorm. Convolution parameters are ``kernel_size=3``\ , ``stride=1`` and ``padding=1``.
+ *
+ SepConvBN5x5: Do the same operation as the previous one but it has different kernel sizes and paddings, which is set to 5 and 2 respectively.
+
+.. autoclass:: nni.nas.pytorch.search_space_zoo.enas_ops.SepConvBN
+
+*
+ SkipConnect
+
+ Call ``torch.nn.Identity`` to connect directly to the next cell.
+
+ENASMacroLayer
+--------------
+
+In Macro search, the controller makes two decisions for each layer: i) the `operation <#macro-operations>`__ to perform on the result of the previous layer, ii) which the previous layer to connect to for SkipConnects. ENAS uses a controller to design the whole model architecture instead of one of its components. The output of operations is going to concat with the tensor of the chosen layer for SkipConnect. NNI provides `predefined operators <#macro-operations>`__ for macro search, which are listed in `Candidate operators <#macro-operations>`__.
+
+Part of one structure in the ENAS macro search space is shown below.
+
+
+.. image:: ../../img/NAS_ENAS_macro.svg
+ :target: ../../img/NAS_ENAS_macro.svg
+ :alt:
+
+
+.. autoclass:: nni.nas.pytorch.search_space_zoo.ENASMacroLayer
+ :members:
+
+To describe the whole search space, NNI provides a model, which is built by stacking the layers.
+
+.. autoclass:: nni.nas.pytorch.search_space_zoo.ENASMacroGeneralModel
+ :members:
+
+Example code
+^^^^^^^^^^^^
+
+:githublink:`example code `
+
+.. code-block:: bash
+
+ git clone https://github.com/Microsoft/nni.git
+ cd nni/examples/nas/search_space_zoo
+ # search the best cell structure
+ python3 enas_macro_example.py
+
+:raw-html:``
+
+Candidate operators
+^^^^^^^^^^^^^^^^^^^
+
+All supported operators for ENAS macro search are listed below.
+
+
+*
+ ConvBranch
+
+ All input first passes into a StdConv, which is made up of a 1x1Conv followed by BatchNorm2d and ReLU. Then the intermediate result goes through one of the operations listed below. The final result is calculated through a BatchNorm2d and ReLU as post-procedure.
+
+
+ * Separable Conv3x3: If ``separable=True``\ , the cell will use `SepConv <#DilConv>`__ instead of normal Conv operation. SepConv's ``kernel_size=3``\ , ``stride=1`` and ``padding=1``.
+ * Separable Conv5x5: SepConv's ``kernel_size=5``\ , ``stride=1`` and ``padding=2``.
+ * Normal Conv3x3: If ``separable=False``\ , the cell will use a normal Conv operations with ``kernel_size=3``\ , ``stride=1`` and ``padding=1``.
+ *
+ Normal Conv5x5: Conv's ``kernel_size=5``\ , ``stride=1`` and ``padding=2``.
+
+.. autoclass:: nni.nas.pytorch.search_space_zoo.enas_ops.ConvBranch
+
+*
+ PoolBranch
+
+ All input first passes into a StdConv, which is made up of a 1x1Conv followed by BatchNorm2d and ReLU. Then the intermediate goes through pooling operation followed by BatchNorm.
+
+
+ * AvgPool: Call ``torch.nn.AvgPool2d``. This operation applies a 2D average pooling over all input channels. Its parameters are fixed to ``kernel_size=3``\ , ``stride=1`` and ``padding=1``.
+ *
+ MaxPool: Call ``torch.nn.MaxPool2d``. This operation applies a 2D max pooling over all input channels. Its parameters are fixed to ``kernel_size=3``\ , ``stride=1`` and ``padding=1``.
+
+.. autoclass:: nni.nas.pytorch.search_space_zoo.enas_ops.PoolBranch
+
+NAS-Bench-201
+-------------
+
+NAS Bench 201 defines a unified search space, which is algorithm agnostic. The predefined skeleton consists of a stack of cells that share the same architecture. Every cell contains four nodes and a DAG is formed by connecting edges among them, where the node represents the sum of feature maps and the edge stands for an operation transforming a tensor from the source node to the target node. The predefined candidate operators can be found in `Candidate operators <#nas-bench-201-reference>`__.
+
+The search space of NAS Bench 201 is shown below.
+
+
+.. image:: ../../img/NAS_Bench_201.svg
+ :target: ../../img/NAS_Bench_201.svg
+ :alt:
+
+
+.. autoclass:: nni.nas.pytorch.nasbench201.NASBench201Cell
+ :members:
+
+Example code
+^^^^^^^^^^^^
+
+:githublink:`example code `
+
+.. code-block:: bash
+
+ # for structure searching
+ git clone https://github.com/Microsoft/nni.git
+ cd nni/examples/nas/search_space_zoo
+ python3 nas_bench_201.py
+
+:raw-html:``
+
+Candidate operators
+^^^^^^^^^^^^^^^^^^^
+
+All supported operators for NAS Bench 201 are listed below.
+
+
+*
+ AvgPool
+
+ If the number of input channels is not equal to the number of output channels, the input will first pass through a ``ReLUConvBN`` layer with ``kernel_size=1``\ , ``stride=1``\ , ``padding=0``\ , and ``dilation=0``.
+ Call ``torch.nn.AvgPool2d``. This operation applies a 2D average pooling over all input channels followed by BatchNorm2d. Its parameters are fixed to ``kernel_size=3`` and ``padding=1``.
+
+.. autoclass:: nni.nas.pytorch.nasbench201.nasbench201_ops.Pooling
+ :members:
+
+*
+ Conv
+
+
+ * Conv1x1: Consist of a sequence of ReLU, ``nn.Cinv2d`` and BatchNorm. The Conv operation's parameter is fixed to ``kernal_size=1``\ , ``padding=0``\ , and ``dilation=1``.
+ * Conv3x3: Consist of a sequence of ReLU, ``nn.Cinv2d`` and BatchNorm. The Conv operation's parameter is fixed to ``kernal_size=3``\ , ``padding=1``\ , and ``dilation=1``.
+
+.. autoclass:: nni.nas.pytorch.nasbench201.nasbench201_ops.ReLUConvBN
+ :members:
+
+*
+ SkipConnect
+
+ Call ``torch.nn.Identity`` to connect directly to the next cell.
+
+*
+ Zeroize
+
+ Generate zero tensors indicating there is no connection from the source node to the target node.
+
+.. autoclass:: nni.nas.pytorch.nasbench201.nasbench201_ops.Zero
+ :members:
diff --git a/docs/en_US/NAS/TextNAS.rst b/docs/en_US/NAS/TextNAS.rst
new file mode 100644
index 0000000000..9bf9420f88
--- /dev/null
+++ b/docs/en_US/NAS/TextNAS.rst
@@ -0,0 +1,94 @@
+TextNAS
+=======
+
+Introduction
+------------
+
+This is the implementation of the TextNAS algorithm proposed in the paper `TextNAS: A Neural Architecture Search Space tailored for Text Representation `__. TextNAS is a neural architecture search algorithm tailored for text representation, more specifically, TextNAS is based on a novel search space consists of operators widely adopted to solve various NLP tasks, and TextNAS also supports multi-path ensemble within a single network to balance the width and depth of the architecture.
+
+The search space of TextNAS contains:
+
+.. code-block:: bash
+
+ * 1-D convolutional operator with filter size 1, 3, 5, 7
+ * recurrent operator (bi-directional GRU)
+ * self-attention operator
+ * pooling operator (max/average)
+
+
+Following the ENAS algorithm, TextNAS also utilizes parameter sharing to accelerate the search speed and adopts a reinforcement-learning controller for the architecture sampling and generation. Please refer to the paper for more details of TextNAS.
+
+Preparation
+-----------
+
+Prepare the word vectors and SST dataset, and organize them in data directory as shown below:
+
+.. code-block:: bash
+
+ textnas
+ ├── data
+ │ ├── sst
+ │ │ └── trees
+ │ │ ├── dev.txt
+ │ │ ├── test.txt
+ │ │ └── train.txt
+ │ └── glove.840B.300d.txt
+ ├── dataloader.py
+ ├── model.py
+ ├── ops.py
+ ├── README.md
+ ├── search.py
+ └── utils.py
+
+The following link might be helpful for finding and downloading the corresponding dataset:
+
+
+* `GloVe: Global Vectors for Word Representation `__
+
+ * `glove.840B.300d.txt `__
+
+* `Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank `__
+
+ * `trainDevTestTrees_PTB.zip `__
+
+Examples
+--------
+
+Search Space
+^^^^^^^^^^^^
+
+:githublink:`Example code `
+
+.. code-block:: bash
+
+ # In case NNI code is not cloned. If the code is cloned already, ignore this line and enter code folder.
+ git clone https://github.com/Microsoft/nni.git
+
+ # search the best architecture
+ cd examples/nas/textnas
+
+ # view more options for search
+ python3 search.py -h
+
+After each search epoch, 10 sampled architectures will be tested directly. Their performances are expected to be 40% - 42% after 10 epochs.
+
+By default, 20 sampled architectures will be exported into ``checkpoints`` directory for next step.
+
+retrain
+^^^^^^^
+
+.. code-block:: bash
+
+ # In case NNI code is not cloned. If the code is cloned already, ignore this line and enter code folder.
+ git clone https://github.com/Microsoft/nni.git
+
+ # search the best architecture
+ cd examples/nas/textnas
+
+ # default to retrain on sst-2
+ sh run_retrain.sh
+
+Reference
+---------
+
+TextNAS directly uses EnasTrainer, please refer to `ENAS <./ENAS.rst>`__ for the trainer APIs.
diff --git a/docs/en_US/NAS/Visualization.rst b/docs/en_US/NAS/Visualization.rst
new file mode 100644
index 0000000000..d1e70f05e0
--- /dev/null
+++ b/docs/en_US/NAS/Visualization.rst
@@ -0,0 +1,86 @@
+NAS Visualization (Experimental)
+================================
+
+Built-in Trainers Support
+-------------------------
+
+Currently, only ENAS and DARTS support visualization. Examples of `ENAS <./ENAS.md>`__ and `DARTS <./DARTS.rst>`__ has demonstrated how to enable visualization in your code, namely, adding this before ``trainer.train()``\ :
+
+.. code-block:: python
+
+ trainer.enable_visualization()
+
+This will create a directory ``logs/`` in your working folder, in which you will find two files ``graph.json`` and ``log``.
+
+You don't have to wait until your program finishes to launch NAS UI, but it's important that these two files have been already created. Launch NAS UI with
+
+.. code-block:: bash
+
+ nnictl webui nas --logdir logs/ --port
+
+Visualize a Customized Trainer
+------------------------------
+
+If you are interested in how to customize a trainer, please read this `doc <./Advanced.rst#extend-the-ability-of-one-shot-trainers>`__.
+
+You should do two modifications to an existing trainer to enable visualization:
+
+
+#. Export your graph before training, with
+
+.. code-block:: python
+
+ vis_graph = self.mutator.graph(inputs)
+ # `inputs` is a dummy input to your model. For example, torch.randn((1, 3, 32, 32)).cuda()
+ # If your model has multiple inputs, it should be a tuple.
+ with open("/path/to/your/logdir/graph.json", "w") as f:
+ json.dump(vis_graph, f)
+
+
+#. Logging the choices you've made. You can do it once per epoch, once per mini-batch or whatever frequency you'd like.
+
+.. code-block:: python
+
+ def __init__(self):
+ # ...
+ self.status_writer = open("/path/to/your/logdir/log", "w") # create a writer
+
+ def train(self):
+ # ...
+ print(json.dumps(self.mutator.status()), file=self.status_writer, flush=True) # dump a record of status
+
+If you are implementing your customized trainer inheriting ``Trainer``. We have provided ``enable_visualization()`` and ``_write_graph_status()`` for easy-to-use purposes. All you need to do is calling ``trainer.enable_visualization()`` before start, and ``trainer._write_graph_status()`` each time you want to do the logging. But remember both of these APIs are experimental and subject to change in future.
+
+Last but not least, invode NAS UI with
+
+.. code-block:: bash
+
+ nnictl webui nas --logdir /path/to/your/logdir
+
+NAS UI Preview
+--------------
+
+
+.. image:: ../../img/nasui-1.png
+ :target: ../../img/nasui-1.png
+ :alt:
+
+
+
+.. image:: ../../img/nasui-2.png
+ :target: ../../img/nasui-2.png
+ :alt:
+
+
+Limitations
+-----------
+
+
+* NAS visualization only works with PyTorch >=1.4. We've tested it on PyTorch 1.3.1 and it doesn't work.
+* We rely on PyTorch support for tensorboard for graph export, which relies on ``torch.jit``. It will not work if your model doesn't support ``jit``.
+* There are known performance issues when loading a moderate-size graph with many op choices (like DARTS search space).
+
+Feedback
+--------
+
+NAS UI is currently experimental. We welcome your feedback. `Here `__ we have listed all the to-do items of NAS UI in the future. Feel free to comment (or `submit a new issue `__\ ) if you have other suggestions.
diff --git a/docs/en_US/NAS/WriteSearchSpace.rst b/docs/en_US/NAS/WriteSearchSpace.rst
new file mode 100644
index 0000000000..0281d19692
--- /dev/null
+++ b/docs/en_US/NAS/WriteSearchSpace.rst
@@ -0,0 +1,70 @@
+Write A .. role:: raw-html(raw)
+ :format: html
+
+Search Space
+====================
+
+Genrally, a search space describes candiate architectures from which users want to find the best one. Different search algorithms, no matter classic NAS or one-shot NAS, can be applied on the search space. NNI provides APIs to unified the expression of neural architecture search space.
+
+A search space can be built on a base model. This is also a common practice when a user wants to apply NAS on an existing model. Take `MNIST on PyTorch `__ as an example. Note that NNI provides the same APIs for expressing search space on PyTorch and TensorFlow.
+
+.. code-block:: python
+
+ from nni.nas.pytorch import mutables
+
+ class Net(nn.Module):
+ def __init__(self):
+ super(Net, self).__init__()
+ self.conv1 = mutables.LayerChoice([
+ nn.Conv2d(1, 32, 3, 1),
+ nn.Conv2d(1, 32, 5, 3)
+ ]) # try 3x3 kernel and 5x5 kernel
+ self.conv2 = nn.Conv2d(32, 64, 3, 1)
+ self.dropout1 = nn.Dropout2d(0.25)
+ self.dropout2 = nn.Dropout2d(0.5)
+ self.fc1 = nn.Linear(9216, 128)
+ self.fc2 = nn.Linear(128, 10)
+
+ def forward(self, x):
+ x = self.conv1(x)
+ x = F.relu(x)
+ # ... same as original ...
+ return output
+
+The example above adds an option of choosing conv5x5 at conv1. The modification is as simple as declaring a ``LayerChoice`` with the original conv3x3 and a new conv5x5 as its parameter. That's it! You don't have to modify the forward function in any way. You can imagine conv1 as any other module without NAS.
+
+So how about the possibilities of connections? This can be done using ``InputChoice``. To allow for a skip connection on the MNIST example, we add another layer called conv3. In the following example, a possible connection from conv2 is added to the output of conv3.
+
+.. code-block:: python
+
+ from nni.nas.pytorch import mutables
+
+ class Net(nn.Module):
+ def __init__(self):
+ # ... same ...
+ self.conv2 = nn.Conv2d(32, 64, 3, 1)
+ self.conv3 = nn.Conv2d(64, 64, 1, 1)
+ # declaring that there is exactly one candidate to choose from
+ # search strategy will choose one or None
+ self.skipcon = mutables.InputChoice(n_candidates=1)
+ # ... same ...
+
+ def forward(self, x):
+ x = self.conv1(x)
+ x = F.relu(x)
+ x = self.conv2(x)
+ x0 = self.skipcon([x]) # choose one or none from [x]
+ x = self.conv3(x)
+ if x0 is not None: # skipconnection is open
+ x += x0
+ x = F.max_pool2d(x, 2)
+ # ... same ...
+ return output
+
+Input choice can be thought of as a callable module that receives a list of tensors and outputs the concatenation/sum/mean of some of them (sum by default), or ``None`` if none is selected. Like layer choices, input choices should be **initialized in ``__init__`` and called in ``forward``**. This is to allow search algorithms to identify these choices and do necessary preparations.
+
+``LayerChoice`` and ``InputChoice`` are both **mutables**. Mutable means "changeable". As opposed to traditional deep learning layers/modules which have fixed operation types once defined, models with mutable are essentially a series of possible models.
+
+Users can specify a **key** for each mutable. By default, NNI will assign one for you that is globally unique, but in case users want to share choices (for example, there are two ``LayerChoice``\ s with the same candidate operations and you want them to have the same choice, i.e., if first one chooses the i-th op, the second one also chooses the i-th op), they can give them the same key. The key marks the identity for this choice and will be used in the dumped checkpoint. So if you want to increase the readability of your exported architecture, manually assigning keys to each mutable would be a good idea. For advanced usage on mutables (e.g., ``LayerChoice`` and ``InputChoice``\ ), see `Mutables <./NasReference.rst>`__.
+
+With search space defined, the next step is searching for the best model from it. Please refer to `classic NAS algorithms <./ClassicNas.md>`__ and `one-shot NAS algorithms <./NasGuide.rst>`__ for how to search from your defined search space.
diff --git a/docs/en_US/Overview.rst b/docs/en_US/Overview.rst
new file mode 100644
index 0000000000..639dae8b71
--- /dev/null
+++ b/docs/en_US/Overview.rst
@@ -0,0 +1,123 @@
+Overview
+========
+
+NNI (Neural Network Intelligence) is a toolkit to help users design and tune machine learning models (e.g., hyperparameters), neural network architectures, or complex system's parameters, in an efficient and automatic way. NNI has several appealing properties: ease-of-use, scalability, flexibility, and efficiency.
+
+
+* **Ease-of-use**\ : NNI can be easily installed through python pip. Only several lines need to be added to your code in order to use NNI's power. You can use both the commandline tool and WebUI to work with your experiments.
+* **Scalability**\ : Tuning hyperparameters or the neural architecture often demands a large number of computational resources, while NNI is designed to fully leverage different computation resources, such as remote machines, training platforms (e.g., OpenPAI, Kubernetes). Hundreds of trials could run in parallel by depending on the capacity of your configured training platforms.
+* **Flexibility**\ : Besides rich built-in algorithms, NNI allows users to customize various hyperparameter tuning algorithms, neural architecture search algorithms, early stopping algorithms, etc. Users can also extend NNI with more training platforms, such as virtual machines, kubernetes service on the cloud. Moreover, NNI can connect to external environments to tune special applications/models on them.
+* **Efficiency**\ : We are intensively working on more efficient model tuning on both the system and algorithm level. For example, we leverage early feedback to speedup the tuning procedure.
+
+The figure below shows high-level architecture of NNI.
+
+
+.. raw:: html
+
+
+
+
+
+
+Key Concepts
+------------
+
+
+*
+ *Experiment*\ : One task of, for example, finding out the best hyperparameters of a model, finding out the best neural network architecture, etc. It consists of trials and AutoML algorithms.
+
+*
+ *Search Space*\ : The feasible region for tuning the model. For example, the value range of each hyperparameter.
+
+*
+ *Configuration*\ : An instance from the search space, that is, each hyperparameter has a specific value.
+
+*
+ *Trial*\ : An individual attempt at applying a new configuration (e.g., a set of hyperparameter values, a specific neural architecture, etc.). Trial code should be able to run with the provided configuration.
+
+*
+ *Tuner*\ : An AutoML algorithm, which generates a new configuration for the next try. A new trial will run with this configuration.
+
+*
+ *Assessor*\ : Analyze a trial's intermediate results (e.g., periodically evaluated accuracy on test dataset) to tell whether this trial can be early stopped or not.
+
+*
+ *Training Platform*\ : Where trials are executed. Depending on your experiment's configuration, it could be your local machine, or remote servers, or large-scale training platform (e.g., OpenPAI, Kubernetes).
+
+Basically, an experiment runs as follows: Tuner receives search space and generates configurations. These configurations will be submitted to training platforms, such as the local machine, remote machines, or training clusters. Their performances are reported back to Tuner. Then, new configurations are generated and submitted.
+
+For each experiment, the user only needs to define a search space and update a few lines of code, and then leverage NNI built-in Tuner/Assessor and training platforms to search the best hyperparameters and/or neural architecture. There are basically 3 steps:
+
+..
+
+ Step 1: `Define search space `__
+
+ Step 2: `Update model codes `__
+
+ Step 3: `Define Experiment `__
+
+
+
+.. raw:: html
+
+
+
+
+
+
+For more details about how to run an experiment, please refer to `Get Started `__.
+
+Core Features
+-------------
+
+NNI provides a key capacity to run multiple instances in parallel to find the best combinations of parameters. This feature can be used in various domains, like finding the best hyperparameters for a deep learning model or finding the best configuration for database and other complex systems with real data.
+
+NNI also provides algorithm toolkits for machine learning and deep learning, especially neural architecture search (NAS) algorithms, model compression algorithms, and feature engineering algorithms.
+
+Hyperparameter Tuning
+^^^^^^^^^^^^^^^^^^^^^
+
+This is a core and basic feature of NNI, we provide many popular `automatic tuning algorithms `__ (i.e., tuner) and `early stop algorithms `__ (i.e., assessor). You can follow `Quick Start `__ to tune your model (or system). Basically, there are the above three steps and then starting an NNI experiment.
+
+General NAS Framework
+^^^^^^^^^^^^^^^^^^^^^
+
+This NAS framework is for users to easily specify candidate neural architectures, for example, one can specify multiple candidate operations (e.g., separable conv, dilated conv) for a single layer, and specify possible skip connections. NNI will find the best candidate automatically. On the other hand, the NAS framework provides a simple interface for another type of user (e.g., NAS algorithm researchers) to implement new NAS algorithms. A detailed description of NAS and its usage can be found `here `__.
+
+NNI has support for many one-shot NAS algorithms such as ENAS and DARTS through NNI trial SDK. To use these algorithms you do not have to start an NNI experiment. Instead, import an algorithm in your trial code and simply run your trial code. If you want to tune the hyperparameters in the algorithms or want to run multiple instances, you can choose a tuner and start an NNI experiment.
+
+Other than one-shot NAS, NAS can also run in a classic mode where each candidate architecture runs as an independent trial job. In this mode, similar to hyperparameter tuning, users have to start an NNI experiment and choose a tuner for NAS.
+
+Model Compression
+^^^^^^^^^^^^^^^^^
+
+NNI provides an easy-to-use model compression framework to compress deep neural networks, the compressed networks typically have much smaller model size and much faster
+inference speed without losing performance significantlly. Model compression on NNI includes pruning algorithms and quantization algorithms. NNI provides many pruning and
+quantization algorithms through NNI trial SDK. Users can directly use them in their trial code and run the trial code without starting an NNI experiment. Users can also use NNI model compression framework to customize their own pruning and quantization algorithms.
+
+A detailed description of model compression and its usage can be found `here `__.
+
+Automatic Feature Engineering
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Automatic feature engineering is for users to find the best features for their tasks. A detailed description of automatic feature engineering and its usage can be found `here `__. It is supported through NNI trial SDK, which means you do not have to create an NNI experiment. Instead, simply import a built-in auto-feature-engineering algorithm in your trial code and directly run your trial code.
+
+The auto-feature-engineering algorithms usually have a bunch of hyperparameters themselves. If you want to automatically tune those hyperparameters, you can leverage hyperparameter tuning of NNI, that is, choose a tuning algorithm (i.e., tuner) and start an NNI experiment for it.
+
+Learn More
+----------
+
+
+* `Get started `__
+* `How to adapt your trial code on NNI? `__
+* `What are tuners supported by NNI? `__
+* `How to customize your own tuner? `__
+* `What are assessors supported by NNI? `__
+* `How to customize your own assessor? `__
+* `How to run an experiment on local? `__
+* `How to run an experiment on multiple machines? `__
+* `How to run an experiment on OpenPAI? `__
+* `Examples `__
+* `Neural Architecture Search on NNI `__
+* `Model Compression on NNI `__
+* `Automatic feature engineering on NNI `__
diff --git a/docs/en_US/Release.rst b/docs/en_US/Release.rst
new file mode 100644
index 0000000000..3b4838c9bc
--- /dev/null
+++ b/docs/en_US/Release.rst
@@ -0,0 +1,1123 @@
+.. role:: raw-html(raw)
+ :format: html
+
+
+ChangeLog
+=========
+
+Release 1.9 - 10/22/2020
+========================
+
+Major updates
+-------------
+
+Neural architecture search
+^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+
+* Support regularized evolution algorithm for NAS scenario (#2802)
+* Add NASBench201 in search space zoo (#2766)
+
+Model compression
+^^^^^^^^^^^^^^^^^
+
+
+* AMC pruner improvement: support resnet, support reproduction of the experiments (default parameters in our example code) in AMC paper (#2876 #2906)
+* Support constraint-aware on some of our pruners to improve model compression efficiency (#2657)
+* Support "tf.keras.Sequential" in model compression for TensorFlow (#2887)
+* Support customized op in the model flops counter (#2795)
+* Support quantizing bias in QAT quantizer (#2914)
+
+Training service
+^^^^^^^^^^^^^^^^
+
+
+* Support configuring python environment using "preCommand" in remote mode (#2875)
+* Support AML training service in Windows (#2882)
+* Support reuse mode for remote training service (#2923)
+
+WebUI & nnictl
+^^^^^^^^^^^^^^
+
+
+* The "Overview" page on WebUI is redesigned with new layout (#2914)
+* Upgraded node, yarn and FabricUI, and enabled Eslint (#2894 #2873 #2744)
+* Add/Remove columns in hyper-parameter chart and trials table in "Trials detail" page (#2900)
+* JSON format utility beautify on WebUI (#2863)
+* Support nnictl command auto-completion (#2857)
+
+UT & IT
+-------
+
+
+* Add integration test for experiment import and export (#2878)
+* Add integration test for user installed builtin tuner (#2859)
+* Add unit test for nnictl (#2912)
+
+Documentation
+-------------
+
+
+* Refactor of the document for model compression (#2919)
+
+Bug fixes
+---------
+
+
+* Bug fix of naïve evolution tuner, correctly deal with trial fails (#2695)
+* Resolve the warning "WARNING (nni.protocol) IPC pipeline not exists, maybe you are importing tuner/assessor from trial code?" (#2864)
+* Fix search space issue in experiment save/load (#2886)
+* Fix bug in experiment import data (#2878)
+* Fix annotation in remote mode (python 3.8 ast update issue) (#2881)
+* Support boolean type for "choice" hyper-parameter when customizing trial configuration on WebUI (#3003)
+
+Release 1.8 - 8/27/2020
+=======================
+
+Major updates
+-------------
+
+Training service
+^^^^^^^^^^^^^^^^
+
+
+* Access trial log directly on WebUI (local mode only) (#2718)
+* Add OpenPAI trial job detail link (#2703)
+* Support GPU scheduler in reusable environment (#2627) (#2769)
+* Add timeout for ``web_channel`` in ``trial_runner`` (#2710)
+* Show environment error message in AzureML mode (#2724)
+* Add more log information when copying data in OpenPAI mode (#2702)
+
+WebUI, nnictl and nnicli
+^^^^^^^^^^^^^^^^^^^^^^^^
+
+
+* Improve hyper-parameter parallel coordinates plot (#2691) (#2759)
+* Add pagination for trial job list (#2738) (#2773)
+* Enable panel close when clicking overlay region (#2734)
+* Remove support for Multiphase on WebUI (#2760)
+* Support save and restore experiments (#2750)
+* Add intermediate results in export result (#2706)
+* Add `command `__ to list trial results with highest/lowest metrics (#2747)
+* Improve the user experience of `nnicli `__ with `examples `__ (#2713)
+
+Neural architecture search
+^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+
+* `Search space zoo: ENAS and DARTS `__ (#2589)
+* API to query intermediate results in NAS benchmark (#2728)
+
+Model compression
+^^^^^^^^^^^^^^^^^
+
+
+* Support the List/Tuple Construct/Unpack operation for TorchModuleGraph (#2609)
+* Model speedup improvement: Add support of DenseNet and InceptionV3 (#2719)
+* Support the multiple successive tuple unpack operations (#2768)
+* `Doc of comparing the performance of supported pruners `__ (#2742)
+* New pruners: `Sensitivity pruner `__ (#2684) and `AMC pruner `__ (#2573) (#2786)
+* TensorFlow v2 support in model compression (#2755)
+
+Backward incompatible changes
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+
+* Update the default experiment folder from ``$HOME/nni/experiments`` to ``$HOME/nni-experiments``. If you want to view the experiments created by previous NNI releases, you can move the experiments folders from ``$HOME/nni/experiments`` to ``$HOME/nni-experiments`` manually. (#2686) (#2753)
+* Dropped support for Python 3.5 and scikit-learn 0.20 (#2778) (#2777) (2783) (#2787) (#2788) (#2790)
+
+Others
+^^^^^^
+
+
+* Upgrade TensorFlow version in Docker image (#2732) (#2735) (#2720)
+
+Examples
+--------
+
+
+* Remove gpuNum in assessor examples (#2641)
+
+Documentation
+-------------
+
+
+* Improve customized tuner documentation (#2628)
+* Fix several typos and grammar mistakes in documentation (#2637 #2638, thanks @tomzx)
+* Improve AzureML training service documentation (#2631)
+* Improve CI of Chinese translation (#2654)
+* Improve OpenPAI training service documenation (#2685)
+* Improve documentation of community sharing (#2640)
+* Add tutorial of Colab support (#2700)
+* Improve documentation structure for model compression (#2676)
+
+Bug fixes
+---------
+
+
+* Fix mkdir error in training service (#2673)
+* Fix bug when using chmod in remote training service (#2689)
+* Fix dependency issue by making ``_graph_utils`` imported inline (#2675)
+* Fix mask issue in ``SimulatedAnnealingPruner`` (#2736)
+* Fix intermediate graph zooming issue (#2738)
+* Fix issue when dict is unordered when querying NAS benchmark (#2728)
+* Fix import issue for gradient selector dataloader iterator (#2690)
+* Fix support of adding tens of machines in remote training service (#2725)
+* Fix several styling issues in WebUI (#2762 #2737)
+* Fix support of unusual types in metrics including NaN and Infinity (#2782)
+* Fix nnictl experiment delete (#2791)
+
+Release 1.7 - 7/8/2020
+======================
+
+Major Features
+--------------
+
+Training Service
+^^^^^^^^^^^^^^^^
+
+
+* Support AML(Azure Machine Learning) platform as NNI training service.
+* OpenPAI job can be reusable. When a trial is completed, the OpenPAI job won't stop, and wait next trial. `refer to reuse flag in OpenPAI config `__.
+* `Support ignoring files and folders in code directory with .nniignore when uploading code directory to training service `__.
+
+Neural Architecture Search (NAS)
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+
+*
+ `Provide NAS Open Benchmarks (NasBench101, NasBench201, NDS) with friendly APIs `__.
+
+*
+ `Support Classic NAS (i.e., non-weight-sharing mode) on TensorFlow 2.X `__.
+
+Model Compression
+^^^^^^^^^^^^^^^^^
+
+
+* Improve Model Speedup: track more dependencies among layers and automatically resolve mask conflict, support the speedup of pruned resnet.
+* Added new pruners, including three auto model pruning algorithms: `NetAdapt Pruner `__\ , `SimulatedAnnealing Pruner `__\ , `AutoCompress Pruner `__\ , and `ADMM Pruner `__.
+* Added `model sensitivity analysis tool