add PBT tuner #2139

RayMeng8 · 2020-03-09T11:36:42Z

The implementation of the paper "Population Based Training of Neural Networks" on NNI

msftclas · 2020-03-09T11:36:55Z

All CLA requirements met.

QuanluZhang · 2020-03-17T08:24:36Z

@RayMeng8 please add document for pbt tuner under docs/en_US/Tuner

src/sdk/pynni/nni/pbt_tuner/pbt_tuner.py

examples/trials/mnist-pbt-tuner-pytorch/config.yml

src/sdk/pynni/nni/pbt_tuner/pbt_tuner.py

QuanluZhang · 2020-03-24T13:27:28Z

@RayMeng8 please add doc and unittest for this tuner

docs/en_US/Tuner/BuiltinTuner.md

examples/trials/mnist-pbt-tuner-pytorch/mnist.py

src/sdk/pynni/nni/pbt_tuner/pbt_tuner.py

docs/en_US/Tuner/BuiltinTuner.md

src/sdk/pynni/nni/pbt_tuner/pbt_tuner.py

leckie-chn · 2020-03-26T11:23:56Z

src/sdk/pynni/nni/pbt_tuner/pbt_tuner.py

+            hyper_parameters[key] = hyper_parameters['save_checkpoint_dir']
+        elif key == 'save_checkpoint_dir':
+            hyper_parameters[key] = os.path.join(bot_checkpoint_dir, str(epoch))
+        elif isinstance(hyper_parameters[key], float):


why not perturb other types of hyper-parameters such as int, string?

In the paper, this way of exploration was introduced, but it is not applicable to other types of data. I am not sure how to perturb other types of data. Maybe I can add them in the future.

good point @leckie-chn . @RayMeng8 if you want to support other types in future, please make it clear what types of search space does PBT support. https://github.com/microsoft/nni/blob/master/docs/en_US/Tutorial/SearchSpaceSpec.md#search-space-types-supported-by-each-tuner

QuanluZhang · 2020-03-27T02:19:21Z

src/sdk/pynni/nni/pbt_tuner/pbt_tuner.py

+            hyper_parameters[key] = os.path.join(bot_checkpoint_dir, str(epoch))
+        elif isinstance(hyper_parameters[key], float):
+            perturb = np.random.choice(factors)
+            hyper_parameters[key] *= perturb


should make sure that after perturb the value is still within search space

QuanluZhang · 2020-03-27T02:30:34Z

src/sdk/pynni/tests/test_builtin_tuners.py

@@ -192,6 +193,9 @@ def test_networkmorphism(self):
    def test_ppo(self):
        pass

+    def test_pbt(self):
+        pass


this is not adding unittest for pbt, please follow other tuners' unittest to think about how to the unittest of pbt should be.

ultmaster · 2020-03-27T13:27:00Z

src/sdk/pynni/tests/test_builtin_tuners.py

+            if isinstance(tuner, PBTTuner):
+                parameters = tuner.generate_multiple_parameters(list(range(i * self.params_each_round,
+                                                                       (i + 1) * self.params_each_round)), st_callback=self.send_trial_callback)
+            else:


I think generate_multiple_parameters of other tuners can be called with st_callback anyway?

ultmaster · 2020-03-27T13:30:18Z

src/sdk/pynni/tests/test_builtin_tuners.py

+_trial_params = {}
+
+
+def _pack_parameter(parameter_id, params, customized=False, trial_job_id=None, parameter_index=None):


Directly import it from msg_dispatcher instead of copying it?

ultmaster · 2020-03-27T13:31:16Z

src/sdk/pynni/nni/utils.py

 import functools
 from enum import Enum, unique
 import json_tricks

+import nni.parameter_expressions as parameter_expressions


from . import parameter_expressions

ultmaster · 2020-03-27T13:32:20Z

src/sdk/pynni/nni/pbt_tuner/pbt_tuner.py

+    bot_trial_info.clean_id()
+
+
+class Trial_Info:


Style comment: TrialInfo.

ultmaster · 2020-03-27T13:33:47Z

src/sdk/pynni/tests/test_builtin_tuners.py

@@ -192,6 +224,9 @@ def test_networkmorphism(self):
    def test_ppo(self):
        pass

+    def test_pbt(self):
+        self.search_space_test_all(lambda: PBTTuner(all_checkpoint_dir="~/nni/checkpoint/test/", population_size=100))


No need to specify all_checkpoint_dir?

ultmaster · 2020-03-27T13:34:41Z

docs/en_US/Tuner/PBTTuner.md

+
+Population Based Training(PBT) comes from [Population Based Training of Neural Networks](https://arxiv.org/abs/1711.09846v1). It's a simple asynchronous optimization algorithm which effectively utilizes a fixed computational budget to jointly optimize a population of models and their hyperparameters to maximize performance. Importantly, PBT discovers a schedule of hyperparameter settings rather than following the generally sub-optimal strategy of trying to find a single fixed set to use for the whole course of training. 
+
+PBTTuner initializes a population with several trials. Users can set a specific number of training epochs. After a certain number of epochs, the parameters and hyperparameters in the trial with bad metrics will be replaced with a better trial (exploit). Then the hyperparameters are purturbed (explore). 


ultmaster · 2020-03-27T13:45:25Z

docs/en_US/Tuner/PBTTuner.md

+
+PBTTuner initializes a population with several trials. Users can set a specific number of training epochs. After a certain number of epochs, the parameters and hyperparameters in the trial with bad metrics will be replaced with a better trial (exploit). Then the hyperparameters are purturbed (explore). 
+
+In our implementation, training epochs in the trial code is regarded as a step of PBT, different with other tuners. When a step is over, PBTTuner will perform exploitation and exploration. The checkpoint is not assigned explicitly, but by continuously changing load_checkpoint_dir and save_checkpoint_dir, we can directly change load_checkpoint_dir to replace parameters and hyperparameters. And save_checkpoint_dir used to save checkpoint which can be loaded in next step. Therefore, the directory need to be accessible by all the trials. If the experiment is local mode, users could provide all_checkpoint_dir which decides load_checkpoint_dir and save_checkpoint_dir(checkpoint_dir is set to "all_checkpoint_dir/<population-id>/<step>"), otherwise the directory would be "~/nni/checkpoint/<exp-id>". If the experiment is not local mode, then users should provide a path in a shared storage which can be accessed by all the trials as all_checkpoint_dir.


In our implementation, training epochs in the trial code is regarded as a step of PBT, different with other tuners. At the end of each step, PBT tuner will do exploitation and exploration -- replacing some trials with new trials. This is implemented by constantly modifying the values of load_checkpoint_dir and save_checkpoint_dir. We can directly change load_checkpoint_dir to replace parameters and hyperparameters, and save_checkpoint_dir to save a checkpoint that will be loaded in the next step. To this end, we need a shared folder which is accessible to all trials.

If the experiment is running in local mode, users could provide an argument all_checkpoint_dir which will be the base folder of load_checkpoint_dir and save_checkpoint_dir (checkpoint_dir is set to all_checkpoint_dir/<population-id>/<step>). By default, all_checkpoint_dir is set to be ~/nni/checkpoint/<exp-id>. If the experiment is in non-local mode, then users should provide a path in a shared storage folder which is mounted at all_checkpoint_dir on worker machines (but it's not necessarily available on the machine which runs tuner).

ultmaster · 2020-03-27T13:46:10Z

docs/en_US/Tuner/BuiltinTuner.md

+
+**Suggested scenario**
+
+Population Based Training (PBT) which bridges and extends parallel search methods and sequential optimization methods. It has a wallclock run time that is no greater than that of a single optimisation process, does not require sequential runs, and is also able to use fewer computational resources than naive search methods. Therefore, it's effective when you want to save computational resources and time. Besides, PBT returns hyperparameter scheduler instead of configuration. If you don't need to get a specific configuration, but just expect good results, you can choose this tuner. It should be noted that, in our implementation, the operation of checkpoint storage location is involved. A trial is considered as several traning epochs of training, so the loading and saving of checkpoint must be specified in the trial code, which is different with other tuners. Otherwise, if the experiment is not local mode, users should provide a path in a shared storage which can be accessed by all the trials. You could try it on very simple task, such as the [mnist-pbt-tuner-pytorch](https://github.com/microsoft/nni/tree/master/examples/trials/mnist-pbt-tuner-pytorch) example. [See details](./PBTTuner.md)


optimization

add pbt-tuner

cebb7d9

QuanluZhang changed the title ~~add pbt-tuner~~ add PBT tuner Mar 9, 2020

QuanluZhang requested review from QuanluZhang and xuehui1991 March 9, 2020 12:17

QuanluZhang linked an issue Mar 9, 2020 that may be closed by this pull request

Add Population Based Training Tuner #1470

Closed

QuanluZhang reviewed Mar 17, 2020

View reviewed changes

src/sdk/pynni/nni/pbt_tuner/pbt_tuner.py Outdated Show resolved Hide resolved

QuanluZhang reviewed Mar 17, 2020

View reviewed changes

src/sdk/pynni/nni/pbt_tuner/pbt_tuner.py Show resolved Hide resolved