Skip to content
This repository has been archived by the owner on Sep 18, 2024. It is now read-only.

add PBT tuner #2139

Merged
merged 26 commits into from
Mar 30, 2020
Merged

add PBT tuner #2139

merged 26 commits into from
Mar 30, 2020

Conversation

RayMeng8
Copy link
Contributor

@RayMeng8 RayMeng8 commented Mar 9, 2020

The implementation of the paper "Population Based Training of Neural Networks" on NNI

@msftclas
Copy link

msftclas commented Mar 9, 2020

CLA assistant check
All CLA requirements met.

@QuanluZhang QuanluZhang changed the title add pbt-tuner add PBT tuner Mar 9, 2020
@QuanluZhang QuanluZhang linked an issue Mar 9, 2020 that may be closed by this pull request
@QuanluZhang
Copy link
Contributor

@RayMeng8 please add document for pbt tuner under docs/en_US/Tuner

@QuanluZhang QuanluZhang marked this pull request as ready for review March 23, 2020 02:42
@QuanluZhang
Copy link
Contributor

@RayMeng8 please add doc and unittest for this tuner

@QuanluZhang QuanluZhang requested a review from leckie-chn March 25, 2020 02:53
hyper_parameters[key] = hyper_parameters['save_checkpoint_dir']
elif key == 'save_checkpoint_dir':
hyper_parameters[key] = os.path.join(bot_checkpoint_dir, str(epoch))
elif isinstance(hyper_parameters[key], float):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not perturb other types of hyper-parameters such as int, string?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the paper, this way of exploration was introduced, but it is not applicable to other types of data. I am not sure how to perturb other types of data. Maybe I can add them in the future.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point @leckie-chn . @RayMeng8 if you want to support other types in future, please make it clear what types of search space does PBT support. https://github.com/microsoft/nni/blob/master/docs/en_US/Tutorial/SearchSpaceSpec.md#search-space-types-supported-by-each-tuner

hyper_parameters[key] = os.path.join(bot_checkpoint_dir, str(epoch))
elif isinstance(hyper_parameters[key], float):
perturb = np.random.choice(factors)
hyper_parameters[key] *= perturb
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should make sure that after perturb the value is still within search space

@@ -192,6 +193,9 @@ def test_networkmorphism(self):
def test_ppo(self):
pass

def test_pbt(self):
pass
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is not adding unittest for pbt, please follow other tuners' unittest to think about how to the unittest of pbt should be.

if isinstance(tuner, PBTTuner):
parameters = tuner.generate_multiple_parameters(list(range(i * self.params_each_round,
(i + 1) * self.params_each_round)), st_callback=self.send_trial_callback)
else:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think generate_multiple_parameters of other tuners can be called with st_callback anyway?

_trial_params = {}


def _pack_parameter(parameter_id, params, customized=False, trial_job_id=None, parameter_index=None):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Directly import it from msg_dispatcher instead of copying it?

import functools
from enum import Enum, unique
import json_tricks

import nni.parameter_expressions as parameter_expressions
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

from . import parameter_expressions

bot_trial_info.clean_id()


class Trial_Info:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Style comment: TrialInfo.

@@ -192,6 +224,9 @@ def test_networkmorphism(self):
def test_ppo(self):
pass

def test_pbt(self):
self.search_space_test_all(lambda: PBTTuner(all_checkpoint_dir="~/nni/checkpoint/test/", population_size=100))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to specify all_checkpoint_dir?


Population Based Training(PBT) comes from [Population Based Training of Neural Networks](https://arxiv.org/abs/1711.09846v1). It's a simple asynchronous optimization algorithm which effectively utilizes a fixed computational budget to jointly optimize a population of models and their hyperparameters to maximize performance. Importantly, PBT discovers a schedule of hyperparameter settings rather than following the generally sub-optimal strategy of trying to find a single fixed set to use for the whole course of training.

PBTTuner initializes a population with several trials. Users can set a specific number of training epochs. After a certain number of epochs, the parameters and hyperparameters in the trial with bad metrics will be replaced with a better trial (exploit). Then the hyperparameters are purturbed (explore).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

perturbed


PBTTuner initializes a population with several trials. Users can set a specific number of training epochs. After a certain number of epochs, the parameters and hyperparameters in the trial with bad metrics will be replaced with a better trial (exploit). Then the hyperparameters are purturbed (explore).

In our implementation, training epochs in the trial code is regarded as a step of PBT, different with other tuners. When a step is over, PBTTuner will perform exploitation and exploration. The checkpoint is not assigned explicitly, but by continuously changing load_checkpoint_dir and save_checkpoint_dir, we can directly change load_checkpoint_dir to replace parameters and hyperparameters. And save_checkpoint_dir used to save checkpoint which can be loaded in next step. Therefore, the directory need to be accessible by all the trials. If the experiment is local mode, users could provide all_checkpoint_dir which decides load_checkpoint_dir and save_checkpoint_dir(checkpoint_dir is set to "all_checkpoint_dir/<population-id>/<step>"), otherwise the directory would be "~/nni/checkpoint/<exp-id>". If the experiment is not local mode, then users should provide a path in a shared storage which can be accessed by all the trials as all_checkpoint_dir.
Copy link
Contributor

@ultmaster ultmaster Mar 27, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In our implementation, training epochs in the trial code is regarded as a step of PBT, different with other tuners. At the end of each step, PBT tuner will do exploitation and exploration -- replacing some trials with new trials. This is implemented by constantly modifying the values of load_checkpoint_dir and save_checkpoint_dir. We can directly change load_checkpoint_dir to replace parameters and hyperparameters, and save_checkpoint_dir to save a checkpoint that will be loaded in the next step. To this end, we need a shared folder which is accessible to all trials.

If the experiment is running in local mode, users could provide an argument all_checkpoint_dir which will be the base folder of load_checkpoint_dir and save_checkpoint_dir (checkpoint_dir is set to all_checkpoint_dir/<population-id>/<step>). By default, all_checkpoint_dir is set to be ~/nni/checkpoint/<exp-id>. If the experiment is in non-local mode, then users should provide a path in a shared storage folder which is mounted at all_checkpoint_dir on worker machines (but it's not necessarily available on the machine which runs tuner).


**Suggested scenario**

Population Based Training (PBT) which bridges and extends parallel search methods and sequential optimization methods. It has a wallclock run time that is no greater than that of a single optimisation process, does not require sequential runs, and is also able to use fewer computational resources than naive search methods. Therefore, it's effective when you want to save computational resources and time. Besides, PBT returns hyperparameter scheduler instead of configuration. If you don't need to get a specific configuration, but just expect good results, you can choose this tuner. It should be noted that, in our implementation, the operation of checkpoint storage location is involved. A trial is considered as several traning epochs of training, so the loading and saving of checkpoint must be specified in the trial code, which is different with other tuners. Otherwise, if the experiment is not local mode, users should provide a path in a shared storage which can be accessed by all the trials. You could try it on very simple task, such as the [mnist-pbt-tuner-pytorch](https://github.com/microsoft/nni/tree/master/examples/trials/mnist-pbt-tuner-pytorch) example. [See details](./PBTTuner.md)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

optimization

@liuzhe-lz liuzhe-lz merged commit a82b4a3 into microsoft:master Mar 30, 2020
@RayMeng8 RayMeng8 deleted the dev-pbt-tuner branch April 3, 2020 07:54
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add Population Based Training Tuner
7 participants