Skip to content
This repository has been archived by the owner on Sep 18, 2024. It is now read-only.

Support remote training service use reuse mode #2923

Merged
merged 49 commits into from
Oct 10, 2020
Merged
Show file tree
Hide file tree
Changes from 34 commits
Commits
Show all changes
49 commits
Select commit Hold shift + click to select a range
dcd2ffd
Merge pull request #251 from microsoft/master
SparkSnail May 29, 2020
3b8b6fb
Merge pull request #252 from microsoft/master
SparkSnail Jun 7, 2020
916e444
Merge pull request #253 from microsoft/master
SparkSnail Jun 15, 2020
caeffb8
Merge pull request #254 from microsoft/master
SparkSnail Jun 17, 2020
57c300e
Merge pull request #255 from microsoft/master
SparkSnail Jun 28, 2020
65660e6
Merge pull request #257 from microsoft/master
SparkSnail Jun 30, 2020
9376d6a
Merge pull request #258 from microsoft/master
SparkSnail Jul 1, 2020
5fef3cf
Merge pull request #259 from microsoft/master
SparkSnail Jul 3, 2020
5544ae8
Merge pull request #261 from microsoft/master
SparkSnail Jul 10, 2020
f9fdfee
Merge pull request #262 from microsoft/master
SparkSnail Jul 16, 2020
c5e26ef
add trial job detail link
SparkSnail Jul 19, 2020
10a04ba
Merge branch 'master' of https://github.com/SparkSnail/nni
SparkSnail Jul 23, 2020
aa64fe6
Merge pull request #263 from microsoft/master
SparkSnail Jul 27, 2020
4ed907f
Merge branch 'master' of https://github.com/SparkSnail/nni
SparkSnail Jul 27, 2020
c6a5f8c
Merge pull request #264 from microsoft/master
SparkSnail Jul 31, 2020
68abe2f
Merge pull request #265 from microsoft/master
SparkSnail Aug 4, 2020
c2b50d2
Merge branch 'master' of https://github.com/SparkSnail/nni
SparkSnail Aug 6, 2020
14e9619
Merge pull request #266 from microsoft/master
SparkSnail Aug 13, 2020
f69e206
Merge pull request #267 from microsoft/master
SparkSnail Aug 13, 2020
a5bb753
Merge branch 'master' of https://github.com/SparkSnail/nni
SparkSnail Aug 21, 2020
12ef0aa
Merge pull request #270 from microsoft/master
SparkSnail Sep 10, 2020
7600a0f
Merge branch 'master' of https://github.com/SparkSnail/nni
SparkSnail Sep 10, 2020
ddcf229
Merge pull request #271 from microsoft/master
SparkSnail Sep 15, 2020
bd327d4
Merge branch 'master' of https://github.com/SparkSnail/nni
SparkSnail Sep 15, 2020
c4f6e66
Merge pull request #272 from microsoft/master
SparkSnail Sep 21, 2020
da2d1c4
Merge branch 'master' of https://github.com/SparkSnail/nni
SparkSnail Sep 21, 2020
529c29f
init
SparkSnail Sep 21, 2020
2a386d3
init
SparkSnail Sep 21, 2020
169e65f
init
SparkSnail Sep 21, 2020
88f8c1b
Merge pull request #273 from microsoft/master
SparkSnail Sep 22, 2020
870b2d0
Merge branch 'master' of https://github.com/SparkSnail/nni into dev-r…
SparkSnail Sep 22, 2020
60ff833
init
SparkSnail Sep 22, 2020
c4fa1c3
init
SparkSnail Sep 23, 2020
4e56975
fix eslint
SparkSnail Sep 25, 2020
a428853
fix gpu scheduler
SparkSnail Sep 29, 2020
3b57f94
init
SparkSnail Oct 9, 2020
8d106ba
update
SparkSnail Oct 9, 2020
492ff8e
fix eslint
SparkSnail Oct 9, 2020
41e3ebd
update doc
SparkSnail Oct 9, 2020
9b5b3f7
update doc
SparkSnail Oct 9, 2020
1dabc88
format code
SparkSnail Oct 9, 2020
c68a7f3
fix comments
SparkSnail Oct 9, 2020
abd660c
fix comments
SparkSnail Oct 9, 2020
d998599
remove machine scheduler
SparkSnail Oct 9, 2020
c8ec30a
fix comments
SparkSnail Oct 10, 2020
ebc12d2
fix comments
SparkSnail Oct 10, 2020
e772871
fix comments
SparkSnail Oct 10, 2020
863100c
fix build
SparkSnail Oct 10, 2020
1387f38
fix eslint
SparkSnail Oct 10, 2020
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 8 additions & 0 deletions docs/en_US/Tutorial/ExperimentConfig.md
Original file line number Diff line number Diff line change
Expand Up @@ -592,6 +592,14 @@ Specifies the pre-command that will be executed before the remote machine execut

__Note__: Because __preCommand__ will execute before other commands each time, it is strongly not recommended to set __preCommand__ that will make changes to system, i.e. `mkdir` or `touch`.

### remoteConfig
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is a little strange to have both "machineList" and "remoteConfig" in the same level

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

machineList is a list type field, I considered merge machineList under remoteConfig, but it may cause compatibility problem.


Optional field in remote mode. Set remote machine related configuration.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the description is not clear

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated


#### reuse

Optional. Set if use trial_runner to maintan multiple trial.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is also not clear. you can describe the benefit when reuse is set True

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated.


### kubeflowConfig

#### operator
Expand Down
5 changes: 1 addition & 4 deletions src/nni_manager/main.ts
Original file line number Diff line number Diff line change
Expand Up @@ -23,9 +23,6 @@ import { KubeflowTrainingService } from './training_service/kubernetes/kubeflow/
import { LocalTrainingService } from './training_service/local/localTrainingService';
import { RouterTrainingService } from './training_service/reusable/routerTrainingService';
import { PAIYarnTrainingService } from './training_service/pai/paiYarn/paiYarnTrainingService';
import {
RemoteMachineTrainingService
} from './training_service/remote_machine/remoteMachineTrainingService';
import { DLTSTrainingService } from './training_service/dlts/dltsTrainingService';

function initStartupInfo(
Expand All @@ -43,7 +40,7 @@ async function initContainer(foreground: boolean, platformMode: string, logFileN
.scope(Scope.Singleton);
} else if (platformMode === 'remote') {
Container.bind(TrainingService)
.to(RemoteMachineTrainingService)
.to(RouterTrainingService)
.scope(Scope.Singleton);
} else if (platformMode === 'pai') {
Container.bind(TrainingService)
Expand Down
3 changes: 3 additions & 0 deletions src/nni_manager/rest_server/restValidationSchemas.ts
Original file line number Diff line number Diff line change
Expand Up @@ -166,6 +166,9 @@ export namespace ValidationSchemas {
}),
nni_manager_ip: joi.object({ // eslint-disable-line @typescript-eslint/camelcase
nniManagerIp: joi.string().min(1)
}),
remote_config: joi.object({ // eslint-disable-line @typescript-eslint/camelcase
reuse: joi.boolean()
})
}
};
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,7 @@ export enum TrialConfigMetadataKey {
MACHINE_LIST = 'machine_list',
LOCAL_CONFIG = 'local_config',
TRIAL_CONFIG = 'trial_config',
REMOTE_CONFIG = 'remote_config',
EXPERIMENT_ID = 'experimentId',
MULTI_PHASE = 'multiPhase',
RANDOM_SCHEDULER = 'random_scheduler',
Expand Down

Large diffs are not rendered by default.

43 changes: 43 additions & 0 deletions src/nni_manager/training_service/reusable/remote/remoteConfig.ts
Original file line number Diff line number Diff line change
@@ -0,0 +1,43 @@
// Copyright (c) Microsoft Corporation.
// Licensed under the MIT license.

import { EnvironmentInformation } from '../environment';
import { RemoteMachineTrialJobDetail } from '../../remote_machine/remoteMachineData';
import { TrialJobApplicationForm } from '../../../common/trainingService';


/**
* work around here, need RemoteMachineTrialJobDetail data structure to schedule machines
*/
export class RemoteMachineMetaDetail extends RemoteMachineTrialJobDetail {
constructor() {
// work around, the form data is a placeholder
const form: TrialJobApplicationForm = {
sequenceId: 0,
hyperParameters: {
value: '',
index: 0
}
};
super('', 'WAITING', 1, '', form);
}
}

/**
* RemoteMachineEnvironmentInformation
*/
export class RemoteMachineEnvironmentInformation extends EnvironmentInformation {
public rmMachineMetaDetail?: RemoteMachineMetaDetail;
}

export class RemoteConfig {
public readonly reuse: boolean;

/**
* Constructor
* @param reuse If job is reusable for multiple trials
*/
constructor(reuse: boolean) {
this.reuse = reuse;
}
}
15 changes: 15 additions & 0 deletions src/nni_manager/training_service/reusable/routerTrainingService.ts
Original file line number Diff line number Diff line change
Expand Up @@ -12,12 +12,15 @@ import { delay } from '../../common/utils';
import { TrialConfigMetadataKey } from '../common/trialConfigMetadataKey';
import { PAIClusterConfig } from '../pai/paiConfig';
import { PAIK8STrainingService } from '../pai/paiK8S/paiK8STrainingService';
import { RemoteMachineTrainingService } from '../remote_machine/remoteMachineTrainingService';
import { EnvironmentService } from './environment';
import { OpenPaiEnvironmentService } from './environments/openPaiEnvironmentService';
import { AMLEnvironmentService } from './environments/amlEnvironmentService';
import { RemoteEnvironmentService } from './environments/remoteEnvironmentService';
import { MountedStorageService } from './storages/mountedStorageService';
import { StorageService } from './storageService';
import { TrialDispatcher } from './trialDispatcher';
import { RemoteConfig } from './remote/remoteConfig';


/**
Expand Down Expand Up @@ -146,6 +149,18 @@ class RouterTrainingService implements TrainingService {
await this.internalTrainingService.setClusterMetadata(key, value);

this.metaDataCache.clear();
} else if (key === TrialConfigMetadataKey.REMOTE_CONFIG) {
const config = <RemoteConfig>JSON.parse(value);
if (config.reuse === true) {
this.log.info(`reuse flag enabled, use EnvironmentManager.`);
this.internalTrainingService = component.get(TrialDispatcher);
Container.bind(EnvironmentService)
.to(RemoteEnvironmentService)
.scope(Scope.Singleton);
} else {
this.log.debug(`caching metadata key:{} value:{}, as training service is not determined.`);
this.internalTrainingService = component.get(RemoteMachineTrainingService);
}
} else {
this.log.debug(`caching metadata key:{} value:{}, as training service is not determined.`);
this.metaDataCache.set(key, value);
Expand Down
8 changes: 7 additions & 1 deletion tools/nni_cmd/config_schema.py
Original file line number Diff line number Diff line change
Expand Up @@ -372,6 +372,12 @@ def validate(self, data):
})
}

remote_config_schema = {
Optional('remoteConfig'): {
'reuse': setType('reuse', bool)
}
}

machine_list_schema = {
'machineList': [Or(
{
Expand Down Expand Up @@ -399,7 +405,7 @@ def validate(self, data):

training_service_schema_dict = {
'local': Schema({**common_schema, **common_trial_schema}),
'remote': Schema({**common_schema, **common_trial_schema, **machine_list_schema}),
'remote': Schema({**common_schema, **common_trial_schema, **machine_list_schema, **remote_config_schema}),
'pai': Schema({**common_schema, **pai_trial_schema, **pai_config_schema}),
'paiYarn': Schema({**common_schema, **pai_yarn_trial_schema, **pai_yarn_config_schema}),
'kubeflow': Schema({**common_schema, **kubeflow_trial_schema, **kubeflow_config_schema}),
Expand Down
8 changes: 6 additions & 2 deletions tools/nni_cmd/launcher.py
Original file line number Diff line number Diff line change
Expand Up @@ -138,6 +138,7 @@ def set_remote_config(experiment_config, port, config_file_name):
'''Call setClusterMetadata to pass trial'''
#set machine_list
request_data = dict()
request_data['remote_config'] = experiment_config['remoteConfig']
request_data['machine_list'] = experiment_config['machineList']
if request_data['machine_list']:
for i in range(len(request_data['machine_list'])):
Expand Down Expand Up @@ -301,7 +302,6 @@ def set_experiment(experiment_config, mode, port, config_file_name):
request_data['maxTrialNum'] = experiment_config['maxTrialNum']
request_data['searchSpace'] = experiment_config.get('searchSpace')
request_data['trainingServicePlatform'] = experiment_config.get('trainingServicePlatform')

if experiment_config.get('description'):
request_data['description'] = experiment_config['description']
if experiment_config.get('multiPhase'):
Expand Down Expand Up @@ -332,7 +332,6 @@ def set_experiment(experiment_config, mode, port, config_file_name):
request_data['versionCheck'] = experiment_config.get('versionCheck')
if experiment_config.get('logCollection'):
request_data['logCollection'] = experiment_config.get('logCollection')

request_data['clusterMetaData'] = []
if experiment_config['trainingServicePlatform'] == 'local':
request_data['clusterMetaData'].append(
Expand All @@ -344,6 +343,11 @@ def set_experiment(experiment_config, mode, port, config_file_name):
{'key': 'machine_list', 'value': experiment_config['machineList']})
request_data['clusterMetaData'].append(
{'key': 'trial_config', 'value': experiment_config['trial']})
if not experiment_config.get('remoteConfig'):
# set default value of reuse in remoteConfig to False
experiment_config['remoteConfig'] = {'reuse': False}
request_data['clusterMetaData'].append(
{'key': 'remote_config', 'value': experiment_config['remoteConfig']})
elif experiment_config['trainingServicePlatform'] == 'pai':
request_data['clusterMetaData'].append(
{'key': 'pai_config', 'value': experiment_config['paiConfig']})
Expand Down
5 changes: 5 additions & 0 deletions tools/nni_trial_tool/trial_runner.py
Original file line number Diff line number Diff line change
Expand Up @@ -27,6 +27,10 @@ def main_loop(args):
gpu_refresh_last_time = datetime.now() - timedelta(minutes=1)

try:
if args.job_pid_file:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If use same way like openpai, it doesn't need here.

Copy link
Contributor Author

@SparkSnail SparkSnail Sep 29, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This pid file is used to check trial_runner process status for environment. In openPAI, we use restful API to get trial_runner status, but in remote mode, we need to maintain the pid file to get trial_runner status.

with open(args.job_pid_file, 'w') as job_file:
job_file.write("%d" % os.getpid())

trials = dict()

command_channel = args.command_channel
Expand Down Expand Up @@ -143,6 +147,7 @@ def check_version(args):
PARSER.add_argument('--nni_manager_version', type=str, help='the nni version transmitted from nniManager')
PARSER.add_argument('--log_collection', type=str, help='set the way to collect log in trial runner')
PARSER.add_argument('--node_count', type=int, help='number of nodes, it determines how to consume command and save code file')
PARSER.add_argument('--job_pid_file', type=str, help='save trial runner process pid')
args, unknown = PARSER.parse_known_args()

setting_file = "settings.json"
Expand Down