Reusable environment support GPU scheduler, add test cases and refactoring. #2627

squirrelsc · 2020-07-03T01:37:08Z

GPU scheduler experience

in paiConfig, add gpuNum. If it uses to submit template. So that, if gpuNum is 1 in trial, and 4 in pai config. One job can run 4 trials in parallel.
in paiConfig add maxTrialNumPerGpu, useActiveGpu. The pattern refers to remote machine, but all openpai job is the same settings.
in paiConfig add cpuNum, memoryMB. The same settings are in trial config also. either of them is ok. But it makes sense that they are settings of openpai, should be in pai config.
It doesn't support to use gpu scheduler on multi-node environment. Firstly, it's hard to control in this case, Secondly, multi-node is for distrubuted training of one trial, but gpu scheduler means one node serve multiple trials. It doesn't need both in most cases.
GPU scheduler is enabled only enableGpuCollectorgpuNum in trial config is specified. The GPU count of environment is collected in runtime.

GPU scheduler code changes

Copied GPU Scheduler logic from remote machine, as it supports multiple environments not like local.
Add TrialGpuSummary to save assignment states with gpu info together.
Add runningTrialCount in environment, so that to support non-gpu scheduling well.
Add usableGpus in environment, so it can support specify which GPUs are assigned to NNI. It's not exposed to end user, but covered by UT already.
Though multi-node doesn't support GPU scheduler, but GPU information of all nodes are collected. defaultGpuSummary uses to get GPU information for GPU scheduler.

Refactoring

Add isRunnerReady flag to make environment.status only save env status.
Add reuseEnvironment in trial config, but not expose to user yet. it can uses to support single trial in future. The assignedTrialCount in environment uses to support this feature.
Make channel more stable in runner, when env exit due to timeout.
Make environment loop interval configurable for different environment service.
Add flag hasMoreEnvironments, so that environment service can stop create more, if there is any error. Remote services can stop create more, if all servers are used.
Add flag shouldUpdateTrials in TrialDispatcher, so that the loop can run faster. It saves time not only for UT, also for production.
Fix some bugs in storage service, which are found by UT.
Fix TrialDispatcher bug, if metric send too fast, some may lost.
Update getIPV4Address to let user know how to fix it.
Remove useless code in AmlCommandChannel, AMLEnvironmentService.
Move some shared code outside remote machine gpu scheduler.
Add some flags to prevent too many duplicate logs.

Test updates

Update test code to support IT.
Add UT for trialDispatcher and mounted storage service.

and refine log.

fix bugs found by UT.

… into dev-2391-improvement

SparkSnail · 2020-07-23T03:01:29Z

src/nni_manager/training_service/common/gpuData.ts

+    // Temporarily, no enough available GPU right now
+    TMP_NO_AVAILABLE_GPU,
+
+    // Cannot match requirement even if all GPU are a


a => available?

Copied from previous code, should be like below,

No environment can match the hard requirement, like the GPU number is smaller than trial asked.

SparkSnail · 2020-07-23T03:27:49Z

src/nni_manager/training_service/pai/paiConfig.ts

        this.userName = userName;
        this.passWord = passWord;
        this.host = host;
        this.token = token;
        this.reuse = reuse;
+        this.cpuNum = cpuNum;
+        this.memoryMB = memoryMB;
+        this.gpuNum = gpuNum;


in other platforms like pai, remote, kubeflow, we added gpuNum, cpuNum setting in trialConfig, not cluster config field. I think we better to unify them.

There are three things about gpuNum,

Settings are here to tell OpenPAI what kind of environment is needed. It's unique to OpenPAI, not to remote, aml. I'm not sure about kueflow.

Environment capacity. It's general to all environments. It doesn't need config like this, since gpu collector (nvdia-smi) dectects environment capacity in runtime.

The requirement of a single trial. It's in trialConfig. With above environment capacity, we can schedule multiple trials on one environment. BTW, cpuNum and memoryMB in trial config is not used today, it may be useful in future.

SparkSnail · 2020-07-23T03:31:27Z

src/nni_manager/package.json

@@ -39,6 +39,7 @@
    "@types/express": "^4.16.0",
    "@types/glob": "^7.1.1",
    "@types/js-base64": "^2.3.1",
+    "@types/js-yaml": "^3.12.5",


already has this js-yaml package? refer line 72

It uses to provide a TypeScript type declaration, so that we can import js lib like import * as yaml from 'js-yaml';, instead of const yaml = require('js-yaml');.

chicm-ms · 2020-07-24T07:26:28Z

src/nni_manager/training_service/reusable/gpuScheduler.ts

@@ -0,0 +1,235 @@
+// Copyright (c) Microsoft Corporation.
+// Licensed under the MIT license.


It seems the main logic of this scheduler is very similar with remote machine gpu scheduler, suggest to refactor remote machine gpu scheduler and reuse it for EnvironmentInformation

Let's minimum changes each time. The original GpuScheduler has many dependencies on remote machine schema, and new GpuScheduler changed schema of gpu info also.

The long term goal is to migrate remote machine to trialdispatcher. It can improve remote machine performance, and more stable by UT on trialDispatcher level.

SparkSnail · 2020-07-29T02:13:01Z

tools/nni_cmd/config_schema.py

        'userName': setType('userName', str),
-        'token': setType('token', str),
+        Or('passWord', 'token', only_one=True): str,


what is only_one used?

select only one from the Or()

squirrelsc · 2020-07-29T03:52:34Z

@SparkSnail Shinai, please help merge the code, if you think it's ok.

SparkSnail · 2020-07-30T12:51:51Z

@SparkSnail Shinai, please help merge the code, if you think it's ok.

ok, will merge it after the pipeline pass.

squirrelsc

@SparkSnail Shinai, please help merge the code, if you think it's ok.

ok, will merge it after the pipeline pass.

How about aml? does it support nvidia-smi?

squirrelsc · 2020-07-29T02:58:15Z

tools/nni_cmd/config_schema.py

        'userName': setType('userName', str),
-        'token': setType('token', str),
+        Or('passWord', 'token', only_one=True): str,


select only one from the Or()

SparkSnail · 2020-07-30T15:00:59Z

@SparkSnail Shinai, please help merge the code, if you think it's ok.

ok, will merge it after the pipeline pass.

How about aml? does it support nvidia-smi?

Can not detect nvidia-smi in my test case, not sure if it is related to aml cluster, I'll double confirming ,due to lack of resources, need to queue long long time for one experiment, waiting. Merge this pr first, aml will have another pr after the confirmation.

SparkSnail · 2020-07-31T01:21:10Z

@SparkSnail Shinai, please help merge the code, if you think it's ok.

ok, will merge it after the pipeline pass.

How about aml? does it support nvidia-smi?

Can not detect nvidia-smi in my test case, not sure if it is related to aml cluster, I'll double confirming ,due to lack of resources, need to queue long long time for one experiment, waiting. Merge this pr first, aml will have another pr after the confirmation.

Double confirmed, aml can get nvidia-smi command in other cluster.

squirrelsc · 2020-07-31T03:21:28Z

Double confirmed, aml can get nvidia-smi command in other cluster.

Can you try if current trial.py can get it also?

Chi Song added 8 commits July 1, 2020 11:52

add isRunnerReady flag to make env status logic simpler.

ecb360f

add IT for reusable OpenPAI

c8f82f9

rename and change reuse as a var

53556d7

add missed parameter

e2fae9b

fix metrics handler for dict

a96dd7c

improve runner timeout stablability

1cdb6e8

and refine log.

fix aml status

b410cba

remove duplicate pipeline file

0b540eb

squirrelsc requested review from QuanluZhang, SparkSnail and chicm-ms July 3, 2020 01:37

Chi Song added 3 commits July 3, 2020 11:40

fix empty nniManagerIp error

e84b205

Add mountedstorage UT

f6381c3

fix bugs found by UT.

Add UT for trial dispatcher, and fix some bugs

0e8c494

squirrelsc linked an issue Jul 6, 2020 that may be closed by this pull request

Advancing job performance: working pool to reuse job #2391

Closed

8 tasks

squirrelsc force-pushed the dev-2391-improvement branch from 566c148 to 0e8c494 Compare July 6, 2020 09:28

Chi Song and others added 11 commits July 6, 2020 17:41

improve error message

e13d3e3

add case for multi node

e27a791

remove js reference

98647b4

support env resource config in OpenPAI

bf8a1a2

add flag to control if environment is reusable.

ae598bf

Merge remote-tracking branch 'official/master' into dev-2391-improvement

1f26855

support gpu scheduler

250d6e5

fix behavior on no gpu number

20b23d6

Merge branch 'dev-2391-improvement' of https://github.com/Microsoft/nni…

8990b4d

… into dev-2391-improvement

support multiple trial on same GPU.

946d052

add test cases for GPU Scheduler

68ef948

squirrelsc changed the title ~~Reusable environment tests and refactoring~~ Reusable environment support GPU scheduler, add test cases and refactoring. Jul 10, 2020

Chi Song added 2 commits July 10, 2020 14:19

rename and fix pylint.

a4441c0

fix parse boolean problem

86db9a3

scarlett2018 mentioned this pull request Jul 14, 2020

July iteration plan #2608

Closed

66 tasks

SparkSnail reviewed Jul 23, 2020

View reviewed changes

chicm-ms reviewed Jul 24, 2020

View reviewed changes

SparkSnail approved these changes Jul 29, 2020

View reviewed changes

SparkSnail reviewed Jul 29, 2020

View reviewed changes

chicm-ms approved these changes Jul 30, 2020

View reviewed changes

fix conflict

16f4690

squirrelsc commented Jul 30, 2020

View reviewed changes

SparkSnail merged commit 143c661 into master Jul 30, 2020

liuzhe-lz deleted the dev-2391-improvement branch October 18, 2020 04:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reusable environment support GPU scheduler, add test cases and refactoring. #2627

Reusable environment support GPU scheduler, add test cases and refactoring. #2627

squirrelsc commented Jul 3, 2020 •

edited

Loading

SparkSnail Jul 23, 2020

squirrelsc Jul 23, 2020

SparkSnail Jul 23, 2020

squirrelsc Jul 23, 2020

SparkSnail Jul 23, 2020

squirrelsc Jul 23, 2020

chicm-ms Jul 24, 2020

squirrelsc Jul 24, 2020

SparkSnail Jul 29, 2020

squirrelsc Jul 29, 2020

squirrelsc commented Jul 29, 2020

SparkSnail commented Jul 30, 2020

squirrelsc left a comment

squirrelsc Jul 29, 2020

SparkSnail commented Jul 30, 2020 •

edited

Loading

SparkSnail commented Jul 31, 2020

squirrelsc commented Jul 31, 2020

		@@ -0,0 +1,235 @@
		// Copyright (c) Microsoft Corporation.
		// Licensed under the MIT license.

Reusable environment support GPU scheduler, add test cases and refactoring. #2627

Reusable environment support GPU scheduler, add test cases and refactoring. #2627

Conversation

squirrelsc commented Jul 3, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

squirrelsc commented Jul 29, 2020

SparkSnail commented Jul 30, 2020

squirrelsc left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkSnail commented Jul 30, 2020 • edited Loading

SparkSnail commented Jul 31, 2020

squirrelsc commented Jul 31, 2020

squirrelsc commented Jul 3, 2020 •

edited

Loading

SparkSnail commented Jul 30, 2020 •

edited

Loading