-
Notifications
You must be signed in to change notification settings - Fork 1.8k
Reusable environment support GPU scheduler, add test cases and refactoring. #2627
Conversation
and refine log.
566c148
to
0e8c494
Compare
… into dev-2391-improvement
// Temporarily, no enough available GPU right now | ||
TMP_NO_AVAILABLE_GPU, | ||
|
||
// Cannot match requirement even if all GPU are a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a => available?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Copied from previous code, should be like below,
No environment can match the hard requirement, like the GPU number is smaller than trial asked.
this.userName = userName; | ||
this.passWord = passWord; | ||
this.host = host; | ||
this.token = token; | ||
this.reuse = reuse; | ||
this.cpuNum = cpuNum; | ||
this.memoryMB = memoryMB; | ||
this.gpuNum = gpuNum; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in other platforms like pai, remote, kubeflow, we added gpuNum, cpuNum setting in trialConfig, not cluster config field. I think we better to unify them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are three things about gpuNum,
- Settings are here to tell OpenPAI what kind of environment is needed. It's unique to OpenPAI, not to remote, aml. I'm not sure about kueflow.
- Environment capacity. It's general to all environments. It doesn't need config like this, since gpu collector (nvdia-smi) dectects environment capacity in runtime.
- The requirement of a
single
trial. It's in trialConfig. With above environment capacity, we can schedule multiple trials on one environment. BTW, cpuNum and memoryMB in trial config is not used today, it may be useful in future.
@@ -39,6 +39,7 @@ | |||
"@types/express": "^4.16.0", | |||
"@types/glob": "^7.1.1", | |||
"@types/js-base64": "^2.3.1", | |||
"@types/js-yaml": "^3.12.5", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
already has this js-yaml
package? refer line 72
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It uses to provide a TypeScript type declaration, so that we can import js lib like import * as yaml from 'js-yaml';
, instead of const yaml = require('js-yaml');
.
@@ -0,0 +1,235 @@ | |||
// Copyright (c) Microsoft Corporation. | |||
// Licensed under the MIT license. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems the main logic of this scheduler is very similar with remote machine gpu scheduler, suggest to refactor remote machine gpu scheduler and reuse it for EnvironmentInformation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's minimum changes each time. The original GpuScheduler has many dependencies on remote machine schema, and new GpuScheduler changed schema of gpu info also.
The long term goal is to migrate remote machine to trialdispatcher. It can improve remote machine performance, and more stable by UT on trialDispatcher level.
'userName': setType('userName', str), | ||
'token': setType('token', str), | ||
Or('passWord', 'token', only_one=True): str, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is only_one
used?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
select only one from the Or()
@SparkSnail Shinai, please help merge the code, if you think it's ok. |
ok, will merge it after the pipeline pass. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@SparkSnail Shinai, please help merge the code, if you think it's ok.
ok, will merge it after the pipeline pass.
How about aml? does it support nvidia-smi?
'userName': setType('userName', str), | ||
'token': setType('token', str), | ||
Or('passWord', 'token', only_one=True): str, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
select only one from the Or()
Can not detect |
Double confirmed, aml can get |
Can you try if current trial.py can get it also? |
GPU scheduler experience
GPU scheduler code changes
Refactoring
Test updates