Skip to content
This repository has been archived by the owner on Sep 18, 2024. It is now read-only.

Reusable environment support GPU scheduler, add test cases and refactoring. #2627

Merged
merged 25 commits into from
Jul 30, 2020

Conversation

squirrelsc
Copy link
Member

@squirrelsc squirrelsc commented Jul 3, 2020

GPU scheduler experience

  1. in paiConfig, add gpuNum. If it uses to submit template. So that, if gpuNum is 1 in trial, and 4 in pai config. One job can run 4 trials in parallel.
  2. in paiConfig add maxTrialNumPerGpu, useActiveGpu. The pattern refers to remote machine, but all openpai job is the same settings.
  3. in paiConfig add cpuNum, memoryMB. The same settings are in trial config also. either of them is ok. But it makes sense that they are settings of openpai, should be in pai config.
  4. It doesn't support to use gpu scheduler on multi-node environment. Firstly, it's hard to control in this case, Secondly, multi-node is for distrubuted training of one trial, but gpu scheduler means one node serve multiple trials. It doesn't need both in most cases.
  5. GPU scheduler is enabled only enableGpuCollectorgpuNum in trial config is specified. The GPU count of environment is collected in runtime.

GPU scheduler code changes

  1. Copied GPU Scheduler logic from remote machine, as it supports multiple environments not like local.
  2. Add TrialGpuSummary to save assignment states with gpu info together.
  3. Add runningTrialCount in environment, so that to support non-gpu scheduling well.
  4. Add usableGpus in environment, so it can support specify which GPUs are assigned to NNI. It's not exposed to end user, but covered by UT already.
  5. Though multi-node doesn't support GPU scheduler, but GPU information of all nodes are collected. defaultGpuSummary uses to get GPU information for GPU scheduler.

Refactoring

  1. Add isRunnerReady flag to make environment.status only save env status.
  2. Add reuseEnvironment in trial config, but not expose to user yet. it can uses to support single trial in future. The assignedTrialCount in environment uses to support this feature.
  3. Make channel more stable in runner, when env exit due to timeout.
  4. Make environment loop interval configurable for different environment service.
  5. Add flag hasMoreEnvironments, so that environment service can stop create more, if there is any error. Remote services can stop create more, if all servers are used.
  6. Add flag shouldUpdateTrials in TrialDispatcher, so that the loop can run faster. It saves time not only for UT, also for production.
  7. Fix some bugs in storage service, which are found by UT.
  8. Fix TrialDispatcher bug, if metric send too fast, some may lost.
  9. Update getIPV4Address to let user know how to fix it.
  10. Remove useless code in AmlCommandChannel, AMLEnvironmentService.
  11. Move some shared code outside remote machine gpu scheduler.
  12. Add some flags to prevent too many duplicate logs.

Test updates

  1. Update test code to support IT.
  2. Add UT for trialDispatcher and mounted storage service.

@squirrelsc squirrelsc linked an issue Jul 6, 2020 that may be closed by this pull request
8 tasks
@squirrelsc squirrelsc force-pushed the dev-2391-improvement branch from 566c148 to 0e8c494 Compare July 6, 2020 09:28
@squirrelsc squirrelsc changed the title Reusable environment tests and refactoring Reusable environment support GPU scheduler, add test cases and refactoring. Jul 10, 2020
@scarlett2018 scarlett2018 mentioned this pull request Jul 14, 2020
66 tasks
// Temporarily, no enough available GPU right now
TMP_NO_AVAILABLE_GPU,

// Cannot match requirement even if all GPU are a
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a => available?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copied from previous code, should be like below,

No environment can match the hard requirement, like the GPU number is smaller than trial asked.

this.userName = userName;
this.passWord = passWord;
this.host = host;
this.token = token;
this.reuse = reuse;
this.cpuNum = cpuNum;
this.memoryMB = memoryMB;
this.gpuNum = gpuNum;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in other platforms like pai, remote, kubeflow, we added gpuNum, cpuNum setting in trialConfig, not cluster config field. I think we better to unify them.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are three things about gpuNum,

  1. Settings are here to tell OpenPAI what kind of environment is needed. It's unique to OpenPAI, not to remote, aml. I'm not sure about kueflow.
  2. Environment capacity. It's general to all environments. It doesn't need config like this, since gpu collector (nvdia-smi) dectects environment capacity in runtime.
  3. The requirement of a single trial. It's in trialConfig. With above environment capacity, we can schedule multiple trials on one environment. BTW, cpuNum and memoryMB in trial config is not used today, it may be useful in future.

@@ -39,6 +39,7 @@
"@types/express": "^4.16.0",
"@types/glob": "^7.1.1",
"@types/js-base64": "^2.3.1",
"@types/js-yaml": "^3.12.5",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

already has this js-yaml package? refer line 72

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It uses to provide a TypeScript type declaration, so that we can import js lib like import * as yaml from 'js-yaml';, instead of const yaml = require('js-yaml');.

@@ -0,0 +1,235 @@
// Copyright (c) Microsoft Corporation.
// Licensed under the MIT license.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems the main logic of this scheduler is very similar with remote machine gpu scheduler, suggest to refactor remote machine gpu scheduler and reuse it for EnvironmentInformation

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's minimum changes each time. The original GpuScheduler has many dependencies on remote machine schema, and new GpuScheduler changed schema of gpu info also.

The long term goal is to migrate remote machine to trialdispatcher. It can improve remote machine performance, and more stable by UT on trialDispatcher level.

'userName': setType('userName', str),
'token': setType('token', str),
Or('passWord', 'token', only_one=True): str,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is only_one used?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

select only one from the Or()

@squirrelsc
Copy link
Member Author

@SparkSnail Shinai, please help merge the code, if you think it's ok.

@SparkSnail
Copy link
Contributor

@SparkSnail Shinai, please help merge the code, if you think it's ok.

ok, will merge it after the pipeline pass.

Copy link
Member Author

@squirrelsc squirrelsc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@SparkSnail Shinai, please help merge the code, if you think it's ok.

ok, will merge it after the pipeline pass.

How about aml? does it support nvidia-smi?

'userName': setType('userName', str),
'token': setType('token', str),
Or('passWord', 'token', only_one=True): str,
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

select only one from the Or()

@SparkSnail
Copy link
Contributor

SparkSnail commented Jul 30, 2020

@SparkSnail Shinai, please help merge the code, if you think it's ok.

ok, will merge it after the pipeline pass.

How about aml? does it support nvidia-smi?

image

Can not detect nvidia-smi in my test case, not sure if it is related to aml cluster, I'll double confirming ,due to lack of resources, need to queue long long time for one experiment, waiting. Merge this pr first, aml will have another pr after the confirmation.

@SparkSnail SparkSnail merged commit 143c661 into master Jul 30, 2020
@SparkSnail
Copy link
Contributor

@SparkSnail Shinai, please help merge the code, if you think it's ok.

ok, will merge it after the pipeline pass.

How about aml? does it support nvidia-smi?

image

Can not detect nvidia-smi in my test case, not sure if it is related to aml cluster, I'll double confirming ,due to lack of resources, need to queue long long time for one experiment, waiting. Merge this pr first, aml will have another pr after the confirmation.

Double confirmed, aml can get nvidia-smi command in other cluster.
image

@squirrelsc
Copy link
Member Author

Double confirmed, aml can get nvidia-smi command in other cluster.

Can you try if current trial.py can get it also?

@liuzhe-lz liuzhe-lz deleted the dev-2391-improvement branch October 18, 2020 04:32
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Advancing job performance: working pool to reuse job
3 participants