This repository has been archived by the owner on Sep 18, 2024. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 1.8k
Add recently-idle environment scheduler in reuse mode #3375
Merged
Merged
Changes from 149 commits
Commits
Show all changes
150 commits
Select commit
Hold shift + click to select a range
dcd2ffd
Merge pull request #251 from microsoft/master
SparkSnail a738331
init changes
squirrelsc 3177aeb
Merge remote-tracking branch 'official/master' into 2391-reuse-job
squirrelsc 2aafac1
refactors
squirrelsc 0435b7f
refactoring
squirrelsc 2e5ef51
minor fix, and take some review comments.
squirrelsc 6d7bc62
move reuse to upper level
squirrelsc c67b162
support multi nodes
e13a620
fix eslint errors
59d4a71
support multi environments better
eae0540
Merge remote-tracking branch 'official/master' into 2391-reuse-job
81c49cf
code refactor
92cab3a
fix openpai yaml format
0674d88
fix k8s yaml schema
e5b9665
rename forward training service
67ef648
Merge remote-tracking branch 'official/master' into 2391-reuse-job
3b8b6fb
Merge pull request #252 from microsoft/master
SparkSnail 1e626fd
add trialService
c6b6061
not send stop for single node
916e444
Merge pull request #253 from microsoft/master
SparkSnail b8e47be
rename environmentManager to trialDispatcher
0ee933a
support no central storage service
c7973be
init
SparkSnail c094057
improve delopment support
c2735d3
Merge remote-tracking branch 'official/master' into 2391-reuse-job
d0b2504
use latest storage component
c8d4696
add gpu info
1a9f19f
work version
SparkSnail 3f4c177
separate channel and add gpu collector in runner
648e0bb
Merge branch '2391-reuse-job' of https://github.com/squirrelsc/nni in…
SparkSnail 2fa4a77
init
SparkSnail caeffb8
Merge pull request #254 from microsoft/master
SparkSnail d0768b0
add more GPU information, and improve debugging.
squirrelsc 8dff16f
fix GPU info collector
bea8ed6
Merge branch '2391-reuse-job' of https://github.com/squirrelsc/nni in…
SparkSnail e297aa5
update
SparkSnail 500c1cb
channel support single file
d880512
refine code, and implement command channel
5c33d11
update
SparkSnail 45424e8
support concurrent trials in runner.
9ca3444
implement web channel
5018039
fix eslint errors, and rename rest to web
283bceb
remove trial service, as it's replaced by channel.
0c67c5c
Merge remote-tracking branch 'official/master' into 2391-reuse-job
671f5d8
fix merged problem, and small refine for ut.
a65a810
fix pylint errors
6d36ae5
fix lint error
5e352f7
init
SparkSnail b9d1aa5
fix conflict
SparkSnail a3a91d8
format
SparkSnail 57c300e
Merge pull request #255 from microsoft/master
SparkSnail 69a5170
remove useless deferred.
edc4608
fix package
SparkSnail c1f0239
fix incorrect check logic
af97bb1
make license header consistent
10feb6a
Merge remote-tracking branch 'official/master' into 2391-reuse-job
c00cd31
add missed await.
78f1386
add doc and example
SparkSnail 586d6ac
support log level in UT
2db8ff8
refine interface to support aml better.
f631e4c
fix runtime error on exit
5982fb3
Merge remote-tracking branch 'official/master' into 2391-reuse-job
f687a6e
fix eslint error
476ffec
send metric data from channel
0f2367c
support version check
9d7bd3c
fix pylint errors
130ed27
fix non-local failed ITs
ab86080
fix comments
SparkSnail 4b11a53
fix conflict
SparkSnail 15ee064
fix conflict
SparkSnail 7c48610
format
SparkSnail c0c7d96
format code
SparkSnail 93eefb2
format code
SparkSnail 53cea0f
remove unused code
SparkSnail 34d9351
format code
SparkSnail 25a9dab
fix comments
SparkSnail cada76a
fix comments
SparkSnail de7dc7c
fix comments
SparkSnail 428dc3d
add blank line
SparkSnail 2e9c70e
fix comments
SparkSnail 8cf8583
fix comments
SparkSnail fd5fd9e
fix build
SparkSnail 54a22af
fix comments
SparkSnail 525b961
fix channel async calls
8ec5e7d
fix comments
SparkSnail bdd3840
fix comments
SparkSnail b341dce
fix comments
SparkSnail ce81c51
Merge remote-tracking branch 'snail/dev-aml' into 2391-improve
e66dc23
fix comments
SparkSnail 9cf6744
merge code logic
bd77f5c
Merge remote-tracking branch 'snail/dev-aml' into 2391-improve
ddfb0cc
Merge branch 'master' of https://github.com/microsoft/nni into dev-aml
SparkSnail 65660e6
Merge pull request #257 from microsoft/master
SparkSnail fec8a67
Merge branch 'master' of https://github.com/SparkSnail/nni into dev-aml
SparkSnail 5200a3a
Merge remote-tracking branch 'snail/dev-aml' into 2391-improve
51befa5
fix eslint errors
478629f
add run fo messages
c299ce1
Merge pull request #256 from squirrelsc/2391-improve
SparkSnail 0517e13
fix comments
SparkSnail fc4b978
sort class
SparkSnail e527743
fix eslint
SparkSnail b047681
fix eslint
SparkSnail 4acc7e8
fix annotation
SparkSnail ec1475a
fix import aml
SparkSnail 8eaeebf
fix comments
SparkSnail 56b6818
fix doc build
SparkSnail e09ff79
fix trial_runner import
SparkSnail ecf615d
fix doc
SparkSnail a7a3baf
fix pylint
SparkSnail 9376d6a
Merge pull request #258 from microsoft/master
SparkSnail 870f1e6
Merge branch 'master' of https://github.com/SparkSnail/nni into dev-aml
SparkSnail 7eaa105
add doc for aml
SparkSnail a0ea554
add content
SparkSnail 972822c
supplement dlts doc
SparkSnail b5f7f06
add doc content
SparkSnail f6d9c3f
fix content
SparkSnail f39036f
fix content
SparkSnail 5e366cf
fix broken link
SparkSnail 43b12b6
Merge branch 'v1.7' into dev-aml
squirrelsc 5fef3cf
Merge pull request #259 from microsoft/master
SparkSnail 9a864fd
Merge branch 'master' of https://github.com/SparkSnail/nni into dev-aml
SparkSnail 3515185
Merge branch 'dev-aml' of https://github.com/SparkSnail/nni into dev-aml
SparkSnail b3ec35a
fix aml docker image
SparkSnail 5544ae8
Merge pull request #261 from microsoft/master
SparkSnail f9fdfee
Merge pull request #262 from microsoft/master
SparkSnail e6c11b1
Merge branch 'master' of https://github.com/SparkSnail/nni into dev-aml
SparkSnail 5e4c09d
fix aml error information
SparkSnail 7242add
add annotation
SparkSnail aa64fe6
Merge pull request #263 from microsoft/master
SparkSnail c6a5f8c
Merge pull request #264 from microsoft/master
SparkSnail 68abe2f
Merge pull request #265 from microsoft/master
SparkSnail 14e9619
Merge pull request #266 from microsoft/master
SparkSnail f69e206
Merge pull request #267 from microsoft/master
SparkSnail 12ef0aa
Merge pull request #270 from microsoft/master
SparkSnail ddcf229
Merge pull request #271 from microsoft/master
SparkSnail c4f6e66
Merge pull request #272 from microsoft/master
SparkSnail 88f8c1b
Merge pull request #273 from microsoft/master
SparkSnail 7eb15f8
Merge pull request #274 from microsoft/master
SparkSnail f73367f
Merge pull request #275 from microsoft/master
SparkSnail 765bc33
Merge pull request #276 from microsoft/master
SparkSnail cff51cc
Merge pull request #277 from microsoft/master
SparkSnail 4232fea
Merge pull request #278 from microsoft/master
SparkSnail cb9efcc
Merge pull request #279 from microsoft/master
SparkSnail ee71f16
Merge pull request #280 from microsoft/master
SparkSnail c3921ed
Merge pull request #281 from microsoft/master
SparkSnail 561f1ad
Merge pull request #284 from microsoft/master
SparkSnail daf028a
Merge pull request #285 from microsoft/master
SparkSnail 0d41f82
fix conflict
SparkSnail f2c2f8b
init
SparkSnail 2706cb1
fix eslint
SparkSnail File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -10,7 +10,7 @@ import { GPUInfo, ScheduleResultType } from '../common/gpuData'; | |
import { EnvironmentInformation } from './environment'; | ||
import { TrialDetail } from './trial'; | ||
|
||
type SCHEDULE_POLICY_NAME = 'random' | 'round-robin'; | ||
type SCHEDULE_POLICY_NAME = 'random' | 'round-robin' | 'recently-idle'; | ||
|
||
export class GpuSchedulerSetting { | ||
public useActiveGpu: boolean = false; | ||
|
@@ -30,7 +30,7 @@ export class GpuScheduler { | |
|
||
// private readonly machineExecutorMap: Set<TrialDetail>; | ||
private readonly log: Logger = getLogger(); | ||
private readonly policyName: SCHEDULE_POLICY_NAME = 'round-robin'; | ||
private readonly policyName: SCHEDULE_POLICY_NAME = 'recently-idle'; | ||
private defaultSetting: GpuSchedulerSetting; | ||
private roundRobinIndex: number = 0; | ||
|
||
|
@@ -101,6 +101,7 @@ export class GpuScheduler { | |
trial.environment.defaultGpuSummary !== undefined && | ||
trial.assignedGpus !== undefined && | ||
trial.assignedGpus.length > 0) { | ||
|
||
for (const gpuInfo of trial.assignedGpus) { | ||
const defaultGpuSummary = trial.environment.defaultGpuSummary; | ||
const num: number | undefined = defaultGpuSummary.assignedGpuIndexMap.get(gpuInfo.index); | ||
|
@@ -190,10 +191,30 @@ export class GpuScheduler { | |
return randomSelect(qualifiedEnvironments); | ||
} else if (this.policyName === 'round-robin') { | ||
return this.roundRobinSelect(qualifiedEnvironments, allEnvironments); | ||
} else if (this.policyName === 'recently-idle') { | ||
return this.recentlyIdleSelect(qualifiedEnvironments, allEnvironments); | ||
} else { | ||
throw new Error(`Unsupported schedule policy: ${this.policyName}`); | ||
} | ||
} | ||
|
||
// Select the environment which is idle most recently. If all environments are not idle, use round robin to select an environment. | ||
private recentlyIdleSelect(qualifiedEnvironments: EnvironmentInformation[], allEnvironments: EnvironmentInformation[]) : EnvironmentInformation { | ||
const now = Date.now(); | ||
let selectedEnvironment: EnvironmentInformation | undefined = undefined; | ||
let minTimeInterval = Number.MAX_SAFE_INTEGER; | ||
for (let environment of qualifiedEnvironments) { | ||
if (environment.latestTrialReleasedTime > 0 && (now - environment.latestTrialReleasedTime) < minTimeInterval) { | ||
selectedEnvironment = environment; | ||
minTimeInterval = now - environment.latestTrialReleasedTime; | ||
} | ||
} | ||
if (selectedEnvironment === undefined) { | ||
return this.roundRobinSelect(qualifiedEnvironments, allEnvironments); | ||
} | ||
selectedEnvironment.latestTrialReleasedTime = -1; | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. BTW, one environment only runs one trial at a time? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. if it is, then policy looks good. if not, one environment can run two trials concurrently, then this policy becomes round robin There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It could run multiple trial at one environment, why it will become round-robin? |
||
return selectedEnvironment; | ||
} | ||
|
||
private roundRobinSelect(qualifiedEnvironments: EnvironmentInformation[], allEnvironments: EnvironmentInformation[]): EnvironmentInformation { | ||
while (!qualifiedEnvironments.includes(allEnvironments[this.roundRobinIndex % allEnvironments.length])) { | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is it possible that only apply this policy for aml training service?