-
Notifications
You must be signed in to change notification settings - Fork 1.8k
fix trial status not correct when trial is early stoped #4005
fix trial status not correct when trial is early stoped #4005
Conversation
0611874
to
38ac353
Compare
await delay(500); | ||
} | ||
|
||
}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SIGTERM
and wait for a few time because maybe the trial has some clean logic, why this will cause the wrong job status?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If user canceled(early stop) a trial, the status is not changed immediately, which will keep "RUNNING" for a while, From "nnimanager" main loop, there is a logic that set "RUNNING" to "FAILED" when status is "RUNNING" but the PID is not alive.
That will cause the status is "FAILED" which should be "EARLY_STOP" sometimes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
got it, but change SIGTERM
to SIGKILL
may cause the early stop trial to skip its clean-up, is this a serious change? @QuanluZhang
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's discuss it tomorrow
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updates: set early_stop before isalive check.
38ac353
to
766f2c2
Compare
ts/nni_manager/core/nnimanager.ts
Outdated
@@ -681,7 +683,7 @@ class NNIManager implements Manager { | |||
this.currSubmittedTrialNum++; | |||
this.log.info('submitTrialJob: form:', form); | |||
const trialJobDetail: TrialJobDetail = await this.trainingService.submitTrialJob(form); | |||
setTimeout(async () => this.stopTrialJobIfOverMaxDurationTimer(trialJobDetail.id), 1000 * this.maxTrialDuration); | |||
setTimeout(async () => this.stopTrialJobIfOverMaxDurationTimer(trialJobDetail.id), this.maxTrialDuration); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it should only add timer when duration is not infinity.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, changed.
a1ce34f
to
e66ef2d
Compare
Caused: the trial status was update when the trial process is exit completely(e.g. from RUNING to EARLY_STOPPED), but process exiting will cost some times which could cause status not match.
Resue mode do not have this issue.