Skip to content
This repository has been archived by the owner on Sep 18, 2024. It is now read-only.

fix trial status not correct when trial is early stoped #4005

Merged

Conversation

acured
Copy link
Contributor

@acured acured commented Aug 2, 2021

Caused: the trial status was update when the trial process is exit completely(e.g. from RUNING to EARLY_STOPPED), but process exiting will cost some times which could cause status not match.

Resue mode do not have this issue.

@acured acured changed the title fix trial status not correct when trial is early stoped(orver max dur… fix trial status not correct when trial is early stoped Aug 2, 2021
@acured acured force-pushed the FixStatusNotCorrectWhenTrialEarlyStop branch from 0611874 to 38ac353 Compare August 3, 2021 08:34
await delay(500);
}

});
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SIGTERM and wait for a few time because maybe the trial has some clean logic, why this will cause the wrong job status?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If user canceled(early stop) a trial, the status is not changed immediately, which will keep "RUNNING" for a while, From "nnimanager" main loop, there is a logic that set "RUNNING" to "FAILED" when status is "RUNNING" but the PID is not alive.

That will cause the status is "FAILED" which should be "EARLY_STOP" sometimes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

got it, but change SIGTERM to SIGKILL may cause the early stop trial to skip its clean-up, is this a serious change? @QuanluZhang

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's discuss it tomorrow

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updates: set early_stop before isalive check.

@acured acured force-pushed the FixStatusNotCorrectWhenTrialEarlyStop branch from 38ac353 to 766f2c2 Compare August 4, 2021 01:40
@QuanluZhang QuanluZhang requested a review from liuzhe-lz August 4, 2021 05:39
@@ -681,7 +683,7 @@ class NNIManager implements Manager {
this.currSubmittedTrialNum++;
this.log.info('submitTrialJob: form:', form);
const trialJobDetail: TrialJobDetail = await this.trainingService.submitTrialJob(form);
setTimeout(async () => this.stopTrialJobIfOverMaxDurationTimer(trialJobDetail.id), 1000 * this.maxTrialDuration);
setTimeout(async () => this.stopTrialJobIfOverMaxDurationTimer(trialJobDetail.id), this.maxTrialDuration);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it should only add timer when duration is not infinity.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, changed.

@acured acured force-pushed the FixStatusNotCorrectWhenTrialEarlyStop branch from a1ce34f to e66ef2d Compare August 5, 2021 01:54
@QuanluZhang QuanluZhang merged commit e99c579 into microsoft:master Aug 5, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants