Skip to content
This repository has been archived by the owner on Sep 18, 2024. It is now read-only.

Fix remote reuse bugs #2981

Merged
merged 49 commits into from
Oct 20, 2020
Merged

Fix remote reuse bugs #2981

merged 49 commits into from
Oct 20, 2020

Conversation

SparkSnail
Copy link
Contributor

No description provided.

@QuanluZhang QuanluZhang removed the request for review from ultmaster October 19, 2020 02:48
for child in psutil.Process(self.process.pid).children(True):
child.kill()
self.process.kill()
except Exception as ex:
Copy link
Contributor

@liuzhe-lz liuzhe-lz Oct 19, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we catch a more specific exception?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, add a NoSuchProcess.

try:
nni_log(LogType.Info, "%s: killing trial" % self.name)
for child in psutil.Process(self.process.pid).children(True):
child.kill()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there is try catch, it's better to try catch for each kill. So that one fail won't effect others.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This error is used to catch psutil.Process(), not child kill. In some kind of scene, trial has already exited, and kill() command is sent later. Will throw process not exist error.

child.kill()
self.process.kill()
except Exception as ex:
nni_log(LogType.Error, "kill trial %s failed: %s " % (trial_id, str(ex)))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It likes a clean up, don't need error level log. debug or info is enough.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

psutil.NoSuchProcess is expected exit issue, use info level. For other kinds of unexpected issue, I think use error level is better.

}
}
} catch (error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no need to release environment resource here, why?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a case that environment is submitted, but it starts slowly and hasn't start process and create pid file, the system call refresh function to read pid file, it will cause no such file exception.

Copy link
Member

@squirrelsc squirrelsc Oct 19, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, it looks you don't need to check this file. Check if environment.isRunnerReady, then check the file. It will depend on first initialized message. And you set env status to running is too earlier. In remote, it's better to wait isRunnerReady first, then check file status, and set to running, success or failed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no, if the environment is failed to start, the isRunnerReady will always be false, but we need to refresh env status to failed here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do you know it failed to start? You may can wait the pid file when initializing, instead of set env to running directly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Check process return code to detect if env is failed to start. Added detecting logic for pid file exist.

@QuanluZhang
Copy link
Contributor

@SparkSnail could you briefly explain why your changes fix the problem?

@@ -664,17 +664,16 @@ class TrialDispatcher implements TrainingService {
}

private releaseEnvironment(trial: TrialDetail): void {
if (undefined === trial.environment) {
throw new Error(`TrialDispatcher: environment is not assigned to trial ${trial.id}, and cannot be released!`);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's reason to remove the check? it helps to find unexpected behavior

Copy link
Contributor Author

@SparkSnail SparkSnail Oct 20, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IT find a case that assessor report two kill command continuous for a trial, a trial released environment, then release again will throw exception. This behavior is by design in assessor, so trialDispatcher should handle this kinds of case.

@chicm-ms chicm-ms closed this Oct 20, 2020
@chicm-ms chicm-ms reopened this Oct 20, 2020
@@ -29,6 +29,7 @@ abstract class OsCommands {
public abstract extractFile(tarFileName: string, targetFolder: string): string;
public abstract executeScript(script: string, isFile: boolean): string;
public abstract addPreCommand(preCommand: string | undefined, command: string | undefined): string | undefined;
public abstract fileExistCommand(filePath: string): string | undefined;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fileExists is enough. Command is for command related, but this one is not related.

@SparkSnail SparkSnail merged commit add7ca6 into v1.9 Oct 20, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants