-
Notifications
You must be signed in to change notification settings - Fork 1.8k
Conversation
…into dev-remote-pipeline
for child in psutil.Process(self.process.pid).children(True): | ||
child.kill() | ||
self.process.kill() | ||
except Exception as ex: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we catch a more specific exception?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, add a NoSuchProcess.
try: | ||
nni_log(LogType.Info, "%s: killing trial" % self.name) | ||
for child in psutil.Process(self.process.pid).children(True): | ||
child.kill() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If there is try catch, it's better to try catch for each kill. So that one fail won't effect others.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This error is used to catch psutil.Process(), not child kill. In some kind of scene, trial has already exited, and kill() command is sent later. Will throw process not exist error.
child.kill() | ||
self.process.kill() | ||
except Exception as ex: | ||
nni_log(LogType.Error, "kill trial %s failed: %s " % (trial_id, str(ex))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It likes a clean up, don't need error level log. debug or info is enough.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
psutil.NoSuchProcess is expected exit issue, use info level. For other kinds of unexpected issue, I think use error level is better.
} | ||
} | ||
} catch (error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no need to release environment resource here, why?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a case that environment is submitted, but it starts slowly and hasn't start process and create pid file, the system call refresh function to read pid file, it will cause no such file
exception.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, it looks you don't need to check this file. Check if environment.isRunnerReady, then check the file. It will depend on first initialized message. And you set env status to running is too earlier. In remote, it's better to wait isRunnerReady first, then check file status, and set to running, success or failed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no, if the environment is failed to start, the isRunnerReady will always be false, but we need to refresh env status to failed here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How do you know it failed to start? You may can wait the pid file when initializing, instead of set env to running directly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Check process return code to detect if env is failed to start. Added detecting logic for pid file exist.
@SparkSnail could you briefly explain why your changes fix the problem? |
@@ -664,17 +664,16 @@ class TrialDispatcher implements TrainingService { | |||
} | |||
|
|||
private releaseEnvironment(trial: TrialDetail): void { | |||
if (undefined === trial.environment) { | |||
throw new Error(`TrialDispatcher: environment is not assigned to trial ${trial.id}, and cannot be released!`); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's reason to remove the check? it helps to find unexpected behavior
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IT find a case that assessor report two kill command continuous for a trial, a trial released environment, then release again will throw exception. This behavior is by design in assessor, so trialDispatcher should handle this kinds of case.
@@ -29,6 +29,7 @@ abstract class OsCommands { | |||
public abstract extractFile(tarFileName: string, targetFolder: string): string; | |||
public abstract executeScript(script: string, isFile: boolean): string; | |||
public abstract addPreCommand(preCommand: string | undefined, command: string | undefined): string | undefined; | |||
public abstract fileExistCommand(filePath: string): string | undefined; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fileExists is enough. Command is for command related, but this one is not related.
No description provided.