Fix remote reuse bugs #2981

SparkSnail · 2020-10-19T02:10:36Z

No description provided.

…mote-pipeline

…te-pipeline

…mote-pipeline

…into dev-remote-pipeline

…te-pipeline

…mote-pipeline

…te-pipeline

liuzhe-lz · 2020-10-19T03:01:59Z

tools/nni_trial_tool/trial.py

+                    for child in psutil.Process(self.process.pid).children(True):
+                        child.kill()
+                    self.process.kill()
+                except Exception as ex:


Can we catch a more specific exception?

ok, add a NoSuchProcess.

squirrelsc · 2020-10-19T03:56:31Z

tools/nni_trial_tool/trial.py

+                try:
+                    nni_log(LogType.Info, "%s: killing trial" % self.name)
+                    for child in psutil.Process(self.process.pid).children(True):
+                        child.kill()


If there is try catch, it's better to try catch for each kill. So that one fail won't effect others.

This error is used to catch psutil.Process(), not child kill. In some kind of scene, trial has already exited, and kill() command is sent later. Will throw process not exist error.

squirrelsc · 2020-10-19T03:56:59Z

tools/nni_trial_tool/trial.py

+                        child.kill()
+                    self.process.kill()
+                except Exception as ex:
+                    nni_log(LogType.Error, "kill trial %s failed: %s " % (trial_id, str(ex)))


It likes a clean up, don't need error level log. debug or info is enough.

psutil.NoSuchProcess is expected exit issue, use info level. For other kinds of unexpected issue, I think use error level is better.

QuanluZhang · 2020-10-19T04:02:21Z

src/nni_manager/training_service/reusable/environments/remoteEnvironmentService.ts

                }
            }
+        } catch (error) {


no need to release environment resource here, why?

There is a case that environment is submitted, but it starts slowly and hasn't start process and create pid file, the system call refresh function to read pid file, it will cause no such file exception.

I see, it looks you don't need to check this file. Check if environment.isRunnerReady, then check the file. It will depend on first initialized message. And you set env status to running is too earlier. In remote, it's better to wait isRunnerReady first, then check file status, and set to running, success or failed.

no, if the environment is failed to start, the isRunnerReady will always be false, but we need to refresh env status to failed here.

How do you know it failed to start? You may can wait the pid file when initializing, instead of set env to running directly.

Check process return code to detect if env is failed to start. Added detecting logic for pid file exist.

QuanluZhang · 2020-10-19T04:10:37Z

@SparkSnail could you briefly explain why your changes fix the problem?

squirrelsc · 2020-10-19T23:52:57Z

src/nni_manager/training_service/reusable/trialDispatcher.ts

@@ -664,17 +664,16 @@ class TrialDispatcher implements TrainingService {
    }

    private releaseEnvironment(trial: TrialDetail): void {
-        if (undefined === trial.environment) {
-            throw new Error(`TrialDispatcher: environment is not assigned to trial ${trial.id}, and cannot be released!`);


What's reason to remove the check? it helps to find unexpected behavior

IT find a case that assessor report two kill command continuous for a trial, a trial released environment, then release again will throw exception. This behavior is by design in assessor, so trialDispatcher should handle this kinds of case.

squirrelsc · 2020-10-20T03:32:45Z

src/nni_manager/training_service/remote_machine/osCommands.ts

@@ -29,6 +29,7 @@ abstract class OsCommands {
    public abstract extractFile(tarFileName: string, targetFolder: string): string;
    public abstract executeScript(script: string, isFile: boolean): string;
    public abstract addPreCommand(preCommand: string | undefined, command: string | undefined): string | undefined;
+    public abstract fileExistCommand(filePath: string): string | undefined;


fileExists is enough. Command is for command related, but this one is not related.

SparkSnail and others added 30 commits February 10, 2020 11:51

fix endpoint

5982baa

add private key

5653624

Merge branch 'master' of https://github.com/microsoft/nni into dev-re…

e5ee726

…mote-pipeline

fix torchversion

1cb2349

Merge branch 'master' of https://github.com/microsoft/nni into dev-re…

e53d4ec

…mote-pipeline

add debug info

99b0c08

add port in pscp.exe

fe60ba5

fix remote pipeline

52aa6ad

Merge branch 'v1.4' of https://github.com/microsoft/nni into dev-remo…

533c504

…te-pipeline

Merge branch 'v1.4.1' of https://github.com/microsoft/nni into dev-re…

8f024fa

…mote-pipeline

fix remote-windows-pipeline

6a96403

Merge branch 'master' of https://github.com/microsoft/nni into dev-re…

7afab04

…mote-pipeline

remove sudo

13bb9f2

fix error

6e1d822

Merge branch 'dev-remote-pipeline' of https://github.com/microsoft/nni …

5a91061

…into dev-remote-pipeline

format code

4faeb06

fix error

21aa3b5

fix error

ffa8c6b

debug

4c49f60

remove clean step

1057ad2

format code

0a52185

Merge branch 'v1.5' of https://github.com/microsoft/nni into dev-remo…

8b20632

…te-pipeline

fix windows copy

f07f3f4

fix conflict

939b1d4

fix pipeline

1f78669

fix remote pipeline

7b7cadc

Merge branch 'master' of https://github.com/microsoft/nni into dev-re…

96d41e4

…mote-pipeline

fix remote it

04f5645

format annotation

beafde2

fix platform judge method

c3c6135

SparkSnail added 5 commits October 14, 2020 11:00

debug kill process

bdbfa6e

revert change

0dd7c97

fix check status

5cd22f0

Merge branch 'v1.9' of https://github.com/microsoft/nni into dev-remo…

7e54be7

…te-pipeline

fix v1.9

4dc2a41

SparkSnail requested review from chicm-ms, QuanluZhang, liuzhe-lz and ultmaster October 19, 2020 02:10

QuanluZhang removed the request for review from ultmaster October 19, 2020 02:48

fix eslint

5cc6898

liuzhe-lz reviewed Oct 19, 2020

View reviewed changes

QuanluZhang requested a review from squirrelsc October 19, 2020 03:49

squirrelsc reviewed Oct 19, 2020

View reviewed changes

QuanluZhang reviewed Oct 19, 2020

View reviewed changes

SparkSnail added 4 commits October 19, 2020 15:32

fix comments

620e4ec

add check file command

fbf758a

update

7c80d45

revert change

706b1b0

squirrelsc reviewed Oct 19, 2020

View reviewed changes

chicm-ms closed this Oct 20, 2020

chicm-ms reopened this Oct 20, 2020

QuanluZhang approved these changes Oct 20, 2020

View reviewed changes

add annotation

cb69b81

chicm-ms approved these changes Oct 20, 2020

View reviewed changes

squirrelsc reviewed Oct 20, 2020

View reviewed changes

squirrelsc approved these changes Oct 20, 2020

View reviewed changes

SparkSnail merged commit add7ca6 into v1.9 Oct 20, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix remote reuse bugs #2981

Fix remote reuse bugs #2981

SparkSnail commented Oct 19, 2020

liuzhe-lz Oct 19, 2020 •

edited

Loading

SparkSnail Oct 19, 2020

squirrelsc Oct 19, 2020

SparkSnail Oct 19, 2020

squirrelsc Oct 19, 2020

SparkSnail Oct 19, 2020

QuanluZhang Oct 19, 2020

SparkSnail Oct 19, 2020

squirrelsc Oct 19, 2020 •

edited

Loading

SparkSnail Oct 19, 2020

squirrelsc Oct 19, 2020

SparkSnail Oct 19, 2020

QuanluZhang commented Oct 19, 2020

squirrelsc Oct 19, 2020

SparkSnail Oct 20, 2020 •

edited

Loading

squirrelsc Oct 20, 2020

Fix remote reuse bugs #2981

Fix remote reuse bugs #2981

Conversation

SparkSnail commented Oct 19, 2020

liuzhe-lz Oct 19, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

squirrelsc Oct 19, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

QuanluZhang commented Oct 19, 2020

Choose a reason for hiding this comment

SparkSnail Oct 20, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

liuzhe-lz Oct 19, 2020 •

edited

Loading

squirrelsc Oct 19, 2020 •

edited

Loading

SparkSnail Oct 20, 2020 •

edited

Loading