This repository has been archived by the owner on Sep 18, 2024. It is now read-only.

experiment management backend #3081

Merged

J-shang merged 39 commits into microsoft:master from J-shang:experiment-backend

Nov 30, 2020

Contributor

J-shang commented Nov 11, 2020 •

edited

Loading

TODO:
~~- If we can add a timestamp in NNIManager.status, like NNIManager.status.timestamp?~~

Konwn bug:

If an error occurs in ExperimentsManager.withLock, .experiment.lock will not be del. This will block nnictl r&d .experiment. Need implement check expiration function in python filelock.
(quick fix kill command under windows #3106 ) It seems the experiment can not receive SIGTERM in windows. psutil.Process(pid).terminate()?

Ning Shang added 10 commits

November 11, 2020 17:10


          step 1 nnictl generate experimentId & merge folder

0a595a9


          step 2.1 modify .experiment structure

cb7485b


          step 2.2 add lock for .experiment rw in nnictl

f4ffbee


          step 2.2 add filelock dependence

4fb7c33


          step 2.2 remove uniqueString from main.js

d29d85b


          fix test bug

e39de46


          fix test bug

02952af


          setp 3.1 add experiment manager

594619a


          step 3.2 add getExperimentsInfo

1c4aabd


          fix eslint

5c8f59a

liuzhe-lz requested review from SparkSnail and liuzhe-lz

November 13, 2020 08:54

liuzhe-lz mentioned this pull request

v2.0 Release Plan #2935

Closed

77 tasks


          add a simple file lock to support stale

ea0b553

J-shang closed this

J-shang reopened this

Ning Shang added 3 commits

November 16, 2020 15:25


          step 3.3 add test

863fc1d


          divide abs experiment manager from manager

eeedf3d


          experiment manager refactor

b83d9aa

J-shang force-pushed the experiment-backend branch from 1f439a1 to b83d9aa Compare

November 19, 2020 01:50

Ning Shang added 7 commits

November 19, 2020 12:20


          support .experiment sync update status

fbe4d7c


          nnictl no longer uses rest api to update status or endtime

a2fbc55


          nnictl no longer uses rest api to update status or endtime

8ec133f


          fix eslint

df01921


          support .experiment sync update endtime

a100f10


          fix test

41b6eac


          fix settimeout bug

23bb387

J-shang force-pushed the experiment-backend branch from e5539bf to 23bb387 Compare

November 19, 2020 14:53

Ning Shang added 2 commits

November 19, 2020 23:49


          fix test

13d6b07


          adjust experiment endTime

fbdb128

chicm-ms reviewed

View reviewed changes

nni/tools/nnictl/common_utils.py Outdated

+                              count += 1
+              def get_file_lock(path: string, timeout=-1, stale=10):
+                  return SimpleFileLock(path + '.lock', timeout=timeout, stale=stale)

Contributor

chicm-ms Nov 25, 2020

what is the purpose to set stale = 10 seconds?

Contributor Author

J-shang Nov 25, 2020

Setting stale=10 is unreasonable. Maybe modify it to stale=-1 is better? means it will seize the lock immediately.

chicm-ms reviewed

View reviewed changes

nni/tools/nnictl/config_utils.py Outdated

-                  def add_experiment(self, expId, port, startTime, file_name, platform, experiment_name, endTime='N/A', status='INITIALIZED'):
+                  def add_experiment(self, expId, port, startTime, platform, experiment_name, endTime='N/A', status='INITIALIZED',
+                                     tag=[], pid=None, webuiUrl=[], logDir=[]):
                       '''set {key:value} paris to self.experiment'''

Contributor

chicm-ms Nov 25, 2020

typo: paris

chicm-ms reviewed

View reviewed changes

nni/tools/nnictl/config_utils.py

+                          if expId not in self.experiments:
+                              return False
+                          self.experiments[expId][key] = value
+                          self.write_file()

Contributor

chicm-ms Nov 25, 2020

suggest to add indent to dump the json file to make the file more readable.

Contributor Author

J-shang Nov 25, 2020

Indeed, I will add indent.

chicm-ms reviewed

View reviewed changes

ts/nni_manager/core/nniExperimentsManager.ts Outdated

+                          if (result !== undefined) {
+                              return result;
+                          } else {
+                              return this.getExperimentsInfo();

Contributor

chicm-ms Nov 25, 2020 •

edited

Loading

it seems we need await here?

Contributor

chicm-ms Nov 25, 2020

and do we need a delay here ?

Contributor Author

J-shang Nov 25, 2020

yes, we need await and a delay is better, will add them.

chicm-ms reviewed

View reviewed changes

ts/nni_manager/core/nniExperimentsManager.ts Outdated

+                                  experimentsInformation[experimentId][key] = value;
+                                  fs.writeFileSync(this.experimentsPath, JSON.stringify(experimentsInformation));
+                              } else {
+                                  this.log.error(`Experiment Manager: Experiment Id ${experimentId} not found, this should not happen`);

Contributor

chicm-ms Nov 25, 2020

if this should not happen, maybe better to throw error and crash with assert

chicm-ms reviewed

View reviewed changes

ts/nni_manager/core/nniExperimentsManager.ts

+                              }
+                          });
+                      } catch (err) {
+                          this.log.error(err);

Contributor

chicm-ms Nov 25, 2020

is this err recoverable?

Contributor Author

J-shang Nov 26, 2020

This error is due to other processes lock the file, I add some details to distinguish this kind of error.

QuanluZhang reviewed

View reviewed changes

ts/nni_manager/core/nniExperimentsManager.ts

+                  public setExperimentInfo(experimentId: string, key: string, value: any): void {
+                      try {
+                          if (this.profileUpdateTimer[key] !== undefined) {
+                              clearTimeout(this.profileUpdateTimer[key]);

Contributor

QuanluZhang Nov 25, 2020

better to add comment to explain why clear timer here

QuanluZhang reviewed

View reviewed changes

ts/nni_manager/core/nniExperimentsManager.ts

+                  }
+                  private async checkCrashed(expId: string, pid: number): Promise<CrashedInfo> {
+                      const alive: boolean = await isAlive(pid);

Contributor

QuanluZhang Nov 25, 2020

you said different experiments may use the same pid, is there any plan to fix this issue?

Contributor Author

J-shang Nov 25, 2020

not different experiments may use the same pid. Some other process may use the crashed experiment pid recorded in .experiment, so when we check if pid is alive to check if experiment is alive, it may make some mistake. An easy way to fix this issue is to check if the experiment id running on the port matches the experiment id recorded in .experiment.


          fix issue in comments

7b1853e

QuanluZhang approved these changes

View reviewed changes

Ning Shang added 4 commits

November 26, 2020 10:38


          fix rest api format

b1ada7a


          add indent in json in experiments manager

ccd906a


          fix unittest


          fix unittest

155c132

chicm-ms reviewed

View reviewed changes

nni/tools/nnictl/config_utils.py Outdated

@@ @@ -54,39 +56,53 @@ class Experiments: @@
                   def __init__(self, home_dir=NNICTL_HOME_DIR):
                       os.makedirs(home_dir, exist_ok=True)
                       self.experiment_file = os.path.join(home_dir, '.experiment')
-                      self.experiments = self.read_file()
+                      self.lock = get_file_lock(self.experiment_file, timeout=1, stale=2)

Contributor

chicm-ms Nov 26, 2020

what is the problem that we try to solve with timeout=1, stale=2 ?

Contributor

chicm-ms Nov 26, 2020

the name timeout parameter of get_file_lock and SimpleFileLock is a little bit misleading, it is used in an inner loop, not the overall timeout.

Contributor Author

J-shang Nov 26, 2020

We will check if the lock file is modified over two seconds to determine whether the current lock need to be forced released.

This because we assume that all operations on the .experiment file should be completed within 2 seconds. If the lock file has not been modified for more than 2 seconds, we think the process that generated the lock may have crashed without releasing the lock.

The async lock implementation in TS code has the same logic. The only difference is that TS only retries to lock 100 times, with an interval of 0.1s, then it will throw error. In python code lock will non-stop retry and print warning with an interval of timeout because we think users has the ability to judge when nnictl should be forced to end.

timeout=1 means we will check if lockfile is expired with an interval of 1 second.

Contributor Author

J-shang Nov 26, 2020

Indeed, timeout -> check_interval, SimpleFileLock -> ExpiredFileLock maybe better?

Contributor

chicm-ms Nov 26, 2020 •

edited

Loading

We could solve this problem with special handling the timeout exception, no need to introduce stale.
check_interval is good, SimpleFileLock -> SimplePreemptiveLock

def __enter__()
    while True:
        try:
            self.aquire()
            return self
        except Timeout:
            # here remove self._lock_file

liuzhe-lz reviewed

View reviewed changes

ts/nni_manager/main.ts Outdated

               import { DLTSTrainingService } from './training_service/dlts/dltsTrainingService';
               function initStartupInfo(
-                  startExpMode: string, resumeExperimentId: string, basePort: number, platform: string,
+                  startExpMode: string, ExperimentId: string, basePort: number, platform: string,

Contributor

liuzhe-lz Nov 26, 2020

It should be "camelCase".


          refector file lock

df491d4

J-shang force-pushed the experiment-backend branch from 184e409 to df491d4 Compare

November 26, 2020 16:32

liuzhe-lz mentioned this pull request

Create experiment from Python code #3111

Merged


          fix eslint

chicm-ms reviewed

View reviewed changes

nni/tools/nnictl/common_utils.py Outdated

+                          lock_file_names = glob.glob(self._lock_file + '.*')
+                          for file_name in lock_file_names:
+                              if os.path.exists(file_name) and time.time() - os.stat(file_name).st_mtime < self._timeout:
+                                  raise TimeoutError()

Contributor

chicm-ms Nov 27, 2020

return None here is more consistent with the logic of base class

chicm-ms reviewed

View reviewed changes

nni/tools/nnictl/common_utils.py Outdated

+                      while True:
+                          try:
+                              self.acquire()
+                              return self

Contributor

chicm-ms Nov 27, 2020

base class already has a loop, we can set timeout=-1 for base class, then we do not need this __enter__

Contributor

chicm-ms Nov 27, 2020

we can rename the check_interval back to stale with this new design.

Contributor Author

J-shang Nov 27, 2020

Indeed, we do not need __enter__ any more, I will remove it.


          remove '__enter__' in filelock

4dd700a

chicm-ms reviewed

View reviewed changes

nni/tools/nnictl/common_utils.py Outdated

+                      try:
+                          lock_file_names = glob.glob(self._lock_file + '.*')
+                          for file_name in lock_file_names:
+                              if os.path.exists(file_name) and time.time() - os.stat(file_name).st_mtime < self._stale:

Contributor

chicm-ms Nov 27, 2020

we need special handling -1 value for stale to make it means never expire:

if os.path.exists(file_name) and (self._stale < 0 or time.time() - os.stat(file_name).st_mtime < self._stale):

Contributor Author

J-shang Nov 27, 2020

Indeed, handle it.


          filelock support never expire

fa75dc3

chicm-ms approved these changes

View reviewed changes

J-shang merged commit 95f731e into microsoft:master

J-shang deleted the experiment-backend branch

December 15, 2020 02:31

kvartet added retiarii-v2.0 and removed retiarii-v2.0 labels

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet