Skip to content
This repository has been archived by the owner on Sep 18, 2024. It is now read-only.

experiment management backend #3081

Merged
merged 39 commits into from
Nov 30, 2020
Merged

Conversation

J-shang
Copy link
Contributor

@J-shang J-shang commented Nov 11, 2020

  • nnictl generate experimentId & merge folder
  • Modify .experiment structure
  • Experiment manager
    • add experiment manager
    • add test
  • Refactor
  • (stretch) Merge .config and db.experimentProfile.param

TODO:
- If we can add a timestamp in NNIManager.status, like NNIManager.status.timestamp?

Konwn bug:

  • If an error occurs in ExperimentsManager.withLock, .experiment.lock will not be del. This will block nnictl r&d .experiment. Need implement check expiration function in python filelock.
  • (quick fix kill command under windows #3106 ) It seems the experiment can not receive SIGTERM in windows. psutil.Process(pid).terminate()?

@liuzhe-lz liuzhe-lz mentioned this pull request Nov 13, 2020
77 tasks
@J-shang J-shang closed this Nov 16, 2020
@J-shang J-shang reopened this Nov 16, 2020
count += 1

def get_file_lock(path: string, timeout=-1, stale=10):
return SimpleFileLock(path + '.lock', timeout=timeout, stale=stale)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is the purpose to set stale = 10 seconds?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Setting stale=10 is unreasonable. Maybe modify it to stale=-1 is better? means it will seize the lock immediately.


def add_experiment(self, expId, port, startTime, file_name, platform, experiment_name, endTime='N/A', status='INITIALIZED'):
def add_experiment(self, expId, port, startTime, platform, experiment_name, endTime='N/A', status='INITIALIZED',
tag=[], pid=None, webuiUrl=[], logDir=[]):
'''set {key:value} paris to self.experiment'''
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo: paris

if expId not in self.experiments:
return False
self.experiments[expId][key] = value
self.write_file()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggest to add indent to dump the json file to make the file more readable.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, I will add indent.

if (result !== undefined) {
return result;
} else {
return this.getExperimentsInfo();
Copy link
Contributor

@chicm-ms chicm-ms Nov 25, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it seems we need await here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and do we need a delay here ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, we need await and a delay is better, will add them.

experimentsInformation[experimentId][key] = value;
fs.writeFileSync(this.experimentsPath, JSON.stringify(experimentsInformation));
} else {
this.log.error(`Experiment Manager: Experiment Id ${experimentId} not found, this should not happen`);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if this should not happen, maybe better to throw error and crash with assert

}
});
} catch (err) {
this.log.error(err);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this err recoverable?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This error is due to other processes lock the file, I add some details to distinguish this kind of error.

public setExperimentInfo(experimentId: string, key: string, value: any): void {
try {
if (this.profileUpdateTimer[key] !== undefined) {
clearTimeout(this.profileUpdateTimer[key]);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

better to add comment to explain why clear timer here

}

private async checkCrashed(expId: string, pid: number): Promise<CrashedInfo> {
const alive: boolean = await isAlive(pid);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you said different experiments may use the same pid, is there any plan to fix this issue?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not different experiments may use the same pid. Some other process may use the crashed experiment pid recorded in .experiment, so when we check if pid is alive to check if experiment is alive, it may make some mistake. An easy way to fix this issue is to check if the experiment id running on the port matches the experiment id recorded in .experiment.

@@ -54,39 +56,53 @@ class Experiments:
def __init__(self, home_dir=NNICTL_HOME_DIR):
os.makedirs(home_dir, exist_ok=True)
self.experiment_file = os.path.join(home_dir, '.experiment')
self.experiments = self.read_file()
self.lock = get_file_lock(self.experiment_file, timeout=1, stale=2)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is the problem that we try to solve with timeout=1, stale=2 ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the name timeout parameter of get_file_lock and SimpleFileLock is a little bit misleading, it is used in an inner loop, not the overall timeout.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We will check if the lock file is modified over two seconds to determine whether the current lock need to be forced released.

This because we assume that all operations on the .experiment file should be completed within 2 seconds. If the lock file has not been modified for more than 2 seconds, we think the process that generated the lock may have crashed without releasing the lock.

The async lock implementation in TS code has the same logic. The only difference is that TS only retries to lock 100 times, with an interval of 0.1s, then it will throw error. In python code lock will non-stop retry and print warning with an interval of timeout because we think users has the ability to judge when nnictl should be forced to end.

timeout=1 means we will check if lockfile is expired with an interval of 1 second.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, timeout -> check_interval, SimpleFileLock -> ExpiredFileLock maybe better?

Copy link
Contributor

@chicm-ms chicm-ms Nov 26, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could solve this problem with special handling the timeout exception, no need to introduce stale.
check_interval is good, SimpleFileLock -> SimplePreemptiveLock

def __enter__()
    while True:
        try:
            self.aquire()
            return self
        except Timeout:
            # here remove self._lock_file

@@ -27,11 +29,10 @@ import { PAIYarnTrainingService } from './training_service/pai/paiYarn/paiYarnTr
import { DLTSTrainingService } from './training_service/dlts/dltsTrainingService';

function initStartupInfo(
startExpMode: string, resumeExperimentId: string, basePort: number, platform: string,
startExpMode: string, ExperimentId: string, basePort: number, platform: string,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should be "camelCase".

lock_file_names = glob.glob(self._lock_file + '.*')
for file_name in lock_file_names:
if os.path.exists(file_name) and time.time() - os.stat(file_name).st_mtime < self._timeout:
raise TimeoutError()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

return None here is more consistent with the logic of base class

while True:
try:
self.acquire()
return self
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

base class already has a loop, we can set timeout=-1 for base class, then we do not need this __enter__

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can rename the check_interval back to stale with this new design.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, we do not need __enter__ any more, I will remove it.

try:
lock_file_names = glob.glob(self._lock_file + '.*')
for file_name in lock_file_names:
if os.path.exists(file_name) and time.time() - os.stat(file_name).st_mtime < self._stale:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we need special handling -1 value for stale to make it means never expire:

if os.path.exists(file_name) and (self._stale < 0 or time.time() - os.stat(file_name).st_mtime < self._stale):

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, handle it.

@J-shang J-shang merged commit 95f731e into microsoft:master Nov 30, 2020
@J-shang J-shang deleted the experiment-backend branch December 15, 2020 02:31
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants