-
Notifications
You must be signed in to change notification settings - Fork 1.8k
Conversation
1f439a1
to
b83d9aa
Compare
e5539bf
to
23bb387
Compare
nni/tools/nnictl/common_utils.py
Outdated
count += 1 | ||
|
||
def get_file_lock(path: string, timeout=-1, stale=10): | ||
return SimpleFileLock(path + '.lock', timeout=timeout, stale=stale) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is the purpose to set stale = 10 seconds?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Setting stale=10
is unreasonable. Maybe modify it to stale=-1
is better? means it will seize the lock immediately.
nni/tools/nnictl/config_utils.py
Outdated
|
||
def add_experiment(self, expId, port, startTime, file_name, platform, experiment_name, endTime='N/A', status='INITIALIZED'): | ||
def add_experiment(self, expId, port, startTime, platform, experiment_name, endTime='N/A', status='INITIALIZED', | ||
tag=[], pid=None, webuiUrl=[], logDir=[]): | ||
'''set {key:value} paris to self.experiment''' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo: paris
if expId not in self.experiments: | ||
return False | ||
self.experiments[expId][key] = value | ||
self.write_file() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggest to add indent to dump the json file to make the file more readable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed, I will add indent.
if (result !== undefined) { | ||
return result; | ||
} else { | ||
return this.getExperimentsInfo(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it seems we need await
here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and do we need a delay
here ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, we need await
and a delay
is better, will add them.
experimentsInformation[experimentId][key] = value; | ||
fs.writeFileSync(this.experimentsPath, JSON.stringify(experimentsInformation)); | ||
} else { | ||
this.log.error(`Experiment Manager: Experiment Id ${experimentId} not found, this should not happen`); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if this should not happen, maybe better to throw error and crash with assert
} | ||
}); | ||
} catch (err) { | ||
this.log.error(err); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this err recoverable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This error is due to other processes lock the file, I add some details to distinguish this kind of error.
public setExperimentInfo(experimentId: string, key: string, value: any): void { | ||
try { | ||
if (this.profileUpdateTimer[key] !== undefined) { | ||
clearTimeout(this.profileUpdateTimer[key]); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
better to add comment to explain why clear timer here
} | ||
|
||
private async checkCrashed(expId: string, pid: number): Promise<CrashedInfo> { | ||
const alive: boolean = await isAlive(pid); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you said different experiments may use the same pid, is there any plan to fix this issue?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not different experiments may use the same pid. Some other process may use the crashed experiment pid recorded in .experiment
, so when we check if pid is alive to check if experiment is alive, it may make some mistake. An easy way to fix this issue is to check if the experiment id
running on the port matches the experiment id
recorded in .experiment
.
nni/tools/nnictl/config_utils.py
Outdated
@@ -54,39 +56,53 @@ class Experiments: | |||
def __init__(self, home_dir=NNICTL_HOME_DIR): | |||
os.makedirs(home_dir, exist_ok=True) | |||
self.experiment_file = os.path.join(home_dir, '.experiment') | |||
self.experiments = self.read_file() | |||
self.lock = get_file_lock(self.experiment_file, timeout=1, stale=2) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is the problem that we try to solve with timeout=1, stale=2
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the name timeout
parameter of get_file_lock
and SimpleFileLock
is a little bit misleading, it is used in an inner loop, not the overall timeout.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We will check if the lock file is modified over two seconds to determine whether the current lock need to be forced released.
This because we assume that all operations on the .experiment
file should be completed within 2
seconds. If the lock file has not been modified for more than 2
seconds, we think the process that generated the lock may have crashed without releasing the lock.
The async lock
implementation in TS code has the same logic. The only difference is that TS only retries to lock 100 times, with an interval of 0.1s, then it will throw error. In python code lock will non-stop retry and print warning with an interval of timeout
because we think users has the ability to judge when nnictl
should be forced to end.
timeout=1
means we will check if lockfile is expired with an interval of 1
second.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed, timeout
-> check_interval
, SimpleFileLock
-> ExpiredFileLock
maybe better?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could solve this problem with special handling the timeout exception, no need to introduce stale
.
check_interval
is good, SimpleFileLock
-> SimplePreemptiveLock
def __enter__()
while True:
try:
self.aquire()
return self
except Timeout:
# here remove self._lock_file
ts/nni_manager/main.ts
Outdated
@@ -27,11 +29,10 @@ import { PAIYarnTrainingService } from './training_service/pai/paiYarn/paiYarnTr | |||
import { DLTSTrainingService } from './training_service/dlts/dltsTrainingService'; | |||
|
|||
function initStartupInfo( | |||
startExpMode: string, resumeExperimentId: string, basePort: number, platform: string, | |||
startExpMode: string, ExperimentId: string, basePort: number, platform: string, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It should be "camelCase".
184e409
to
df491d4
Compare
nni/tools/nnictl/common_utils.py
Outdated
lock_file_names = glob.glob(self._lock_file + '.*') | ||
for file_name in lock_file_names: | ||
if os.path.exists(file_name) and time.time() - os.stat(file_name).st_mtime < self._timeout: | ||
raise TimeoutError() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
return None
here is more consistent with the logic of base class
nni/tools/nnictl/common_utils.py
Outdated
while True: | ||
try: | ||
self.acquire() | ||
return self |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
base class already has a loop, we can set timeout=-1
for base class, then we do not need this __enter__
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we can rename the check_interval
back to stale
with this new design.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed, we do not need __enter__
any more, I will remove it.
nni/tools/nnictl/common_utils.py
Outdated
try: | ||
lock_file_names = glob.glob(self._lock_file + '.*') | ||
for file_name in lock_file_names: | ||
if os.path.exists(file_name) and time.time() - os.stat(file_name).st_mtime < self._stale: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we need special handling -1
value for stale to make it means never expire:
if os.path.exists(file_name) and (self._stale < 0 or time.time() - os.stat(file_name).st_mtime < self._stale):
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed, handle it.
TODO:
- If we can add a timestamp inNNIManager.status
, likeNNIManager.status.timestamp
?Konwn bug:
ExperimentsManager.withLock
,.experiment.lock
will not be del. This will block nnictl r&d.experiment
. Need implement check expiration function in python filelock.SIGTERM
in windows.psutil.Process(pid).terminate()
?