-
Notifications
You must be signed in to change notification settings - Fork 1.8k
Conversation
@@ -101,12 +101,7 @@ class GPUScheduler { | |||
*/ | |||
private async runGpuMetricsCollectorScript(): Promise<void> { | |||
await execMkdir(this.gpuMetricCollectorScriptFolder, true); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add unmask 0
here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this.gpuMetricCollectorScriptFolder
is ${os.tmpdir()}/nni/script
, I think it is better to add username in it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
true
means share=True
, so umask
is 0 here. As a magic value true
is hard to understand and I'll add some comments.
This folder is to store collected metrics. So it cannot have user name in it so long as the collector script runs exclusively. Consider user A starts a collector and it writes metrics to /tmp/nni/A
; then user B will fail to start another collector and she does not know where the metrics locates in.
// the directory to store temp scripts in remote machine | ||
const remoteGpuScriptCollectorDir: string = this.getRemoteScriptsPath(rmMeta.username); | ||
await SSHClientUtility.remoteExeCommand(`mkdir -p ${remoteGpuScriptCollectorDir}`, conn); | ||
await SSHClientUtility.remoteExeCommand(`(umask 0 ; mkdir -p ${remoteGpuScriptCollectorDir})`, conn); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no need to add umask, because this unixPathJoin(getRemoteTmpDir(this.remoteOS), userName, 'nni', 'scripts');
is how remoteGpuScriptCollectorDir
come from, userName
in it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, then we are in trouble.
As stated in comment above, one machine can only have one collector running, which makes it impossible for the second user to collect GPU metrics if the folder is not shared.
This is a big problem and it may take some time to fix.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
According to blame this behaviour was introduced 10 months ago. I will be surprised if a bug can be hidden for so long.
I don't have a remote GPU machine for now. I'll check it on Monday.
I have confirmed pgrep -fx
will detect process launched by other users though.
// the directory to store temp scripts in remote machine | ||
const remoteGpuScriptCollectorDir: string = this.getRemoteScriptsPath(rmMeta.username); | ||
await SSHClientUtility.remoteExeCommand(`mkdir -p ${remoteGpuScriptCollectorDir}`, conn); | ||
await SSHClientUtility.remoteExeCommand(`(umask 0 ; mkdir -p ${remoteGpuScriptCollectorDir})`, conn); | ||
await SSHClientUtility.remoteExeCommand(`chmod 777 ${nniRootDir} ${nniRootDir}/* ${nniRootDir}/scripts/*`, conn); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is nniRootDir
used for? @SparkSnail
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is the dir used in remote machine of NNI. Normally it is /tmp/nni
folder, the folder is used for nni trial logs and gpu script files.
src/nni_manager/training_service/remote_machine/remoteMachineTrainingService.ts
Outdated
Show resolved
Hide resolved
Updated. Now each user will have a seperate GPU metrics collector. |
Fix issue #1665
Solution: Don't create script on Linux, use one-line command instead. Windows routine is not modified.