Skip to content
This repository has been archived by the owner on Sep 18, 2024. It is now read-only.

fix gpu script permission issue #1707

Merged
merged 2 commits into from
Nov 21, 2019
Merged

Conversation

liuzhe-lz
Copy link
Contributor

@liuzhe-lz liuzhe-lz commented Nov 5, 2019

Fix issue #1665

Solution: Don't create script on Linux, use one-line command instead. Windows routine is not modified.

@@ -101,12 +101,7 @@ class GPUScheduler {
*/
private async runGpuMetricsCollectorScript(): Promise<void> {
await execMkdir(this.gpuMetricCollectorScriptFolder, true);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add unmask 0 here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this.gpuMetricCollectorScriptFolder is ${os.tmpdir()}/nni/script, I think it is better to add username in it.

Copy link
Contributor Author

@liuzhe-lz liuzhe-lz Nov 10, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

true means share=True, so umask is 0 here. As a magic value true is hard to understand and I'll add some comments.
This folder is to store collected metrics. So it cannot have user name in it so long as the collector script runs exclusively. Consider user A starts a collector and it writes metrics to /tmp/nni/A; then user B will fail to start another collector and she does not know where the metrics locates in.

// the directory to store temp scripts in remote machine
const remoteGpuScriptCollectorDir: string = this.getRemoteScriptsPath(rmMeta.username);
await SSHClientUtility.remoteExeCommand(`mkdir -p ${remoteGpuScriptCollectorDir}`, conn);
await SSHClientUtility.remoteExeCommand(`(umask 0 ; mkdir -p ${remoteGpuScriptCollectorDir})`, conn);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no need to add umask, because this unixPathJoin(getRemoteTmpDir(this.remoteOS), userName, 'nni', 'scripts'); is how remoteGpuScriptCollectorDir come from, userName in it.

Copy link
Contributor Author

@liuzhe-lz liuzhe-lz Nov 10, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, then we are in trouble.
As stated in comment above, one machine can only have one collector running, which makes it impossible for the second user to collect GPU metrics if the folder is not shared.
This is a big problem and it may take some time to fix.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to blame this behaviour was introduced 10 months ago. I will be surprised if a bug can be hidden for so long.
I don't have a remote GPU machine for now. I'll check it on Monday.
I have confirmed pgrep -fx will detect process launched by other users though.

// the directory to store temp scripts in remote machine
const remoteGpuScriptCollectorDir: string = this.getRemoteScriptsPath(rmMeta.username);
await SSHClientUtility.remoteExeCommand(`mkdir -p ${remoteGpuScriptCollectorDir}`, conn);
await SSHClientUtility.remoteExeCommand(`(umask 0 ; mkdir -p ${remoteGpuScriptCollectorDir})`, conn);
await SSHClientUtility.remoteExeCommand(`chmod 777 ${nniRootDir} ${nniRootDir}/* ${nniRootDir}/scripts/*`, conn);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is nniRootDir used for? @SparkSnail

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is the dir used in remote machine of NNI. Normally it is /tmp/nni folder, the folder is used for nni trial logs and gpu script files.

@liuzhe-lz
Copy link
Contributor Author

Updated. Now each user will have a seperate GPU metrics collector.

@liuzhe-lz liuzhe-lz merged commit 6a5864c into microsoft:master Nov 21, 2019
@liuzhe-lz liuzhe-lz deleted the gpu-perm branch November 22, 2019 09:45
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants