Skip to content
This repository has been archived by the owner on Sep 18, 2024. It is now read-only.

fix gpu script permission issue #1707

Merged
merged 2 commits into from
Nov 21, 2019
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 0 additions & 8 deletions src/nni_manager/training_service/common/gpuData.ts
Original file line number Diff line number Diff line change
Expand Up @@ -59,14 +59,6 @@ export class GPUSummary {
}
}

export const GPU_INFO_COLLECTOR_FORMAT_LINUX: string =
`
#!/bin/bash
export METRIC_OUTPUT_DIR={0}
echo $$ >{1}
python3 -m nni_gpu_tool.gpu_metrics_collector
`;

export const GPU_INFO_COLLECTOR_FORMAT_WINDOWS: string =
`
$env:METRIC_OUTPUT_DIR="{0}"
Expand Down
26 changes: 10 additions & 16 deletions src/nni_manager/training_service/common/util.ts
Original file line number Diff line number Diff line change
Expand Up @@ -27,7 +27,7 @@ import * as path from 'path';
import { String } from 'typescript-string-operations';
import { countFilesRecursively, getNewLine, validateFileNameRecursively } from '../../common/utils';
import { file } from '../../node_modules/@types/tmp';
import { GPU_INFO_COLLECTOR_FORMAT_LINUX, GPU_INFO_COLLECTOR_FORMAT_WINDOWS } from './gpuData';
import { GPU_INFO_COLLECTOR_FORMAT_WINDOWS } from './gpuData';

/**
* Validate codeDir, calculate file count recursively under codeDir, and throw error if any rule is broken
Expand Down Expand Up @@ -219,22 +219,16 @@ export function getScriptName(fileNamePrefix: string): string {
}
}

/**
* generate script file
* @param gpuMetricCollectorScriptFolder
*/
export function getgpuMetricsCollectorScriptContent(gpuMetricCollectorScriptFolder: string): string {
export function getGpuMetricsCollectorBashScriptContent(scriptFolder: string): string {
return `echo $$ > ${scriptFolder}/pid ; METRIC_OUTPUT_DIR=${scriptFolder} python3 -m nni_gpu_tool.gpu_metrics_collector`;
}

export function runGpuMetricsCollector(scriptFolder: string): void {
if (process.platform === 'win32') {
return String.Format(
GPU_INFO_COLLECTOR_FORMAT_WINDOWS,
gpuMetricCollectorScriptFolder,
path.join(gpuMetricCollectorScriptFolder, 'pid')
);
const scriptPath = path.join(scriptFolder, 'gpu_metrics_collector.ps1');
const content = String.Format(GPU_INFO_COLLECTOR_FORMAT_WINDOWS, scriptFolder, path.join(scriptFolder, 'pid'));
fs.writeFile(scriptPath, content, { encoding: 'utf8' }, () => { runScript(scriptPath); });
} else {
return String.Format(
GPU_INFO_COLLECTOR_FORMAT_LINUX,
gpuMetricCollectorScriptFolder,
path.join(gpuMetricCollectorScriptFolder, 'pid')
);
cp.exec(getGpuMetricsCollectorBashScriptContent(scriptFolder), { shell: '/bin/bash' });
}
}
9 changes: 2 additions & 7 deletions src/nni_manager/training_service/local/gpuScheduler.ts
Original file line number Diff line number Diff line change
Expand Up @@ -28,7 +28,7 @@ import { String } from 'typescript-string-operations';
import { getLogger, Logger } from '../../common/log';
import { delay } from '../../common/utils';
import { GPUInfo, GPUSummary } from '../common/gpuData';
import { execKill, execMkdir, execRemove, execTail, getgpuMetricsCollectorScriptContent, getScriptName, runScript } from '../common/util';
import { execKill, execMkdir, execRemove, execTail, runGpuMetricsCollector } from '../common/util';

/**
* GPUScheduler for local training service
Expand Down Expand Up @@ -101,12 +101,7 @@ class GPUScheduler {
*/
private async runGpuMetricsCollectorScript(): Promise<void> {
await execMkdir(this.gpuMetricCollectorScriptFolder, true);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add unmask 0 here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this.gpuMetricCollectorScriptFolder is ${os.tmpdir()}/nni/script, I think it is better to add username in it.

Copy link
Contributor Author

@liuzhe-lz liuzhe-lz Nov 10, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

true means share=True, so umask is 0 here. As a magic value true is hard to understand and I'll add some comments.
This folder is to store collected metrics. So it cannot have user name in it so long as the collector script runs exclusively. Consider user A starts a collector and it writes metrics to /tmp/nni/A; then user B will fail to start another collector and she does not know where the metrics locates in.

//generate gpu_metrics_collector script
const gpuMetricsCollectorScriptPath: string =
path.join(this.gpuMetricCollectorScriptFolder, getScriptName('gpu_metrics_collector'));
const gpuMetricsCollectorScriptContent: string = getgpuMetricsCollectorScriptContent(this.gpuMetricCollectorScriptFolder);
await fs.promises.writeFile(gpuMetricsCollectorScriptPath, gpuMetricsCollectorScriptContent, { encoding: 'utf8' });
runScript(gpuMetricsCollectorScriptPath);
runGpuMetricsCollector(this.gpuMetricCollectorScriptFolder);
}

// tslint:disable:non-literal-fs-path
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -42,10 +42,10 @@ import {
getVersion, uniqueString, unixPathJoin
} from '../../common/utils';
import { CONTAINER_INSTALL_NNI_SHELL_FORMAT } from '../common/containerJobData';
import { GPU_INFO_COLLECTOR_FORMAT_LINUX, GPUSummary } from '../common/gpuData';
import { GPUSummary } from '../common/gpuData';
import { TrialConfig } from '../common/trialConfig';
import { TrialConfigMetadataKey } from '../common/trialConfigMetadataKey';
import { execCopydir, execMkdir, execRemove, validateCodeDir } from '../common/util';
import { execCopydir, execMkdir, execRemove, validateCodeDir, getGpuMetricsCollectorBashScriptContent } from '../common/util';
import { GPUScheduler } from './gpuScheduler';
import {
HOST_JOB_SHELL_FORMAT, RemoteCommandResult, REMOTEMACHINE_TRIAL_COMMAND_FORMAT, RemoteMachineMeta,
Expand Down Expand Up @@ -334,8 +334,6 @@ class RemoteMachineTrainingService implements TrainingService {
break;
case TrialConfigMetadataKey.MACHINE_LIST:
await this.setupConnections(value);
//remove local temp files
await execRemove(this.getLocalGpuMetricCollectorDir());
break;
case TrialConfigMetadataKey.TRIAL_CONFIG:
const remoteMachineTrailConfig: TrialConfig = <TrialConfig>JSON.parse(value);
Expand Down Expand Up @@ -428,34 +426,6 @@ class RemoteMachineTrainingService implements TrainingService {
return Promise.resolve();
}

/**
* Generate gpu metric collector directory to store temp gpu metric collector script files
*/
private getLocalGpuMetricCollectorDir(): string {
const userName: string = path.basename(os.homedir()); //get current user name of os

return path.join(os.tmpdir(), userName, 'nni', 'scripts');
}

/**
* Generate gpu metric collector shell script in local machine,
* used to run in remote machine, and will be deleted after uploaded from local.
*/
private async generateGpuMetricsCollectorScript(userName: string): Promise<void> {
const gpuMetricCollectorScriptFolder : string = this.getLocalGpuMetricCollectorDir();
await execMkdir(path.join(gpuMetricCollectorScriptFolder, userName));
//generate gpu_metrics_collector.sh
const gpuMetricsCollectorScriptPath: string = path.join(gpuMetricCollectorScriptFolder, userName, 'gpu_metrics_collector.sh');
// This directory is used to store gpu_metrics and pid created by script
const remoteGPUScriptsDir: string = this.getRemoteScriptsPath(userName);
const gpuMetricsCollectorScriptContent: string = String.Format(
GPU_INFO_COLLECTOR_FORMAT_LINUX,
remoteGPUScriptsDir,
unixPathJoin(remoteGPUScriptsDir, 'pid')
);
await fs.promises.writeFile(gpuMetricsCollectorScriptPath, gpuMetricsCollectorScriptContent, { encoding: 'utf8' });
}

private async setupConnections(machineList: string): Promise<void> {
this.log.debug(`Connecting to remote machines: ${machineList}`);
const deferred: Deferred<void> = new Deferred<void>();
Expand All @@ -480,23 +450,18 @@ class RemoteMachineTrainingService implements TrainingService {
private async initRemoteMachineOnConnected(rmMeta: RemoteMachineMeta, conn: Client): Promise<void> {
// Create root working directory after ssh connection is ready
// generate gpu script in local machine first, will copy to remote machine later
await this.generateGpuMetricsCollectorScript(rmMeta.username);
const nniRootDir: string = unixPathJoin(getRemoteTmpDir(this.remoteOS), 'nni');
await SSHClientUtility.remoteExeCommand(`mkdir -p ${this.remoteExpRootDir}`, conn);

// Copy NNI scripts to remote expeirment working directory
const localGpuScriptCollectorDir: string = this.getLocalGpuMetricCollectorDir();
// the directory to store temp scripts in remote machine
const remoteGpuScriptCollectorDir: string = this.getRemoteScriptsPath(rmMeta.username);
await SSHClientUtility.remoteExeCommand(`mkdir -p ${remoteGpuScriptCollectorDir}`, conn);
await SSHClientUtility.remoteExeCommand(`(umask 0 ; mkdir -p ${remoteGpuScriptCollectorDir})`, conn);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no need to add umask, because this unixPathJoin(getRemoteTmpDir(this.remoteOS), userName, 'nni', 'scripts'); is how remoteGpuScriptCollectorDir come from, userName in it.

Copy link
Contributor Author

@liuzhe-lz liuzhe-lz Nov 10, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, then we are in trouble.
As stated in comment above, one machine can only have one collector running, which makes it impossible for the second user to collect GPU metrics if the folder is not shared.
This is a big problem and it may take some time to fix.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to blame this behaviour was introduced 10 months ago. I will be surprised if a bug can be hidden for so long.
I don't have a remote GPU machine for now. I'll check it on Monday.
I have confirmed pgrep -fx will detect process launched by other users though.

await SSHClientUtility.remoteExeCommand(`chmod 777 ${nniRootDir} ${nniRootDir}/* ${nniRootDir}/scripts/*`, conn);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is nniRootDir used for? @SparkSnail

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is the dir used in remote machine of NNI. Normally it is /tmp/nni folder, the folder is used for nni trial logs and gpu script files.

//copy gpu_metrics_collector.sh to remote
await SSHClientUtility.copyFileToRemote(path.join(localGpuScriptCollectorDir, rmMeta.username, 'gpu_metrics_collector.sh'),
unixPathJoin(remoteGpuScriptCollectorDir, 'gpu_metrics_collector.sh'), conn);

//Begin to execute gpu_metrics_collection scripts
// tslint:disable-next-line: no-floating-promises
SSHClientUtility.remoteExeCommand(`bash ${unixPathJoin(remoteGpuScriptCollectorDir, 'gpu_metrics_collector.sh')}`, conn);
const script = getGpuMetricsCollectorBashScriptContent(remoteGpuScriptCollectorDir);
SSHClientUtility.remoteExeCommand(`bash -c '${script}'`, conn);

const disposable: Rx.IDisposable = this.timer.subscribe(
async (tick: number) => {
Expand Down