Skip to content
This repository has been archived by the owner on Sep 18, 2024. It is now read-only.

Support Windows as remote node. #2431

Merged
merged 72 commits into from
May 19, 2020
Merged

Support Windows as remote node. #2431

merged 72 commits into from
May 19, 2020

Conversation

squirrelsc
Copy link
Member

@squirrelsc squirrelsc commented May 14, 2020

  1. Add WindowsCommand.ts to support different shell commands in windows.
  2. Update documents to explain how to enable remote node in Windows.

Fixed some bugs,

  1. Fix folder support problem on tarAdd function in training_service/common/util.ts.
  2. Fix Keras compatibility issue on val_acc or val_accuracy name in logs.
  3. Fix killChildProcesses of linux to support kill all descendant processes, instead of only direct child processes.
  4. Fix too muchGPU stats commands

Small improvements

  1. Let nni_trial_tool.trial_keeper to save its process id itself, to simplify script, and be more reliable.
  2. Migrate non-cross platform code into osCommands.ts.
  3. Refine ExecutorManager and ShellExecutor, to let ShellExecutor manage MaxTrials, and make it 1 to reduce concurrency issues.
  4. Move map of Trial to shellexecutor from RemoteMachineTrainingService to ExecutorManager. It let resource management be more convenient, and avoid null reference.
  5. Add RandomInt in utils.ts.
  6. Move gpu collector related files into experiment folder for multi-run.
  7. Let OsCommands can normalize folder like / to \ in windows, and versus in Linux.
  8. Extend timeout of setMetaddata to 30 seconds for high latency remote machines like in cloud.
  9. Fix partial success of pipelines-it-remote-windows.yml
  10. Remove gpuNum: 0 in test cases to use GPU.
  11. Improved some logs.

@squirrelsc squirrelsc linked an issue May 14, 2020 that may be closed by this pull request
2 tasks
*/
public getFirstExecutor(): ShellExecutor {
return this.executorArray[0];
// init a new executor if no free one.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comments does not match the code

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

return this.executorArray[0];
// init a new executor if no free one.
if (executor === undefined) {
throw new Error("executor shouldn't be undefined before return!");
Copy link
Contributor

@chicm-ms chicm-ms May 19, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this block is duplicated with above?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

public get getUsedConnectionNumber(): number {
return this.usedConnectionNumber;
return this.usedCount;
}

public addUsedConnectionNumber(): void {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this function addUsedConnectionNumber be removed?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

await executor.createFolder(trialWorkingFolder);
await executor.createFolder(unixPathJoin(trialWorkingFolder, '.nni'));
await executor.createFolder(trialJobDetail.workingDirectory);
await executor.createFolder(executor.joinPath(trialJobDetail.workingDirectory, '.nni'));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not use a function support create folder recursively?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

}
}

async function getRemoteFileContentLoop(executor: ShellExecutor): Promise<void> {
for (let i: number = 0; i < 10; i++) {
const remoteFullName = executor.joinPath(executor.getTempPath(), REMOTEFILE);
for (let i: number = 0; i < 3; i++) {
// console.log(i);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove this line

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

@@ -41,14 +44,16 @@ describe('ShellExecutor test', () => {
rmMeta = JSON.parse(fs.readFileSync('../../.vscode/rminfo.json', 'utf8'));
console.log(rmMeta);
} catch (err) {
console.log(`Please configure rminfo.json to enable remote machine test.${err}`);
console.log(`Please configure rminfo.json to enable remote machine test. ${err}`);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove console.log() in TS code, use this.log() instead.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tried, it doesn't work in test code.

@@ -14,7 +14,6 @@ assessor:
trial:
codeDir: ../../../examples/trials/cifar10_pytorch
command: python3 main.py --epochs 1 --batches 1
gpuNum: 1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why remove gpuNum?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks the test environment is used by two test suites sometime. So if set it to 1, NNI will wait another finished, but it causes timeout error. Remove this setting, it lets test cases to use default GPU, not wait each other, and reduced failure chance.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to add examples to test gpuScheduler in NNI, if remove this configuration, we could not test gpu related functions in pipeline.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, I will add it back. But it's test is very limited.

@SparkSnail SparkSnail merged commit 69cae21 into master May 19, 2020
@squirrelsc squirrelsc deleted the dev-windows-node branch May 20, 2020 02:07
@squirrelsc squirrelsc restored the dev-windows-node branch May 20, 2020 02:07
@squirrelsc squirrelsc deleted the dev-windows-node branch May 20, 2020 02:08
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support Windows as Remote Training Node
3 participants