Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[OTX] Enable multi-GPU training #1392

Merged
merged 23 commits into from
Dec 28, 2022
Merged

[OTX] Enable multi-GPU training #1392

merged 23 commits into from
Dec 28, 2022

Conversation

eunwoosh
Copy link
Contributor

@eunwoosh eunwoosh commented Nov 25, 2022

Summary

Enable multi-GPU training for classification, detection, segmentation tasks.

This PR includes

  • Enable multi-GPU training
  • make it available to set output_path in each task class.
  • add test code for multi-GPU training

@eunwoosh eunwoosh requested a review from supersoob November 25, 2022 12:13
@eunwoosh eunwoosh force-pushed the multigpu_enablement branch from 916da4e to a67a36b Compare December 8, 2022 06:49
@github-actions github-actions bot added the ALGO Any changes in OTX Algo Tasks implementation label Dec 13, 2022
@eunwoosh eunwoosh force-pushed the multigpu_enablement branch from a2d5e2d to a7d48b0 Compare December 13, 2022 04:28
@github-actions github-actions bot added the TEST Any changes in tests label Dec 13, 2022
@eunwoosh eunwoosh changed the title Multigpu enablement [OTX] Enable multi-GPU training Dec 15, 2022
@eunwoosh eunwoosh force-pushed the multigpu_enablement branch from ef4eadd to 10967d0 Compare December 15, 2022 02:36
@github-actions github-actions bot removed the ALGO Any changes in OTX Algo Tasks implementation label Dec 15, 2022
@eunwoosh eunwoosh marked this pull request as ready for review December 15, 2022 05:01
@eunwoosh eunwoosh requested a review from a team as a code owner December 15, 2022 05:01
otx/cli/tools/train.py Outdated Show resolved Hide resolved
otx/cli/tools/train.py Outdated Show resolved Hide resolved
otx/cli/tools/train.py Outdated Show resolved Hide resolved
Copy link
Contributor

@jaegukhyun jaegukhyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally it looks good to me. Frankly of speaking I'm not expert of multiprocessing, so let's check the test results. I left some comment, actually most of them are questions. Also I have two questions for overall PR.

  1. Does using multiprocessing give more benefits than using nn.parallel.DistributedDataParallel? I have read this article
  2. Don't we need stressed test for multiprocessing? Multiprocessing works always looks good if they run small amount(training schedule, # of jobs), but when the volume of works grow, they give error out of sudden.

@eunwoosh
Copy link
Contributor Author

eunwoosh commented Dec 15, 2022

Generally it looks good to me. Frankly of speaking I'm not expert of multiprocessing, so let's check the test results. I left some comment, actually most of them are questions. Also I have two questions for overall PR.

  1. Does using multiprocessing give more benefits than using nn.parallel.DistributedDataParallel? I have read this article
  2. Don't we need stressed test for multiprocessing? Multiprocessing works always looks good if they run small amount(training schedule, # of jobs), but when the volume of works grow, they give error out of sudden.

Thanks for comment! I think I can answer your questions.

Answer1 : You're right. So I implement to use nn.parallel.DistributedDataParallel.
Answer 2 : Actually, I can't understand your exact point. So Do you mean that we need to make test for multiple multi GPU training?

@jaegukhyun
Copy link
Contributor

Generally it looks good to me. Frankly of speaking I'm not expert of multiprocessing, so let's check the test results. I left some comment, actually most of them are questions. Also I have two questions for overall PR.

  1. Does using multiprocessing give more benefits than using nn.parallel.DistributedDataParallel? I have read this article
  2. Don't we need stressed test for multiprocessing? Multiprocessing works always looks good if they run small amount(training schedule, # of jobs), but when the volume of works grow, they give error out of sudden.

Thanks for comment! I think I can answer your questions.

Answer1 : You're right. So I implement to use nn.parallel.DistributedDataParallel. Answer 2 : Actually, I can't understand your exact point. So Do you mean that we need to make test for multiple multi GPU training?

To the answer2, I'm not sure, but I think we should check whether multi-gpu training can run multiple training jobs
For example, run below training jobs simultaneously

  1. one training job for gpu1, 2
  2. one training job for gpu 1,2,3
  3. one training job for 2,3
  4. one training job for 2,3,4
    Did you check this situation in your environment? Test code can test simple situation, but we should check complex situation in local environment

@eunwoosh
Copy link
Contributor Author

Generally it looks good to me. Frankly of speaking I'm not expert of multiprocessing, so let's check the test results. I left some comment, actually most of them are questions. Also I have two questions for overall PR.

  1. Does using multiprocessing give more benefits than using nn.parallel.DistributedDataParallel? I have read this article
  2. Don't we need stressed test for multiprocessing? Multiprocessing works always looks good if they run small amount(training schedule, # of jobs), but when the volume of works grow, they give error out of sudden.

Thanks for comment! I think I can answer your questions.
Answer1 : You're right. So I implement to use nn.parallel.DistributedDataParallel. Answer 2 : Actually, I can't understand your exact point. So Do you mean that we need to make test for multiple multi GPU training?

To the answer2, I'm not sure, but I think we should check whether multi-gpu training can run multiple training jobs For example, run below training jobs simultaneously

  1. one training job for gpu1, 2
  2. one training job for gpu 1,2,3
  3. one training job for 2,3
  4. one training job for 2,3,4
    Did you check this situation in your environment? Test code can test simple situation, but we should check complex situation in local environment

I understand. I'll check the situation in the local environment, and I'll update result.

@eunwoosh
Copy link
Contributor Author

@jaegukhyun I checked that 4 multi GPU training worked well parallelly.

Copy link
Contributor

@harimkang harimkang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left a few comments.

otx/algorithms/anomaly/tasks/inference.py Show resolved Hide resolved
otx/cli/tools/train.py Outdated Show resolved Hide resolved
otx/cli/utils/hpo.py Outdated Show resolved Hide resolved
otx/cli/tools/train.py Show resolved Hide resolved
Copy link
Contributor

@sungmanc sungmanc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please replace print sentence with logger and also left some comments

otx/cli/tools/train.py Outdated Show resolved Hide resolved
otx/cli/tools/train.py Show resolved Hide resolved
otx/cli/utils/multi_gpu.py Outdated Show resolved Hide resolved
otx/cli/utils/multi_gpu.py Outdated Show resolved Hide resolved
otx/cli/utils/multi_gpu.py Show resolved Hide resolved
otx/cli/utils/multi_gpu.py Outdated Show resolved Hide resolved
@eunwoosh
Copy link
Contributor Author

eunwoosh commented Dec 28, 2022

There are two fail cases in Pre-Merge Check as bellow.

TestToolsMPAMultilabelClassification.test_otx_eval_openvino[Custom_Image_Classification_EfficientNet-V2-S]
AssertionError: trained_performance[k]=1.0, exported_performance[k]=0.9607843137254902
TestToolsMPAMultilabelClassification.test_otx_eval_openvino[Custom_Image_Classification_EfficientNet-B0]
AssertionError: trained_performance[k]=0.9803921568627451, exported_performance[k]=0.9607843137254902

These are due to gap between exported model performance and trained model performance.
I think that it's acceptable difference. So, could you approve the PR?

sungmanc
sungmanc previously approved these changes Dec 28, 2022
Copy link
Contributor

@JihwanEom JihwanEom left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use also 3 GPUs or 4 GPUs?

otx/cli/tools/train.py Show resolved Hide resolved
otx/cli/tools/train.py Show resolved Hide resolved
otx/cli/utils/multi_gpu.py Outdated Show resolved Hide resolved
otx/cli/utils/multi_gpu.py Outdated Show resolved Hide resolved
otx/cli/utils/multi_gpu.py Show resolved Hide resolved
otx/cli/utils/multi_gpu.py Outdated Show resolved Hide resolved
otx/cli/utils/multi_gpu.py Show resolved Hide resolved
otx/cli/utils/multi_gpu.py Show resolved Hide resolved
@eunwoosh
Copy link
Contributor Author

Can we use also 3 GPUs or 4 GPUs?

yes

@eunwoosh eunwoosh requested a review from sungmanc December 28, 2022 01:40
@eunwoosh eunwoosh merged commit a31c064 into feature/otx Dec 28, 2022
@eunwoosh eunwoosh deleted the multigpu_enablement branch December 28, 2022 02:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ALGO Any changes in OTX Algo Tasks implementation CLI Any changes in OTE CLI TEST Any changes in tests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants