[OTX] Enable multi-GPU training #1392

eunwoosh · 2022-11-25T12:12:42Z

Summary

Enable multi-GPU training for classification, detection, segmentation tasks.

This PR includes

Enable multi-GPU training
make it available to set output_path in each task class.
add test code for multi-GPU training

otx/algorithms/common/tasks/training_base.py

otx/cli/tools/train.py

tests/integration/cli/classification/test_classification.py

jaegukhyun

Generally it looks good to me. Frankly of speaking I'm not expert of multiprocessing, so let's check the test results. I left some comment, actually most of them are questions. Also I have two questions for overall PR.

Does using multiprocessing give more benefits than using nn.parallel.DistributedDataParallel? I have read this article
Don't we need stressed test for multiprocessing? Multiprocessing works always looks good if they run small amount(training schedule, # of jobs), but when the volume of works grow, they give error out of sudden.

eunwoosh · 2022-12-15T06:23:33Z

Generally it looks good to me. Frankly of speaking I'm not expert of multiprocessing, so let's check the test results. I left some comment, actually most of them are questions. Also I have two questions for overall PR.

Does using multiprocessing give more benefits than using nn.parallel.DistributedDataParallel? I have read this article

Don't we need stressed test for multiprocessing? Multiprocessing works always looks good if they run small amount(training schedule, # of jobs), but when the volume of works grow, they give error out of sudden.

Thanks for comment! I think I can answer your questions.

Answer1 : You're right. So I implement to use nn.parallel.DistributedDataParallel.
Answer 2 : Actually, I can't understand your exact point. So Do you mean that we need to make test for multiple multi GPU training?

jaegukhyun · 2022-12-15T06:31:02Z

Generally it looks good to me. Frankly of speaking I'm not expert of multiprocessing, so let's check the test results. I left some comment, actually most of them are questions. Also I have two questions for overall PR.

Does using multiprocessing give more benefits than using nn.parallel.DistributedDataParallel? I have read this article

Don't we need stressed test for multiprocessing? Multiprocessing works always looks good if they run small amount(training schedule, # of jobs), but when the volume of works grow, they give error out of sudden.

Thanks for comment! I think I can answer your questions.

Answer1 : You're right. So I implement to use nn.parallel.DistributedDataParallel. Answer 2 : Actually, I can't understand your exact point. So Do you mean that we need to make test for multiple multi GPU training?

To the answer2, I'm not sure, but I think we should check whether multi-gpu training can run multiple training jobs
For example, run below training jobs simultaneously

one training job for gpu1, 2
one training job for gpu 1,2,3
one training job for 2,3
one training job for 2,3,4
Did you check this situation in your environment? Test code can test simple situation, but we should check complex situation in local environment

eunwoosh · 2022-12-15T06:39:54Z

Generally it looks good to me. Frankly of speaking I'm not expert of multiprocessing, so let's check the test results. I left some comment, actually most of them are questions. Also I have two questions for overall PR.

Does using multiprocessing give more benefits than using nn.parallel.DistributedDataParallel? I have read this article

Don't we need stressed test for multiprocessing? Multiprocessing works always looks good if they run small amount(training schedule, # of jobs), but when the volume of works grow, they give error out of sudden.

Thanks for comment! I think I can answer your questions.
Answer1 : You're right. So I implement to use nn.parallel.DistributedDataParallel. Answer 2 : Actually, I can't understand your exact point. So Do you mean that we need to make test for multiple multi GPU training?

To the answer2, I'm not sure, but I think we should check whether multi-gpu training can run multiple training jobs For example, run below training jobs simultaneously

one training job for gpu1, 2

one training job for gpu 1,2,3

one training job for 2,3

one training job for 2,3,4
Did you check this situation in your environment? Test code can test simple situation, but we should check complex situation in local environment

I understand. I'll check the situation in the local environment, and I'll update result.

eunwoosh · 2022-12-16T02:48:57Z

@jaegukhyun I checked that 4 multi GPU training worked well parallelly.

harimkang

I left a few comments.

otx/algorithms/anomaly/tasks/inference.py

otx/cli/tools/train.py

otx/cli/utils/hpo.py

otx/cli/tools/train.py

sungmanc

Please replace print sentence with logger and also left some comments

otx/cli/tools/train.py

otx/cli/utils/multi_gpu.py

eunwoosh · 2022-12-28T00:32:31Z

There are two fail cases in Pre-Merge Check as bellow.

TestToolsMPAMultilabelClassification.test_otx_eval_openvino[Custom_Image_Classification_EfficientNet-V2-S]
AssertionError: trained_performance[k]=1.0, exported_performance[k]=0.9607843137254902
TestToolsMPAMultilabelClassification.test_otx_eval_openvino[Custom_Image_Classification_EfficientNet-B0]
AssertionError: trained_performance[k]=0.9803921568627451, exported_performance[k]=0.9607843137254902

These are due to gap between exported model performance and trained model performance.
I think that it's acceptable difference. So, could you approve the PR?

JihwanEom

Can we use also 3 GPUs or 4 GPUs?

otx/cli/tools/train.py

otx/cli/utils/multi_gpu.py

tests/integration/cli/classification/test_classification.py

eunwoosh · 2022-12-28T01:35:42Z

Can we use also 3 GPUs or 4 GPUs?

yes

eunwoosh requested a review from supersoob November 25, 2022 12:13

eunwoosh force-pushed the multigpu_enablement branch from 916da4e to a67a36b Compare December 8, 2022 06:49

github-actions bot added the ALGO Any changes in OTX Algo Tasks implementation label Dec 13, 2022

eunwoosh force-pushed the multigpu_enablement branch from a2d5e2d to a7d48b0 Compare December 13, 2022 04:28

github-actions bot added the TEST Any changes in tests label Dec 13, 2022

eunwoosh changed the title ~~Multigpu enablement~~ [OTX] Enable multi-GPU training Dec 15, 2022

eunwoosh force-pushed the multigpu_enablement branch from ef4eadd to 10967d0 Compare December 15, 2022 02:36

github-actions bot removed the ALGO Any changes in OTX Algo Tasks implementation label Dec 15, 2022

eunwoosh marked this pull request as ready for review December 15, 2022 05:01

eunwoosh requested a review from a team as a code owner December 15, 2022 05:01

eunwoosh requested review from sungmanc, goodsong81, harimkang, jaegukhyun and JihwanEom December 15, 2022 05:01