-
Notifications
You must be signed in to change notification settings - Fork 446
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[OTX] Enable multi-GPU training #1392
Conversation
916da4e
to
a67a36b
Compare
a2d5e2d
to
a7d48b0
Compare
ef4eadd
to
10967d0
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally it looks good to me. Frankly of speaking I'm not expert of multiprocessing, so let's check the test results. I left some comment, actually most of them are questions. Also I have two questions for overall PR.
- Does using multiprocessing give more benefits than using nn.parallel.DistributedDataParallel? I have read this article
- Don't we need stressed test for multiprocessing? Multiprocessing works always looks good if they run small amount(training schedule, # of jobs), but when the volume of works grow, they give error out of sudden.
Thanks for comment! I think I can answer your questions. Answer1 : You're right. So I implement to use nn.parallel.DistributedDataParallel. |
To the answer2, I'm not sure, but I think we should check whether multi-gpu training can run multiple training jobs
|
I understand. I'll check the situation in the local environment, and I'll update result. |
@jaegukhyun I checked that 4 multi GPU training worked well parallelly. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left a few comments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please replace print
sentence with logger
and also left some comments
8b39ba2
to
5587d32
Compare
There are two fail cases in Pre-Merge Check as bellow.
These are due to gap between exported model performance and trained model performance. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we use also 3 GPUs or 4 GPUs?
yes |
Summary
Enable multi-GPU training for classification, detection, segmentation tasks.
This PR includes