-
Notifications
You must be signed in to change notification settings - Fork 446
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[OTX] Bugfix: multi GPU raise error when num_workers isn't set as 0. #1475
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
eunwoosh
requested review from
goodsong81,
supersoob,
harimkang,
JihwanEom,
sungmanc and
jaegukhyun
January 2, 2023 08:43
eunwoosh
changed the title
[OTX] bugfix: multi GPU raise error when num_workers isn't set as 0.
[OTX] Bugfix: multi GPU raise error when num_workers isn't set as 0.
Jan 2, 2023
goodsong81
reviewed
Jan 2, 2023
harimkang
approved these changes
Jan 3, 2023
goodsong81
approved these changes
Jan 3, 2023
cih9088
added a commit
to cih9088/training_extensions
that referenced
this pull request
Jan 5, 2023
…t as 0. (openvinotoolkit#1475)" This reverts commit c076902.
cih9088
added a commit
to cih9088/training_extensions
that referenced
this pull request
Jan 5, 2023
…t as 0. (openvinotoolkit#1475)" This reverts commit c076902.
cih9088
added a commit
to cih9088/training_extensions
that referenced
this pull request
Jan 5, 2023
…t as 0. (openvinotoolkit#1475)" This reverts commit c076902.
cih9088
added a commit
to cih9088/training_extensions
that referenced
this pull request
Jan 6, 2023
…t as 0. (openvinotoolkit#1475)" This reverts commit c076902.
cih9088
added a commit
to cih9088/training_extensions
that referenced
this pull request
Jan 6, 2023
…t as 0. (openvinotoolkit#1475)" This reverts commit c076902.
cih9088
added a commit
to cih9088/training_extensions
that referenced
this pull request
Jan 6, 2023
…t as 0. (openvinotoolkit#1475)" This reverts commit c076902.
cih9088
added a commit
to cih9088/training_extensions
that referenced
this pull request
Jan 6, 2023
…t as 0. (openvinotoolkit#1475)" This reverts commit c076902.
cih9088
added a commit
to cih9088/training_extensions
that referenced
this pull request
Jan 6, 2023
…t as 0. (openvinotoolkit#1475)" This reverts commit c076902.
cih9088
added a commit
to cih9088/training_extensions
that referenced
this pull request
Jan 6, 2023
…t as 0. (openvinotoolkit#1475)" This reverts commit c076902.
cih9088
added a commit
to cih9088/training_extensions
that referenced
this pull request
Jan 6, 2023
…t as 0. (openvinotoolkit#1475)" This reverts commit c076902.
cih9088
added a commit
to cih9088/training_extensions
that referenced
this pull request
Jan 6, 2023
…t as 0. (openvinotoolkit#1475)" This reverts commit c076902.
cih9088
added a commit
to cih9088/training_extensions
that referenced
this pull request
Jan 6, 2023
…t as 0. (openvinotoolkit#1475)" This reverts commit c076902.
cih9088
added a commit
to cih9088/training_extensions
that referenced
this pull request
Jan 6, 2023
…t as 0. (openvinotoolkit#1475)" This reverts commit c076902.
cih9088
added a commit
to cih9088/training_extensions
that referenced
this pull request
Jan 9, 2023
…t as 0. (openvinotoolkit#1475)" This reverts commit c076902.
goodsong81
added a commit
that referenced
this pull request
Jan 16, 2023
* Update MPA submodule to origin/otx * [OTX-MMCV] Public mmdetection (#1382) Enable model training and NNCF in mmdet (#1355) * Enable detection training on latest mmcv/det - ATSS / SSD / YOLOX - NNCF support for ATSS * fix: import errors * feat: add monkey patch to mmdet modules - most of patches would be just wrapping for not tracing in nncf context * feat: add trainable yolox - add trainable yolox - recursively search dataset cfg for nested dataset classes * fix: change device to cpu when nncf tracing * feat: add trainable ssd * refactor: rearange nncf adapter * feat: add trainable mask rcnn models * refactor: move out common utils * fix: ssd head bug * feat: add lr scheduler for accuracy aware runner * refactor: nncf module and monkey patch * fix: proper clustering anchors for ssd * fix: unable to trace the first module in NNCFNetwork * fix: bring back ssd head structure * feat: add train_step method to NNCFNetwork * fix: mismatches * fix: update pipeline for wrapper * fix: add missing file * Fix merge error * Enable model training and NNCF in mmseg (#1400) * refactor: remove redundant * feat: enable mmseg training * feat: add nncf related stuff * fix: change lr config * fix: align nncf target metric * refactor: use mpa for training and inference * test: enable tests * fix: minor bug * refactor: patcher * fix: build consistent nncf graph * fix: minor bug * fix: remove unused backup * fix: dealt with datacontainer * [OTX-MMCLS] Enable NNCF (#1435) * fix: use patcher * feat: update mmcls version * feat: enable NNCF for mmcls * refactor: add build NNCF model functions * fix: minor bug * fix: typo * fix: make sure importing nncf when enabled only * fix: inherit from base super class of otx * [OTX] Introduce mmdeploy to export cls/seg/det models (#1466) * feat: export using mmdeploy * fix: adapt mmdeploy exported model * test: enable openvino export * fix: patch depending on fn type * feat: mmdeploy for classification model * test: enable export and openvino performance test * fix: change temporary requirements * refactor: use builder * fix: do not propagate logger * fix: remove image channel format conversion * fix: handle unlabeled data * fix: run eval before optimizing nncf network * feat: change confidence threshold after nncf optimization * fix: remove redundant attribute * fix: official released openvino version * fix: remove redundants * feat: public mm series libraries * feat: otx refactoring and bug fix * Revert "[OTX] Bugfix: multi GPU raise error when num_workers isn't set as 0. (#1475)" This reverts commit c076902. * feat: enable multi-nodes distributed training * fix: redundant parts * Revert "[OTX] Evaluate a model before training starts (#1472)" This reverts commit f728295. * feat: enable evaluation before and after training * style: fix failed cases * fix: disable sam optimizer for nncf task * fix: add frezelayer hook for segmentation * fix: deepcopy instead of shallowcopy * fix: enable temporary disabled features * fix: handle nncf state simply * fix: remove submodule * feat: proper test runner handler * fix: add forcetrainmodehook * fix: make sure model is evaluated before run * fix: more merge conflicts * fix: buffer line by line in userspace * fix: patch torch, etc. only when nncf task is executed * fix: restrict kornia version * fix: restrict version * fix: align data pipeline for supcon * fix: unclutter things * fix: ignore annoying leftover data.yaml Co-authored-by: Songki Choi <songki.choi@intel.com>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Summary
Reason of the bug
When process is spawned, deafult multi porcess method is set as "spawn". It raises error when
dataloader
is used with num_workers > 0.This is because dataloader has a DatasetItemEntity which has a thread lock attribute and thread lock is unpickleable.
I don't know exact reason, but when forking a new process, unpickleable argument can be passed to new process.