Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[OTX] Bugfix: multi GPU raise error when num_workers isn't set as 0. #1475

Merged
merged 2 commits into from
Jan 3, 2023

Conversation

eunwoosh
Copy link
Contributor

@eunwoosh eunwoosh commented Jan 2, 2023

Summary

  • fix a bug that error is raised when multi gpu training with none zero num_workers.

Reason of the bug

When process is spawned, deafult multi porcess method is set as "spawn". It raises error when dataloader is used with num_workers > 0.
This is because dataloader has a DatasetItemEntity which has a thread lock attribute and thread lock is unpickleable.
I don't know exact reason, but when forking a new process, unpickleable argument can be passed to new process.

@eunwoosh eunwoosh requested a review from a team as a code owner January 2, 2023 08:43
@github-actions github-actions bot added the CLI Any changes in OTE CLI label Jan 2, 2023
@eunwoosh eunwoosh changed the title [OTX] bugfix: multi GPU raise error when num_workers isn't set as 0. [OTX] Bugfix: multi GPU raise error when num_workers isn't set as 0. Jan 2, 2023
@eunwoosh eunwoosh merged commit c076902 into feature/otx Jan 3, 2023
@eunwoosh eunwoosh deleted the es/multi_gpu_num_worker_fix branch January 3, 2023 04:47
cih9088 added a commit to cih9088/training_extensions that referenced this pull request Jan 5, 2023
cih9088 added a commit to cih9088/training_extensions that referenced this pull request Jan 5, 2023
cih9088 added a commit to cih9088/training_extensions that referenced this pull request Jan 5, 2023
cih9088 added a commit to cih9088/training_extensions that referenced this pull request Jan 6, 2023
cih9088 added a commit to cih9088/training_extensions that referenced this pull request Jan 6, 2023
cih9088 added a commit to cih9088/training_extensions that referenced this pull request Jan 6, 2023
cih9088 added a commit to cih9088/training_extensions that referenced this pull request Jan 6, 2023
cih9088 added a commit to cih9088/training_extensions that referenced this pull request Jan 6, 2023
cih9088 added a commit to cih9088/training_extensions that referenced this pull request Jan 6, 2023
cih9088 added a commit to cih9088/training_extensions that referenced this pull request Jan 6, 2023
cih9088 added a commit to cih9088/training_extensions that referenced this pull request Jan 6, 2023
cih9088 added a commit to cih9088/training_extensions that referenced this pull request Jan 6, 2023
cih9088 added a commit to cih9088/training_extensions that referenced this pull request Jan 6, 2023
cih9088 added a commit to cih9088/training_extensions that referenced this pull request Jan 9, 2023
goodsong81 added a commit that referenced this pull request Jan 16, 2023
* Update MPA submodule to origin/otx

* [OTX-MMCV] Public mmdetection (#1382)

Enable model training and NNCF in mmdet (#1355)

* Enable detection training on latest mmcv/det
- ATSS / SSD / YOLOX
- NNCF support for ATSS

* fix: import errors

* feat: add monkey patch to mmdet modules
- most of patches would be just wrapping for not tracing in nncf context

* feat: add trainable yolox
- add trainable yolox
- recursively search dataset cfg for nested dataset classes

* fix: change device to cpu when nncf tracing

* feat: add trainable ssd

* refactor: rearange nncf adapter

* feat: add trainable mask rcnn models

* refactor: move out common utils

* fix: ssd head bug

* feat: add lr scheduler for accuracy aware runner

* refactor: nncf module and monkey patch

* fix: proper clustering anchors for ssd

* fix: unable to trace the first module in NNCFNetwork

* fix: bring back ssd head structure

* feat: add train_step method to NNCFNetwork

* fix: mismatches

* fix: update pipeline for wrapper

* fix: add missing file

* Fix merge error

* Enable model training and NNCF in mmseg (#1400)

* refactor: remove redundant

* feat: enable mmseg training

* feat: add nncf related stuff

* fix: change lr config

* fix: align nncf target metric

* refactor: use mpa for training and inference

* test: enable tests

* fix: minor bug

* refactor: patcher

* fix: build consistent nncf graph

* fix: minor bug

* fix: remove unused backup

* fix: dealt with datacontainer

* [OTX-MMCLS] Enable NNCF (#1435)

* fix: use patcher

* feat: update mmcls version

* feat: enable NNCF for mmcls

* refactor: add build NNCF model functions

* fix: minor bug

* fix: typo

* fix: make sure importing nncf when enabled only

* fix: inherit from base super class of otx

* [OTX] Introduce mmdeploy to export cls/seg/det models (#1466)

* feat: export using mmdeploy

* fix: adapt mmdeploy exported model

* test: enable openvino export

* fix: patch depending on fn type

* feat: mmdeploy for classification model

* test: enable export and openvino performance test

* fix: change temporary requirements

* refactor: use builder

* fix: do not propagate logger

* fix: remove image channel format conversion

* fix: handle unlabeled data

* fix: run eval before optimizing nncf network

* feat: change confidence threshold after nncf optimization

* fix: remove redundant attribute

* fix: official released openvino version

* fix: remove redundants

* feat: public mm series libraries

* feat: otx refactoring and bug fix

* Revert "[OTX] Bugfix: multi GPU raise error when num_workers isn't set as 0. (#1475)"

This reverts commit c076902.

* feat: enable multi-nodes distributed training

* fix: redundant parts

* Revert "[OTX] Evaluate a model before training starts (#1472)"

This reverts commit f728295.

* feat: enable evaluation before and after training

* style: fix failed cases

* fix: disable sam optimizer for nncf task

* fix: add frezelayer hook for segmentation

* fix: deepcopy instead of shallowcopy

* fix: enable temporary disabled features

* fix: handle nncf state simply

* fix: remove submodule

* feat: proper test runner handler

* fix: add forcetrainmodehook

* fix: make sure model is evaluated before run

* fix: more merge conflicts

* fix: buffer line by line in userspace

* fix: patch torch, etc. only when nncf task is executed

* fix: restrict kornia version

* fix: restrict version

* fix: align data pipeline for supcon

* fix: unclutter things

* fix: ignore annoying leftover data.yaml

Co-authored-by: Songki Choi <songki.choi@intel.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLI Any changes in OTE CLI
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants