Decrease batch size if CUDA OOM occurs #2022

eunwoosh · 2023-04-18T00:31:04Z

Summary

Add an argument to decrease batch size if current value isn't fit to CUDA memory size.
Currently, it supports all tasks except anomaly tasks.
It only decreases default batch size including batch size changed by the argument.

Method detail

If it's enabled, run multiple trials to check various batch size right before training.
To reduce time, each trial only trains single iteration by manipulating dataset class
and disabling validation, including both before and after training, if available.

How to test

You can test by running otx train with params --learning_parameters.auto_decrease_bs true argument.

Checklist

I have added unit tests to cover my changes.
I have added integration tests to cover my changes.
I have added e2e tests for validation.
I have added the description of my changes into CHANGELOG in my target branch (e.g., CHANGELOG in develop).
I have updated the documentation in my target branch accordingly (e.g., documentation in develop).
I have linked related issues.

License

I submit my code changes under the same Apache License that covers the project.
Feel free to contact the maintainers if that's a concern.
I have updated the license header for each file (see an example below).

# Copyright (C) 2023 Intel Corporation
# SPDX-License-Identifier: Apache-2.0

sungmanc

Could you also attach your local E2E tests? we don't have enough time to check with E2E, so local E2E tests are needed for now.

eunwoosh · 2023-04-18T01:43:49Z

Could you also attach your local E2E tests? we don't have enough time to check with E2E, so local E2E tests are needed for now.

I see. I'll update it after running

jaegukhyun · 2023-04-18T01:50:54Z

I have two qustions.

Should we add this feature to release1.2? If this is not urgent issue, I think we can add this feature in next release
Don't we need cli test for this new feature?

jaegukhyun

Overall LGTM, could you take a look my comment?

otx/algorithms/common/adapters/torch/utils/automatic_bs.py

goodsong81

Thank you for the great functionality.
Let's refine the interface a little bit.

otx/algorithms/common/adapters/mmcv/utils/automatic_bs.py

otx/algorithms/common/adapters/torch/utils/__init__.py

otx/algorithms/common/adapters/torch/utils/automatic_bs.py

otx/api/entities/train_parameters.py

otx/cli/tools/train.py

eunwoosh · 2023-04-18T07:55:46Z

I applied all of comments. Could you review my PR? @jaegukhyun @goodsong81

otx/algorithms/common/adapters/mmcv/utils/automatic_bs.py

sungmanc

Needed to merge other PR first.

goodsong81

Requesting a few changes. Thanks!
BTW, auto_batch_size might be a better name than auto_adapt_bs. Just an opinion, though :)

otx/algorithms/action/configs/classification/configuration.yaml

otx/algorithms/action/configs/detection/configuration.yaml

otx/algorithms/action/configs/classification/configuration.yaml

otx/algorithms/action/configs/detection/configuration.yaml

otx/algorithms/classification/configs/configuration.yaml

otx/algorithms/common/configs/training_base.py

otx/api/entities/train_parameters.py

goodsong81

Could you revise the warning messages?

otx/algorithms/action/configs/detection/configuration.yaml

otx/algorithms/classification/configs/configuration.yaml

otx/algorithms/action/configs/classification/configuration.yaml

otx/algorithms/common/configs/training_base.py

otx/algorithms/detection/configs/detection/configuration.yaml

otx/algorithms/detection/configs/instance_segmentation/configuration.yaml

otx/algorithms/detection/configs/rotated_detection/configuration.yaml

otx/algorithms/segmentation/configs/configuration.yaml

goodsong81

Could you revise one more time? Thanks!

otx/algorithms/segmentation/configs/ocr_lite_hrnet_18_mod2/template.yaml

otx/algorithms/action/configs/classification/configuration.yaml

goodsong81

Thank you! LGTM.

sungmanc

LGTM

CHANGELOG.md

Co-authored-by: Sungman Cho <sungman.cho@intel.com>

jaegukhyun

I left a comment

tests/integration/cli/action/test_action_classification.py

goodsong81 · 2023-04-20T11:47:59Z

@eunwoosh I think that we can reduce the # epochs for the training integration test with this feature turned on because a few epoch is enough to check if the adjusted batch size actually work.
Let's adjust it later on develop branch. :)

* HOT-FIX: Revert segmentation model's ignore mode in CLI (#2011) Revert segmentation ignore=True * Improve tiling preprocess (#2013) * prevent timeout during init phase * Fix reg tests (#2008) * Edit regression tests * Change the dataset root * Miss typo * Fix pre-commit * Fix openvino import error due to Tiler init import (#2015) Remove init import for Tiler to prevent OpenVINO import * Bump up version to 1.2.0 (#2017) * Set the python version to "3.10" for code-scan workflow * Add missing __init__.py (#2019) * Add missing __init__.py * Change license * Release 1.2.0rc1 * Fix issue that str2bool not being applied in certain cases (#2023) * Add workaround solution * Fix minor * Remove str int * Fix default dict (#2025) fix: change default to configdict Signed-off-by: Inhyuk Andy Cho <andy.inhyuk.jo@intel.com> * Convert dummy datasets to toy datasets (#1988) * Update cls, det datsets * Remove useless files * Change action datasets * Edit action dataset * change dir * Add xml files * Remove useless * Edite tets * Fix tests * Fix tests * Remove ptc * Remove * Fix precommit * Update dataset, fix cls bug * Remove useless dataset * Edit drop_last * Fix missed part * Change threshold values to unifying * bugfix: squeezing to 1 dimenetion * Change threshold for deployment * Fix multi-gpu issue, e2e tests * Decrease num_workers for tiling test and tiling processes * Revert num_workers for tests * Fix datsets --------- Co-authored-by: eunwoosh <eunwoo.shin@intel.com> * Fix E2E tests (#2032) * Optimize data preprocessing time and enhance overall performance in semantic segmentation (#2020) * HOTFIX: change doc version to 1.2.0 * Add storage cache in Apache Arrow format using Datumaro (#2009) * feat: change label entity to dictionay * feat: add datumaro arrow cache * refacor: move to proper directory * fix: align to the latest * fix: align data to otx * fix: align new version * refactor: disable storage cache for action tasks * test: fix * fix: version back * docs: add to changelog * fix: keep __height, __width * docs: add description * test: revert tests * fix: revert back to list * style: ruff * HOT-FIX: Revert segmentation model's ignore mode in CLI (Develop) (#2012) Revert segmentation ignore=True * fix: make force verbose * test: add storage cache test * feat: datumaro 1.2.0 * test: test path exists * test: do deepcopy * style: make black happy --------- Signed-off-by: Inhyuk Andy Cho <andy.inhyuk.jo@intel.com> Co-authored-by: Harim Kang <harim.kang@intel.com> * Fix typo in prediction_to_annotation_converter.py (#2028) * HOT-FIX: Revert segmentation model's ignore mode in CLI (Develop) (#2012) Revert segmentation ignore=True * Bump up version to 1.3.0rc0 (#2016) * bug fix * del * revert * revert changlog --------- Co-authored-by: Harim Kang <harim.kang@intel.com> Co-authored-by: Songki Choi <songki.choi@intel.com> * Fix tiling config loading bug (#2030) * fix tiling loading bug * fix mypy * Make multi gpu child processes done right after evaluation (#2033) * Fixes in exportable code (#2031) * Create Actions domain and task type despite flag * Move import torch to the function * Fix str in dump_features * Move dump_frames to otx/api/utils * Remove __pycache__ from dunped exportable code * Add tests for demo --output option * Update sha for exportable code requirements * Add flag to task_type_to_label_domain * Roll back tests creation to add it in separate PR * Add FEATURE_FLAGS_OTX_ACTION_TASKS init in demo.py * Remove extra comments * Fix linter * Add documentation for the noisy label detection feature (#2034) * Add documentation for noisy label detection feature Signed-off-by: Kim, Vinnam <vinnam.kim@intel.com> * Update CHANGELOG.md Signed-off-by: Kim, Vinnam <vinnam.kim@intel.com> * Add documentation Signed-off-by: Kim, Vinnam <vinnam.kim@intel.com> * Fix typo Signed-off-by: Kim, Vinnam <vinnam.kim@intel.com> * Fix small typo --------- Signed-off-by: Kim, Vinnam <vinnam.kim@intel.com> Co-authored-by: Songki Choi <songki.choi@intel.com> * Remove skip tests, fix regression tests (#2036) * Remove skip tests, fix regression tests * Fix precommit * Hide internal options from external GUI (#2037) Signed-off-by: Songki Choi <songki.choi@intel.com> * Add unit test for classification task and configurer (#2035) * Reduce the depth of aumix process (#2038) Enable light augmix * Change samples_per_gpu in _infer_model(Detection) (#2041) Change samples_per_gpu in _infer_model * Decrease batch size if CUDA OOM occurs (#2022) * implement adpating bs * refine impl * implement adaptive bs also in cls, seg task * refine adapt bs algo to consider gpu util * refactor code * write comment and docstring * implement decreasig bs on action task * update learning rate after decreasing batch size * implement test code of mmcv automatic_bs file * remove meta modification * remove unused improt * implement test code of torch automatic_bs file * align with pre commit * add line to tell not supporting anomaly * update CHANGELOG * update docs * change argument help * change file name * apply pr comment * add auto_decrease_bs in learning parameters * align with pre commit * fix typo * add integration test * bugfix * update test code * not execute algo in nncf * suppor nncf * apply comment * align with pre commit * change method to set value * refine warning comment * remove breakpoint * make hpo not use auto decrease batch size * refine warning & typo fix * align with pre commit * Update CHANGELOG.md Co-authored-by: Sungman Cho <sungman.cho@intel.com> * update unit test * update integration test * bufix * Release 1.2.0rc2 Signed-off-by: Songki Choi <songki.choi@intel.com> * Update OTX commit hash for exportable code requiements --------- Signed-off-by: Inhyuk Andy Cho <andy.inhyuk.jo@intel.com> Signed-off-by: Kim, Vinnam <vinnam.kim@intel.com> Signed-off-by: Songki Choi <songki.choi@intel.com> Co-authored-by: Harim Kang <harim.kang@intel.com> Co-authored-by: Eugene Liu <eugene.liu@intel.com> Co-authored-by: Sungman Cho <sungman.cho@intel.com> Co-authored-by: Yunchu Lee <yunchu.lee@intel.com> Co-authored-by: Jaeguk Hyun <jaeguk.hyun@intel.com> Co-authored-by: Inhyuk Cho <andy.inhyuk.jo@intel.com> Co-authored-by: eunwoosh <eunwoo.shin@intel.com> Co-authored-by: Soobee Lee <soobee.lee@intel.com> Co-authored-by: Galina Zalesskaya <galina.zalesskaya@intel.com> Co-authored-by: Vinnam Kim <vinnam.kim@intel.com>

github-actions bot added ALGO Any changes in OTX Algo Tasks implementation API Any changes in OTX API CLI Any changes in OTE CLI TEST Any changes in tests labels Apr 18, 2023

eunwoosh force-pushed the es/decrease_bs_necessary branch from 37ec5d7 to ce3489c Compare April 18, 2023 00:35

eunwoosh added this to the 1.2.0 milestone Apr 18, 2023

github-actions bot added the DOC Improvements or additions to documentation label Apr 18, 2023

eunwoosh marked this pull request as ready for review April 18, 2023 01:39

eunwoosh requested a review from a team as a code owner April 18, 2023 01:39

sungmanc requested changes Apr 18, 2023

View reviewed changes

jaegukhyun reviewed Apr 18, 2023

View reviewed changes

otx/algorithms/common/adapters/torch/utils/automatic_bs.py Outdated Show resolved Hide resolved

goodsong81 requested changes Apr 18, 2023

View reviewed changes

github-actions bot removed the CLI Any changes in OTE CLI label Apr 18, 2023

eunwoosh modified the milestones: 1.2.0, 1.3.0 Apr 18, 2023

eunwoosh changed the base branch from releases/1.2.0 to develop April 18, 2023 07:15

eunwoosh force-pushed the es/decrease_bs_necessary branch from 2ce4479 to f5fab15 Compare April 18, 2023 07:17

eunwoosh requested review from goodsong81, jaegukhyun and sungmanc April 18, 2023 07:55

harimkang reviewed Apr 18, 2023

View reviewed changes

otx/algorithms/common/adapters/mmcv/utils/automatic_bs.py Show resolved Hide resolved

sungmanc requested changes Apr 19, 2023

View reviewed changes

goodsong81 requested changes Apr 19, 2023

View reviewed changes

eunwoosh added 4 commits April 20, 2023 11:24

implement adpating bs

b7c9656

refine impl

6612630

implement adaptive bs also in cls, seg task

1537645

refine adapt bs algo to consider gpu util

3b16392

goodsong81 requested changes Apr 20, 2023

View reviewed changes

eunwoosh added 3 commits April 20, 2023 15:55

refine warning comment

8ebb478

remove breakpoint

af94e67

make hpo not use auto decrease batch size

a9a9835

goodsong81 requested changes Apr 20, 2023

View reviewed changes

otx/algorithms/segmentation/configs/ocr_lite_hrnet_18_mod2/template.yaml Outdated Show resolved Hide resolved

otx/algorithms/action/configs/classification/configuration.yaml Outdated Show resolved Hide resolved

github-actions bot added the CLI Any changes in OTE CLI label Apr 20, 2023

refine warning & typo fix

b96af4c

goodsong81 previously approved these changes Apr 20, 2023

View reviewed changes

sungmanc previously approved these changes Apr 20, 2023

View reviewed changes

CHANGELOG.md Outdated Show resolved Hide resolved

align with pre commit

493005d

eunwoosh dismissed stale reviews from sungmanc and goodsong81 via 493005d April 20, 2023 07:34

Update CHANGELOG.md

f1a1673

Co-authored-by: Sungman Cho <sungman.cho@intel.com>

goodsong81 previously approved these changes Apr 20, 2023

View reviewed changes

update unit test

ee4e226

eunwoosh dismissed goodsong81’s stale review via ee4e226 April 20, 2023 07:51

jaegukhyun reviewed Apr 20, 2023

View reviewed changes

tests/integration/cli/action/test_action_classification.py Show resolved Hide resolved

update integration test

4260b8d

eunwoosh modified the milestones: 1.3.0, 1.2.0 Apr 20, 2023

jaegukhyun previously approved these changes Apr 20, 2023

View reviewed changes

goodsong81 previously approved these changes Apr 20, 2023

View reviewed changes

bufix

a76197b

eunwoosh dismissed stale reviews from goodsong81 and jaegukhyun via a76197b April 20, 2023 10:41

goodsong81 approved these changes Apr 20, 2023

View reviewed changes

goodsong81 merged commit d705378 into openvinotoolkit:releases/1.2.0 Apr 20, 2023

eunwoosh deleted the es/decrease_bs_necessary branch April 21, 2023 01:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Decrease batch size if CUDA OOM occurs #2022

Decrease batch size if CUDA OOM occurs #2022

eunwoosh commented Apr 18, 2023 •

edited

Loading

sungmanc left a comment

eunwoosh commented Apr 18, 2023

jaegukhyun commented Apr 18, 2023

jaegukhyun left a comment

goodsong81 left a comment

eunwoosh commented Apr 18, 2023

sungmanc left a comment

goodsong81 left a comment

goodsong81 left a comment

goodsong81 left a comment

goodsong81 left a comment

sungmanc left a comment

jaegukhyun left a comment

goodsong81 commented Apr 20, 2023

Decrease batch size if CUDA OOM occurs #2022

Decrease batch size if CUDA OOM occurs #2022

Conversation

eunwoosh commented Apr 18, 2023 • edited Loading

Summary

Method detail

How to test

Checklist

License

sungmanc left a comment

Choose a reason for hiding this comment

eunwoosh commented Apr 18, 2023

jaegukhyun commented Apr 18, 2023

jaegukhyun left a comment

Choose a reason for hiding this comment

goodsong81 left a comment

Choose a reason for hiding this comment

eunwoosh commented Apr 18, 2023

sungmanc left a comment

Choose a reason for hiding this comment

goodsong81 left a comment

Choose a reason for hiding this comment

goodsong81 left a comment

Choose a reason for hiding this comment

goodsong81 left a comment

Choose a reason for hiding this comment

goodsong81 left a comment

Choose a reason for hiding this comment

sungmanc left a comment

Choose a reason for hiding this comment

jaegukhyun left a comment

Choose a reason for hiding this comment

goodsong81 commented Apr 20, 2023

eunwoosh commented Apr 18, 2023 •

edited

Loading