Fix a bug that training is stuck while detection model is trained on distrubited environment #3904

eunwoosh · 2024-08-28T07:08:54Z

Summary

This PR fixes #3635

How this change solves the bug

In distributed training, torchmetric package tries to sync values across processes.
If one process has a value in bbox, label or scores and other process not, then torchmetric makes empty torch Tensor.
But if dtype of tensor between processes is different, then process is stuck during sync.
To avoid it, this PR specifies dtype properly.

How to test

Checklist

I have added unit tests to cover my changes.
I have added integration tests to cover my changes.
I have ran e2e tests and there is no issues.
I have added the description of my changes into CHANGELOG in my target branch (e.g., CHANGELOG in develop).
I have updated the documentation in my target branch accordingly (e.g., documentation in develop).
I have linked related issues.

License

I submit my code changes under the same Apache License that covers the project.
Feel free to contact the maintainers if that's a concern.
I have updated the license header for each file (see an example below).

# Copyright (C) 2024 Intel Corporation
# SPDX-License-Identifier: Apache-2.0

src/otx/core/data/dataset/detection.py

codecov · 2024-08-28T09:26:18Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 81.20%. Comparing base (0a395b2) to head (09bc0b1).
Report is 1 commits behind head on develop.

Additional details and impacted files

@@             Coverage Diff             @@
##           develop    #3904      +/-   ##
===========================================
- Coverage    81.21%   81.20%   -0.02%     
===========================================
  Files          283      283              
  Lines        27169    27169              
===========================================
- Hits         22065    22062       -3     
- Misses        5104     5107       +3

Flag	Coverage Δ
py310	`?`
py311	`81.20% <ø> (+<0.01%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

specify data type

ce13f11

github-actions bot added the OTX 2.0 label Aug 28, 2024

eunwoosh marked this pull request as ready for review August 28, 2024 07:12

eunwoosh requested review from samet-akcay, harimkang, eugene123tw, kprokofi, chuneuny-emily, sovrasov, sungchul2, GalyaZalesskaya, negvet, goodsong81, yunchu and wonjuleee as code owners August 28, 2024 07:12

add docstring to notify need to update FMeasure for distributed training

dd3c311

sungchul2 reviewed Aug 28, 2024

View reviewed changes

src/otx/core/data/dataset/detection.py Outdated Show resolved Hide resolved

update label dtype from int32 to long

09bc0b1

eunwoosh requested a review from sungchul2 August 29, 2024 00:31

eunwoosh enabled auto-merge August 29, 2024 00:31

sungchul2 approved these changes Aug 29, 2024

View reviewed changes

harimkang approved these changes Aug 29, 2024

View reviewed changes

eunwoosh added this pull request to the merge queue Aug 29, 2024

Merged via the queue into openvinotoolkit:develop with commit bc5b7d0 Aug 29, 2024
20 of 21 checks passed

eunwoosh deleted the fix_det_dist_train branch August 29, 2024 05:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix a bug that training is stuck while detection model is trained on distrubited environment #3904

Fix a bug that training is stuck while detection model is trained on distrubited environment #3904

eunwoosh commented Aug 28, 2024 •

edited

Loading

codecov bot commented Aug 28, 2024 •

edited

Loading

Fix a bug that training is stuck while detection model is trained on distrubited environment #3904

Fix a bug that training is stuck while detection model is trained on distrubited environment #3904

Conversation

eunwoosh commented Aug 28, 2024 • edited Loading

Summary

How this change solves the bug

How to test

Checklist

License

codecov bot commented Aug 28, 2024 • edited Loading

Codecov Report

eunwoosh commented Aug 28, 2024 •

edited

Loading

codecov bot commented Aug 28, 2024 •

edited

Loading