Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix a bug that training is stuck while detection model is trained on distrubited environment #3904

Merged
merged 3 commits into from
Aug 29, 2024

Conversation

eunwoosh
Copy link
Contributor

@eunwoosh eunwoosh commented Aug 28, 2024

Summary

This PR fixes #3635

How this change solves the bug

In distributed training, torchmetric package tries to sync values across processes.
If one process has a value in bbox, label or scores and other process not, then torchmetric makes empty torch Tensor.
But if dtype of tensor between processes is different, then process is stuck during sync.
To avoid it, this PR specifies dtype properly.

How to test

Checklist

  • I have added unit tests to cover my changes.​
  • I have added integration tests to cover my changes.​
  • I have ran e2e tests and there is no issues.
  • I have added the description of my changes into CHANGELOG in my target branch (e.g., CHANGELOG in develop).​
  • I have updated the documentation in my target branch accordingly (e.g., documentation in develop).
  • I have linked related issues.

License

  • I submit my code changes under the same Apache License that covers the project.
    Feel free to contact the maintainers if that's a concern.
  • I have updated the license header for each file (see an example below).
# Copyright (C) 2024 Intel Corporation
# SPDX-License-Identifier: Apache-2.0

Copy link

codecov bot commented Aug 28, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 81.20%. Comparing base (0a395b2) to head (09bc0b1).
Report is 1 commits behind head on develop.

Additional details and impacted files
@@             Coverage Diff             @@
##           develop    #3904      +/-   ##
===========================================
- Coverage    81.21%   81.20%   -0.02%     
===========================================
  Files          283      283              
  Lines        27169    27169              
===========================================
- Hits         22065    22062       -3     
- Misses        5104     5107       +3     
Flag Coverage Δ
py310 ?
py311 81.20% <ø> (+<0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@eunwoosh eunwoosh requested a review from sungchul2 August 29, 2024 00:31
@eunwoosh eunwoosh enabled auto-merge August 29, 2024 00:31
@eunwoosh eunwoosh added this pull request to the merge queue Aug 29, 2024
Merged via the queue into openvinotoolkit:develop with commit bc5b7d0 Aug 29, 2024
20 of 21 checks passed
@eunwoosh eunwoosh deleted the fix_det_dist_train branch August 29, 2024 05:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Problems in multi-card distributed training
3 participants