Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG]: test_dfencoder_distributed_e2e intemittently fails with a raised ProcessRaisedException #1021

Closed
2 tasks done
Tracked by #1141
dagardner-nv opened this issue Jul 6, 2023 · 0 comments · Fixed by #1113
Closed
2 tasks done
Tracked by #1141
Assignees
Labels
bug Something isn't working dfp [Workflow] Related to the Digital Fingerprinting (DFP) workflow

Comments

@dagardner-nv
Copy link
Contributor

dagardner-nv commented Jul 6, 2023

Version

23.07

Which installation method(s) does this occur on?

Source

Describe the bug.

This test intermittently fails. In repeated testing this failed on the 51st iteration.

Minimum reproducible example

pytest --run_slow tests/dfencoder/test_dfencoder_distributed_e2e.py

Relevant log output

original_trace = self.error_queues[error_index].get()
        msg = "\n\n-- Process %d terminated with the following error:\n" % error_index
        msg += original_trace
>       raise ProcessRaisedException(msg, error_index, failed_process.pid)
E       torch.multiprocessing.spawn.ProcessRaisedException: 
E       
E       -- Process 0 terminated with the following error:
E       Traceback (most recent call last):
E         File "/home/dagardner/work/morpheus/morpheus/models/dfencoder/multiprocessing.py", line 30, in _wrap
E           fn(i, *args)
E         File "/home/dagardner/work/morpheus/tests/dfencoder/test_dfencoder_distributed_e2e.py", line 176, in _run_test
E           assert min(losses) < LOSS_TARGETS[loss_type][ft] * LOSS_TOLERANCE_RATIO
E       AssertionError

../conda/envs/morpheus/lib/python3.10/site-packages/torch/multiprocessing/spawn.py:160: ProcessRaisedException

Full env printout

No response

Other/Misc.

No response

Code of Conduct

  • I agree to follow Morpheus' Code of Conduct
  • I have searched the open bugs and have found no duplicates for this bug report
@dagardner-nv dagardner-nv added the bug Something isn't working label Jul 6, 2023
@dagardner-nv dagardner-nv self-assigned this Jul 31, 2023
@dagardner-nv dagardner-nv moved this from Todo to Review in Morpheus Boards Jul 31, 2023
@mdemoret-nv mdemoret-nv added the dfp [Workflow] Related to the Digital Fingerprinting (DFP) workflow label Aug 21, 2023
@mdemoret-nv mdemoret-nv added this to the 23.11 - DFP Improvements milestone Aug 21, 2023
rapids-bot bot pushed a commit that referenced this issue Aug 22, 2023
* Call the `manual_seed` method from within the subprocess, this ensures the subprocess runs deterministically.
* Add a sleep to a busy-loop in `morpheus/models/dfencoder/multiprocessing.py`
* Misc pylint fixes

fixes #1021

Authors:
  - David Gardner (https://github.com/dagardner-nv)

Approvers:
  - Christopher Harris (https://github.com/cwharris)
  - Michael Demoret (https://github.com/mdemoret-nv)

URL: #1113
@github-project-automation github-project-automation bot moved this from Review - Ready for Review to Done in Morpheus Boards Aug 22, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working dfp [Workflow] Related to the Digital Fingerprinting (DFP) workflow
Projects
Status: Done
Development

Successfully merging a pull request may close this issue.

2 participants