-
Notifications
You must be signed in to change notification settings - Fork 62
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
test_multiprocessing_parallel_manager.py causes occasional CI failures after 6h timeout #1158
Comments
|
hmm.. that test is not related to 1151... I have no idea why it hangs since its a simple MPI test that passes and gathers arrays... e.g. if its caused by MPI backend rather then watertap code... I don't really think its even needed for the multiprocessing manager as none of the functions being tested are used by multiprocessing and there just for MPI compatibility, that is not used by multiprocessing.... Effectively, multiprocessing manager takes a stance that everything needs to template MPI functionality, but concurrent futures, multiprocessing, nor ray use any of the functions being tested in these tests (test_multprocessing ,test_raio or test_concurentfutures).... this is a code design question for @bknueven |
Trying to track this down. Based on the screen-shot it look like three tests in But that file only has three tests: https://github.com/watertap-org/watertap/blob/main/watertap/tools/parallel/tests/test_multiprocessing_parallel_manager.py. So maybe the issue is in loading the next module up, which is https://github.com/watertap-org/watertap/blob/main/watertap/tools/parallel/tests/test_parallel_manager.py? |
@lbianchi-lbl will look into way to instrument the tests so that we might have insight on what went wrong if/when it fails again. |
The current plan is to follow @bknueven's idea to try to enable logging and/or printed output for the affected tests. Since the failures seem to be non-deterministic, this would allow us to get some extra information when failures do occur without enabling verbose logging for the entire test suite. |
![]() Seeing this on a different test on #1151. Maybe there's something special about 38%? |
From #1201 (where I introduced a marker that allows to run separately tests located under (In passing runs, Now we just need to re-run until the timeout condition occurs in an open tab so that we can see the log... |
Well, it's in the documentation for this method: https://docs.python.org/3/library/multiprocessing.html#multiprocessing.Queue.empty It's a bit frustrating they bother to include it at all. |
Its really frustrating as this behavior has been so difficult to reproduce - In all local runs i did on Windows I never had hangs (unless I ran out of ram...or something that caused a worker to die). Is there a way to implement a time out and re-try for py-test? sometimes multiprocesing/ray can screw up due to pickle issues or other OS related problems that cause the worker to die/hang. |
Unless this starts happening significantly more often, I'd suggest to wait until the PS code is moved to its own repository. At that point, we can think of ways to test these issues more systematically (i.e. having a dedicated CI workflows that runs a high number of replicas of this test run in parallel). |
See also: #1102 (comment)
The text was updated successfully, but these errors were encountered: