Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test_multiprocessing_parallel_manager.py causes occasional CI failures after 6h timeout #1158

Open
lbianchi-lbl opened this issue Sep 29, 2023 · 12 comments · Fixed by #1211
Open
Assignees
Labels
bug Something isn't working parameter-sweep Priority:Normal Normal Priority Issue or PR tools

Comments

@lbianchi-lbl
Copy link
Contributor

lbianchi-lbl commented Sep 29, 2023

image

See also: #1102 (comment)

@lbianchi-lbl lbianchi-lbl added bug Something isn't working tools labels Sep 29, 2023
@ksbeattie ksbeattie added the Priority:High High Priority Issue or PR label Oct 5, 2023
@lbianchi-lbl
Copy link
Contributor Author

lbianchi-lbl commented Oct 19, 2023

Judging from #1151 (comment), this might be related to #1151. @avdudchenko, can you elaborate on this? Would merging in #1151 resolve this issue? EDIT: this doesn't seem to be the case (see comment below)

@avdudchenko
Copy link
Contributor

hmm.. that test is not related to 1151...

I have no idea why it hangs since its a simple MPI test that passes and gathers arrays... e.g. if its caused by MPI backend rather then watertap code...

I don't really think its even needed for the multiprocessing manager as none of the functions being tested are used by multiprocessing and there just for MPI compatibility, that is not used by multiprocessing....

Effectively, multiprocessing manager takes a stance that everything needs to template MPI functionality, but concurrent futures, multiprocessing, nor ray use any of the functions being tested in these tests (test_multprocessing ,test_raio or test_concurentfutures).... this is a code design question for @bknueven

@bknueven
Copy link
Contributor

Trying to track this down. Based on the screen-shot it look like three tests in test_multiprocessing_parallel_manager.py pass? (The three green dots.)

But that file only has three tests: https://github.com/watertap-org/watertap/blob/main/watertap/tools/parallel/tests/test_multiprocessing_parallel_manager.py.

So maybe the issue is in loading the next module up, which is https://github.com/watertap-org/watertap/blob/main/watertap/tools/parallel/tests/test_parallel_manager.py?

@ksbeattie
Copy link
Contributor

@lbianchi-lbl will look into way to instrument the tests so that we might have insight on what went wrong if/when it fails again.

@lbianchi-lbl
Copy link
Contributor Author

The current plan is to follow @bknueven's idea to try to enable logging and/or printed output for the affected tests. Since the failures seem to be non-deterministic, this would allow us to get some extra information when failures do occur without enabling verbose logging for the entire test suite.

@bknueven
Copy link
Contributor

Screenshot 2023-11-10 at 9 21 59 AM

Seeing this on a different test on #1151. Maybe there's something special about 38%?

@lbianchi-lbl
Copy link
Contributor Author

lbianchi-lbl commented Nov 15, 2023

From #1201 (where I introduced a marker that allows to run separately tests located under watertap.tools and tests that aren't) it looks like the problem is indeed somewhere under watertap.tools:

image

(In passing runs, pytest -m tools takes 1-2 minutes).

Now we just need to re-run until the timeout condition occurs in an open tab so that we can see the log...

@lbianchi-lbl
Copy link
Contributor Author

After a lot of poking and prodding, it looks like the culprit is multiprocessing.Queue.empty() hanging:
image

@bknueven
Copy link
Contributor

Well, it's in the documentation for this method: https://docs.python.org/3/library/multiprocessing.html#multiprocessing.Queue.empty

It's a bit frustrating they bother to include it at all.

@bknueven
Copy link
Contributor

bknueven commented Feb 2, 2024

@bknueven bknueven reopened this Feb 2, 2024
@avdudchenko
Copy link
Contributor

https://github.com/watertap-org/watertap/actions/runs/7761451268/job/21169940288

Its really frustrating as this behavior has been so difficult to reproduce - In all local runs i did on Windows I never had hangs (unless I ran out of ram...or something that caused a worker to die).

Is there a way to implement a time out and re-try for py-test? sometimes multiprocesing/ray can screw up due to pickle issues or other OS related problems that cause the worker to die/hang.

@lbianchi-lbl
Copy link
Contributor Author

Unless this starts happening significantly more often, I'd suggest to wait until the PS code is moved to its own repository. At that point, we can think of ways to test these issues more systematically (i.e. having a dedicated CI workflows that runs a high number of replicas of this test run in parallel).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working parameter-sweep Priority:Normal Normal Priority Issue or PR tools
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants