test_multiprocessing_parallel_manager.py causes occasional CI failures after 6h timeout #1158

lbianchi-lbl · 2023-09-29T04:00:37Z

lbianchi-lbl · 2023-10-19T20:55:57Z

~~Judging from #1151 (comment), this might be related to #1151. @avdudchenko, can you elaborate on this? Would merging in #1151 resolve this issue?~~ EDIT: this doesn't seem to be the case (see comment below)

avdudchenko · 2023-10-19T21:54:59Z

hmm.. that test is not related to 1151...

I have no idea why it hangs since its a simple MPI test that passes and gathers arrays... e.g. if its caused by MPI backend rather then watertap code...

I don't really think its even needed for the multiprocessing manager as none of the functions being tested are used by multiprocessing and there just for MPI compatibility, that is not used by multiprocessing....

Effectively, multiprocessing manager takes a stance that everything needs to template MPI functionality, but concurrent futures, multiprocessing, nor ray use any of the functions being tested in these tests (test_multprocessing ,test_raio or test_concurentfutures).... this is a code design question for @bknueven

bknueven · 2023-10-26T20:35:43Z

Trying to track this down. Based on the screen-shot it look like three tests in test_multiprocessing_parallel_manager.py pass? (The three green dots.)

But that file only has three tests: https://github.com/watertap-org/watertap/blob/main/watertap/tools/parallel/tests/test_multiprocessing_parallel_manager.py.

So maybe the issue is in loading the next module up, which is https://github.com/watertap-org/watertap/blob/main/watertap/tools/parallel/tests/test_parallel_manager.py?

ksbeattie · 2023-11-09T21:34:17Z

@lbianchi-lbl will look into way to instrument the tests so that we might have insight on what went wrong if/when it fails again.

lbianchi-lbl · 2023-11-09T21:37:27Z

The current plan is to follow @bknueven's idea to try to enable logging and/or printed output for the affected tests. Since the failures seem to be non-deterministic, this would allow us to get some extra information when failures do occur without enabling verbose logging for the entire test suite.

bknueven · 2023-11-10T16:24:45Z

Seeing this on a different test on #1151. Maybe there's something special about 38%?

lbianchi-lbl · 2023-11-15T02:02:51Z

From #1201 (where I introduced a marker that allows to run separately tests located under watertap.tools and tests that aren't) it looks like the problem is indeed somewhere under watertap.tools:

(In passing runs, pytest -m tools takes 1-2 minutes).

Now we just need to re-run until the timeout condition occurs in an open tab so that we can see the log...

lbianchi-lbl · 2023-11-15T17:10:07Z

After a lot of poking and prodding, it looks like the culprit is multiprocessing.Queue.empty() hanging:

bknueven · 2023-11-15T22:07:15Z

Well, it's in the documentation for this method: https://docs.python.org/3/library/multiprocessing.html#multiprocessing.Queue.empty

It's a bit frustrating they bother to include it at all.

bknueven · 2024-02-02T22:31:23Z

https://github.com/watertap-org/watertap/actions/runs/7761451268/job/21169940288

avdudchenko · 2024-02-03T00:36:32Z

https://github.com/watertap-org/watertap/actions/runs/7761451268/job/21169940288

Its really frustrating as this behavior has been so difficult to reproduce - In all local runs i did on Windows I never had hangs (unless I ran out of ram...or something that caused a worker to die).

Is there a way to implement a time out and re-try for py-test? sometimes multiprocesing/ray can screw up due to pickle issues or other OS related problems that cause the worker to die/hang.

lbianchi-lbl · 2024-02-08T21:17:23Z

Unless this starts happening significantly more often, I'd suggest to wait until the PS code is moved to its own repository. At that point, we can think of ways to test these issues more systematically (i.e. having a dedicated CI workflows that runs a high number of replicas of this test run in parallel).

lbianchi-lbl added bug Something isn't working tools labels Sep 29, 2023

ksbeattie added the Priority:High High Priority Issue or PR label Oct 5, 2023

ksbeattie assigned avdudchenko Oct 5, 2023

lbianchi-lbl mentioned this issue Nov 14, 2023

Diagnose timeouts in pytest CI jobs (#1158) #1201

Closed

lbianchi-lbl mentioned this issue Nov 16, 2023

Correcting documentation typo for loopTool code block #1205

Merged

bknueven mentioned this issue Nov 21, 2023

Making sure the MultiprocessingParallelManager cleans up when done #1211

Merged

adam-a-a mentioned this issue Nov 21, 2023

anaerobic digestor and aerobic basin costing update #1196

Merged

lbianchi-lbl added the parameter-sweep label Nov 21, 2023

lbianchi-lbl closed this as completed in #1211 Nov 21, 2023

bknueven reopened this Feb 2, 2024

lbianchi-lbl added Priority:Normal Normal Priority Issue or PR and removed Priority:High High Priority Issue or PR labels Mar 7, 2024

lbianchi-lbl mentioned this issue May 3, 2024

CI jobs hanging/timing out during or after test_parallel_manager.py watertap-org/parameter-sweep#8

Open

lbianchi-lbl mentioned this issue May 11, 2024

GAC costing documentation into template dedicated file #1375

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test_multiprocessing_parallel_manager.py causes occasional CI failures after 6h timeout #1158

test_multiprocessing_parallel_manager.py causes occasional CI failures after 6h timeout #1158

lbianchi-lbl commented Sep 29, 2023 •

edited

Loading

lbianchi-lbl commented Oct 19, 2023 •

edited

Loading

avdudchenko commented Oct 19, 2023

bknueven commented Oct 26, 2023

ksbeattie commented Nov 9, 2023

lbianchi-lbl commented Nov 9, 2023

bknueven commented Nov 10, 2023

lbianchi-lbl commented Nov 15, 2023 •

edited

Loading

lbianchi-lbl commented Nov 15, 2023

bknueven commented Nov 15, 2023

bknueven commented Feb 2, 2024

avdudchenko commented Feb 3, 2024

lbianchi-lbl commented Feb 8, 2024

test_multiprocessing_parallel_manager.py causes occasional CI failures after 6h timeout #1158

test_multiprocessing_parallel_manager.py causes occasional CI failures after 6h timeout #1158

Comments

lbianchi-lbl commented Sep 29, 2023 • edited Loading

lbianchi-lbl commented Oct 19, 2023 • edited Loading

avdudchenko commented Oct 19, 2023

bknueven commented Oct 26, 2023

ksbeattie commented Nov 9, 2023

lbianchi-lbl commented Nov 9, 2023

bknueven commented Nov 10, 2023

lbianchi-lbl commented Nov 15, 2023 • edited Loading

lbianchi-lbl commented Nov 15, 2023

bknueven commented Nov 15, 2023

bknueven commented Feb 2, 2024

avdudchenko commented Feb 3, 2024

lbianchi-lbl commented Feb 8, 2024

lbianchi-lbl commented Sep 29, 2023 •

edited

Loading

lbianchi-lbl commented Oct 19, 2023 •

edited

Loading

lbianchi-lbl commented Nov 15, 2023 •

edited

Loading