Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CI: Python 3.10 on ubuntu-latest issue #851

Closed
tylerjereddy opened this issue Nov 15, 2022 · 3 comments · Fixed by #858
Closed

CI: Python 3.10 on ubuntu-latest issue #851

tylerjereddy opened this issue Nov 15, 2022 · 3 comments · Fixed by #858
Labels
CI continuous integration pydarshan

Comments

@tylerjereddy
Copy link
Collaborator

As discussed in gh-830, there is a mysterious "hang" of the pytest suite with Python 3.10 and ubuntu-latest GitHub actions runner on that branch. Logging in to the runner via ssh allows execution via pytest as expected, so it is pretty unusual/weird. Furthermore, in tylerjereddy#23 I checked that pytest 6.2.5 and Python 3.10.0 had no effect on the hang (vs. the newer versions of those libs in use for the original hang).

I'll keep this issue open as a reference, but since I couldn't reproduce even in the runner itself when working interactively, nor in act with Python 3.10 and ubuntu-latest, there seems to be no sane way to debug this. It looks like switching to the latest Ubuntu LTS release, using the ubuntu-22.04 tag alleviates the issue so I'll probably try applying that to Shane's PR and referencing this issue in a comment.

Unless we get related user reports, I wouldn't suggest putting any more time on this one though...

@tylerjereddy tylerjereddy added pydarshan CI continuous integration labels Nov 15, 2022
tylerjereddy added a commit that referenced this issue Nov 15, 2022
* use Ubuntu 22.04 LTS for Python 3.10
testing to avoid mysterious issues
described in gh-851
@tylerjereddy
Copy link
Collaborator Author

Ah, I can reproduce locally now when I use pytest-xdist to run on multiple processes. And I'm using the newer Ubuntu we picked in CI to circumvent the issue, 22.04 LTS. Seems "ok" in a single process--perhaps an investigation of thread-safety/concurrency issues may be in order. For now I'll run the suite on a single core/thread locally.

python -m pytest -v --pyargs darshan -n 10

=================================================================================================================================================== test session starts ====================================================================================================================================================
platform linux -- Python 3.10.6, pytest-7.2.0, pluggy-1.0.0 -- /home/tyler/python_310_darshan_venv/bin/python
cachedir: .pytest_cache
rootdir: /tmp
plugins: forked-1.4.0, repeat-0.9.1, mpi-0.6, xdist-3.0.2
[gw0] linux Python 3.10.6 cwd: /tmp                                          
[gw1] linux Python 3.10.6 cwd: /tmp                                          
[gw2] linux Python 3.10.6 cwd: /tmp                                          
[gw3] linux Python 3.10.6 cwd: /tmp                                          
[gw4] linux Python 3.10.6 cwd: /tmp                                          
[gw5] linux Python 3.10.6 cwd: /tmp                                          
[gw6] linux Python 3.10.6 cwd: /tmp                                          
[gw7] linux Python 3.10.6 cwd: /tmp                                          
[gw8] linux Python 3.10.6 cwd: /tmp                                          
[gw9] linux Python 3.10.6 cwd: /tmp                                          
[gw0] Python 3.10.6 (main, Nov  2 2022, 18:53:38) [GCC 11.3.0]               
[gw1] Python 3.10.6 (main, Nov  2 2022, 18:53:38) [GCC 11.3.0]                
[gw2] Python 3.10.6 (main, Nov  2 2022, 18:53:38) [GCC 11.3.0]                 
[gw3] Python 3.10.6 (main, Nov  2 2022, 18:53:38) [GCC 11.3.0]                  
[gw4] Python 3.10.6 (main, Nov  2 2022, 18:53:38) [GCC 11.3.0]                   
[gw5] Python 3.10.6 (main, Nov  2 2022, 18:53:38) [GCC 11.3.0]                    
[gw6] Python 3.10.6 (main, Nov  2 2022, 18:53:38) [GCC 11.3.0]                     
[gw7] Python 3.10.6 (main, Nov  2 2022, 18:53:38) [GCC 11.3.0]                      
[gw8] Python 3.10.6 (main, Nov  2 2022, 18:53:38) [GCC 11.3.0]                       
[gw9] Python 3.10.6 (main, Nov  2 2022, 18:53:38) [GCC 11.3.0]                        
gw0 ok / gw1 ok / gw2 ok / gw3 ok / gw4 ok / gw5 ok / gw6 ok / gw7 ok / gw8 ok / gw9 o

@tylerjereddy
Copy link
Collaborator Author

To clarify, I could only reproduce on the current version of the feature branch in gh-839.

Because this is quite annoying (hang with no feedback), I decided to git bisect the hang in the parallel Python test suite (see below). I'll probably just move on for now, but this may be useful info to think about at some point re: preventing it more clearly in the future.

09a5e64b578c82f2fec9aad3243c5f57d3c20305 is the first bad commit
commit 09a5e64b578c82f2fec9aad3243c5f57d3c20305
Author: Tyler Reddy <tyler.je.reddy@gmail.com>
Date:   Sun Nov 20 15:28:53 2022 -0700

    MAINT: PR 839 revisions
    
    * a number of additional `MPI-IO` and `STDIO` test cases
    were added from the logs repo to `test_derived_metrics_bytes_and_bandwidth()`
    
    * for the `MPI-IO` cases to pass, special casing was added
    to `log_get_bytes_bandwidth()` such that `total_bytes` is
    actually extracted from `POSIX`

 darshan-util/pydarshan/darshan/backend/cffi_backend.py | 11 ++++++++++-
 darshan-util/pydarshan/darshan/tests/test_cffi_misc.py | 15 +++++++++++++++
 2 files changed, 25 insertions(+), 1 deletion(-)

--- a/darshan-util/pydarshan/darshan/backend/cffi_backend.py
+++ b/darshan-util/pydarshan/darshan/backend/cffi_backend.py
@@ -732,7 +732,16 @@ def log_get_bytes_bandwidth(log_path: str, mod_name: str) -> str:
     # in the old perl-based summary reports
     darshan_derived_metrics = log_get_derived_metrics(log_path=log_path,
                                                       mod_name=mod_name)
-    total_mib = darshan_derived_metrics.total_bytes / 2 ** 20
+    if mod_name == "MPI-IO":
+        # for whatever reason, this seems to require
+        # total_bytes reported from POSIX to match the
+        # old perl summary reports
+        darshan_derived_metrics_posix = log_get_derived_metrics(log_path=log_path,
+                                                                mod_name="POSIX")
+        total_mib = darshan_derived_metrics_posix.total_bytes / 2 ** 20
+    else:
+        total_mib = darshan_derived_metrics.total_bytes / 2 ** 20
+
     total_bw = darshan_derived_metrics.agg_perf_by_slowest
     ret_str = f"I/O performance estimate (at the {mod_name} layer): transferred {total_mib:.1f} MiB at {total_bw:.2f} MiB/s"
     return ret_str
diff --git a/darshan-util/pydarshan/darshan/tests/test_cffi_misc.py b/darshan-util/pydarshan/darshan/tests/test_cffi_misc.py
index 86bcbf8c..92060da4 100644
--- a/darshan-util/pydarshan/darshan/tests/test_cffi_misc.py
+++ b/darshan-util/pydarshan/darshan/tests/test_cffi_misc.py
@@ -167,6 +167,9 @@ def test_log_get_generic_record(dtype):
     ("imbalanced-io.darshan",
      "STDIO",
      "I/O performance estimate (at the STDIO layer): transferred 1.1 MiB at 0.01 MiB/s"),
+    ("imbalanced-io.darshan",
+     "MPI-IO",
+     "I/O performance estimate (at the MPI-IO layer): transferred 101785.8 MiB at 101.58 MiB/s"),
     ("laytonjb_test1_id28730_6-7-43012-2131301613401632697_1.darshan",
      "STDIO",
      "I/O performance estimate (at the STDIO layer): transferred 0.0 MiB at 4.22 MiB/s"),
@@ -176,6 +179,18 @@ def test_log_get_generic_record(dtype):
     ("treddy_mpi-io-test_id4373053_6-2-60198-9815401321915095332_1.darshan",
      "STDIO",
      "I/O performance estimate (at the STDIO layer): transferred 0.0 MiB at 16.47 MiB/s"),
+    ("e3sm_io_heatmap_only.darshan",
+     "STDIO",
+     "I/O performance estimate (at the STDIO layer): transferred 0.0 MiB at 3.26 MiB/s"),
+    ("e3sm_io_heatmap_only.darshan",
+     "MPI-IO",
+     "I/O performance estimate (at the MPI-IO layer): transferred 290574.1 MiB at 105.69 MiB/s"),
+    ("partial_data_stdio.darshan",
+     "MPI-IO",
+     "I/O performance estimate (at the MPI-IO layer): transferred 32.0 MiB at 2317.98 MiB/s"),
+    ("partial_data_stdio.darshan",
+     "STDIO",
+     "I/O performance estimate (at the STDIO layer): transferred 16336.0 MiB at 2999.14 MiB/s"),
 ])
 def test_derived_metrics_bytes_and_bandwidth(log_path, mod_name, expected_str):
     # test the basic scenario of retrieving

@tylerjereddy
Copy link
Collaborator Author

Maybe related to two files handles and/or two calls to log_get_derived_metrics in that function? I guess we're going to get rid of the special MPI-IO + POSIX mess in that function anyway, but perhaps useful to note the related code change driving the hang with parallel testing.

tylerjereddy added a commit to tylerjereddy/darshan that referenced this issue Nov 30, 2022
* the testsuite now always uses `DarshanReport` with a context
manager to avoid shenanigans with `__del__` and garbage
collection/`pytest`/multiple threads

* this appears to fix the problem with testsuite hangs
described in darshan-hpcgh-839 and darshan-hpcgh-851
tylerjereddy added a commit to tylerjereddy/darshan that referenced this issue Nov 30, 2022
Fixes darshan-hpc#851

* the testsuite now always uses `DarshanReport` with a context
manager to avoid shenanigans with `__del__` and garbage
collection/`pytest`/multiple threads

* this appears to fix the problem with testsuite hangs
described in darshan-hpcgh-839 and darshan-hpcgh-851; I pushed this commit into
darshan-hpcgh-839 recently so if the CI there stops hanging with `3.10`
on top of my local confirmation, hopefully we're good to go
on this annoyance

* if the fix is confirmed by the CI over there, I do suggest
we encourage the use of `DarshanReport` with a context manager
in our documentation--perhaps we could open an issue for doing
that and maybe looking for cases in our source (beyond the tests)
where we may also consider the switchover
tylerjereddy added a commit to tylerjereddy/darshan that referenced this issue Nov 30, 2022
* the testsuite now always uses `DarshanReport` with a context
manager to avoid shenanigans with `__del__` and garbage
collection/`pytest`/multiple threads

* this appears to fix the problem with testsuite hangs
described in darshan-hpcgh-839 and darshan-hpcgh-851
tylerjereddy added a commit to tylerjereddy/darshan that referenced this issue Dec 7, 2022
* the Github Actions infrastructure is progressively
phasing in a newer base image of Ubuntu:
https://github.blog/changelog/2022-11-09-github-actions-ubuntu-latest-workflows-will-use-ubuntu-22-04/

* that means that we are somewhat randomly going to
see images that lack Python `3.6` from the build cache
sometimes per:
actions/setup-python#355 (comment)

* instead of pinning to an old version of Ubuntu in GHA for
3.6 support, let's just drop 3.6 from the testing matrix
per:
darshan-hpc#510 (comment)
(it has been EOL for 1 year)

* there's also no reason to retain the special Python `3.10`
handling where we use Ubuntu `22.04` for that case, for two reasons:
1) `22.04` is being rolled out as the new default anyway
2) darshan-hpcgh-851 with the Python garbage collection issues in `3.10` was
resolved by using contexts, so special treatment not justified
anymore
tylerjereddy added a commit to tylerjereddy/darshan that referenced this issue Dec 16, 2022
* the testsuite now always uses `DarshanReport` with a context
manager to avoid shenanigans with `__del__` and garbage
collection/`pytest`/multiple threads

* this appears to fix the problem with testsuite hangs
described in darshan-hpcgh-839 and darshan-hpcgh-851
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CI continuous integration pydarshan
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant