test: update pytest framework (modflowpy#1493)

* use pytest-benchmark's builtin profiling capability instead of manual implementation * remove requires_exe(mf6) from test_mf6.py tests that don't run models/simulations * add @requires_spatial_reference marker to conftest.py (for tests depending on spatialreference.org) * try both importlib.import_module and pkg_resources.get_distribution in @requires_pkg marker * mark test_lgr.py::test_simple_lgr_model_from_scratch as flaky (occasional forrtl error (65): floating invalid) * split test_export.py::test_polygon_from_ij into network-bound and non-network-bound cases * add comments to flaky tests with links to potentially similar issues * add timeouts to CI jobs (10min for build, lint, & smoke, 45min for test, 90min for daily jobs) * remove unneeded markers from pytest.ini * match profiling/benchmarking test files in pytest.ini * mark get-modflow tests as flaky (modflowpy#1489 (comment)) * cache benchmark results in daily CI and compare with prior runs * various tidying/cleanup
wpbonelli · Aug 10, 2022 · 7d33c40 · 7d33c40
1 parent 9c42c37
commit 7d33c40
Show file tree

Hide file tree

Showing 17 changed files with 243 additions and 239 deletions.
diff --git a/.github/workflows/commit.yml b/.github/workflows/commit.yml
@@ -18,6 +18,7 @@ jobs:
     defaults:
       run:
         shell: bash
+    timeout-minutes: 10
 
     steps:
       - name: Checkout repo
@@ -50,13 +51,13 @@ jobs:
         run: |
           twine check --strict dist/*
 
-
   lint:
     name: Lint
     runs-on: ubuntu-latest
     defaults:
       run:
         shell: bash
+    timeout-minutes: 10
 
     steps:
       - name: Checkout repo
@@ -106,14 +107,13 @@ jobs:
         run: |
           pylint --jobs=2 --errors-only --exit-zero ./flopy
 
-
-
   smoke:
     name: Smoke
     runs-on: ubuntu-latest
     defaults:
       run:
         shell: bash
+    timeout-minutes: 10
 
     steps:
       - name: Checkout repo
@@ -185,7 +185,6 @@ jobs:
           directory: ./autotest
           file: coverage.xml
 
-
   test:
     name: Test
     needs: smoke
@@ -204,6 +203,7 @@ jobs:
             path: ~/.cache/pip
           - os: macos-latest
             path: ~/Library/Caches/pip
+    timeout-minutes: 45
 
     steps:
       - name: Checkout repo
@@ -290,6 +290,7 @@ jobs:
     defaults:
       run:
         shell: pwsh
+    timeout-minutes: 45
 
     steps:
       - name: Checkout repo
@@ -302,7 +303,7 @@ jobs:
         uses: actions/cache@v2.1.0
         with:
           path: ~/conda_pkgs_dir
-          key: ${{ runner.os }}-${{ matrix.python-version }}-${{ matrix.run-type }}-${{ hashFiles('etc/environment.yml', 'flopy') }}
+          key: ${{ runner.os }}-${{ matrix.python-version }}-${{ matrix.run-type }}-${{ hashFiles('etc/environment.yml') }}
 
       # Standard python fails on windows without GDAL installation
       # Using custom bash shell ("shell: bash -l {0}") with Miniconda

diff --git a/.github/workflows/daily.yml b/.github/workflows/daily.yml
@@ -25,6 +25,7 @@ jobs:
     defaults:
       run:
         shell: bash
+    timeout-minutes: 90
 
     steps:
       - name: Checkout repo
@@ -90,7 +91,6 @@ jobs:
           file: coverage.xml
 
   examples:
-
     name: Example scripts & notebooks
     runs-on: ${{ matrix.os }}
     strategy:
@@ -110,6 +110,7 @@ jobs:
     defaults:
       run:
         shell: bash
+    timeout-minutes: 90
 
     steps:
       - name: Checkout repo
@@ -194,6 +195,7 @@ jobs:
     defaults:
       run:
         shell: bash
+    timeout-minutes: 90
 
     steps:
       - name: Checkout repo
@@ -230,14 +232,23 @@ jobs:
         env:
           GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
 
-      - name: Run tests
+      - name: Load cached benchmark results (for comparison)
+        uses: actions/cache@v2.1.0
+        with:
+          path: ./autotest/.benchmarks
+          key: benchmark-${{ matrix.os }}-${{ matrix.python-version }} }}
+
+      - name: Run benchmarks
         working-directory: ./autotest
         run: |
-          pytest -v --cov=flopy --cov-report=xml --durations=0 --benchmark-only --benchmark-autosave  --keep-failed=.failed
+          pytest -v --durations=0 \
+            --cov=flopy --cov-report=xml \
+            --benchmark-only --benchmark-autosave --benchmark-compare --benchmark-compare-fail=mean:25% \
+            --keep-failed=.failed
         env:
           GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
 
-      - name: Upload failed test outputs
+      - name: Upload failed benchmark outputs
         uses: actions/upload-artifact@v2
         if: failure()
         with:
@@ -279,6 +290,7 @@ jobs:
     defaults:
       run:
         shell: pwsh
+    timeout-minutes: 90
 
     steps:
       - name: Checkout repo
@@ -291,7 +303,7 @@ jobs:
         uses: actions/cache@v2.1.0
         with:
           path: ~/conda_pkgs_dir
-          key: ${{ runner.os }}-${{ matrix.python-version }}-${{ matrix.run-type }}-${{ hashFiles('etc/environment.yml', 'flopy') }}
+          key: ${{ runner.os }}-${{ matrix.python-version }}-${{ matrix.run-type }}-${{ hashFiles('etc/environment.yml') }}
 
       # Standard python fails on windows without GDAL installation
       # Using custom bash shell ("shell: bash -l {0}") with Miniconda
@@ -362,6 +374,7 @@ jobs:
     defaults:
       run:
         shell: pwsh
+    timeout-minutes: 90
 
     steps:
       - name: Checkout repo
@@ -374,7 +387,7 @@ jobs:
         uses: actions/cache@v2.1.0
         with:
           path: ~/conda_pkgs_dir
-          key: ${{ runner.os }}-${{ matrix.python-version }}-${{ matrix.run-type }}-${{ hashFiles('etc/environment.yml', 'flopy') }}
+          key: ${{ runner.os }}-${{ matrix.python-version }}-${{ matrix.run-type }}-${{ hashFiles('etc/environment.yml') }}
 
       # Standard python fails on windows without GDAL installation
       # Using custom bash shell ("shell: bash -l {0}") with Miniconda
@@ -410,7 +423,6 @@ jobs:
         env:
           GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
 
-
       - name: Upload failed test outputs
         uses: actions/upload-artifact@v2
         if: failure()
@@ -446,6 +458,7 @@ jobs:
     defaults:
       run:
         shell: pwsh
+    timeout-minutes: 90
 
     steps:
       - name: Checkout repo
@@ -458,7 +471,7 @@ jobs:
         uses: actions/cache@v2.1.0
         with:
           path: ~/conda_pkgs_dir
-          key: ${{ runner.os }}-${{ matrix.python-version }}-${{ matrix.run-type }}-${{ hashFiles('etc/environment.yml', 'flopy') }}
+          key: ${{ runner.os }}-${{ matrix.python-version }}-${{ matrix.run-type }}-${{ hashFiles('etc/environment.yml') }}
 
       # Standard python fails on windows without GDAL installation
       # Using custom bash shell ("shell: bash -l {0}") with Miniconda
@@ -487,14 +500,23 @@ jobs:
         env:
           GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
 
-      - name: Run tests
+      - name: Load cached benchmark results (for comparison)
+        uses: actions/cache@v2.1.0
+        with:
+          path: ./autotest/.benchmarks
+          key: benchmark-${{ runner.os }}-${{ matrix.python-version }} }}
+
+      - name: Run benchmarks
         working-directory: ./autotest
         run: |
-          pytest -v --cov=flopy --cov-report=xml --durations=0 --benchmark-only --benchmark-autosave --keep-failed=.failed
+          pytest -v --durations=0 \
+            --cov=flopy --cov-report=xml \
+            --benchmark-only --benchmark-autosave --benchmark-compare --benchmark-compare-fail=mean:25% \
+            --keep-failed=.failed
         env:
           GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
 
-      - name: Upload failed test outputs
+      - name: Upload failed benchmark outputs
         uses: actions/upload-artifact@v2
         if: failure()
         with:
@@ -505,7 +527,7 @@ jobs:
       - name: Upload benchmark results
         uses: actions/upload-artifact@v2
         with:
-          name: benchmark-${{ matrix.os }}-${{ matrix.python-version }}
+          name: benchmark-${{ runner.os }}-${{ matrix.python-version }}
           path: |
             ./autotest/.benchmarks/**/*.json
 

diff --git a/DEVELOPER.md b/DEVELOPER.md
@@ -192,8 +192,6 @@ Markers are a `pytest` feature that can be used to select subsets of tests. Mark
 - `slow`: tests that don't complete in a few seconds
 - `example`: exercise scripts, tutorials and notebooks
 - `regression`: tests that compare multiple results
-- `benchmark`: test that gather runtime statistics
-- `profile`: tests measuring performance in detail
 
 Markers can be used with the `-m <marker>` option. For example, to run only fast tests:
 
@@ -221,9 +219,20 @@ This will retain the test directories created by the test, which allows files to
 
 There is also a `--keep-failed <dir>` option which preserves the outputs of failed tests in the given location, however this option is only compatible with function-scoped temporary directories (the `tmpdir` fixture defined in `conftest.py`).
 
-### Benchmarking
+### Performance testing
 
-Benchmarking is accomplished with [`pytest-benchmark`](https://pytest-benchmark.readthedocs.io/en/latest/index.html). Any test function can be turned into a benchmark by requesting the `benchmark` fixture (i.e. declaring a `benchmark` argument), which can be used to wrap any function call. For instance:
+Performance testing is accomplished with [`pytest-benchmark`](https://pytest-benchmark.readthedocs.io/en/latest/index.html).
+
+To allow optional separation of performance from correctness concerns, performance test files may be named either as typical test files or may match any of the following patterns:
+
+- `benchmark_*.py`
+- `profile_*.py`
+- `*_profile*.py`.
+- `*_benchmark*.py`
+
+#### Benchmarking
+
+Any test function can be turned into a benchmark by requesting the `benchmark` fixture (i.e. declaring a `benchmark` argument), which can be used to wrap any function call. For instance:
 
 ```python
 def test_benchmark(benchmark):
@@ -251,25 +260,27 @@ Rather than alter an existing function call to use this syntax, a lambda can be
 
 ```python
 def test_benchmark(benchmark):
-    def sleep_1s():
+    def sleep_s(s):
         import time
-        time.sleep(1)
+        time.sleep(s)
         return True
 
-    assert benchmark(lambda: sleep_1s())
+    assert benchmark(lambda: sleep_s(1))
 ```
 
 This can be convenient when the function call is complicated or passes many arguments.
 
-To control the number of repetitions and rounds (repetitions of repetitions) use `benchmark.pedantic`, e.g. `benchmark.pedantic(some_function(), iterations=1, rounds=1)`.
+Benchmarked functions are repeated several times (the number of iterations depending on the test's runtime, with faster tests generally getting more reps) to compute summary statistics. To control the number of repetitions and rounds (repetitions of repetitions) use `benchmark.pedantic`, e.g. `benchmark.pedantic(some_function(), iterations=1, rounds=1)`.
+
+Benchmarking is incompatible with `pytest-xdist` and is disabled automatically when tests are run in parallel. When tests are not run in parallel, benchmarking is enabled by default. Benchmarks can be disabled with the `--benchmark-disable` flag.
 
-Benchmarked functions are repeated several times (the number of iterations depending on the test's runtime, with faster tests generally getting more reps) to compute summary statistics. Benchmarking is incompatible with `pytest-xdist` and is disabled automatically when tests are run in parallel. When tests are not run in parallel, benchmarking is enabled by default. Benchmarks can be disabled with the `--benchmark-disable` flag.
+Benchmark results are only printed to `stdout` by default. To save results to a JSON file, use `--benchmark-autosave`. This will create a `.benchmarks` folder in the current working location (if you're running tests, this should be `autotest/.benchmarks`).
 
-Benchmark results are only printed to stdout by default. To save results to a JSON file, use `--benchmark-autosave`. This will create a `.benchmarks` folder in the current working location (if you're running tests, this should appear at `autotest/.benchmarks`).
+#### Profiling
 
-### Profiling
+Profiling is [distinct](https://stackoverflow.com/a/39381805/6514033) from benchmarking in evaluating a program's call stack in detail, while benchmarking just invokes a function repeatedly and computes summary statistics. Profiling is also accomplished with `pytest-benchmark`: use the `--benchmark-cprofile` option when running tests which use the `benchmark` fixture described above. The option's value is the column to sort results by. For instance, to sort by total time, use `--benchmark-cprofile="tottime"`. See the `pytest-benchmark` [docs](https://pytest-benchmark.readthedocs.io/en/stable/usage.html#commandline-options) for more information.
 
-Profiling is [distinct](https://stackoverflow.com/a/39381805/6514033) from benchmarking in considering program behavior in detail, while benchmarking just invokes functions repeatedly and computes summary statistics. Profiling test files may be named either as typical test files or matching `profile_*.py` or `*_profile*.py`. Functions marked with the `profile` marker are considered profiling tests and will not run unless `pytest` is invoked with the `--profile` (short `-P`) flag.
+By default, `pytest-benchmark` will only print profiling results to `stdout`. If the `--benchmark-autosave` flag is provided, performance profile data will be included in the JSON files written to the `.benchmarks` save directory as described in the benchmarking section above.
 
 ### Writing tests