Test suite sometimes picks up way more tests; takes ~6 times longer #343

h-vetinari · 2025-02-01T23:49:07Z

There's something strange with the tests; in the following run for linux-64+CUDA+MKL (d04bba8 in #340), py311 collects a whole bunch more tests and takes longer than elsewhere (even than py312, which is the only version where we run the inductor tests).

TEST START: /home/conda/feedstock_root/build_artifacts/linux-64/pytorch-2.5.1-cuda126_mkl_py311_h3846359_311.conda
+ python -m pytest -n 2 test/test_autograd.py test/test_autograd_fallback.py test/test_custom_ops.py test/test_linalg.py test/test_mkldnn.py test/test_modules.py test/test_nn.py test/test_torch.py test/test_xnnpack_integration.py -k 'not ((TestTorch and test_print) or (TestAutograd and test_profiler_seq_nr) or (TestAutograd and test_profiler_propagation) or test_mutable_custom_op_fixed_layout or test_BCELoss_weights_no_reduce_cuda or test_ctc_loss_cudnn_tensor_cuda  or (TestTorch and test_index_add_correctness) or test_sdpa_inference_mode_aot_compile or (TestNN and test_grid_sample) or test_indirect_device_assert or (GPUTests and test_scatter_reduce2) or (TestLinalgCPU and test_inverse_errors_large_cpu) or test_base_does_not_require_grad_mode_nothing or test_base_does_not_require_grad_mode_warn or test_composite_registered_to_cpu_mode_nothing)' -m 'not hypothesis' --durations=50
= 13171 passed, 2586 skipped, 91 xfailed, 143216 warnings in 2916.32s (0:48:36) =
TEST END: /home/conda/feedstock_root/build_artifacts/linux-64/pytorch-2.5.1-cuda126_mkl_py311_h3846359_311.conda
TEST START: /home/conda/feedstock_root/build_artifacts/linux-64/pytorch-2.5.1-cuda126_mkl_py39_hdffab68_311.conda
+ python -m pytest -n 2 test/test_autograd.py test/test_autograd_fallback.py test/test_custom_ops.py test/test_linalg.py test/test_mkldnn.py test/test_modules.py test/test_nn.py test/test_torch.py test/test_xnnpack_integration.py -k 'not ((TestTorch and test_print) or (TestAutograd and test_profiler_seq_nr) or (TestAutograd and test_profiler_propagation) or test_mutable_custom_op_fixed_layout or test_BCELoss_weights_no_reduce_cuda or test_ctc_loss_cudnn_tensor_cuda  or (TestTorch and test_index_add_correctness) or test_sdpa_inference_mode_aot_compile or (TestNN and test_grid_sample) or test_indirect_device_assert or (GPUTests and test_scatter_reduce2) or (TestLinalgCPU and test_inverse_errors_large_cpu) or test_base_does_not_require_grad_mode_nothing or test_base_does_not_require_grad_mode_warn or test_composite_registered_to_cpu_mode_nothing)' -m 'not hypothesis' --durations=50
== 7552 passed, 1375 skipped, 31 xfailed, 75701 warnings in 458.74s (0:07:38) ==
TEST END: /home/conda/feedstock_root/build_artifacts/linux-64/pytorch-2.5.1-cuda126_mkl_py39_hdffab68_311.conda
TEST START: /home/conda/feedstock_root/build_artifacts/linux-64/pytorch-2.5.1-cuda126_mkl_py313_h33c0e77_311.conda
+ python -m pytest -n 2 test/test_autograd.py test/test_autograd_fallback.py test/test_custom_ops.py test/test_linalg.py test/test_mkldnn.py test/test_modules.py test/test_nn.py test/test_torch.py test/test_xnnpack_integration.py -k 'not ((TestTorch and test_print) or (TestCustomOp and test_data_dependent_compile) or (TestCustomOp and test_functionalize_error) or (TestCustomOpAPI and test_compile) or (TestCustomOpAPI and test_fake) or test_compile_int4_mm or test_compile_int8_mm or (TestAutograd and test_profiler_seq_nr) or (TestAutograd and test_profiler_propagation) or test_mutable_custom_op_fixed_layout or test_BCELoss_weights_no_reduce_cuda or test_ctc_loss_cudnn_tensor_cuda  or (TestTorch and test_index_add_correctness) or test_sdpa_inference_mode_aot_compile or (TestNN and test_grid_sample) or test_indirect_device_assert or (GPUTests and test_scatter_reduce2) or (TestLinalgCPU and test_inverse_errors_large_cpu) or test_base_does_not_require_grad_mode_nothing or test_base_does_not_require_grad_mode_warn or test_composite_registered_to_cpu_mode_nothing)' -m 'not hypothesis' --durations=50
== 7532 passed, 1375 skipped, 31 xfailed, 75718 warnings in 455.21s (0:07:35) ==
TEST END: /home/conda/feedstock_root/build_artifacts/linux-64/pytorch-2.5.1-cuda126_mkl_py313_h33c0e77_311.conda
TEST START: /home/conda/feedstock_root/build_artifacts/linux-64/pytorch-2.5.1-cuda126_mkl_py310_hca309f4_311.conda
+ python -m pytest -n 2 test/test_autograd.py test/test_autograd_fallback.py test/test_custom_ops.py test/test_linalg.py test/test_mkldnn.py test/test_modules.py test/test_nn.py test/test_torch.py test/test_xnnpack_integration.py -k 'not ((TestTorch and test_print) or (TestAutograd and test_profiler_seq_nr) or (TestAutograd and test_profiler_propagation) or test_mutable_custom_op_fixed_layout or test_BCELoss_weights_no_reduce_cuda or test_ctc_loss_cudnn_tensor_cuda  or (TestTorch and test_index_add_correctness) or test_sdpa_inference_mode_aot_compile or (TestNN and test_grid_sample) or test_indirect_device_assert or (GPUTests and test_scatter_reduce2) or (TestLinalgCPU and test_inverse_errors_large_cpu) or test_base_does_not_require_grad_mode_nothing or test_base_does_not_require_grad_mode_warn or test_composite_registered_to_cpu_mode_nothing)' -m 'not hypothesis' --durations=50
== 7552 passed, 1375 skipped, 31 xfailed, 75701 warnings in 459.08s (0:07:39) ==
TEST END: /home/conda/feedstock_root/build_artifacts/linux-64/pytorch-2.5.1-cuda126_mkl_py310_hca309f4_311.conda
TEST START: /home/conda/feedstock_root/build_artifacts/linux-64/pytorch-2.5.1-cuda126_mkl_py312_hdbe889e_311.conda
+ python -m pytest -n 2 test/test_autograd.py test/test_autograd_fallback.py test/test_custom_ops.py test/test_linalg.py test/test_mkldnn.py test/test_modules.py test/test_nn.py test/test_torch.py test/test_xnnpack_integration.py test/inductor/test_torchinductor.py -k 'not ((TestTorch and test_print) or (TestAutograd and test_profiler_seq_nr) or (TestAutograd and test_profiler_propagation) or test_mutable_custom_op_fixed_layout or test_BCELoss_weights_no_reduce_cuda or test_ctc_loss_cudnn_tensor_cuda  or (TestTorch and test_index_add_correctness) or test_sdpa_inference_mode_aot_compile or (TestNN and test_grid_sample) or test_indirect_device_assert or (GPUTests and test_scatter_reduce2) or (TestLinalgCPU and test_inverse_errors_large_cpu) or test_base_does_not_require_grad_mode_nothing or test_base_does_not_require_grad_mode_warn or test_composite_registered_to_cpu_mode_nothing)' -m 'not hypothesis' --durations=50
= 8196 passed, 1429 skipped, 31 xfailed, 76339 warnings in 2177.80s (0:36:17) ==
TEST END: /home/conda/feedstock_root/build_artifacts/linux-64/pytorch-2.5.1-cuda126_mkl_py312_hdbe889e_311.conda

The set of modules and skips is exactly the same as on 3.9 or 3.10, so I don't know what would explain this difference in test collection.

(note: it's expected that 3.12 runs longer due to being the only version where we include the torchinductor tests, and that 3.13 has more skips because dynamo doesn't yet support 3.13 in pytorch 2.5)

However, after merging #340 to main, the exact same job yielded completely different behaviour, with every test run collecting 13k+ tests and taking ~50min instead of <10.

TEST START: /home/conda/feedstock_root/build_artifacts/linux-64/pytorch-2.5.1-cuda126_mkl_py313_h33c0e77_311.conda
+ python -m pytest -n 2 test/test_autograd.py test/test_autograd_fallback.py test/test_custom_ops.py test/test_linalg.py test/test_mkldnn.py test/test_modules.py test/test_nn.py test/test_torch.py test/test_xnnpack_integration.py -k 'not ((TestTorch and test_print) or (TestCustomOp and test_data_dependent_compile) or (TestCustomOp and test_functionalize_error) or (TestCustomOpAPI and test_compile) or (TestCustomOpAPI and test_fake) or test_compile_int4_mm or test_compile_int8_mm or (TestAutograd and test_profiler_seq_nr) or (TestAutograd and test_profiler_propagation) or test_mutable_custom_op_fixed_layout or test_BCELoss_weights_no_reduce_cuda or test_ctc_loss_cudnn_tensor_cuda  or (TestTorch and test_index_add_correctness) or test_sdpa_inference_mode_aot_compile or (TestNN and test_grid_sample) or test_indirect_device_assert or (GPUTests and test_scatter_reduce2) or (TestLinalgCPU and test_inverse_errors_large_cpu) or test_base_does_not_require_grad_mode_nothing or test_base_does_not_require_grad_mode_warn or test_composite_registered_to_cpu_mode_nothing)' -m 'not hypothesis' --durations=50
= 13151 passed, 2570 skipped, 91 xfailed, 143235 warnings in 2899.57s (0:48:19) =
TEST END: /home/conda/feedstock_root/build_artifacts/linux-64/pytorch-2.5.1-cuda126_mkl_py313_h33c0e77_311.conda
TEST START: /home/conda/feedstock_root/build_artifacts/linux-64/pytorch-2.5.1-cuda126_mkl_py311_h3846359_311.conda
+ python -m pytest -n 2 test/test_autograd.py test/test_autograd_fallback.py test/test_custom_ops.py test/test_linalg.py test/test_mkldnn.py test/test_modules.py test/test_nn.py test/test_torch.py test/test_xnnpack_integration.py -k 'not ((TestTorch and test_print) or (TestAutograd and test_profiler_seq_nr) or (TestAutograd and test_profiler_propagation) or test_mutable_custom_op_fixed_layout or test_BCELoss_weights_no_reduce_cuda or test_ctc_loss_cudnn_tensor_cuda  or (TestTorch and test_index_add_correctness) or test_sdpa_inference_mode_aot_compile or (TestNN and test_grid_sample) or test_indirect_device_assert or (GPUTests and test_scatter_reduce2) or (TestLinalgCPU and test_inverse_errors_large_cpu) or test_base_does_not_require_grad_mode_nothing or test_base_does_not_require_grad_mode_warn or test_composite_registered_to_cpu_mode_nothing)' -m 'not hypothesis' --durations=50
= 13171 passed, 2586 skipped, 91 xfailed, 143216 warnings in 2898.41s (0:48:18) =
TEST END: /home/conda/feedstock_root/build_artifacts/linux-64/pytorch-2.5.1-cuda126_mkl_py311_h3846359_311.conda
TEST START: /home/conda/feedstock_root/build_artifacts/linux-64/pytorch-2.5.1-cuda126_mkl_py310_hca309f4_311.conda
+ python -m pytest -n 2 test/test_autograd.py test/test_autograd_fallback.py test/test_custom_ops.py test/test_linalg.py test/test_mkldnn.py test/test_modules.py test/test_nn.py test/test_torch.py test/test_xnnpack_integration.py -k 'not ((TestTorch and test_print) or (TestAutograd and test_profiler_seq_nr) or (TestAutograd and test_profiler_propagation) or test_mutable_custom_op_fixed_layout or test_BCELoss_weights_no_reduce_cuda or test_ctc_loss_cudnn_tensor_cuda  or (TestTorch and test_index_add_correctness) or test_sdpa_inference_mode_aot_compile or (TestNN and test_grid_sample) or test_indirect_device_assert or (GPUTests and test_scatter_reduce2) or (TestLinalgCPU and test_inverse_errors_large_cpu) or test_base_does_not_require_grad_mode_nothing or test_base_does_not_require_grad_mode_warn or test_composite_registered_to_cpu_mode_nothing)' -m 'not hypothesis' --durations=50
= 13171 passed, 2586 skipped, 91 xfailed, 143216 warnings in 2950.04s (0:49:10) =
TEST END: /home/conda/feedstock_root/build_artifacts/linux-64/pytorch-2.5.1-cuda126_mkl_py310_hca309f4_311.conda
TEST START: /home/conda/feedstock_root/build_artifacts/linux-64/pytorch-2.5.1-cuda126_mkl_py312_hdbe889e_311.conda
+ python -m pytest -n 2 test/test_autograd.py test/test_autograd_fallback.py test/test_custom_ops.py test/test_linalg.py test/test_mkldnn.py test/test_modules.py test/test_nn.py test/test_torch.py test/test_xnnpack_integration.py test/inductor/test_torchinductor.py -k 'not ((TestTorch and test_print) or (TestAutograd and test_profiler_seq_nr) or (TestAutograd and test_profiler_propagation) or test_mutable_custom_op_fixed_layout or test_BCELoss_weights_no_reduce_cuda or test_ctc_loss_cudnn_tensor_cuda  or (TestTorch and test_index_add_correctness) or test_sdpa_inference_mode_aot_compile or (TestNN and test_grid_sample) or test_indirect_device_assert or (GPUTests and test_scatter_reduce2) or (TestLinalgCPU and test_inverse_errors_large_cpu) or test_base_does_not_require_grad_mode_nothing or test_base_does_not_require_grad_mode_warn or test_composite_registered_to_cpu_mode_nothing)' -m 'not hypothesis' --durations=50
= 1 failed, 14514 passed, 2663 skipped, 91 xfailed, 143956 warnings in 4192.22s (1:09:52) =
# 3.9 not run after flaky failure for 3.12

Originally posted by @h-vetinari in #340 (comment)

The text was updated successfully, but these errors were encountered:

mgorny · 2025-02-11T12:37:11Z

Do you think it would be a problem to unconditionally use pytest -v? I can't reproduce this locally, at least with a bunch of random attempts, and it's hard to even guess anything without seeing test names.

h-vetinari · 2025-02-11T12:53:35Z

Do you think it would be a problem to unconditionally use pytest -v?

The logs are already extremely verbose, so if we could avoid adding another ~40k lines that would be great. But otherwise fine.

Speaking of, if you have any ideas on that - I'd love to get rid off another chunk of thousands of lines with ~0 information content; everything about:

copying torch/include/ATen/ops/quantile_native.h -> build/lib.macosx-10.13-x86_64-cpython-310/torch/include/ATen/ops
[...]
copying build/lib.macosx-10.13-x86_64-cpython-310/torch/include/ATen/ops/quantile_native.h -> build/bdist.macosx-10.13-x86_64/wheel/./torch/include/ATen/ops
[...]
adding 'torch/include/ATen/ops/quantile_native.h'

should just go away.

In other words, the logs are currently displaying the entire content of the wheel 3 times per python version, for essentially zero gain. I'm already down to -v, and if I remove that, then the only thing shown is still building wheel, but that's a step too far, because I do want to see the CMake config and compilation portion of the build. This seems like a very badly verbosity API in pip TBH.

mgorny · 2025-02-11T13:07:16Z

These are coming from setuptools, and I think I could reduce their verbosity. I'll try.

mgorny · 2025-02-11T13:55:20Z

diff --git a/recipe/build.sh b/recipe/build.sh
--- a/recipe/build.sh
+++ b/recipe/build.sh
@@ -243,7 +243,7 @@ case ${PKG_NAME} in
   libtorch)
     # Call setup.py directly to avoid spending time on unnecessarily
     # packing and unpacking the wheel.
-    $PREFIX/bin/python setup.py build
+    $PREFIX/bin/python setup.py -q build
 
     mv build/lib.*/torch/bin/* ${PREFIX}/bin/
     mv build/lib.*/torch/lib/* ${PREFIX}/lib/
@@ -256,7 +256,7 @@ case ${PKG_NAME} in
     cp build/CMakeCache.txt build/CMakeCache.txt.orig
     ;;
   pytorch)
-    $PREFIX/bin/python -m pip install . --no-deps --no-build-isolation -v --no-clean \
+    $PREFIX/bin/python -m pip install . --no-deps --no-build-isolation -v --no-clean --config-settings=--global-option=-q \
         | sed "s,${CXX},\$\{CXX\},g" \
         | sed "s,${PREFIX},\$\{PREFIX\},g"
     # Keep this in ${PREFIX}/lib so that the library can be found by

This seems to work for me.

h-vetinari · 2025-02-12T04:16:18Z

Picked into 22ee0fe, and I'm turning on -v for the test suite in 47f7f21.

h-vetinari · 2025-02-12T23:09:47Z

I think the problem is actually a cuda-detection issue. Looking at the most recent logs in #326, the "fast" test suite runs skip all the _cuda_ tests (and perhaps collect less of them in the first place), whereas the long runs actually do run the _cuda_ tests (some of which take quite a while).

mgorny · 2025-02-13T06:16:47Z

Hmm, that would explain why I couldn't reproduce on a non-GPU system. I'll try on a GPU system later today.

mgorny · 2025-02-13T15:00:52Z

Ah, indeed. Collects total of 8992 tests with CPU build, and 15892 with CUDA build.

I wonder if we could and should assert for that somehow. Since we can't unconditionally assume everyone building it will do some on a GPU system, the assert could be conditional to some external CUDA detection method.

h-vetinari · 2025-02-14T03:31:50Z

I wonder if we could and should assert for that somehow.

I don't think that's necessary. I'd even go as far as saying that I prefer the current setup. The GPU tests do get run on average at least once or twice per megabuild (so we have the coverage), but having the CUDA detection fail and only run the CPU tests in much less time actually keeps the overall testing time sane.

Though I'm not opposed to fixing it properly (and perhaps reducing the tests and/or skipping some long-running ones, like I did in 5640240).

mgorny · 2025-02-14T04:27:20Z

xkcd#1172 ("workflows") sounds like :-).

hmaarrfk · 2025-02-14T06:57:08Z

I would somewhat prefer that we be more "reactionary" with the tests.

Developer time (yours) is quite important and waiting an extra hour during a iteration cycle is really a downer.

conda-forge used to prioritize "existence tests" rather than exhaustive tests. I think we should go steer toward that direction.

Having 5-30 very targetted lines like:

pytest the/name/of/the/file.py::the_test_name

will likely be equally as effective in finding build errors.

h-vetinari · 2025-02-14T07:29:46Z

and waiting an extra hour during a iteration cycle is really a downer.

Generally yes, but since the iteration cycle for the GPU builds here is >10h in any case, it barely makes a difference.

And where speed matters for debugging cycles, it's really easy to comment out the pytest call, restrict it to a single python version, or just delete some of the tested modules...

conda-forge used to prioritize "existence tests" rather than exhaustive tests. I think we should go steer toward that direction.

I actually prefer running the full test suite in feedstocks that I maintain, especially if the feedstock does non-trivial surgery. Too often have I seen bizarre stuff that breaks due to trivial oversights (381fcb8 is a recent example of something I wouldn't have found without extensive tests), and in the vast majority of cases the conclusion was that it's been my fault. 😅

Running more than "experience tests" is (in my experience) a relatively concrete insurance that the surgery left the patient alive and (mostly) healthy. In the case of pytorch, the full test suite is too massive to make that approach feasible, but in terms of overall philosophy, I'd rather wait an hour longer than end up with avoidable bugs that I don't even know about.

As we discussed across a couple issues recently, I believe there's a middle ground to be found between our positions on this, and I'm happy to iterate on the set of modules and tests.

hmaarrfk · 2025-02-14T08:38:39Z

Generally yes, but since the iteration cycle for the GPU builds here is >10h in any case, it barely makes a difference.

I would like to use build time as proxy for "scope".

I know many of our jobs depend/support conda-forge, but honestly, the recipe has become too complicated already. If we can decrease the scope by 10%, I call that a win.

Too often have I seen bizarre stuff that breaks due to trivial oversights (381fcb8 is a recent example of something I wouldn't have found without extensive tests)

I agree that adding a test here is the right move, and in fact the test that was added is narrow in scope which is what I'm advocating for. conda-forge shouldn't be considered a "I'm going to run my business on this". Anaconda provides packages for this, and also a support line to get more targetted support. We can't recreate this for "free".

Honestly, the scope increase of the pytorch recipe has made me less interested in helping maintain it. The simplicity of "you can build this recipe on any linux box, GPU or not" should be under valued IMO.

I know I've said this many times, but unless this whole thread is a place to bikeshed, I think you are also concerned with the scope an flakiness of the tests. My proposal is as follows:

Detect if the runner has the GPU enabled. Typically nvidia-smi can be used.
If so, assert that data can be sent to the GPU `

tensor = torch.zeros((16, 16), dtype=torch.uint8, device='cuda')

Run a small test for torch compile -GPU if GPU available
Run a small test for torch compile-CPU always.
Run a small tests for BLAS compatility check, focusing on linkage issues.

If no GPU is detected, we should indicate it in the logs. A pytorch conda-forge maintainer should then use their judgement in pressing the merge button if they feel comfortable doing so.

Key takeaway: I don't think we need to mandate the GPU test suite in the CIs.

mgorny · 2025-02-14T13:31:58Z

One thing we could consider is rerunning failing tests too. Normally I'd do that via pytest-rerunfailures option --reruns=5 or likewise, but I'm not sure if that's going to work correctly for the issues we're seeing.

This was referenced Feb 2, 2025

Fix gpu builds #340

Merged

Fix CMake metadata for CUDA-enabled libtorch #339

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Test suite sometimes picks up way more tests; takes ~6 times longer #343

Test suite sometimes picks up way more tests; takes ~6 times longer #343

h-vetinari commented Feb 1, 2025 •

edited

Loading

mgorny commented Feb 11, 2025

h-vetinari commented Feb 11, 2025

mgorny commented Feb 11, 2025

mgorny commented Feb 11, 2025

h-vetinari commented Feb 12, 2025

h-vetinari commented Feb 12, 2025

mgorny commented Feb 13, 2025

mgorny commented Feb 13, 2025

h-vetinari commented Feb 14, 2025 •

edited

Loading

mgorny commented Feb 14, 2025

hmaarrfk commented Feb 14, 2025

h-vetinari commented Feb 14, 2025

hmaarrfk commented Feb 14, 2025

mgorny commented Feb 14, 2025

Test suite sometimes picks up way more tests; takes ~6 times longer #343

Test suite sometimes picks up way more tests; takes ~6 times longer #343

Comments

h-vetinari commented Feb 1, 2025 • edited Loading

mgorny commented Feb 11, 2025

h-vetinari commented Feb 11, 2025

mgorny commented Feb 11, 2025

mgorny commented Feb 11, 2025

h-vetinari commented Feb 12, 2025

h-vetinari commented Feb 12, 2025

mgorny commented Feb 13, 2025

mgorny commented Feb 13, 2025

h-vetinari commented Feb 14, 2025 • edited Loading

mgorny commented Feb 14, 2025

hmaarrfk commented Feb 14, 2025

h-vetinari commented Feb 14, 2025

hmaarrfk commented Feb 14, 2025

mgorny commented Feb 14, 2025

h-vetinari commented Feb 1, 2025 •

edited

Loading

h-vetinari commented Feb 14, 2025 •

edited

Loading