Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Test suite sometimes picks up way more tests; takes ~6 times longer #343

Open
h-vetinari opened this issue Feb 1, 2025 · 14 comments
Open

Comments

@h-vetinari
Copy link
Member

h-vetinari commented Feb 1, 2025

There's something strange with the tests; in the following run for linux-64+CUDA+MKL (d04bba8 in #340), py311 collects a whole bunch more tests and takes longer than elsewhere (even than py312, which is the only version where we run the inductor tests).

TEST START: /home/conda/feedstock_root/build_artifacts/linux-64/pytorch-2.5.1-cuda126_mkl_py311_h3846359_311.conda
+ python -m pytest -n 2 test/test_autograd.py test/test_autograd_fallback.py test/test_custom_ops.py test/test_linalg.py test/test_mkldnn.py test/test_modules.py test/test_nn.py test/test_torch.py test/test_xnnpack_integration.py -k 'not ((TestTorch and test_print) or (TestAutograd and test_profiler_seq_nr) or (TestAutograd and test_profiler_propagation) or test_mutable_custom_op_fixed_layout or test_BCELoss_weights_no_reduce_cuda or test_ctc_loss_cudnn_tensor_cuda  or (TestTorch and test_index_add_correctness) or test_sdpa_inference_mode_aot_compile or (TestNN and test_grid_sample) or test_indirect_device_assert or (GPUTests and test_scatter_reduce2) or (TestLinalgCPU and test_inverse_errors_large_cpu) or test_base_does_not_require_grad_mode_nothing or test_base_does_not_require_grad_mode_warn or test_composite_registered_to_cpu_mode_nothing)' -m 'not hypothesis' --durations=50
= 13171 passed, 2586 skipped, 91 xfailed, 143216 warnings in 2916.32s (0:48:36) =
TEST END: /home/conda/feedstock_root/build_artifacts/linux-64/pytorch-2.5.1-cuda126_mkl_py311_h3846359_311.conda
TEST START: /home/conda/feedstock_root/build_artifacts/linux-64/pytorch-2.5.1-cuda126_mkl_py39_hdffab68_311.conda
+ python -m pytest -n 2 test/test_autograd.py test/test_autograd_fallback.py test/test_custom_ops.py test/test_linalg.py test/test_mkldnn.py test/test_modules.py test/test_nn.py test/test_torch.py test/test_xnnpack_integration.py -k 'not ((TestTorch and test_print) or (TestAutograd and test_profiler_seq_nr) or (TestAutograd and test_profiler_propagation) or test_mutable_custom_op_fixed_layout or test_BCELoss_weights_no_reduce_cuda or test_ctc_loss_cudnn_tensor_cuda  or (TestTorch and test_index_add_correctness) or test_sdpa_inference_mode_aot_compile or (TestNN and test_grid_sample) or test_indirect_device_assert or (GPUTests and test_scatter_reduce2) or (TestLinalgCPU and test_inverse_errors_large_cpu) or test_base_does_not_require_grad_mode_nothing or test_base_does_not_require_grad_mode_warn or test_composite_registered_to_cpu_mode_nothing)' -m 'not hypothesis' --durations=50
== 7552 passed, 1375 skipped, 31 xfailed, 75701 warnings in 458.74s (0:07:38) ==
TEST END: /home/conda/feedstock_root/build_artifacts/linux-64/pytorch-2.5.1-cuda126_mkl_py39_hdffab68_311.conda
TEST START: /home/conda/feedstock_root/build_artifacts/linux-64/pytorch-2.5.1-cuda126_mkl_py313_h33c0e77_311.conda
+ python -m pytest -n 2 test/test_autograd.py test/test_autograd_fallback.py test/test_custom_ops.py test/test_linalg.py test/test_mkldnn.py test/test_modules.py test/test_nn.py test/test_torch.py test/test_xnnpack_integration.py -k 'not ((TestTorch and test_print) or (TestCustomOp and test_data_dependent_compile) or (TestCustomOp and test_functionalize_error) or (TestCustomOpAPI and test_compile) or (TestCustomOpAPI and test_fake) or test_compile_int4_mm or test_compile_int8_mm or (TestAutograd and test_profiler_seq_nr) or (TestAutograd and test_profiler_propagation) or test_mutable_custom_op_fixed_layout or test_BCELoss_weights_no_reduce_cuda or test_ctc_loss_cudnn_tensor_cuda  or (TestTorch and test_index_add_correctness) or test_sdpa_inference_mode_aot_compile or (TestNN and test_grid_sample) or test_indirect_device_assert or (GPUTests and test_scatter_reduce2) or (TestLinalgCPU and test_inverse_errors_large_cpu) or test_base_does_not_require_grad_mode_nothing or test_base_does_not_require_grad_mode_warn or test_composite_registered_to_cpu_mode_nothing)' -m 'not hypothesis' --durations=50
== 7532 passed, 1375 skipped, 31 xfailed, 75718 warnings in 455.21s (0:07:35) ==
TEST END: /home/conda/feedstock_root/build_artifacts/linux-64/pytorch-2.5.1-cuda126_mkl_py313_h33c0e77_311.conda
TEST START: /home/conda/feedstock_root/build_artifacts/linux-64/pytorch-2.5.1-cuda126_mkl_py310_hca309f4_311.conda
+ python -m pytest -n 2 test/test_autograd.py test/test_autograd_fallback.py test/test_custom_ops.py test/test_linalg.py test/test_mkldnn.py test/test_modules.py test/test_nn.py test/test_torch.py test/test_xnnpack_integration.py -k 'not ((TestTorch and test_print) or (TestAutograd and test_profiler_seq_nr) or (TestAutograd and test_profiler_propagation) or test_mutable_custom_op_fixed_layout or test_BCELoss_weights_no_reduce_cuda or test_ctc_loss_cudnn_tensor_cuda  or (TestTorch and test_index_add_correctness) or test_sdpa_inference_mode_aot_compile or (TestNN and test_grid_sample) or test_indirect_device_assert or (GPUTests and test_scatter_reduce2) or (TestLinalgCPU and test_inverse_errors_large_cpu) or test_base_does_not_require_grad_mode_nothing or test_base_does_not_require_grad_mode_warn or test_composite_registered_to_cpu_mode_nothing)' -m 'not hypothesis' --durations=50
== 7552 passed, 1375 skipped, 31 xfailed, 75701 warnings in 459.08s (0:07:39) ==
TEST END: /home/conda/feedstock_root/build_artifacts/linux-64/pytorch-2.5.1-cuda126_mkl_py310_hca309f4_311.conda
TEST START: /home/conda/feedstock_root/build_artifacts/linux-64/pytorch-2.5.1-cuda126_mkl_py312_hdbe889e_311.conda
+ python -m pytest -n 2 test/test_autograd.py test/test_autograd_fallback.py test/test_custom_ops.py test/test_linalg.py test/test_mkldnn.py test/test_modules.py test/test_nn.py test/test_torch.py test/test_xnnpack_integration.py test/inductor/test_torchinductor.py -k 'not ((TestTorch and test_print) or (TestAutograd and test_profiler_seq_nr) or (TestAutograd and test_profiler_propagation) or test_mutable_custom_op_fixed_layout or test_BCELoss_weights_no_reduce_cuda or test_ctc_loss_cudnn_tensor_cuda  or (TestTorch and test_index_add_correctness) or test_sdpa_inference_mode_aot_compile or (TestNN and test_grid_sample) or test_indirect_device_assert or (GPUTests and test_scatter_reduce2) or (TestLinalgCPU and test_inverse_errors_large_cpu) or test_base_does_not_require_grad_mode_nothing or test_base_does_not_require_grad_mode_warn or test_composite_registered_to_cpu_mode_nothing)' -m 'not hypothesis' --durations=50
= 8196 passed, 1429 skipped, 31 xfailed, 76339 warnings in 2177.80s (0:36:17) ==
TEST END: /home/conda/feedstock_root/build_artifacts/linux-64/pytorch-2.5.1-cuda126_mkl_py312_hdbe889e_311.conda

The set of modules and skips is exactly the same as on 3.9 or 3.10, so I don't know what would explain this difference in test collection.

(note: it's expected that 3.12 runs longer due to being the only version where we include the torchinductor tests, and that 3.13 has more skips because dynamo doesn't yet support 3.13 in pytorch 2.5)

However, after merging #340 to main, the exact same job yielded completely different behaviour, with every test run collecting 13k+ tests and taking ~50min instead of <10.

TEST START: /home/conda/feedstock_root/build_artifacts/linux-64/pytorch-2.5.1-cuda126_mkl_py313_h33c0e77_311.conda
+ python -m pytest -n 2 test/test_autograd.py test/test_autograd_fallback.py test/test_custom_ops.py test/test_linalg.py test/test_mkldnn.py test/test_modules.py test/test_nn.py test/test_torch.py test/test_xnnpack_integration.py -k 'not ((TestTorch and test_print) or (TestCustomOp and test_data_dependent_compile) or (TestCustomOp and test_functionalize_error) or (TestCustomOpAPI and test_compile) or (TestCustomOpAPI and test_fake) or test_compile_int4_mm or test_compile_int8_mm or (TestAutograd and test_profiler_seq_nr) or (TestAutograd and test_profiler_propagation) or test_mutable_custom_op_fixed_layout or test_BCELoss_weights_no_reduce_cuda or test_ctc_loss_cudnn_tensor_cuda  or (TestTorch and test_index_add_correctness) or test_sdpa_inference_mode_aot_compile or (TestNN and test_grid_sample) or test_indirect_device_assert or (GPUTests and test_scatter_reduce2) or (TestLinalgCPU and test_inverse_errors_large_cpu) or test_base_does_not_require_grad_mode_nothing or test_base_does_not_require_grad_mode_warn or test_composite_registered_to_cpu_mode_nothing)' -m 'not hypothesis' --durations=50
= 13151 passed, 2570 skipped, 91 xfailed, 143235 warnings in 2899.57s (0:48:19) =
TEST END: /home/conda/feedstock_root/build_artifacts/linux-64/pytorch-2.5.1-cuda126_mkl_py313_h33c0e77_311.conda
TEST START: /home/conda/feedstock_root/build_artifacts/linux-64/pytorch-2.5.1-cuda126_mkl_py311_h3846359_311.conda
+ python -m pytest -n 2 test/test_autograd.py test/test_autograd_fallback.py test/test_custom_ops.py test/test_linalg.py test/test_mkldnn.py test/test_modules.py test/test_nn.py test/test_torch.py test/test_xnnpack_integration.py -k 'not ((TestTorch and test_print) or (TestAutograd and test_profiler_seq_nr) or (TestAutograd and test_profiler_propagation) or test_mutable_custom_op_fixed_layout or test_BCELoss_weights_no_reduce_cuda or test_ctc_loss_cudnn_tensor_cuda  or (TestTorch and test_index_add_correctness) or test_sdpa_inference_mode_aot_compile or (TestNN and test_grid_sample) or test_indirect_device_assert or (GPUTests and test_scatter_reduce2) or (TestLinalgCPU and test_inverse_errors_large_cpu) or test_base_does_not_require_grad_mode_nothing or test_base_does_not_require_grad_mode_warn or test_composite_registered_to_cpu_mode_nothing)' -m 'not hypothesis' --durations=50
= 13171 passed, 2586 skipped, 91 xfailed, 143216 warnings in 2898.41s (0:48:18) =
TEST END: /home/conda/feedstock_root/build_artifacts/linux-64/pytorch-2.5.1-cuda126_mkl_py311_h3846359_311.conda
TEST START: /home/conda/feedstock_root/build_artifacts/linux-64/pytorch-2.5.1-cuda126_mkl_py310_hca309f4_311.conda
+ python -m pytest -n 2 test/test_autograd.py test/test_autograd_fallback.py test/test_custom_ops.py test/test_linalg.py test/test_mkldnn.py test/test_modules.py test/test_nn.py test/test_torch.py test/test_xnnpack_integration.py -k 'not ((TestTorch and test_print) or (TestAutograd and test_profiler_seq_nr) or (TestAutograd and test_profiler_propagation) or test_mutable_custom_op_fixed_layout or test_BCELoss_weights_no_reduce_cuda or test_ctc_loss_cudnn_tensor_cuda  or (TestTorch and test_index_add_correctness) or test_sdpa_inference_mode_aot_compile or (TestNN and test_grid_sample) or test_indirect_device_assert or (GPUTests and test_scatter_reduce2) or (TestLinalgCPU and test_inverse_errors_large_cpu) or test_base_does_not_require_grad_mode_nothing or test_base_does_not_require_grad_mode_warn or test_composite_registered_to_cpu_mode_nothing)' -m 'not hypothesis' --durations=50
= 13171 passed, 2586 skipped, 91 xfailed, 143216 warnings in 2950.04s (0:49:10) =
TEST END: /home/conda/feedstock_root/build_artifacts/linux-64/pytorch-2.5.1-cuda126_mkl_py310_hca309f4_311.conda
TEST START: /home/conda/feedstock_root/build_artifacts/linux-64/pytorch-2.5.1-cuda126_mkl_py312_hdbe889e_311.conda
+ python -m pytest -n 2 test/test_autograd.py test/test_autograd_fallback.py test/test_custom_ops.py test/test_linalg.py test/test_mkldnn.py test/test_modules.py test/test_nn.py test/test_torch.py test/test_xnnpack_integration.py test/inductor/test_torchinductor.py -k 'not ((TestTorch and test_print) or (TestAutograd and test_profiler_seq_nr) or (TestAutograd and test_profiler_propagation) or test_mutable_custom_op_fixed_layout or test_BCELoss_weights_no_reduce_cuda or test_ctc_loss_cudnn_tensor_cuda  or (TestTorch and test_index_add_correctness) or test_sdpa_inference_mode_aot_compile or (TestNN and test_grid_sample) or test_indirect_device_assert or (GPUTests and test_scatter_reduce2) or (TestLinalgCPU and test_inverse_errors_large_cpu) or test_base_does_not_require_grad_mode_nothing or test_base_does_not_require_grad_mode_warn or test_composite_registered_to_cpu_mode_nothing)' -m 'not hypothesis' --durations=50
= 1 failed, 14514 passed, 2663 skipped, 91 xfailed, 143956 warnings in 4192.22s (1:09:52) =
# 3.9 not run after flaky failure for 3.12

Originally posted by @h-vetinari in #340 (comment)

@mgorny
Copy link
Contributor

mgorny commented Feb 11, 2025

Do you think it would be a problem to unconditionally use pytest -v? I can't reproduce this locally, at least with a bunch of random attempts, and it's hard to even guess anything without seeing test names.

@h-vetinari
Copy link
Member Author

Do you think it would be a problem to unconditionally use pytest -v?

The logs are already extremely verbose, so if we could avoid adding another ~40k lines that would be great. But otherwise fine.

Speaking of, if you have any ideas on that - I'd love to get rid off another chunk of thousands of lines with ~0 information content; everything about:

copying torch/include/ATen/ops/quantile_native.h -> build/lib.macosx-10.13-x86_64-cpython-310/torch/include/ATen/ops
[...]
copying build/lib.macosx-10.13-x86_64-cpython-310/torch/include/ATen/ops/quantile_native.h -> build/bdist.macosx-10.13-x86_64/wheel/./torch/include/ATen/ops
[...]
adding 'torch/include/ATen/ops/quantile_native.h'

should just go away.

In other words, the logs are currently displaying the entire content of the wheel 3 times per python version, for essentially zero gain. I'm already down to -v, and if I remove that, then the only thing shown is still building wheel, but that's a step too far, because I do want to see the CMake config and compilation portion of the build. This seems like a very badly verbosity API in pip TBH.

@mgorny
Copy link
Contributor

mgorny commented Feb 11, 2025

These are coming from setuptools, and I think I could reduce their verbosity. I'll try.

@mgorny
Copy link
Contributor

mgorny commented Feb 11, 2025

diff --git a/recipe/build.sh b/recipe/build.sh
--- a/recipe/build.sh
+++ b/recipe/build.sh
@@ -243,7 +243,7 @@ case ${PKG_NAME} in
   libtorch)
     # Call setup.py directly to avoid spending time on unnecessarily
     # packing and unpacking the wheel.
-    $PREFIX/bin/python setup.py build
+    $PREFIX/bin/python setup.py -q build
 
     mv build/lib.*/torch/bin/* ${PREFIX}/bin/
     mv build/lib.*/torch/lib/* ${PREFIX}/lib/
@@ -256,7 +256,7 @@ case ${PKG_NAME} in
     cp build/CMakeCache.txt build/CMakeCache.txt.orig
     ;;
   pytorch)
-    $PREFIX/bin/python -m pip install . --no-deps --no-build-isolation -v --no-clean \
+    $PREFIX/bin/python -m pip install . --no-deps --no-build-isolation -v --no-clean --config-settings=--global-option=-q \
         | sed "s,${CXX},\$\{CXX\},g" \
         | sed "s,${PREFIX},\$\{PREFIX\},g"
     # Keep this in ${PREFIX}/lib so that the library can be found by

This seems to work for me.

@h-vetinari
Copy link
Member Author

Picked into 22ee0fe, and I'm turning on -v for the test suite in 47f7f21.

@h-vetinari
Copy link
Member Author

I think the problem is actually a cuda-detection issue. Looking at the most recent logs in #326, the "fast" test suite runs skip all the _cuda_ tests (and perhaps collect less of them in the first place), whereas the long runs actually do run the _cuda_ tests (some of which take quite a while).

@mgorny
Copy link
Contributor

mgorny commented Feb 13, 2025

Hmm, that would explain why I couldn't reproduce on a non-GPU system. I'll try on a GPU system later today.

@mgorny
Copy link
Contributor

mgorny commented Feb 13, 2025

Ah, indeed. Collects total of 8992 tests with CPU build, and 15892 with CUDA build.

I wonder if we could and should assert for that somehow. Since we can't unconditionally assume everyone building it will do some on a GPU system, the assert could be conditional to some external CUDA detection method.

@h-vetinari
Copy link
Member Author

h-vetinari commented Feb 14, 2025

I wonder if we could and should assert for that somehow.

I don't think that's necessary. I'd even go as far as saying that I prefer the current setup. The GPU tests do get run on average at least once or twice per megabuild (so we have the coverage), but having the CUDA detection fail and only run the CPU tests in much less time actually keeps the overall testing time sane.

Though I'm not opposed to fixing it properly (and perhaps reducing the tests and/or skipping some long-running ones, like I did in 5640240).

@mgorny
Copy link
Contributor

mgorny commented Feb 14, 2025

xkcd#1172 ("workflows") sounds like :-).

@hmaarrfk
Copy link
Contributor

I would somewhat prefer that we be more "reactionary" with the tests.

Developer time (yours) is quite important and waiting an extra hour during a iteration cycle is really a downer.

conda-forge used to prioritize "existence tests" rather than exhaustive tests. I think we should go steer toward that direction.

Having 5-30 very targetted lines like:

pytest the/name/of/the/file.py::the_test_name

will likely be equally as effective in finding build errors.

@h-vetinari
Copy link
Member Author

and waiting an extra hour during a iteration cycle is really a downer.

Generally yes, but since the iteration cycle for the GPU builds here is >10h in any case, it barely makes a difference.

And where speed matters for debugging cycles, it's really easy to comment out the pytest call, restrict it to a single python version, or just delete some of the tested modules...

conda-forge used to prioritize "existence tests" rather than exhaustive tests. I think we should go steer toward that direction.

I actually prefer running the full test suite in feedstocks that I maintain, especially if the feedstock does non-trivial surgery. Too often have I seen bizarre stuff that breaks due to trivial oversights (381fcb8 is a recent example of something I wouldn't have found without extensive tests), and in the vast majority of cases the conclusion was that it's been my fault. 😅

Running more than "experience tests" is (in my experience) a relatively concrete insurance that the surgery left the patient alive and (mostly) healthy. In the case of pytorch, the full test suite is too massive to make that approach feasible, but in terms of overall philosophy, I'd rather wait an hour longer than end up with avoidable bugs that I don't even know about.

As we discussed across a couple issues recently, I believe there's a middle ground to be found between our positions on this, and I'm happy to iterate on the set of modules and tests.

@hmaarrfk
Copy link
Contributor

Generally yes, but since the iteration cycle for the GPU builds here is >10h in any case, it barely makes a difference.

I would like to use build time as proxy for "scope".

I know many of our jobs depend/support conda-forge, but honestly, the recipe has become too complicated already. If we can decrease the scope by 10%, I call that a win.

Too often have I seen bizarre stuff that breaks due to trivial oversights (381fcb8 is a recent example of something I wouldn't have found without extensive tests)

I agree that adding a test here is the right move, and in fact the test that was added is narrow in scope which is what I'm advocating for. conda-forge shouldn't be considered a "I'm going to run my business on this". Anaconda provides packages for this, and also a support line to get more targetted support. We can't recreate this for "free".

Honestly, the scope increase of the pytorch recipe has made me less interested in helping maintain it. The simplicity of "you can build this recipe on any linux box, GPU or not" should be under valued IMO.

I know I've said this many times, but unless this whole thread is a place to bikeshed, I think you are also concerned with the scope an flakiness of the tests. My proposal is as follows:

  1. Detect if the runner has the GPU enabled. Typically nvidia-smi can be used.
  2. If so, assert that data can be sent to the GPU `
tensor = torch.zeros((16, 16), dtype=torch.uint8, device='cuda')
  1. Run a small test for torch compile -GPU if GPU available
  2. Run a small test for torch compile-CPU always.
  3. Run a small tests for BLAS compatility check, focusing on linkage issues.

If no GPU is detected, we should indicate it in the logs. A pytorch conda-forge maintainer should then use their judgement in pressing the merge button if they feel comfortable doing so.

Key takeaway: I don't think we need to mandate the GPU test suite in the CIs.

@mgorny
Copy link
Contributor

mgorny commented Feb 14, 2025

One thing we could consider is rerunning failing tests too. Normally I'd do that via pytest-rerunfailures option --reruns=5 or likewise, but I'm not sure if that's going to work correctly for the issues we're seeing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants