-
Notifications
You must be signed in to change notification settings - Fork 577
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Test Ifpack2_unit_tests_MPI_4 unit tests randomly failing in many ATDM and PR builds since at least 2021-08-30 #10016
Comments
@trilinos/framework, note, I did not add the entries to the Note that this has only caused 95 test failures over 32 different builds over the since 2021-08-30 as shown from the triaging script
Note that this caused at least 28 PR build iterations to fail so this should be triaged and fixed. I noticed this while trying to find a clean version of Trilinos to do testing with for PR #9973 and TriBITSPub/TriBITS#433. |
NOTE: If you run this query and click "Show Matching Output" you can see by how much to tolerance is being missed by in all of these various test runs on one page. So either the solve tolerance needs to be tightened down or the checking tolerance needs to be loosened up. |
Also note that another unit test failed as well that had a tolerance of 0.01 as shown in this query showing:
But those only occurred on 2021-08-30 and not sense so I think you can ignore those. |
@trilinos/framework NOTE: Even though there were 95 failures of this test in lots of PR iterations and in several ATDM Trilinos builds over 3+ months, no one, reported this (until I did and I don't count). This shows a gap in the current Trilinos testing and triaging efforts that such errors are not caught and reported sooner. This suggests the need for another screening tool or process that looks at even a single failing test in a PR or nightly build and then constructs CDash queries that looks over all builds and over a longer period of time to see if there is a pattern. (This is what I did manually in this case.) In this case, I saw a tolerance that was missed by a small amount and I figured that this was not the first time such a failure was impacting the automated builds. (When you see a tolerance missed by a huge margin, that is more likely to be a serious bug or system issue and not just non-determinism causing significant differences in roundoff errors.) |
@bartlettroscoe: Thanks for reporting this. This test failure is showing up in our weekly SecondaryATDM triaging monitor. CC: @jwillenbring, @ZUUL42 |
Fixed by PR #10017 |
CC: @trilinos/ifpack2, @<triage-contact> (Trilinos <product-area-name> Triage Contact (or "Current ATDM contact"))
Next Action Status
Description
As shown in this query (click "Shown Matching Output" in upper right) the tests:
Ifpack2_unit_tests_MPI_4
in the builds:
PR-9483-test-Trilinos_pullrequest_clang_10.0.0-3559
PR-9483-test-Trilinos_pullrequest_gcc_7.2.0_debug-3527
PR-9483-test-Trilinos_pullrequest_gcc_7.2.0_debug-3591
PR-9627-test-Trilinos_pullrequest_cuda_10.1.105-2132
PR-9627-test-Trilinos_pullrequest_cuda_10.1.105_uvm_off-1129
PR-9660-test-Trilinos_pullrequest_gcc_7.2.0_debug-3528
PR-9660-test-Trilinos_pullrequest_gcc_7.2.0_debug-3538
PR-9676-test-Trilinos_pullrequest_clang_10.0.0-3585
PR-9691-test-Trilinos_pullrequest_clang_10.0.0-3641
PR-9691-test-Trilinos_pullrequest_gcc_7.2.0_debug-3614
PR-9691-test-Trilinos_pullrequest_gcc_7.2.0_debug-3648
PR-9758-test-Trilinos_pullrequest_gcc_7.2.0_debug-3747
PR-9768-test-Trilinos_pullrequest_clang_10.0.0-3765
PR-9773-test-rhel7_sems-clang-7.0.1-openmpi-1.10.1-serial_release-debug_shared_no-kokkos-arch_no-asan_no-complex_no-fpic_mpi_no-pt_no-rdc_no-package-enables-19
PR-9810-test-Trilinos_pullrequest_gcc_7.2.0_debug-3839
PR-9836-test-Trilinos_pullrequest_clang_10.0.0-3913
PR-9859-test-rhel7_sems-clang-7.0.1-openmpi-1.10.1-serial_release-debug_shared_no-kokkos-arch_no-asan_no-complex_no-fpic_mpi_no-pt_no-rdc_no-package-enables-49
PR-9859-test-rhel7_sems-clang-7.0.1-openmpi-1.10.1-serial_release-debug_shared_no-kokkos-arch_no-asan_no-complex_no-fpic_mpi_no-pt_no-rdc_no-package-enables-53
PR-9866-test-rhel7_sems-gnu-7.2.0-openmpi-1.10.1-serial_debug_shared_no-kokkos-arch_no-asan_no-complex_no-fpic_mpi_no-pt_no-rdc_no-package-enables-77
PR-9876-test-rhel7_sems-gnu-7.2.0-openmpi-1.10.1-serial_debug_shared_no-kokkos-arch_no-asan_no-complex_no-fpic_mpi_no-pt_no-rdc_no-package-enables-81
PR-9876-test-rhel7_sems-gnu-7.2.0-openmpi-1.10.1-serial_release-debug_shared_no-kokkos-arch_no-asan_no-complex_no-fpic_mpi_no-pt_no-rdc_no-package-enables-142
PR-9876-test-rhel7_sems-gnu-8.3.0-openmpi-1.10.1-openmp_release-debug_static_no-kokkos-arch_no-asan_no-complex_no-fpic_mpi_no-pt_no-rdc_no-package-enables-188
PR-9883-test-Trilinos_pullrequest_clang_10.0.0-3937
PR-9920-test-Trilinos_pullrequest_gcc_7.2.0_debug-4045
PR-9929-test-Trilinos_pullrequest_gcc_7.2.0_debug-4120
PR-9990-test-rhel7_sems-gnu-7.2.0-openmpi-1.10.1-serial_release-debug_shared_no-kokkos-arch_no-asan_no-complex_no-fpic_mpi_no-pt_no-rdc_no-package-enables-202
PR-9999-test-Trilinos_pullrequest_clang_10.0.0-4135
PR-Experimental-test-Trilinos_pullrequest_caraway-29
Trilinos-atdm-sems-rhel7-clang-7.0.1-openmp-shared-release
Trilinos-atdm-sems-rhel7-clang-7.0.1-openmp-shared-release-debug
Trilinos-atdm-sems-rhel7-intel-18.0.5-openmp-shared-debug
Trilinos-atdm-sems-rhel7-intel-18.0.5-openmp-shared-release-debug
started failing on testing day 2021-08-30.
When the unit test
Ifpack2Chebyshev_double_int_longlong_Test0_UnitTest
fails it seems to be missing the tolerance by just a little as shown here showing:It looks like other unit tests are randomly failing as well failing to meet the tolerance.
If you run this query and then click "Shown Matching Output" you can see by how much the tolerance is being missed in these various tests.
Current Status on CDash
Run the above query adjusting the "Begin" and "End" dates to match today any other date range or just click "CURRENT" in the top bar to see results for the current testing day.
Steps to Reproduce
One should be able to reproduce this failure as described in:
and the system-specific instructions at:
Just log into any of the associated machines and copy and paste the full CDash build name
<build-name>
listed above and run commands like:where
<package-name>
is any package that you want to enable to reproduce build and/or test results.Again, for exact system-specific details on what commands to run to build and run tests, see:
If you can't figure out what commands to run to reproduce the problem given this documentation, then please post a comment here and we will give you the exact minimal commands.
The text was updated successfully, but these errors were encountered: