TpetraCore CrsMatrix pack and unpack test failures on clean build #1395

jwillenbring · 2017-06-05T13:30:11Z

CC: @trilinos/framework

@trilinos/tpetra

There have been two Tpetra test failures on the GCC 4.9.3 MPI build Linux-gcc-4.9.3-MPI_Release_gcc_4.9.3_openmpi_1.8.7_DEV for the last week:

http://testing.sandia.gov/cdash/viewTest.php?onlyfailed&buildid=2934080

It is important to resolve these issues, either by fixing the issue or disabling the test so we can move in the direction of making automated decisions based on 100% clean builds.

The text was updated successfully, but these errors were encountered:

tjfulle · 2017-06-05T14:52:06Z

This error doesn't occur with the standard sems checkin script, I am looking at it now with the jenkins environment that shows failure

mhoemmen · 2017-06-05T14:54:04Z

std::bad_alloc suggests overflow, perhaps due to a signed / unsigned conversion.

tjfulle · 2017-06-05T14:57:48Z

@mhoemmen, perhaps? There is strange behavior on the test that intentionally sends in bad LIDs, like the bad LID is not caught soon enough. I'm getting closer....

mhoemmen · 2017-06-05T15:51:12Z

@tjfulle I'm stuck in meetings all day today and tomorrow or I'd help you out :(

tjfulle · 2017-06-05T15:51:29Z

@mhoemmen, the source of the error is in NumPacketsAndOffsetsFunctor wherein there is a check for exportLID + 1 <= lclMatrix.numRows() (not this exactly, but equivalent). In the move to combine the computation of the number of packets and offsets, the check found its way behind a if (debug) branch. This test intentionally sends in bad exportLIDs but they were not caught since this was a release build.

mhoemmen · 2017-06-05T15:52:32Z

@tjfulle my bad :( can you fix it? feel free to push. Thanks!

tjfulle · 2017-06-05T15:53:22Z

I've got the fix in and the tests are running as we speak :)

bartlettroscoe · 2017-06-05T16:08:56Z

@tjfulle:

This error doesn't occur with the standard sems checkin script

The single standard CI build is a build with debug-mode checking turned on Trilinos_ENABLE_DEBUG=ON (since that best supports developers and users developing against Trilinos). This build has Trilinos_ENABLE_DEBUG=OFF (as you can see at http://testing.sandia.gov/cdash/viewNotes.php?buildid=2934080) which is also a very important type of build for customers actually wanting performance out of Trilinos. We can add an MPI_RELEASE_SHARED_PT extra build to the checkin-test-sems.sh script so you can run this build as well before you push, if you like. That will take twice as long to test and push but for some types of changes, perhaps that is a good idea. Or we can just fix failures for these post-push "Clean" builds ASAP. Those are the only two options right now.

P.S. Note that a PR model (like should be getting worked on in #1155) that runs several builds for each independent PR branch is identical to running the checkin-test-sems.sh script with extra builds (as long as all of the builds run are using the same SEMS env, which would be the case here). The difference is that you supply the hardware to run the builds and tests (and therefore is more scalable than a centralized PR approach).

tjfulle · 2017-06-05T16:13:50Z

@bartlettroscoe, I think adding an optional MPI_RELEASE_SHARED_PT extra build would be helpful, in general. In this particular case, it is better to pull the check out of the if(debug) branch all together. Thus, the standard CI build will now catch this.

mhoemmen · 2017-06-05T16:14:40Z

Thanks @tjfulle ! :-D

@bartlettroscoe Usually I'll do MPI_RELEASE with CUDA, since some packages have had trouble in the past finishing the debug build (libraries too large, so the linker fails) with NVCC.

tjfulle · 2017-06-05T16:39:35Z

@mhoemmen, PR is in :)

tjfulle · 2017-06-05T16:41:23Z

#1398 addresses these failures. @jwillenbring, this issue can be closed when the clean build shows up as clean.

bartlettroscoe · 2017-06-05T17:37:49Z

@tjfulle:

I think adding an optional MPI_RELEASE_SHARED_PT extra build would be helpful, in general

I have added it in my local checkin-test-sems.sh script and am running it locally before pushing.

mhoemmen · 2017-06-05T17:42:43Z

@bartlettroscoe wrote: "am running it locally before pushing" -- does that refer to @tjfulle 's PR, or your changes?

tjfulle · 2017-06-05T17:44:47Z

@mhoemmen , my PR was generated by a commit pushed by the SEMS checkin script already

This build failed in the "Clean" dashboard of Trilinos and there was a request in #1395 to add this. That MPI full optimized/release build is an important customer build so developers might consider running it more before pushing. Build/Test Cases Summary Enabled Packages: Disabled Packages: PyTrilinos,Claps,TriKota Enabled all Packages 0) MPI_RELEASE_DEBUG_SHARED_PT => Test case MPI_RELEASE_DEBUG_SHARED_PT was not run! => Does not affect push readiness! (-1.00 min) 1) MPI_RELEASE_SHARED_PT => passed: passed=2338,notpassed=0 (101.69 min)

bartlettroscoe · 2017-06-05T20:36:25Z

"am running it locally before pushing" -- does that refer to @tjfulle 's PR, or your changes?

Both. I expect to see Anasazi failures that are reported for that build, which you can see here:

http://testing.sandia.gov/cdash/index.php?project=Trilinos&filtercount=4&showfilters=1&filtercombine=and&field1=subprojects&compare1=93&value1=Anasazi&field2=buildname&compare2=61&value2=Linux-gcc-4.9.3-MPI_Release_gcc_4.9.3_openmpi_1.8.7_DEV&field3=groupname&compare3=61&value3=Clean&field4=buildstarttime&compare4=84&value4=now

But because these failures may might be random (i.e. #1393), I may not see them (we will see).

mhoemmen · 2017-06-05T21:54:25Z

Commit 9f15cfb fixes this; thanks @tjfulle !

I accidentally set Trilinos_ENABLE_DEBUG=ON which just made this build identical to the standard MPI_RELEASE_DEBUG_SHARED_PT build (which is not helpful). Build/Test Cases Summary Enabled Packages: Disabled Packages: PyTrilinos,Claps,TriKota Enabled all Packages 0) MPI_RELEASE_DEBUG_SHARED_PT => Test case MPI_RELEASE_DEBUG_SHARED_PT was not run! => Does not affect push readiness! (-1.00 min) 1) MPI_RELEASE_SHARED_PT => passed: passed=2334,notpassed=0 (8.17 min)

bartlettroscoe · 2017-06-05T22:34:52Z

FYI: I ran the MPI_RELEASE_SHARED_PT build and it passed and pushed (see below).

@tjfulle and @mhoemmen, this means that if you want to test the MPI fully optimized build, you can do that by adding:

./checkin-test-sems.sh --st-extra-builds=MPI_RELEASE_SHARED_PT \
  --do-all --push

But this is not guaranteed to be exactly the same as the "Clean" build Linux-gcc-4.9.3-MPI_Release_gcc_4.9.3_openmpi_1.8.7_DEV set up by the Trilinos Framework team because they don't use the same basic PT set of packages and otherwise same standard SEMS configuration. But hopefully it would catch most of the errors that will break that build.

P.S. The strange thing is that I got an MPI abort error in the test FEI_fei_ubase_MPI_3. But when I ran the full Trilinos test suite again, it passed (and that is what is shown below). I don't remember the last time I got a random MPI failure on this machine running the MPI_RELEASE_DEBUG_SHARED_PT CI build. Not sure what that means.

DID PUSH: Trilinos: crf450.srn.sandia.gov

Mon Jun  5 16:27:50 MDT 2017

Enabled Packages: 
Disabled Packages: PyTrilinos,Claps,TriKota Enabled all Packages

Build test results:
-------------------
0) MPI_RELEASE_DEBUG_SHARED_PT => Test case MPI_RELEASE_DEBUG_SHARED_PT was not run! => Does not affect push readiness! (-1.00 min)
1) MPI_RELEASE_SHARED_PT => passed: passed=2334,notpassed=0 (8.17 min)

*** Commits for repo :
  8cde1d0 Actually disable debug-mode in MPI_RELEASE_SHARED_PT build (#1395)

1) MPI_RELEASE_SHARED_PT Results:
---------------------------------

  passed: Trilinos/MPI_RELEASE_SHARED_PT: passed=2334,notpassed=0
  
  Mon Jun  5 16:27:45 MDT 2017
  
  Enabled Packages: 
  Disabled Packages: PyTrilinos,Claps,TriKota
  Enabled all Packages
  Hostname: crf450.srn.sandia.gov
  Source Dir: /home/rabartl/Trilinos.base/Trilinos/cmake/tribits/ci_support/../../..
  Build Dir: /home/rabartl/Trilinos.base/BUILDS/CHECKIN/MPI_RELEASE_SHARED_PT
  
  CMake Cache Varibles: -DTrilinos_TRIBITS_DIR:PATH=/home/rabartl/Trilinos.base/Trilinos/cmake/tribits -DTrilinos_ENABLE_TESTS:BOOL=ON -DTrilinos_TEST_CATEGORIES:STRING=BASIC -DTrilinos_ALLOW_NO_PACKAGES:BOOL=OFF -DDART_TESTING_TIMEOUT:STRING=300.0 -DTrilinos_CONFIGURE_OPTIONS_FILE:STRING=cmake/std/MpiReleaseDebugSharedPtSettings.cmake,cmake/std/BasicCiTestingSettings.cmake -DTrilinos_ENABLE_DEBUG=OFF -DTrilinos_ENABLE_SECONDARY_TESTED_CODE=ON -DTrilinos_ENABLE_ALL_OPTIONAL_PACKAGES:BOOL=ON -DTrilinos_ENABLE_ALL_PACKAGES:BOOL=ON -DTrilinos_ENABLE_ALL_FORWARD_DEP_PACKAGES:BOOL=ON -DTrilinos_ENABLE_PyTrilinos:BOOL=OFF -DTrilinos_ENABLE_Claps:BOOL=OFF -DTrilinos_ENABLE_TriKota:BOOL=OFF
  Make Options: -j16
  CTest Options: -j16 
  
  Pull: Not Performed
  Configure: Not Performed
  Build: Not Performed
  Test: Passed (8.17 min)
  
  100% tests passed, 0 tests failed out of 2334
  
  Label Time Summary:
  Amesos               =  19.03 sec (14 tests)
  Amesos2              =   9.92 sec (8 tests)
  Anasazi              =  93.48 sec (71 tests)
  AztecOO              =  15.46 sec (17 tests)
  Belos                =  91.43 sec (67 tests)
  Domi                 = 119.19 sec (125 tests)
  Epetra               =  41.42 sec (61 tests)
  EpetraExt            =  12.61 sec (10 tests)
  FEI                  =  39.13 sec (43 tests)
  Galeri               =   4.19 sec (9 tests)
  GlobiPack            =   1.06 sec (6 tests)
  Ifpack               =  54.58 sec (53 tests)
  Ifpack2              =  30.39 sec (33 tests)
  Intrepid             =  62.93 sec (152 tests)
  Intrepid2            =  17.38 sec (73 tests)
  Isorropia            =   8.23 sec (6 tests)
  Kokkos               =  30.75 sec (21 tests)
  KokkosKernels        =   1.56 sec (9 tests)
  Komplex              =   1.16 sec (1 test)
  ML                   =  42.55 sec (34 tests)
  MiniTensor           =   0.18 sec (2 tests)
  Moertel              =   1.42 sec (1 test)
  MueLu                = 109.62 sec (56 tests)
  NOX                  = 112.46 sec (101 tests)
  OptiPack             =   5.31 sec (5 tests)
  Panzer               = 224.24 sec (129 tests)
  Phalanx              =   2.94 sec (19 tests)
  Pike                 =   1.93 sec (7 tests)
  Piro                 =  18.92 sec (12 tests)
  Pliris               =   2.54 sec (2 tests)
  ROL                  = 282.70 sec (133 tests)
  RTOp                 =   9.03 sec (24 tests)
  Rythmos              =  44.84 sec (83 tests)
  SEACAS               =   6.91 sec (8 tests)
  STK                  =  23.40 sec (11 tests)
  Sacado               =  33.22 sec (292 tests)
  Shards               =   0.55 sec (4 tests)
  ShyLU                =   7.03 sec (5 tests)
  Stokhos              =  67.29 sec (74 tests)
  Stratimikos          =  18.78 sec (39 tests)
  Teko                 =  28.76 sec (19 tests)
  Tempus               =  41.02 sec (9 tests)
  Teuchos              =  35.01 sec (121 tests)
  ThreadPool           =   7.65 sec (10 tests)
  Thyra                =  55.00 sec (80 tests)
  Tpetra               = 126.74 sec (132 tests)
  TrilinosCouplings    =  21.09 sec (19 tests)
  Triutils             =   2.23 sec (2 tests)
  Xpetra               =  22.70 sec (17 tests)
  Zoltan               = 196.26 sec (16 tests)
  Zoltan2              = 124.45 sec (97 tests)
  
  Total Test time (real) = 490.23 sec
  
  Total time for MPI_RELEASE_SHARED_PT = 8.17 min

tjfulle · 2017-06-06T14:38:26Z

@jwillenbring, these two @trilinos/tpetra tests show up as failures on cdash again this morning. @bartlettroscoe and I have independently run the test suite with all tests passing. Are the tests on cdash using the most updated code?

bartlettroscoe · 2017-06-06T15:01:49Z

@tjfulle:

Are the tests on cdash using the most updated code?

CDash tells you exactly what version of Trilinos is being tested and how to see what new commits were pulled since the last time a build was run. See instructions on finding that info at:

https://github.com/trilinos/Trilinos/wiki/Policies--%7C-Testing#finding_info_on_cdash

mhoemmen · 2017-06-06T15:05:37Z

@bartlettroscoe Just curious -- why is it that changing Tpetra enables FEI tests? FEI doesn't depend on Tpetra at all. It's likely a prerequisite of Panzer, so I understand why it needs to be built, but why should it need to be tested?

tjfulle · 2017-06-06T15:11:16Z

Thanks @bartlettroscoe! It looks like the cdash errors reported are using a checkout of Trilinos that does not yet have yesterday's fixes.

rppawlo · 2017-06-06T16:40:33Z

@bartlettroscoe Just curious -- why is it that changing Tpetra enables FEI tests? FEI doesn't depend on Tpetra at all. It's likely a prerequisite of Panzer, so I understand why it needs to be built, but why should it need to be tested?

FYI: Panzer does not have a dependency on FEI. It was dropped as a required package a while ago.

bartlettroscoe · 2017-06-06T18:38:43Z

@mhoemmen:

why is it that changing Tpetra enables FEI tests?

Just configure Trilinos with:

-DTrilinos_ENABLE_Tpetra=on -DTrilinos_ENABLE_ALL_FORWARD_DEP_PACKAGES=ON

and look at the STDOUT output. It tells you exactly why. See:

Sweep forward enabling all forward library dependent packages because Trilinos_ENABLE_ALL_FORWARD_DEP_PACKAGES=ON ...

[...]
-- Setting Trilinos_ENABLE_Isorropia=ON because Trilinos_ENABLE_Tpetra=ON
[...]
-- Setting Trilinos_ENABLE_ML=ON because Trilinos_ENABLE_Isorropia=ON
[...]
-- Setting Trilinos_ENABLE_FEI=ON because Trilinos_ENABLE_ML=ON
[...]

So FEI depends on ML depends on Isorropia depends on Tpetra and there you have it (sort of like everyone on Hollywood is related to Kevin Bacon in less than 6 steps).

This a a major problem with Trilinos. Trilinos packages need to be better structured into subpakages and then have the dependencies controlled between subpackages. I will write a story for this and try to raise awareness of this problem (once again).

Any questions?

bartlettroscoe · 2017-06-06T18:41:35Z

FYI: Panzer does not have a dependency on FEI. It was dropped as a required package a while ago.

Panzer depending on FEI would not enable the FEI tests, just the libs. FEI needs to have an (indirect) upstream dependency on Tpetra to do that (which it has).

tjfulle · 2017-06-07T14:26:28Z

@trilinos/tpetra tests all pass on the clean builds this morning.

mhoemmen · 2017-06-07T19:47:37Z

@bartlettroscoe wrote:

So FEI depends on ML depends on Isorropia depends on Tpetra and there you have it.

Thanks for clarifying :-) . I didn't realize Isorropia depended on Tpetra. I think that was just someone's abandoned research project. If I do the following, would that decouple FEI from Tpetra?

Separate Isorropia into two subpackages, IsorropiaMain and IsorropiaTpetra, with the latter having all Tpetra content
Make ML depend only on IsorropiaMain?

If so, I'll open a new issue to do this.

bartlettroscoe · 2017-06-08T00:57:00Z

If I do the following, would that decouple FEI from Tpetra?

Separate Isorropia into two subpackages, IsorropiaMain and IsorropiaTpetra, with the latter having all Tpetra content

Make ML depend only on IsorropiaMain?

Yup, that is the type of thing you have to do. Note that Thyra has subpackages ThyraTpetra and ThyraEpetra and downstream packages depends on one or the other (sometimes both) as can be see by:

[rabartl@crf450 Trilinos (develop)]$ find . -name Dependencies.cmake -exec grep -l ThyraEpetra {} \; | grep -v cmake/tribits | grep -v TriBITS/tribits
./packages/tempus/cmake/Dependencies.cmake
./packages/stratimikos/cmake/Dependencies.cmake
./packages/thyra/adapters/tpetra/cmake/Dependencies.cmake
./packages/thyra/adapters/epetraext/cmake/Dependencies.cmake
./packages/panzer/disc-fe/cmake/Dependencies.cmake
./packages/nox/cmake/Dependencies.cmake
./packages/teko/cmake/Dependencies.cmake
./packages/anasazi/cmake/Dependencies.cmake
./packages/rythmos/cmake/Dependencies.cmake
./packages/piro/cmake/Dependencies.cmake

[rabartl@crf450 Trilinos (develop)]$ find . -name Dependencies.cmake -exec grep -l ThyraTpetra {} \; | grep -v cmake/tribits | grep -v TriBITS/tribits
./packages/stratimikos/cmake/Dependencies.cmake
./packages/panzer/disc-fe/cmake/Dependencies.cmake
./packages/muelu/cmake/Dependencies.cmake
./packages/teko/cmake/Dependencies.cmake
./packages/ifpack2/cmake/Dependencies.cmake

But what kills you are sloppy indirect dependencies that end up enabling a bunch of extra downstream dependencies anyway.

I am going to create a more general Issue for this to get this issue on the map so that Trilinos developers can be alerted to start doing this type of work to better control the dependencies in Trilinos (which has many advantages, not just speeding up CI testing). There are likely a dozen or more package in Trilinos that need to be broken down into subpackages and then a bunch more packages that need to narrow their dependence on just the subpackages they need.

mhoemmen · 2017-06-08T15:26:31Z

Thanks @bartlettroscoe ! I think it will help us a lot to purge unnecessary dependencies. A big problem is that people add experimental features that never get enabled in the nightlies, but force optional dependencies that do get enabled. More intelligent use of subpackages could help with that. Also, we have branches and PRs now; we should those instead of polluting the repo with some random's untested dissertationware.

jwillenbring added the pkg: Tpetra label Jun 5, 2017

jwillenbring mentioned this issue Jun 5, 2017

Remove all failures in "Clean" track builds. #1294

Closed

mhoemmen assigned tjfulle Jun 5, 2017

tjfulle mentioned this issue Jun 5, 2017

Tpetra: Move check for bad export LIDs out of if(debug) branch. #1398

Merged

mhoemmen closed this as completed Jun 5, 2017

tjfulle reopened this Jun 6, 2017

tjfulle closed this as completed Jun 7, 2017

mhoemmen mentioned this issue Jun 7, 2017

Isorropia: Separate experimental Tpetra-dependent code into its own subpackage #1406

Closed

bartlettroscoe mentioned this issue Jun 13, 2017

Clean up Trilinos dependencies using subpackages to reduce and better control dependencies #1418

Open

bartlettroscoe mentioned this issue Apr 11, 2019

Trilinos auto PR tester stability issues #3276

Closed

trilinos-autotester mentioned this issue Mar 6, 2021

Tempus: Remove Doxygen dependence on algpseudocode. #8845

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TpetraCore CrsMatrix pack and unpack test failures on clean build #1395

TpetraCore CrsMatrix pack and unpack test failures on clean build #1395

jwillenbring commented Jun 5, 2017 •

edited by bartlettroscoe

Loading

tjfulle commented Jun 5, 2017

mhoemmen commented Jun 5, 2017

tjfulle commented Jun 5, 2017

mhoemmen commented Jun 5, 2017

tjfulle commented Jun 5, 2017

mhoemmen commented Jun 5, 2017

tjfulle commented Jun 5, 2017

bartlettroscoe commented Jun 5, 2017

tjfulle commented Jun 5, 2017

mhoemmen commented Jun 5, 2017

tjfulle commented Jun 5, 2017

tjfulle commented Jun 5, 2017

bartlettroscoe commented Jun 5, 2017

mhoemmen commented Jun 5, 2017

tjfulle commented Jun 5, 2017

bartlettroscoe commented Jun 5, 2017

mhoemmen commented Jun 5, 2017

bartlettroscoe commented Jun 5, 2017

tjfulle commented Jun 6, 2017

bartlettroscoe commented Jun 6, 2017

mhoemmen commented Jun 6, 2017

tjfulle commented Jun 6, 2017

rppawlo commented Jun 6, 2017

bartlettroscoe commented Jun 6, 2017

bartlettroscoe commented Jun 6, 2017

tjfulle commented Jun 7, 2017

mhoemmen commented Jun 7, 2017

bartlettroscoe commented Jun 8, 2017

mhoemmen commented Jun 8, 2017

TpetraCore CrsMatrix pack and unpack test failures on clean build #1395

TpetraCore CrsMatrix pack and unpack test failures on clean build #1395

Comments

jwillenbring commented Jun 5, 2017 • edited by bartlettroscoe Loading

tjfulle commented Jun 5, 2017

mhoemmen commented Jun 5, 2017

tjfulle commented Jun 5, 2017

mhoemmen commented Jun 5, 2017

tjfulle commented Jun 5, 2017

mhoemmen commented Jun 5, 2017

tjfulle commented Jun 5, 2017

bartlettroscoe commented Jun 5, 2017

tjfulle commented Jun 5, 2017

mhoemmen commented Jun 5, 2017

tjfulle commented Jun 5, 2017

tjfulle commented Jun 5, 2017

bartlettroscoe commented Jun 5, 2017

mhoemmen commented Jun 5, 2017

tjfulle commented Jun 5, 2017

bartlettroscoe commented Jun 5, 2017

mhoemmen commented Jun 5, 2017

bartlettroscoe commented Jun 5, 2017

tjfulle commented Jun 6, 2017

bartlettroscoe commented Jun 6, 2017

mhoemmen commented Jun 6, 2017

tjfulle commented Jun 6, 2017

rppawlo commented Jun 6, 2017

bartlettroscoe commented Jun 6, 2017

bartlettroscoe commented Jun 6, 2017

tjfulle commented Jun 7, 2017

mhoemmen commented Jun 7, 2017

bartlettroscoe commented Jun 8, 2017

mhoemmen commented Jun 8, 2017

jwillenbring commented Jun 5, 2017 •

edited by bartlettroscoe

Loading