Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TpetraCore CrsMatrix pack and unpack test failures on clean build #1395

Closed
jwillenbring opened this issue Jun 5, 2017 · 29 comments
Closed
Assignees

Comments

@jwillenbring
Copy link
Member

jwillenbring commented Jun 5, 2017

CC: @trilinos/framework

@trilinos/tpetra

There have been two Tpetra test failures on the GCC 4.9.3 MPI build Linux-gcc-4.9.3-MPI_Release_gcc_4.9.3_openmpi_1.8.7_DEV for the last week:

http://testing.sandia.gov/cdash/viewTest.php?onlyfailed&buildid=2934080

It is important to resolve these issues, either by fixing the issue or disabling the test so we can move in the direction of making automated decisions based on 100% clean builds.

@tjfulle
Copy link
Contributor

tjfulle commented Jun 5, 2017

This error doesn't occur with the standard sems checkin script, I am looking at it now with the jenkins environment that shows failure

@mhoemmen
Copy link
Contributor

mhoemmen commented Jun 5, 2017

std::bad_alloc suggests overflow, perhaps due to a signed / unsigned conversion.

@tjfulle
Copy link
Contributor

tjfulle commented Jun 5, 2017

@mhoemmen, perhaps? There is strange behavior on the test that intentionally sends in bad LIDs, like the bad LID is not caught soon enough. I'm getting closer....

@mhoemmen
Copy link
Contributor

mhoemmen commented Jun 5, 2017

@tjfulle I'm stuck in meetings all day today and tomorrow or I'd help you out :(

@tjfulle
Copy link
Contributor

tjfulle commented Jun 5, 2017

@mhoemmen, the source of the error is in NumPacketsAndOffsetsFunctor wherein there is a check for exportLID + 1 <= lclMatrix.numRows() (not this exactly, but equivalent). In the move to combine the computation of the number of packets and offsets, the check found its way behind a if (debug) branch. This test intentionally sends in bad exportLIDs but they were not caught since this was a release build.

@mhoemmen
Copy link
Contributor

mhoemmen commented Jun 5, 2017

@tjfulle my bad :( can you fix it? feel free to push. Thanks!

@tjfulle
Copy link
Contributor

tjfulle commented Jun 5, 2017

I've got the fix in and the tests are running as we speak :)

@bartlettroscoe
Copy link
Member

@tjfulle:

This error doesn't occur with the standard sems checkin script

The single standard CI build is a build with debug-mode checking turned on Trilinos_ENABLE_DEBUG=ON (since that best supports developers and users developing against Trilinos). This build has Trilinos_ENABLE_DEBUG=OFF (as you can see at http://testing.sandia.gov/cdash/viewNotes.php?buildid=2934080) which is also a very important type of build for customers actually wanting performance out of Trilinos. We can add an MPI_RELEASE_SHARED_PT extra build to the checkin-test-sems.sh script so you can run this build as well before you push, if you like. That will take twice as long to test and push but for some types of changes, perhaps that is a good idea. Or we can just fix failures for these post-push "Clean" builds ASAP. Those are the only two options right now.

P.S. Note that a PR model (like should be getting worked on in #1155) that runs several builds for each independent PR branch is identical to running the checkin-test-sems.sh script with extra builds (as long as all of the builds run are using the same SEMS env, which would be the case here). The difference is that you supply the hardware to run the builds and tests (and therefore is more scalable than a centralized PR approach).

@tjfulle
Copy link
Contributor

tjfulle commented Jun 5, 2017

@bartlettroscoe, I think adding an optional MPI_RELEASE_SHARED_PT extra build would be helpful, in general. In this particular case, it is better to pull the check out of the if(debug) branch all together. Thus, the standard CI build will now catch this.

@mhoemmen
Copy link
Contributor

mhoemmen commented Jun 5, 2017

Thanks @tjfulle ! :-D

@bartlettroscoe Usually I'll do MPI_RELEASE with CUDA, since some packages have had trouble in the past finishing the debug build (libraries too large, so the linker fails) with NVCC.

@tjfulle
Copy link
Contributor

tjfulle commented Jun 5, 2017

@mhoemmen, PR is in :)

@tjfulle
Copy link
Contributor

tjfulle commented Jun 5, 2017

#1398 addresses these failures. @jwillenbring, this issue can be closed when the clean build shows up as clean.

@bartlettroscoe
Copy link
Member

@tjfulle:

I think adding an optional MPI_RELEASE_SHARED_PT extra build would be helpful, in general

I have added it in my local checkin-test-sems.sh script and am running it locally before pushing.

@mhoemmen
Copy link
Contributor

mhoemmen commented Jun 5, 2017

@bartlettroscoe wrote: "am running it locally before pushing" -- does that refer to @tjfulle 's PR, or your changes?

@tjfulle
Copy link
Contributor

tjfulle commented Jun 5, 2017

@mhoemmen , my PR was generated by a commit pushed by the SEMS checkin script already

bartlettroscoe added a commit that referenced this issue Jun 5, 2017
This build failed in the "Clean" dashboard of Trilinos and there was a request
in #1395 to add this.  That MPI full optimized/release build is an important
customer build so developers might consider running it more before pushing.

Build/Test Cases Summary
Enabled Packages:
Disabled Packages: PyTrilinos,Claps,TriKota
Enabled all Packages
0) MPI_RELEASE_DEBUG_SHARED_PT => Test case MPI_RELEASE_DEBUG_SHARED_PT was not run! => Does not affect push readiness! (-1.00 min)
1) MPI_RELEASE_SHARED_PT => passed: passed=2338,notpassed=0 (101.69 min)
@bartlettroscoe
Copy link
Member

"am running it locally before pushing" -- does that refer to @tjfulle 's PR, or your changes?

Both. I expect to see Anasazi failures that are reported for that build, which you can see here:

But because these failures may might be random (i.e. #1393), I may not see them (we will see).

@mhoemmen
Copy link
Contributor

mhoemmen commented Jun 5, 2017

Commit 9f15cfb fixes this; thanks @tjfulle !

@mhoemmen mhoemmen closed this as completed Jun 5, 2017
bartlettroscoe added a commit that referenced this issue Jun 5, 2017
I accidentally set Trilinos_ENABLE_DEBUG=ON which just made this build
identical to the standard MPI_RELEASE_DEBUG_SHARED_PT build (which is not
helpful).

Build/Test Cases Summary
Enabled Packages:
Disabled Packages: PyTrilinos,Claps,TriKota
Enabled all Packages
0) MPI_RELEASE_DEBUG_SHARED_PT => Test case MPI_RELEASE_DEBUG_SHARED_PT was not run! => Does not affect push readiness! (-1.00 min)
1) MPI_RELEASE_SHARED_PT => passed: passed=2334,notpassed=0 (8.17 min)
@bartlettroscoe
Copy link
Member

FYI: I ran the MPI_RELEASE_SHARED_PT build and it passed and pushed (see below).

@tjfulle and @mhoemmen, this means that if you want to test the MPI fully optimized build, you can do that by adding:

./checkin-test-sems.sh --st-extra-builds=MPI_RELEASE_SHARED_PT \
  --do-all --push

But this is not guaranteed to be exactly the same as the "Clean" build Linux-gcc-4.9.3-MPI_Release_gcc_4.9.3_openmpi_1.8.7_DEV set up by the Trilinos Framework team because they don't use the same basic PT set of packages and otherwise same standard SEMS configuration. But hopefully it would catch most of the errors that will break that build.

P.S. The strange thing is that I got an MPI abort error in the test FEI_fei_ubase_MPI_3. But when I ran the full Trilinos test suite again, it passed (and that is what is shown below). I don't remember the last time I got a random MPI failure on this machine running the MPI_RELEASE_DEBUG_SHARED_PT CI build. Not sure what that means.


DID PUSH: Trilinos: crf450.srn.sandia.gov

Mon Jun  5 16:27:50 MDT 2017

Enabled Packages: 
Disabled Packages: PyTrilinos,Claps,TriKota Enabled all Packages

Build test results:
-------------------
0) MPI_RELEASE_DEBUG_SHARED_PT => Test case MPI_RELEASE_DEBUG_SHARED_PT was not run! => Does not affect push readiness! (-1.00 min)
1) MPI_RELEASE_SHARED_PT => passed: passed=2334,notpassed=0 (8.17 min)

*** Commits for repo :
  8cde1d0 Actually disable debug-mode in MPI_RELEASE_SHARED_PT build (#1395)

1) MPI_RELEASE_SHARED_PT Results:
---------------------------------

  passed: Trilinos/MPI_RELEASE_SHARED_PT: passed=2334,notpassed=0
  
  Mon Jun  5 16:27:45 MDT 2017
  
  Enabled Packages: 
  Disabled Packages: PyTrilinos,Claps,TriKota
  Enabled all Packages
  Hostname: crf450.srn.sandia.gov
  Source Dir: /home/rabartl/Trilinos.base/Trilinos/cmake/tribits/ci_support/../../..
  Build Dir: /home/rabartl/Trilinos.base/BUILDS/CHECKIN/MPI_RELEASE_SHARED_PT
  
  CMake Cache Varibles: -DTrilinos_TRIBITS_DIR:PATH=/home/rabartl/Trilinos.base/Trilinos/cmake/tribits -DTrilinos_ENABLE_TESTS:BOOL=ON -DTrilinos_TEST_CATEGORIES:STRING=BASIC -DTrilinos_ALLOW_NO_PACKAGES:BOOL=OFF -DDART_TESTING_TIMEOUT:STRING=300.0 -DTrilinos_CONFIGURE_OPTIONS_FILE:STRING=cmake/std/MpiReleaseDebugSharedPtSettings.cmake,cmake/std/BasicCiTestingSettings.cmake -DTrilinos_ENABLE_DEBUG=OFF -DTrilinos_ENABLE_SECONDARY_TESTED_CODE=ON -DTrilinos_ENABLE_ALL_OPTIONAL_PACKAGES:BOOL=ON -DTrilinos_ENABLE_ALL_PACKAGES:BOOL=ON -DTrilinos_ENABLE_ALL_FORWARD_DEP_PACKAGES:BOOL=ON -DTrilinos_ENABLE_PyTrilinos:BOOL=OFF -DTrilinos_ENABLE_Claps:BOOL=OFF -DTrilinos_ENABLE_TriKota:BOOL=OFF
  Make Options: -j16
  CTest Options: -j16 
  
  Pull: Not Performed
  Configure: Not Performed
  Build: Not Performed
  Test: Passed (8.17 min)
  
  100% tests passed, 0 tests failed out of 2334
  
  Label Time Summary:
  Amesos               =  19.03 sec (14 tests)
  Amesos2              =   9.92 sec (8 tests)
  Anasazi              =  93.48 sec (71 tests)
  AztecOO              =  15.46 sec (17 tests)
  Belos                =  91.43 sec (67 tests)
  Domi                 = 119.19 sec (125 tests)
  Epetra               =  41.42 sec (61 tests)
  EpetraExt            =  12.61 sec (10 tests)
  FEI                  =  39.13 sec (43 tests)
  Galeri               =   4.19 sec (9 tests)
  GlobiPack            =   1.06 sec (6 tests)
  Ifpack               =  54.58 sec (53 tests)
  Ifpack2              =  30.39 sec (33 tests)
  Intrepid             =  62.93 sec (152 tests)
  Intrepid2            =  17.38 sec (73 tests)
  Isorropia            =   8.23 sec (6 tests)
  Kokkos               =  30.75 sec (21 tests)
  KokkosKernels        =   1.56 sec (9 tests)
  Komplex              =   1.16 sec (1 test)
  ML                   =  42.55 sec (34 tests)
  MiniTensor           =   0.18 sec (2 tests)
  Moertel              =   1.42 sec (1 test)
  MueLu                = 109.62 sec (56 tests)
  NOX                  = 112.46 sec (101 tests)
  OptiPack             =   5.31 sec (5 tests)
  Panzer               = 224.24 sec (129 tests)
  Phalanx              =   2.94 sec (19 tests)
  Pike                 =   1.93 sec (7 tests)
  Piro                 =  18.92 sec (12 tests)
  Pliris               =   2.54 sec (2 tests)
  ROL                  = 282.70 sec (133 tests)
  RTOp                 =   9.03 sec (24 tests)
  Rythmos              =  44.84 sec (83 tests)
  SEACAS               =   6.91 sec (8 tests)
  STK                  =  23.40 sec (11 tests)
  Sacado               =  33.22 sec (292 tests)
  Shards               =   0.55 sec (4 tests)
  ShyLU                =   7.03 sec (5 tests)
  Stokhos              =  67.29 sec (74 tests)
  Stratimikos          =  18.78 sec (39 tests)
  Teko                 =  28.76 sec (19 tests)
  Tempus               =  41.02 sec (9 tests)
  Teuchos              =  35.01 sec (121 tests)
  ThreadPool           =   7.65 sec (10 tests)
  Thyra                =  55.00 sec (80 tests)
  Tpetra               = 126.74 sec (132 tests)
  TrilinosCouplings    =  21.09 sec (19 tests)
  Triutils             =   2.23 sec (2 tests)
  Xpetra               =  22.70 sec (17 tests)
  Zoltan               = 196.26 sec (16 tests)
  Zoltan2              = 124.45 sec (97 tests)
  
  Total Test time (real) = 490.23 sec
  
  Total time for MPI_RELEASE_SHARED_PT = 8.17 min

@tjfulle
Copy link
Contributor

tjfulle commented Jun 6, 2017

@jwillenbring, these two @trilinos/tpetra tests show up as failures on cdash again this morning. @bartlettroscoe and I have independently run the test suite with all tests passing. Are the tests on cdash using the most updated code?

@tjfulle tjfulle reopened this Jun 6, 2017
@bartlettroscoe
Copy link
Member

@tjfulle:

Are the tests on cdash using the most updated code?

CDash tells you exactly what version of Trilinos is being tested and how to see what new commits were pulled since the last time a build was run. See instructions on finding that info at:

@mhoemmen
Copy link
Contributor

mhoemmen commented Jun 6, 2017

@bartlettroscoe Just curious -- why is it that changing Tpetra enables FEI tests? FEI doesn't depend on Tpetra at all. It's likely a prerequisite of Panzer, so I understand why it needs to be built, but why should it need to be tested?

@tjfulle
Copy link
Contributor

tjfulle commented Jun 6, 2017

Thanks @bartlettroscoe! It looks like the cdash errors reported are using a checkout of Trilinos that does not yet have yesterday's fixes.

@rppawlo
Copy link
Contributor

rppawlo commented Jun 6, 2017

@bartlettroscoe Just curious -- why is it that changing Tpetra enables FEI tests? FEI doesn't depend on Tpetra at all. It's likely a prerequisite of Panzer, so I understand why it needs to be built, but why should it need to be tested?

FYI: Panzer does not have a dependency on FEI. It was dropped as a required package a while ago.

@bartlettroscoe
Copy link
Member

@mhoemmen:

why is it that changing Tpetra enables FEI tests?

Just configure Trilinos with:

-DTrilinos_ENABLE_Tpetra=on -DTrilinos_ENABLE_ALL_FORWARD_DEP_PACKAGES=ON

and look at the STDOUT output. It tells you exactly why. See:

Sweep forward enabling all forward library dependent packages because Trilinos_ENABLE_ALL_FORWARD_DEP_PACKAGES=ON ...

[...]
-- Setting Trilinos_ENABLE_Isorropia=ON because Trilinos_ENABLE_Tpetra=ON
[...]
-- Setting Trilinos_ENABLE_ML=ON because Trilinos_ENABLE_Isorropia=ON
[...]
-- Setting Trilinos_ENABLE_FEI=ON because Trilinos_ENABLE_ML=ON
[...]

So FEI depends on ML depends on Isorropia depends on Tpetra and there you have it (sort of like everyone on Hollywood is related to Kevin Bacon in less than 6 steps).

This a a major problem with Trilinos. Trilinos packages need to be better structured into subpakages and then have the dependencies controlled between subpackages. I will write a story for this and try to raise awareness of this problem (once again).

Any questions?

@bartlettroscoe
Copy link
Member

FYI: Panzer does not have a dependency on FEI. It was dropped as a required package a while ago.

Panzer depending on FEI would not enable the FEI tests, just the libs. FEI needs to have an (indirect) upstream dependency on Tpetra to do that (which it has).

@tjfulle
Copy link
Contributor

tjfulle commented Jun 7, 2017

@trilinos/tpetra tests all pass on the clean builds this morning.

@tjfulle tjfulle closed this as completed Jun 7, 2017
@mhoemmen
Copy link
Contributor

mhoemmen commented Jun 7, 2017

@bartlettroscoe wrote:

So FEI depends on ML depends on Isorropia depends on Tpetra and there you have it.

Thanks for clarifying :-) . I didn't realize Isorropia depended on Tpetra. I think that was just someone's abandoned research project. If I do the following, would that decouple FEI from Tpetra?

  1. Separate Isorropia into two subpackages, IsorropiaMain and IsorropiaTpetra, with the latter having all Tpetra content
  2. Make ML depend only on IsorropiaMain?

If so, I'll open a new issue to do this.

@bartlettroscoe
Copy link
Member

If I do the following, would that decouple FEI from Tpetra?

  1. Separate Isorropia into two subpackages, IsorropiaMain and IsorropiaTpetra, with the latter having all Tpetra content
  2. Make ML depend only on IsorropiaMain?

Yup, that is the type of thing you have to do. Note that Thyra has subpackages ThyraTpetra and ThyraEpetra and downstream packages depends on one or the other (sometimes both) as can be see by:

[rabartl@crf450 Trilinos (develop)]$ find . -name Dependencies.cmake -exec grep -l ThyraEpetra {} \; | grep -v cmake/tribits | grep -v TriBITS/tribits
./packages/tempus/cmake/Dependencies.cmake
./packages/stratimikos/cmake/Dependencies.cmake
./packages/thyra/adapters/tpetra/cmake/Dependencies.cmake
./packages/thyra/adapters/epetraext/cmake/Dependencies.cmake
./packages/panzer/disc-fe/cmake/Dependencies.cmake
./packages/nox/cmake/Dependencies.cmake
./packages/teko/cmake/Dependencies.cmake
./packages/anasazi/cmake/Dependencies.cmake
./packages/rythmos/cmake/Dependencies.cmake
./packages/piro/cmake/Dependencies.cmake

[rabartl@crf450 Trilinos (develop)]$ find . -name Dependencies.cmake -exec grep -l ThyraTpetra {} \; | grep -v cmake/tribits | grep -v TriBITS/tribits
./packages/stratimikos/cmake/Dependencies.cmake
./packages/panzer/disc-fe/cmake/Dependencies.cmake
./packages/muelu/cmake/Dependencies.cmake
./packages/teko/cmake/Dependencies.cmake
./packages/ifpack2/cmake/Dependencies.cmake

But what kills you are sloppy indirect dependencies that end up enabling a bunch of extra downstream dependencies anyway.

I am going to create a more general Issue for this to get this issue on the map so that Trilinos developers can be alerted to start doing this type of work to better control the dependencies in Trilinos (which has many advantages, not just speeding up CI testing). There are likely a dozen or more package in Trilinos that need to be broken down into subpackages and then a bunch more packages that need to narrow their dependence on just the subpackages they need.

@mhoemmen
Copy link
Contributor

mhoemmen commented Jun 8, 2017

Thanks @bartlettroscoe ! I think it will help us a lot to purge unnecessary dependencies. A big problem is that people add experimental features that never get enabled in the nightlies, but force optional dependencies that do get enabled. More intelligent use of subpackages could help with that. Also, we have branches and PRs now; we should those instead of polluting the repo with some random's untested dissertationware.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants